cell2sentence

cell2sentence: create cell sentences from sequencing data


License
Other
Install
pip install cell2sentence==0.0.1

Documentation

cell2sentence

cell2sentence workflow image

Reframing cells as sentences of genes, ordered by expression. Please read the manuscript on bioRxiv for methodological details and examples.

(https://www.biorxiv.org/content/10.1101/2022.09.18.508438)

Stable Setup

Install cell2sentence from PyPI with

pip install cell2sentence

Convert Anndata Object to Cell Sentences

After your data is loaded into a standard AnnData adata object, you may create a cell2sentence object with:

import cell2sentence as cs

csdata = cs.transforms.csdata_from_adata(adata)

and generate a list of cell sentences with:

sentences = csdata.create_sentence_lists()

A tutorial script showing how to use pretrained word vectors to analyze the pbmc3k dataset used by Seurat and scanpy in their guided clustering tutorials is available at tutorials/pbmc3k_cell_sentences.py

Training Models with Cell Sentences

The .create_sentence_lists() and .create_sentence_strings() functions can both be used to interface with a wide variety of tools. Exact transformations required will vary from tool to too.

gensim

As an example, some guidance on training a Word2Vec model in gensim is provided here. A tutorial from the gensim team is also available here.

For a quickstart, once you have a csdata object, you can run:

import gensim

sentences = csdata.create_sentence_lists()
model = gensim.models.Word2Vec(sentences=sentences,
                               vector_size=400,
                               window=5,
                               min_count=1,
                               workers=4)

The model can then be queried directly, for example, to find the top 10 genes most similar to 'CD8B' in the embedding, you can run:

model.wv.most_similar['CD8B']

For more details, consult the gensim documentation.

Further Notes

As a note, the pretrained models stored in this repository are saved instances of gensim KeyedVectors.

If you train any models on your own data, please submit them as a pull request or through correspondence to rahul.dhodapkar {at} yale.edu so others can use them! If you prototype any new uses for cell sentences, please reach out so it can be included here.

Development Setup

Create a conda environment using python3 using anaconda with:

conda create -n cell2sentence python=3.8

and activate the environment with

conda activate cell2sentence

finally, you can install the latest development version of cell2sentence by running

make install

which simply uses pip -e.

Loading Data

All data used in the bioRxiv manuscript are publicly available, and details are outlined in the DATA.md file in this repository.