A highly scalable and accurate inference of gene expression and structure for single-cell transcriptomes using semi-supervised deep learning.


Keywords
deep-learning, semi-supervised-learning, imputation, transcriptome, single-cell
License
Apache-2.0
Install
pip install disc==1.1.6

Documentation

DISC

PyPI

An accurate and scalable imputation algorithm based on semi-supervised deep learning for single-cell transcriptome.

  • Free software: Apache License 2.0

Requirements

Installation

Install TensorFlow

If you have an Nvidia GPU, be sure to install a version of TensorFlow that supports it first -- DISC runs much faster with GPU:

pip install "tensorflow-gpu>= 1.13.1,<2.0.0"

We typically tensorflow-gpu==1.13.1.

Here are requirements for GPU version TensorFlow:

* Hardware
    * NVIDIA GPU card with CUDA Compute Capability 3.5 or higher.
* Software
    * NVIDIA GPU drivers - CUDA 10.0 requires 410.x or higher.
    * CUDA Toolkit - TensorFlow_ supports CUDA 10.0 (TensorFlow >= 1.13.0)
    * CUPTI ships with the CUDA Toolkit.
    * cuDNN SDK (>= 7.4.1)

see this for further information

Install DISC with pip

To install with pip, run the following from a terminal:

pip install disc
Install DISC from GitHub

To clone the repository and install manually, run the following from a terminal:

git clone git://github.com/iyhaoo/DISC.git

cd disc

python setup.py install

Usage

Quick Start
  1. Run DISC:

    disc \
    --dataset=matrix.loom \
    --out-dir=out_dir
    

    where matrix.loom is a loom-formatted raw count matrix with genes in rows and cells in columns and out_dir is the path of output directory.

  2. Results:

    • log.tsv: a tsv-formatted log file that records training states.
    • summary.pdf: a pdf-formatted file that visualizes the fitting line and optimal point and it will be updated in real time when running.
    • summary.tsv: a tsv-formatted file that shows the raw data of visualization.
    • result: a directory for imputaion results as below:
      • imputation.loom: a loom-formatted imputed matrix with genes in rows and cells in columns.
      • feature.loom: a loom-formatted dimensionally reduced feature matrix provided by our method based on the imputed matrix above with feature in rows and cells in columns.
      • running_info.hdf5: a hdf5-formatted saved some basic information about the input dataset such as library size, genes used for modelling and so on.
    • models: a directory for trained models in every save interval
Data availability

We provide loom-formatted original, raw, down-sampled (DS), imputed raw/DS RNA-seq data and FISH data.

  • MELANOMA :

    8,640 cells from the melanoma WM989 cell line were sequenced using Drop-seq, where 32,287 genes were detected (MELANOMA). In addition, RNA FISH experiment of across 7,000-88,000 cells from the same cell line was conducted and 26 genes were detected (MELANOMA_FISH).

    The original, raw, DS (0.5), imputed raw/DS RNA-seq data and FISH data are provide here.

  • SSCORTEX :

    Mouse somatosensory cortex of CD-1 mice at age of p28 and p29 were profiled by 10X where 7,477 cells were detected (SSCORTEX). In addition, osmFISH experiment of 4,839 cells from somatosensory cortex, hippocampus and ventricle of a CD-1 mouse at age of p22 was conducted and 33 genes were detected (SSCORTEX_FISH).

    The original, raw RNA-seq data and FISH data are provide here.

  • PBMC :

    2,700 freeze-thaw peripheral blood mononuclear cells (PBMC) from a healthy donor were profiled by 10X, where 32,738 genes were detect (PBMC).

    The original, raw, DS (0.3), imputed DS RNA-seq data are provide here.

  • CBMC :

    Cord blood mononuclear cells were profiled by CITE-seq, where 8,005 human cells were detected in total (CBMC).

    The original and raw RNA-seq data are provide here.

  • RETINA :

    Retinas of mice at age of p14 were profiled in 7 different replicates on by Drop-seq, where 6,600, 9,000, 6,120, 7,650, 7,650, 8280, and 4000 (49,300 in total) STAMPs (single-cell transcriptomes attached to micro-particles) were collected with totally 24,658 genes detected (RETINA).

    The raw RNA-seq data and the RDS-formatted cluster assignments data from the original study are provide here.

  • BRAIN_SPLiT :

    156,049 mice nuclei from developing brain and spinal cord at age of p2 or p11 mice were profiled by SPLiT-seq, where 26,894 genes were detected (BRAIN_SPLiT).

    The raw RNA-seq data and the RDS-formatted cluster assignments data from the original study are provide here.

  • BRAIN_1.3M :

    1,306,127 cells from combined cortex, hippocampus, and subventricular zone of 2 E18 C57BL/6 mice were profiled by 10X, where 27998 genes were detected (BRAIN_1.3M).

Tutorials
  1. Data preparation and imputation
  2. Reproducing our results:
    • Gene expression structures recovery validated by FISH (MELANOMA)
    • Dropout event recovery (MELANOMA)
    • Cell type identification improvement (PBMC)
  3. Supplementary topics:
    • Use DISC compressed features for Seurat clustering (PBMC)
    • Violin plots of marker genes across cell types (script, PBMC, RETINA)

References

Yao He#, Hao Yuan#, Cheng Wu#, Zhi Xie*. "Reliable and efficient imputation and cell type identification for single-cell transcriptomes using a semi-supervised deep learning approach"

History

1.0.2 (2020-01-07)

  • Set default values as paper.

1.0.1 (2020-01-06)

  • Small bug fixes.

1.0.0 (2019-12-16)

  • First release on PyPI.