scpopcorn

PopCorn is a new method for the identification of sub-populations of cells present within individual single cell experiments and mapping of these sub-populations across the experiments.


License
GPL-3.0
Install
pip install scpopcorn==0.1.20

Documentation

scPopCorn

A python tool to do comparative analysis of mulitple single cell RNA-seq datasets.

1. Installation

$ pip install scpopcorn

2. Input scRNA-seq Data File Format

scPopCorn needs multiple single cell RNA-seq dataset as inputs. Bascially, the format looks like the following. Example data files can be found in the Data folder.

Cell1ID Cell2ID Cell3ID Cell4ID Cell5ID ...
Gene1 12 0 0 0 ...
Gene2 125 0 298 0 ...
Gene3 0 0 0 0 ...
... ... ... ... ... ...

The gourd truth labels for cells in each dataset can also be input. The format is as following

Cell1ID Lable1
Cell1ID Lable2
Cell1ID Lable3
Cell1ID Lable4
... ..

3. How to use

3.1 import scpopcorn package

from scpopcorn import MergeSingleCell
from scpopcorn import SingleCellData

3.2 read in RNA-seq datasets

File1 = "../Data/Human&Mouse_Pancreas/pancreas_human.expressionMatrix.txt"
Test1 = SingleCellData()
Test1.ReadData_SeuratFormat(File1)

File2 = "../Data/Human&Mouse_Pancreas/pancreas_mouse.expressionMatrix.txt"
Test2 = SingleCellData()
Test2.ReadData_SeuratFormat(File2)

3.3 read in ground truth cell labels (this is optional)

File1T = "../Data/Human&Mouse_Pancreas/pancreas_human.CellLabels.txt"
Test1.ReadTurth(File1T, 0, 1)

File2T = "../Data/Human&Mouse_Pancreas/pancreas_mouse.CellLabels.txt"
Test2.ReadTurth(File2T, 0, 1)

3.4 normlize counts data, find highly vaiable genes, and natural logarithm of one plus of the counts data

Test1.Normalized_per_Cell()
Test1.FindHVG()
Test1.Log1P()

Test2.Normalized_per_Cell()
Test2.FindHVG()
Test2.Log1P()

3.5 combine datasets and set number of supercells for each dataset

NumSuperCell_Test1 = 50
NumSuperCell_Test2 = 50
MSingle = MergeSingleCell(Test1, Test2)
MSingle.MultiDefineSuperCell(NumSuperCell_Test1,NumSuperCell_Test2)

In this example, we define 50 supercells for each dataset. The number of super cell can be chosen as following. If you have N cells, then you can define the number of super cell M, by letting N/M between 20 and 30.

3.6 compute co-membership graph within each dataset and similarity matrix across dataset

MSingle.ConstructWithinSimiarlityMat_SuperCellLevel()
MSingle.ConstructBetweenSimiarlityMat_SuperCellLevel()

3.7 run joint partition

Estimate_NumCluster = 10 # initial guess of number of corresponding clusters, do not need to be accurate!!!
MSingle.SDP_NKcut(Estimate_NumCluster)

Estimate_NumCluster is the initial guess of the number of sub-populations you want to find and it is just an approxiamtion.

3.8 rounding the results

NumCluster_Min = 3 
NumCluster_Max = 20
CResult = MSingle.NKcut_Rounding(NumCluster_Min, NumCluster_Max)

scPopCorn will screen number of clusters from NumCluster_Min to NumCluster_Max and automatically find the best number of clusters in [NumCluster_Min, NumCluster_Max]

3.9 evaluate of clustering results using ground truth (this is optional)

MSingle.Evaluation(CResult)

3.10 similairty between cell subpopulations across datasets

MSingle.StatResult()

3.11 Umap plots using the results generated by scPopCorn

MSingle.Umap_Result()

3.12 ScPopCorn for sub-clusters

After see the Umap plot, you may want to further joint partition a sub-cluster. You can do something as following

ClusterID = 0
NumCluster = 3
MSingle.Deep_Partition(ClusterID, NumCluster) # deep partition for cluster 0 into 3 clusters
NumCluster_Min = 3
NumCluster_Max = 5
MSingle.SDP_Deep_Rounding(NumCluster_Min, NumCluster_Max) # find out best number of clusters for the deep partition
MSingle.Merge_Deep_Partition() # merge the new partitions to the original one
MSingle.Umap_Result() # see the new results

3.13 ouptput the results

MSingle.OutputResult("TestOut.txt")

Output results in the "TestOut.txt" file.

4. Examples and reproducible results

Jupypter notebooks of examples are provide in Reproduce folder!!!