Metacell - Single-cell RNA Sequencing Analysis
The metacell package implements the metacell algorithm for single-cell RNA sequencing (scRNA-seq) data analysis.
The original metacell algorithm was implemented in R. In contrast, the Python package described here implements an improved <https://todo/> version, which uses improved internal algorithms and supports much larger data sets (millions of cells).
Metacell Analysis
Naively, scRNA_seq data is a set of cell profiles, where for each one, for each gene, we get a count of the mRNA molecules that existed in the cell for that gene. This serves as an indicator of how "expressed" or "active" the gene is.
As in any real world technology, the raw data may suffer from technical artifacts (counting the molecules of two cells in one profile, counting the molecules from a ruptured cells, counting only the molecules from the cell nucleus, etc.). This requires pruning the raw data to exclude such artifacts. TODO: Create view should consider these issues.
The current technology scRNA-seq data is also very sparse (typically <10%, sometimes even <5% of the RNA molecules are counted). This introduces large sampling variance on top of the original signal, which itself contains significant inherent biological noise.
Analyzing scRNA-seq data therefore requires processing the profiles in bulk. Classically, this has been done by directly clustering the cells using various methods.
In contrast, the metacell approach groups together profiles of the "same" biological state into groups with the minimal number of profiles (~100) needed for computing robust statistics (in particular, mean gene expression). Each such group is a single "metacell".
By summing profiles together, each metacell greatly reduces the sampling variance, and provides a more robust estimation of some transcription state. In particular, a metacell is not a cell type (multiple metacells may belong to the same type), and is not a parametric model of the cell state.
The metacells should therefore be further analyzed using additional methods to classify cell types, detect cell trajectories and/or lineage, build parametric models for cell behavior, etc. Using metacells as input for such analysis techniques should benefit both from the more robust, less noisy input; and from the (~100-fold) reduction in the number of profiles to analyze.
An obvious technique is to recursively group the metacells into larger groups, and investigate the resulting clustering hierarchy. Methods for further analysis of the data may choose to build upon this readily available hierarchy (which is also provided by this package).
Usage
Installation
Given Python version 3.6 and above, run pip install --user metacell
(or, sudo pip install
metacell
if you have sudo
privileges). This will install this package and two scripts:
metacell_flow
and metacell_call
.
The metacell package is built using the DynaMake package.
The metacell_flow
script is simply the dynamake
script, pre-loaded with the metacell build
steps, and metacell_call
is simply the dynacall
script, pre-loaded with the metacell
computation functions.
Importing
To use the package, create a directory somewhere with a large amount of available space (for ~2M
cells, metacell required up to 200GB of disk space). If using a servers cluster, on some shared file
system, and provide a DynaMake configuration file with an appropriate run_prefix
for
distributing jobs across the cluster; see the DynaMake documentation for details.
Create a raw
sub-directory and download the raw_data into it. Then, run metacell_import
to
import the gene and cell data into the fast-access format used by the metacell package. This is
unfortunately a slow process (in particular, the delaning step). Luckily, you only need to do this
once.
This will create the following directories:
genes
- Describes the genes used in the data.
cells
- A COLLECTION profiles directory containing all the imported cells.
Grouping
TODO: Running metacell_group
.
TODO: excluded_genes_names.txt
, forbidden_genes_names.txt
.
Zoning
TODO: Running metacell_zone
.
Analysis
TODO: Running metacell_dash
.
TODO: colors
directory.
Directory Structure
The provided automated workflow uses the following directory structure. The specifics of each directory type are provided further below.
ROOT
-
A directory which is typically named after the data source (
PBMC
,MOCA
,HCA
, etc.). Will contain:raw
-
A directory containing the raw data from the data source. Should contain one or more:
std
-
The first processing step will convert the
raw
data to the standard input format. The result will contain:name
- A directory in the standard (STD) format ready to be delaned.
Creating the standard format directory is pretty fast. It typically requires either a simple copy or applying something like
sed
to fix minor formatting issues. For HCA files, it is just a matter of dumping the data into a textual format. delaned
-
The next processing step will split the input into distinct lanes. The result will contain:
name
-
A directory for each of the standard format directories. This would be either a standard format directory, or will contain:
lane
- A directory in the standard format containing just the data for a single lane.
When data contains multiple lanes, delaning is a painfully slow process as it needs to split the matrix market file, processing one line at a time (without any parallelization). It would have been much easier if the raw data was already split into lanes.
genes
-
A directory containing fast access per-gene data (GENES) extracted from the
delaned
data.The extraction process first verifies this data is identical for all the
delaned
directories. Then one of these directories is used to actually extract the data. This is very fast, as there are only ~30,000 genes to deal with. cells
-
A directory containing fast access per-gene-per-profile (COLLECTION) data. Will contain:
name
-
A directory for each of the
delaned
directories. This would either be a simple profiles batch (BATCH), or a collection combining:lane
- A profiles batch directory containing the data for a single lane.
Conversion of the (delaned) textual data to the fast-access format isn't very fast, but this is offset by being able to run this conversion in parallel for each lane. It is therefore much less painful than delaning. Once conversion is done, the import process is complete and does not need to be repeated.
views
-
This is where the metacells package performs its analysis, recursively grouping the cells into metacells. Should contain:
name
-
Some view profiles container (VIEW) based on the full
cells
profiles collection, to group. Typically such a view will exclude profiles with too-low total UMIs count, or that suffer from other technical issues (e.g. doublets). Genes may also be excluded.The main function of the metacell package is to recursively compute the groups for such a view (that is, extend it to become a GROUPED directory).
Raw Data
The formats which can be used as the raw input for metacell processing:
- 10X
-
A directory containing 10X Genomics data. Should contain the files:
matrix.mtx
- Matrix market UMIs data.
barcodes.tsv
- The barcode of each profile (possibly with a
-lane
suffix). genes.tsv
- The ensembl ID and name of each gene.
- MOCA
-
A directory containing Mouse Organogenesis Cell Atlas data. Should contain the files:
gene_count.txt
- Matrix market UMIs data.
cell_annotate.csv
- Per-cell data (see list of expected fields in
metacell/storage/moca_format.yaml
). gene_annotate.csv
- Per-gene data (see list of expected fields in
metacell/storage/moca_format.yaml
).
- HCA
- A file containing Human Cell Atlas in HDF format with a
.h5` suffix. Must contain a single top-level key with the sub-keys
barcodes
,indptr
,genes
,indices
anddata
.
These formats are converted to a standard input format:
- STD
-
A directory containing data in a standard metacell input format. Should contain the files:
umis.mtx
- Matrix market UMIs data.
profiles.csv
- Per-profile data.
genes.csv
- Per-gene data.
format.yaml
- A description of the fields of the CSV files.
Fast Access Data
Formats of data allowing fast access during the metacell processing.
- GENES
-
A directory containing per-gene data, accessed using the :py:class:`metacell.storage.genes.Genes` class. Should contain the files:
genes.yaml
- General meta-data describing this set of genes. Should contain the keys
organism
,genes_count
anduuid
, which must be equal to themd5sum
of thegenes.name.txt
file. This UUID uniquely identifies this set of genes. genes.*.txt
- Per-gene string data. These should include
genes.name.txt
containing the (sorted, all-upper-case) gene names, andgenes.ensembl.txt
containing the ensembl identifier of each gene. genes.*.npy
- Per-gene numeric data, if any.
- CONTAINER
-
Either a BATCH, COLLECTION, or VIEW directory, providing access to per-gene-per-profile data for some genes and profiles using the :py:class:`metacell.storage.profiles.ProfilesContainer` class.
Each container directory contains at least the file:
profiles.yaml
-
General meta-data describing this container of profiles. Should contain at least the keys:
organism
- The organism whose cells were sequenced (e.g.
human
ormouse
). genes_uuid
- The UUID of the genes per profile.
-
profiles_count
,genes_count
- The number of profiles, and number of genes per profiles.
profiles_kind
- The kind of profiles (
cells
,metacells
,clans
,tribes
,nations
,hordes
). uuid
- Should be equal to the
md5sum
of thematrix.umis.mmm
ormatrix.umis.npy
file. This UUID uniquely identifies this profiles batch.
metadata.yaml
- Additional arbitrary meta-data collected for this container profile over time.
Additional keys may be specified depending on the container type.
- BATCH
-
A directory containing per-gene-per-profile data for some closely related profiles (a single batch), accessed using the :py:class:`metacell.storage.profiles.ProfilesBatch` class. Barcodes can be safely assumed to be unique within each batch, but not between multiple batches. When investigating batch effects, this is the smallest group of profiles for which these effects can be assumed to be the same. Should contain the additional files:
-
gp.*.mmm
,gp.*.npy
-
Per-gene-per-profile data. The
mmm
format is a metacell-specific compressed sparse data format accessed using the :py:class:`metacell.matrices.MemoryMappedMatrix` class. Each batch should containgp.umis.mmm
orgp.umis.npy
.Additional per-gene-per-profile data may exist, for example:
gp.log_fold.npy
- The log-fold-factor of the gene's UMIs fraction (out of the profile's total), compared to the gene's median UMIs fraction (in all the profiles). This serves as an estimate of how strong a marker the gene is for this specific profile. This should not be used for single cells.
p.*.txt
-
Per-profile string data. Common data includes:
p.barcode.txt
- The unique (in this batch) per-profile barcode (without any lane suffix).
p.name.txt
- The unique (in this batch) per-profile string name, if any.
-
p.batch.txt
,p.lane.txt
- The name of the batch and the name of the lane (if any) per profile. These are the same for all profiles in the batch, but having these files is useful when collecting multiple batches to a single data set (see below).
p.*.npy
-
Per-profile numeric data. Common data includes:
p.total_umis.npy
- The sum of the genes UMIs for each profile.
p.max_log_fold.npy
- The maximal fold factor of a gene in this profile.
-
g.*.txt
,g.*.npy
-
Per-gene string and numeric data, specific for this profiles batch, if any. Common data includes:
g.median_fraction.npy
- The median fraction of the gene's UMIs out of the profile's UMIs. This is used as an estimate of the global gene expression. This should not be used for single cells.
g.min_rank_log_fold.npy
- The minimal rank of the gene's fold factor in some profile. That is, the lower the number, the stronger the marker the gene is for distinguishing between profiles. This is used to filter genes in heatmap visualizations. This should not be used for single cells.
-
- VIEW
-
A directory containing a possibly restricted view of another profiles container accessed using the :py:class:`metacell.storage.profiles.ProfilesView` class. The
profiles.yaml
file should contain the additional keys:base_uuid
- The UUID of the base profiles container this is a view of.
base_path
- The path of the base profiles container this is a view of. If this is a relative path, it is relative to the view directory.
The directory may contain the additional file(s):
p.base_index.npy
- If exist, contain the sorted indices of the base container profiles which are included in this view. Otherwise, the view contains all the base container profiles.
g.base_index.npy
- If exist, contain the sorted indices of the base container genes which are included in this view. Otherwise, the view contains all the base container genes.
When accessing data of a view, if there exists a local file containing the data, it is used. Otherwise, if possible, the data of the base container is fetched and sliced to include only the relevant subset of profiles and/or genes. By default, data is assumed to be slicable. Non-slicable data is specified in :py:attr:`metacell.storage.profiles.ProfilesView.GENES_DEPENDENT_DATA` and :py:attr:`metacell.storage.profiles.ProfilesView.PROFILES_DEPENDENT_DATA`.
- COLLECTION
-
A directory combining a collection of other profiles container directories into a single unified profiles container, accessed using the :py:class:`metacell.storage.profiles.ProfilesCollection` class. The
profiles.yaml
file should contain the additional key:combined
-
A list of mapping, one per combined profile container directories. Should contain the keys:
name
- The name of a sub-directory of the collection directory, containing the profiles container which is combined into the overall collection.
uuid
- The UUID identifying the combined profiles container.
profiles_count
- The number of profiles in the combined profiles container.
For each listed
name
, a sub-directory with that name must exist, and contain a profiles container. When accessing data of a collection, if there exists a local file containing the data, it is used. Otherwise (except for per-gene data), the data is combined from the data of all the collected sub-containers.
Grouped Data
The primary function of the metacell package is to group profiles together, in particular, to group cell profiles into metacells, metacells into clans, etc.
When groups are computed, the following files are created:
p.group_index.npy
- The group index for each of the container's profiles. A negative value indicates the profile belongs to no group (is an "outlier").
grouped
- A profiles VIEW containing just the profiles with a non-negative group index, that is, only the non-outlier profiles.
outliers
- A profiles VIEW containing just the outlier profiles (which are not included in any
of the groups). Typically this would only hold
p.base_index.npy
in addition to theprofiles.yaml
. groups
- A GROUPS profiles container for the computed groups.
zones
- A hierarchy of ZONE (view) containers based on the computed groups.
tmp
-
Temporary files used to compute the groupings, but need not be used once it has been computed. The contents of this directory depends on whether we are grouping a few profiles or many profiles.
If grouping a small number of profiles:
selected
- A view of the profiles container, including only the selected ("feature") genes.
If grouping a large number of profiles, we can't directly compute the groups, since the basic algorithm has a complexity of O(N^2). Computation therefore must works through a divide-and-conquer algorithm. This requires different temporary files:
phase-1
-
A directory containing the second phase's files.
pile-index
- A profiles view of a random subset of a restricted number of to-be-grouped profiles.
To help in sorting file names, the indices are zero-padded so all have the same
width (e.g.
00, .., 10, 11, ...
). Each such pile is independently (directly) grouped. outliers-pile
-
This directory is a profiles view containing all profiles that were marked as an outlier in some indexed pile. It is also independently grouped. Unlike the indexed piles, this view may contain a large number of profiles so grouping it may require an intermediate step of its own.
Rare "types" do not have enough profiles in each of the piles to form groups, so will be marked as outliers. In this outliers-only view, there might be enough of these rare profiles to form a proper group.
When grouping the profiles in the final first phase outliers pile (small enough to be directly grouped), we disable outliers detection. That is, we insist that each of the profiles be assigned to some group by the end of the first phase, as these groups are used to compute the piles for the second phase and we'd like to give each profile a chance to be grouped in this second phase.
grouped
- A view of the grouped profiles container, with
p.group_index.npy
containing the combined group indices from all the phase-1 piles. groups
-
The intermediate grouping combines the groups from all the phase-1 piles, as well as the groups from the phase-1 outliers. This is the less-accurate grouping we will improve upon. To allow doing so, we compute the groups-of-groups (that is, we compute
p.group_index.npy
here), and use these to construct the second phase piles.Similarly to grouping the outliers pile, we disable outlier detection when grouping the final pile, so that each group will be assigned to one of the second phase piles.
phase-2
-
A directory containing the second phase's files. This is similar to the first phase's structure, with the following differences:
- The profiles contained in each pile are not chosen randomly. Instead, they are the profiles contained in one of the computed intermediate groups-of-groups.
- The outliers pile contains the phase-1 outliers in addition to the outliers of the phase-2 indexed piles. This allows very rare profile "types" a second chance at being grouped.
- There are no
grouped
orgroups
directories; the results are written directly to the original (parent) grouped profiles container. These final groups are recursively grouped as long as there is a large number of profiles, as described above.
- GROUPS
-
A profiles container describing groups of profiles. The
gp.umis.npy
per-profile-per-gene UMIs count in this container will contain the sum of the per-profile-per-gene of each of the grouped profiles contained in each (group) profile.A groups profiles container has identical format to a normal profiles batch directory, except that its
profiles.yaml
file contains agrouped_profiles
key with the (relative) path to the grouped profiles container (typically..
), and agrouped_uuid
key with the UUID of the grouped profiles container. For convenience,mean_grouped_count
andmean_cells_count
provide an estimate of the number of cells and profiles in each group.The
profiles_kind
of the groups container should be based on the grouped container profiles kind. In particular, groups ofcells
are calledmetacells
. Such groups are expected to contain cells of the "same" state.When grouping a large number of profiles, the
groups
directory might itself be grouped. This recursion continues as long as the total number of profiles is "large". Groups ofmetacells
are calledclans
, groups ofclans
are calledtribes
, groups oftribes
are callednations
, and groups ofnations
are calledhordes
. All higher-level groups contain profiles with a "similar", but not necessarily the "same" state.Finally, a groups container will hold the files
p.grouped_count.npy
with the number of profiles summed into each group, andp.cells_count.npy
with the number of cells summed into each group. These two numbers are the same for metacells, but differ for higher level groups.
- ZONE
-
A tree directory is used to provide convenient views for exploring the hierarchy created by the computed groups. In each level of the tree, in addition to the
groups
andgrouped
containers, there is also azones
directory that contains views that group the overall profiles into groups-of-groups.The directory structure for a small amount of cells would contain:
zones groups # All the clans (groups of metacells) ... grouped # All the metacells (grouped into clans). ... clan-0 # The 1st clan groups # All the metacells in the 1st clan ... grouped # All the cells in the 1st clan ... clan-1 # The 2nd clan groups # All the metacells in the 2nd clan ... grouped # All the cells in the 2nd clan ... ... ...
And a directory structure for a larger number (several hundred of thousands) of cells would contain:
zones groups # All the tribess (groups of clans) ... grouped # All the clans (grouped into tribes). ... tribe-0 # The 1st tribe groups # All the clans in the 1st tribe ... grouped # All the metacells in the 1st tribe ... clan-000 # The 1st clan in the 1st tribe groups # All the metacells in the 1st clan ... grouped # All the cells in the 1st clan ... clan-001 # The 2nd clan in the 1st tribe ... ... tribe-1 # The 2nd tribe groups # All the clans in the 2nd tribe ... grouped # All the metacells in the 2nd tribe ... clan # The 1st clan in the 2nd tribe groups # All the metacells in the 1st clan ... grouped # All the cells in the 1st clan ... clan-091 # The 2nd clan in the 2nd tribe ... ... ...
A much larger number (tens of millions) of cells would include an additional hierarchy level for each
nation
. Even larges number of cells would (up to a few billion) would require an additional level for eachhorde
. Currently the largest data sets are at most of a few million so this scheme should be sufficient for the foreseeable future.In all cases, the
grouped
directory contains ap.group_index.npy
which specifies the index of the group in the adjacentgroups
directory.