This repository provides a template on standardized scRNA-seq analysis using Python and the Scanpy library. All parameters of the workflow are controlled with a single config file.
We strongly recommend to install MORESCA into a virtual environment. Here, we use Conda:
conda create -n <env_name> python=3.12
conda activate <env_name>
Then, simply install MORESCA with pip:
pip install moresca
Important
If you want to use Python 3.13 on MacOS, make sure to use GCC>=16. This is required for compiling scikit-misc. See this discussion for advice.
Flag | Type | Description | Default |
---|---|---|---|
-d, --data | Path | Path to the h5ad file. | data/adata_raw.h5ad |
-o, --output | Path | Path to the output folder for the processed data. | results |
-p, --parameters | Path | Path to the config file. | config.gin |
-v, --verbose | Flag | Generate verbose logs. |
By default, moresca
expects the data in H5AD
format to be in data
. The output
directory as well as figure folders specified in the config are generated on the fly if they don't exist yet.
Currently, the script will perform the most common operations from doublet removal to DEG analysis of found clusters. If you want to apply ambient RNA correction beforehand, you need to run this separately.
The following example executes the pipeline with the h5ad file example_data.h5ad
and the parameter file config.gin
, saving the output in the folder results
.
moresca -d example_data.h5ad -o results -p config.gin
By default, the used parameter file looks like this:
# config.gin
quality_control:
apply = True
doublet_removal = True
outlier_removal = True
min_genes = 200
min_counts = None
max_counts = None
min_cells = 10
max_genes = None
mt_threshold = 15
rb_threshold = 10
hb_threshold = 1
figures = "figures/"
pre_qc_plots = True
post_qc_plots = True
normalization:
apply = True
method = "log1pPF"
remove_mt = False
remove_rb = False
remove_hb = False
feature_selection:
apply = True
species = "hsapiens"
method = "seurat_v3"
number_features = 2000
scaling:
apply = True
max_value = None
pca:
apply = True
n_comps = 100
use_highly_variable = True
batch_effect_correction:
apply = False
method = "harmony"
batch_key = None
neighborhood_graph:
apply = True
n_neighbors = 30
n_pcs = None
metric = "cosine"
clustering:
apply = True
method = "leiden"
resolution = 1.0
diff_gene_exp:
apply = True
method = "wilcoxon"
groupby = "leiden_r1.0"
use_raw = False
layer = "unscaled"
tables = None
umap:
apply = True
plotting:
apply = True
umap = True
path = "figures/"
The following values of the parameters are currently possible
Note
The parameter species
is only used for the feature selection based on anti-correlation. For available values check here. It defaults to hsapiens
.
Parameter | Values | Description |
---|---|---|
quality_control | ||
apply | bool | Whether to apply the quality control steps or not. |
doublet_removal | bool | Whether to perform doublet removal or not. |
outlier_removal | bool | Whether to remove outliers or not. |
min_genes | int, float, bool, None | The minimum number of genes required for a cell to pass quality control. |
min_counts | int, float, bool, None | The minimum total counts required for a cell to pass quality control. |
max_counts | int, float, bool, None | The maximum total counts allowed for a cell to pass quality control. |
min_cells | int, float, bool, None | The minimum number of cells required for a gene to pass quality control. |
max_genes | int, float, str, bool, None | The maximum number of genes allowed for a cell to pass quality control. |
mt_threshold | int, float, str, bool, None | The threshold for the percentage of counts in mitochondrial genes (maximum). |
rb_threshold | int, float, str, bool, None | The threshold for the percentage of counts in ribosomal genes (minimum). |
hb_threshold | int, float, str, bool, None | The threshold for the percentage of counts in hemoglobin genes (maximum). |
figures | str, Path, None | The path to the output directory for the quality control plots. |
pre_qc_plots | bool, None | Whether to generate plots of QC covariates before quality control or not. |
post_qc_plots | bool, None | Whether to generate plots of QC covariates after quality control or not. |
normalization | ||
apply | bool | Whether to apply the normalization steps or not. |
method | log1pCP10k, log1pPF, PFlog1pPF, analytical_pearson, None, False | The normalization method to use. |
remove_mt | bool, None | Whether to remove mitochondrial genes or not. |
remove_rb | bool, None | Whether to remove ribosomal genes or not. |
remove_hb | bool, None | Whether to remove hemoglobin genes or not. |
feature_selection | ||
apply | bool | Whether to apply the feature selection steps or not. |
method | seurat, seurat_v3, analytical_pearson, anti_correlation, triku, hotspot, None, False | The feature selection method to use. |
species | str | Species of the data. Only used if feature_selection=anti_correlation |
number_features | int, None | The number of top features to select (only applicable for certain methods). |
scaling | ||
apply | bool | Whether to apply the scaling step or not. |
max_value | int, float, None | The maximum value to which the data will be scaled. If None, the data will be scaled to unit variance. |
pca | ||
apply | bool | Whether to apply the PCA or not. |
n_comps | int, float | The number of principal components to compute. A float is interpreted as the proportion of the total variance to retain. |
use_highly_variable | bool | Whether to use highly variable genes for PCA computation. |
batch_effect_correction | ||
apply | bool | Whether to apply the batch effect correction or not. |
method | harmony, None, False | The batch effect correction method to use. |
batch_key | str | The key in adata.obs that identifies the batches. |
neighborhood_graph | ||
apply | bool | Whether to compute the neighborhood graph or not. |
n_neighbors | int | The number of neighbors to consider for each cell. |
n_pcs | int, None | The number of principal components to use for the computation. |
metric | str | The distance metric to use for computing the neighborhood graph. |
clustering | ||
apply | bool | Whether to perform clustering or not. |
method | leiden, phenograph, None, False | The clustering method to use. |
resolution | float, int, list, tuple, auto | The resolution parameter for the clustering method. Can be a single value or a list of values. |
diff_gene_exp | ||
apply | bool | Whether to perform differential gene expression analysis or not. |
method | wilcoxon, t-test, logreg, t-test_overestim_var | The differential gene expression analysis method to use. |
groupby | str | The key in adata.obs that identifies the groups for comparison. |
use_raw | bool, None | Whether to use the raw gene expression data or not. |
layer | str, None | The layer in adata.layers to use for the differential gene expression analysis. |
corr_method | benjamini-hochberg, bonferroni | The method to use for multiple testing correction. |
tables | str, Path, None | The path to the output directory for the differential expression tables. |
umap | ||
apply | bool | Whether to run UMAP or not. |
plotting | ||
apply | bool | Whether to create plots or not. |
umap | bool | Whether to plot the UMAP or not. |
path | str, Path | The path to the output directory for the plots. |
For contribution purposes, you should clone MORESCA from GitHub and install it in dev mode:
git clone git@github.com:claassenlab/MORESCA.git
cd MORESCA
pip install -e ".[dev]"
pre-commit install
This additionally installs ruff
and pytest
, which we use for formatting and code style control. Please run these before you commit new code.
Additionally, it will set up a pre-commit hook to run ruff
.