scPROTEIN

A Versatile Deep Graph Contrastive Learning Framework for Single-cell Proteomics Embedding

scPROTEIN (single-cell PROTeomics EmbeddINg) is a deep contrastive learning framework for single-cell proteomics embedding.

The advance of single-cell proteomics sequencing technology sheds light on the research in revealing the protein-protein interactions, posttranslational modifications, and proteoform dynamics of proteins in a cell. However, the uncertainty estimation for peptide quantification, data missingness, batch effects and high noise hinder the analysis of single-cell proteomic data. It is important to solve this set of tangled problems together, but the existing methods tailored for single-cell transcriptomes cannot fully address this task. Here, we proposed a novel versatile framework designed for single-cell proteomics data analysis called scPROTEIN, which consists of peptide uncertainty estimation based on a multi-task heteroscedastic regression model and cell embedding generation based on graph contrastive learning. scPROTEIN can estimate the uncertainty of peptide quantification, denoise protein data, remove batch effects and encode single-cell proteomic-specific embeddings in a unified framework. We demonstrate that scPROTEIN is efficient for cell clustering, batch correction, cell type annotation, clinical analysis, and spatially resolved proteomic data exploration.

For more information, please refer to https://www.biorxiv.org/content/10.1101/2022.12.14.520366v1

Dependences

Input single-cell proteomic data format

A csv file in the following format is needed for scPROTEIN learning from stage 1:

Protein	Peptide	Cell 0	Cell 1	Cell 2	Cell 3	Cell 4	Cell 5	Cell 6
P08865	LLVVTDPR_2	0.215943903	1.825849332	0.17106779	0.090752671	0.633329732	-0.044091136	NA
P26447	RTDEAAFQK_3	1.873431237	1.425136257	2.354956659	1.373487482	1.724188343	0.828024968	0.511722654
P26447	LNKSELK_3	NA	NA	NA	NA	NA	-0.164518259	-0.765802428
Q00610	LLYNNVSNFGR_2	-0.452033525	NA	NA	-0.211513228	-0.573607252	-0.593867542	NA
P05120	LNGLYPFR_2	NA	NA	0.245379509	0.923845132	0.300612918	NA	NA

"Protein" represents the protein name and "Peptide" denotes the corresponding constituting peptide sequence(s). The columns "Cell 0","Cell 1"... are the protein data in each cell. NA is the missing value. If datasets are provided directly from protein-level (without "Peptide" column), scPROTEIN can start from stage 2.

Documentation

The documentation which elucidates the functions of scPROTEIN is provided.

Usage

Recomended usage procedure is as follows.

1.Installation

The running environment of scPROTEIN can be installed from docker-hub repository:

Pull the Docker image from docker-hub

docker pull nkuweili/scprotein:latest

Run the Docker image (GPU is needed)

docker run --name scprotein --gpus all -it --rm nkuweili/scprotein:latest /bin/bash

Download this repository (This usually takes 15 seconds on a normal desktop computer)

# If you encounter issue about proxy when running "git clone" in the Docker environment on your own device, you can first execute the following command before running "git clone":
# git config --global http.proxy ""

git clone https://github.com/TencentAILabHealthcare/scPROTEIN.git
cd scPROTEIN/

After downloading this repository, all the single-cell proteomics datasets used in our study will also be included. We provided these datasets in both .csv and .h5ad formats.

2.Setup of scPROTEIN python package

We provided scPROTEIN package hosted on PyPI and can be installed via pip.

pip install scprotein

If for some reason this doesn't work on your device, you can also directly install scPROTEIN with the provided .whl file.

pip install docs/scprotein-0.1.1-py3-none-any.whl

You can check if scPROTEIN package has been successfully installed via the following command:

python3 -c "import scprotein"

3.For datasets provided with raw peptide-level profile, scPROTEIN starts from stage 1 to learn the peptide uncertainty and obtain the protein-level abundance in an uncertainty-guided manner.

python3 train_stage1.py

After stage 1, the learned estimated peptide uncertainty array will be saved in folder './scPROTEIN'

4.Run stage 2 to obtain the learned cell embeddings.

python3 train_stage2.py --stage1 True

For datasets provided directly with the reconstructed protein-level profile, scPROTEIN will start from stage2.

python3 train_stage2.py

After stage 2, the learned cell embedding will be saved in folder './scPROTEIN/'.

For data integration analysis, you can firstly use function integrate_sc_proteomic_features to load datasets. Subsequently, the process of running scPROTEIN to learn cell embedding is similar. You can refer to the tutorials in data_integration for more details.

5.Evaluate the learned cell embeddings.

python3 visualization.py

Expected output

After running visualization.py, a TSNE plot showing the cluster result will be saved in folder './scPROTEIN/', and a corresponding evaluation metric table will be displayed.

Use trained scPROTEIN model for evaluation

For loading checkpoints for scPROTEIN stage1 and stage2 on SCoPE2_Specht dataset for generating uncertainty and cell embedding, respectively:

python3 train_stage1.py --use_trained_scPROTEIN True
python3 train_stage2.py --stage1 True --use_trained_scPROTEIN True

Tutorial

The following notebooks are provided to show how to run scPROTEIN model

tutorial_scPROTEIN_stage1 gives a detailed description for uncertainty estimation for scPROTEIN stage1.
tutorial_scPROTEIN_stage2 provides an example using protein-level data from stage1 to learn cell embedding in stage2.
data_integration shows the running process for data integration and batch correction across various MS acquisitions.
downstream_application displays the analysis for clinical proteomic data, spatial proteomic data and cell cycle.

Hyperparameters

Hyperparameters for stage 1:

Hyperparameter	Description	Default
batch_size	Batch_size	256
kernel_nums	Kernel num of each conv block	[300,200,100]
kernel_size	Kernel size of each conv block	[2,2,2]
max_pool_size	Max pooling size	1
conv_layers	Nums of conv layers	3
hidden_dim	Hidden dim for fc layer	3000

Hyperparameters for stage 2:

Hyperparameter	Description	Default
stage1	If scPROTEIN starts from stage1	False
num_hidden	Hidden dimension	400
num_proj_hidden	Dimension of projection head	256
num_layers	Number of GCN layers	2
num_protos	Number of prototypes	2
num_changed_edges	Number of added/removed edges	10
drop_edge_rate_1	Dropedge rate for view1	0.2
drop_edge_rate_2	Dropedge rate for view2	0.4
drop_feature_rate_1	Mask_feature rate for view1	0.4
drop_feature_rate_2	Mask_feature rate for view1	0.2
alpha	Balance factor	0.05
tau	Temperature coefficient	0.4

Time cost

Taking demo SCoPE2_Specht dataset (1490 cells, 3042 proteins) as an example, typical running time on a "normal" desktop computer is about 40 minutes for stage 1 and about 10 minutes for stage 2.

Disclaimer

This tool is for research purpose and not approved for clinical use.

This is not an official Tencent product.

Questions

If you have any suggestions/ideas for scPROTEIN or have issues trying to use it, please don't hesitate to reach out to us. You can post an issue or reach us by email(nkuweili@mail.nankai.edu.cn, fionafyang@tencent.com).

Citation

Li, W., Yang, F., et al. A Versatile Deep Graph Contrastive Learning Framework for Single-cell Proteomics Embedding. https://www.biorxiv.org/content/10.1101/2022.12.14.520366v1

scprotein
Release 1.1.1

Release 1.1.1

1.1.1

0.1.1

0.1.0

Documentation

scPROTEIN

A Versatile Deep Graph Contrastive Learning Framework for Single-cell Proteomics Embedding

Dependences

Input single-cell proteomic data format

Documentation

Usage

Expected output

Use trained scPROTEIN model for evaluation

Tutorial

Hyperparameters

Time cost

Disclaimer

Questions

Citation

Stats

Development practices

Releases

Contributors

scprotein Release 1.1.1

Release 1.1.1 Toggle Dropdown 1.1.1 0.1.1 0.1.0

Documentation

scPROTEIN

A Versatile Deep Graph Contrastive Learning Framework for Single-cell Proteomics Embedding

Dependences

Input single-cell proteomic data format

Documentation

Usage

Expected output

Use trained scPROTEIN model for evaluation

Tutorial

Hyperparameters

Time cost

Disclaimer

Questions

Citation

Stats

Development practices

Releases

Contributors

scprotein
Release 1.1.1

Release 1.1.1

1.1.1

0.1.1

0.1.0