profile-binr

PROFILE methodology for the binarisation and normalisation of RNA-seq data


Keywords
bioinformatics, computational-biology, pandas, python3, rna-seq-analysis, binarization, single-cell-rna-seq, normalisation, rpy2
License
BSD-1-Clause
Install
pip install profile-binr==0.1.1

Documentation

profile_binr

The PROFILE methodology for the binarisation and normalisation of RNA-seq data.

This is a Python interface to a set of normalisation and binarisation functions for RNA-seq data originally written in R.

This software package is based on the methodology developed by Beal, Jonas; Montagud, Arnau; Traynard, Pauline; Barillot, Emmanuel; and Calzone, Laurence at Computational Systems Biology of Cancer team at Institut Curie (contact-sysbio@curie.fr). It generalizes and offers a Python interface of the original implementation in Rmarkdown notebooks available at https://github.com/sysbio-curie/PROFILE.

Installation

Using conda

The tool can be installed using the Conda package profile_binr in the colomoto channel. Note that some of its dependencies requires the conda-forge channel.

conda install -c conda-forge colomoto::profile_binr

Using pip

Requirements

  • R (≥4.0)
  • R packages:
    • mclust
    • diptest
    • moments
    • magrittr
    • tidyr
    • dplyr
    • tibble
    • bigmemory
    • doSNOW
    • foreach
    • glue
pip install profile_binr

Usage

Once again this is a minimal example :

from profile_binr import ProfileBin
import pandas as pd

# your data is assumed to contain observations as
# rows and genes as columns
data = pd.read_csv("path/to/your/data.csv")
data.head()
Clec1b Kdm3a Coro2b 8430408G22Rik Clec9a Phf6 Usp14 Tmem167b
cell_id
HSPC_025 0.0 4.891604 1.426148 0.0 0.0 2.599758 2.954035 6.357369
HSPC_031 0.0 6.877725 0.000000 0.0 0.0 2.423483 1.804914 0.000000
HSPC_037 0.0 0.000000 6.913384 0.0 0.0 2.051659 8.265465 0.000000
LT-HSC_001 0.0 0.000000 8.178374 0.0 0.0 6.419817 3.453502 2.579528
HSPC_001 0.0 0.000000 9.475577 0.0 0.0 7.733370 1.478900 0.000000
# create the binarisation instance using the dataframe
# with the index containing the cell identifier
# and the columns being the gene names
probin = ProfileBin(data)

# compute the criteria used to binarise/normalise the data :
# This method uses a parallel implementation, you can specify the 
# number of workers with an integer
probin.fit(8) # train using 8 threads

# Look at the computed criteria
probin.criteria.head(8)
Dip BI Kurtosis DropOutRate MeanNZ DenPeak Amplitude Category
Clec1b 0.358107 1.635698 54.017736 0.876208 1.520978 -0.007249 8.852181 ZeroInf
Kdm3a 0.000000 2.407548 -0.784019 0.326087 3.847940 0.209239 10.126676 Bimodal
Coro2b 0.000000 2.320060 7.061604 0.658213 2.383819 0.004597 9.475577 ZeroInf
8430408G22Rik 0.684454 3.121069 21.729044 0.884058 2.983472 0.005663 9.067857 ZeroInf
Clec9a 1.000000 2.081717 140.089285 0.965580 2.280293 -0.009361 9.614233 Discarded
Phf6 0.000000 1.988667 -1.389024 0.035628 5.025501 2.017547 10.135226 Bimodal
Usp14 0.000000 2.208080 -1.224987 0.007850 6.109964 8.245570 11.088750 Bimodal
Tmem167b 0.000000 2.430813 0.093023 0.393720 3.448331 0.072982 9.486826 Bimodal
# get binarised data (alternatively .binarise()):
my_bin = probin.binarize()
my_bin.head()
Clec1b Kdm3a Coro2b 8430408G22Rik Clec9a Phf6 Usp14 Tmem167b
HSPC_025 NaN 1.0 NaN NaN NaN 0.0 0.0 1.0
HSPC_031 NaN 1.0 NaN NaN NaN 0.0 0.0 0.0
HSPC_037 NaN 0.0 1.0 NaN NaN 0.0 1.0 0.0
LT-HSC_001 NaN 0.0 1.0 NaN NaN 1.0 0.0 0.0
HSPC_001 NaN 0.0 1.0 NaN NaN 1.0 0.0 0.0
# idem for normalised data :
my_norm = probin.normalize()
my_norm.head()
Clec1b Kdm3a Coro2b 8430408G22Rik Clec9a Phf6 Usp14 Tmem167b
HSPC_025 0.0 9.786196e-01 0.184102 0.0 NaN 0.000801 8.318176e-05 9.999970e-01
HSPC_031 0.0 9.999981e-01 0.000000 0.0 NaN 0.000462 8.084114e-07 6.874397e-11
HSPC_037 0.0 4.408417e-09 0.892449 0.0 NaN 0.000145 9.999940e-01 6.874397e-11
LT-HSC_001 0.0 4.408417e-09 1.000000 0.0 NaN 0.991865 6.230178e-04 1.599753e-04
HSPC_001 0.0 4.408417e-09 1.000000 0.0 NaN 0.999865 2.171153e-07 6.874397e-11

References

  • Béal J, Montagud A, Traynard P, Barillot E and Calzone L (2019) Personalization of Logical Models With Multi-Omics Data Allows Clinical Stratification of Patients. Front. Physiol. 9:1965. doi:10.3389/fphys.2018.01965