SCSilicon
SCSilicon is a tool for synthetic single-cell DNA sequencing data generation.
1. Pre-requirements
- python3
- numpy>=1.16.1
- pandas>=0.23.4,<0.24
- tasklogger>=0.4.0
- wget>=3.2
- seaborn>=0.11.1
- matplotlib>=3.0.2
- SCSsim
All python packages will be automatically installed when you install SCSilicon if these packages are not included in your python library.
To install SCSsim, please refer to the README of SCSsim.
2. Installation
Installation with pip
To install with pip, run the following from a terminal:
pip install scsilicon
Installation from Github
To clone the repository and install manually, run the following from a terminal:
git clone https://github.com/xikanfeng2/SCSilicon.git
cd SCSilicon
python setup.py install
3. Quick start
The following code runs SCSilicon.
import scsilicon as scs
# create SCSiliconParams object
params = scs.SCSiliconParams()
#download all necessary reference files. Just run once in the first time and remove this line after the first run.
scs.download_ref_data(params)
# simulate snp samples
snp_simulator = scs.SNPSimulator()
snp_simulator.sim_samples(params)
# simulate snv samples
snv_simulator = scs.SNVSimulator()
snv_simulator.sim_samples(params)
# simulate indel samples
indel_simulator = scs.IndelSimulator()
indel_simulator.sim_samples(params)
# simulate cnv samples
cnv_simulator = scs.CNVSimulator()
cnv_simulator.sim_samples(params)
SCSiliconParams
object
4. All the general parameters for the SCSilicon simulation are stored in a SCSiliconParams
object. Let’s create a new one.
params = scs.SCSiliconParams()
SCSiliconParams
object
4.1 All parameters in -
out_dir
: string, optional, default: './'.
The output directory path -
ref
: string, optional, default: hg19.
The reference genome version: hg19 or hg38 -
chrom
: string, optional, default: chr22.
The chromosome number for reads generation: all or a specific chromosome -
layout
: string, optional, default: 'SE'.
The reads laryout: PE or SE (PD for paired-end and SE for single-end) -
coverage
: int, optional, default: 5.
The sequencing coverage -
isize
: int, optional, default: 260.
The mean insert size for paired-end sequencing -
threads
: int, optional, default: 1.
The number of threads to use for reads generation -
verbose
: int or boolean, optional, default: 1.
IfTrue
or> 0
, print log messages
4.2 Getting and setting
If we want to look at the value of parameters, we can extract it using the get_params
function:
params.get_params()
# console log: {'out_dir': './', 'ref': 'hg19', 'chrom': 'chr20', 'layout': 'PE', 'coverage': 5, 'isize': 260, 'threads': 10}
Alternatively, to give a parameter a new value we can use the set_params
function:
paramss.set_params(ref='hg38', chrom='chr22')
# console log: {'out_dir': './', 'ref': 'hg38', 'chrom': 'chr22', 'layout': 'PE', 'coverage': 5, 'isize': 260, 'threads': 10}
We can also set parameters directly when we create new SCSiliconParams
object:
params = scs.SCSiliconParams(ref='hg38', chrom='chr22')
SNPSimulator
object
5. Simulating reads for SNPs using Once we have a set of parameters we are happy with we can use SNPSimulator
to simulate samples with SNPs in it.
snp_simulator = scs.SNPSimulator()
snp_simulator.sim_samples(params)
SNPSimulator
object
5.1 All parameters in -
cell_no
: int, optional, default: 1.
The cell number for this simulation -
snp_no : int, optional, default: 1000
The SNP number of each sample
For each sample, SNPSimulator
will randomly select a total number of SNPs from dbSNP file and snp_no
parameter can be used to control this total number.
5.2 Getting and setting
Similar to SCSiliconParams
, SNPSimulator
uses the functions get_params
and set_params
to get or set parameters.
5.3 Generating FASTAQ sample
SNPSimulator
object uses the function sim_samples
to generate FASTQ files for each sample.
snp_simulator.sim_samples()
If you want to simulate multiple
samples once, you can use the cell_no
parameter to contorl this.
snp_simulator.set_params(cell_no=10)
# or set the parameter when creating the object
snp_simulator = scs.SNPSimulator(cell_no=10)
# generating reads
snp_simulator.sim_samples(params)
Above code will simulate 10 samples with FASTQ format once.
sim_samples
function
5.4 Output files of The sim_samples
function will generate two output files for each sample in your output directory.
-
sample{1}-snps.txt
: the SNPs included in this sample. This file can be reagrded as the groud truth for SNP detection software. -
sample{1}.fq
: the reads data of this sample with FASTQ format.
{1}
is the sample no., like sample1-snps.txt, sample2-snps.txt.
CNVimulator
object
6. Simulating reads for CNVs using We can use CNVimulator
to simulate samples with CNVs.
cnv_simulator = scs.CNVSimulator()
cnv_simulator.sim_samples(params)
CNVimulator
object
6.1 All parameters in -
cell_no
: int, optional, default: 1.
The cell number for this simulation -
bin_len
: int, optional, default: 500000.
The fixed bin length -
seg_no
: int, optional, default: 10.
The segment number for each chromosome -
cluster_no
: int, optional, default: 1.
The cell cluster number for multiple sample simulation -
normal_frac
: float, optional, default: 0.4.
The fraction of normal cells -
noise_frac
: float, optional, default: 0.1.
The noise fraction for cnv matrix
6.2 Getting and setting
Similar to SCSiliconParams
, CNVimulator
uses the functions get_params
and set_params
to get or set parameters.
6.3 Generating FASTAQ sample
CNVimulator
object also uses the function sim_samples
to generate FASTQ files for each sample.
cnv_simulator.sim_samples(params)
The seg_no
parameter can be used to control the segments in each chromosome.
cnv_simulator.set_params(seg_no=8)
# or set the parameter when creating the object
cnv_simulator = scs.SNPSimulator(seg_no=8)
# generating reads
cnv_simulator.sim_samples(params)
Above code will split each chromosome to 8 segments and this is useful for segmentation experiments of single cell CNV detection tools.
If you want to simulate multiple
samples once, you can use the cell_no
parameter to contorl this.
cnv_simulator.set_params(cell_no=10)
# or set the parameter when creating the object
cnv_simulator = scs.SNPSimulator(cell_no=10)
# generating reads
cnv_simulator.sim_samples(params)
Above code will simulate 10 samples with FASTQ format once.
For multiple-sample simulation, you can use the cluster_no
parameter to seperate these samples to several clusters.
cnv_simulator.set_params(cluster_no=5)
# or set the parameter when creating the object
cnv_simulator = scs.SNPSimulator(cluster_no=10)
# generating reads
cnv_simulator.sim_samples(params)
sim_samples
function
6.4 Output files of The sim_samples
function will generate two output files for each sample in your output directory.
-
cnv.csv
: the CNV matrix with cells as rows and bins as columns. This file can be reagrded as the groud truth for CNV detection software. -
segments.csv
: the segments information for each chromosome. This file can be reagrded as the groud truth for segmentation experiments. -
clusters.csv
: the clusters information for each sample. This file can be reagrded as the groud truth for cell cluster experiments. -
sample{1}.fq
: the reads data of this sample with FASTQ format.
{1}
is the sample no., like sample1.fq, sample2.fq.
6.5 Visualizing the CNV matrix
CNVimulator
object has the funciton visualize_cnv_matrix
to draw the heatmap graph for the cnv matrix.
cnv_simulator.visualize_cnv_matrix(out_prefix)
This function will save the heatmap with pdf format to the file named as out_prefix.pdf
. One example of cnv heatmap graph is shown below:
SNVSimulator
object
7. Simulating reads for SNVs using Once we have a set of parameters we are happy with we can use SNVSimulator
to simulate samples with SNVs in it.
snv_simulator = scs.SNVSimulator()
snv_simulator.sim_samples(params)
SNVSimulator
object
7.1 All parameters in -
cell_no
: int, optional, default: 1.
The cell number for this simulation -
snv_no
: int, optional, default: 1000
The SNV number of each sample
IndelSimulator
object
8. Simulating reads for Indels using Once we have a set of parameters we are happy with we can use IndelSimulator
to simulate samples with Indels in it.
indel_simulator = scs.IndelSimulator()
indel_simulator.sim_samples(params)
IndelSimulator
object
8.1 All parameters in -
cell_no
: int, optional, default: 1.
The cell number for this simulation -
in_no
: int, optional, default: 1000
The insertion number of each sample -
del_no
: int, optional, default: 1000
The deletion number of each sample
Cite us
todo
Help
If you have any questions or require assistance using SCSilicon, please contact us with fxk@nwpu.edu.cn.