BABACHI: Backgroud Allelic Dosage Bayesian Checkpoint Identification
BABACHI is a tool for Background Allelic Dosage (BAD) genomic regions calling from non-phased heterozygous SNVs. It is aimed at estimation of BAD on low-coverage sequencing data, where the precise estimation of allelic copy numbers is not possible.
BAD corresponds to the ratio of Major copy number to Minor copy number.
BABACHI takes in a vcf-like .tsv file with heterozygous SNVs sorted by genome positions (ascending). The input file must contain the following first 7 columns: chromosome, position, ID, reference base, alternative base, reference read count, alternative read count All lines, starting with # are ignored.
The output is a .bed file with BAD annotations.
BABABCHI package requires only a standard computer with enough RAM to support the in-memory operations.
The package can be installed on all major platforms (e.g. BSD, GNU/Linux, OS X, Windows) from Python Package Index (PyPI) and GitHub. The package has been tested on the following systems:
- Windows: Windows 10
- Linux: Ubuntu 18.04
BABACHI mainly depends on the following Python 3 packages:
docopt>=0.6.2 numpy>=1.18.0 schema>=0.7.2 contextlib2>=0.5.5 pandas>=1.0.4 matplotlib>=3.2.1 seaborn>=0.10.1
Install from PyPi
pip3 install babachi
Install from Github
git clone https://github.com/autosome-ru/BABACHI cd BABACHI python3 setup.py install
sudo, if required The package should take less than 1 minute to install.
python >= 3.6
To get full usage description one can execute:
This will produce the following message:
Usage: babachi <file> [-O <path> |--output <path>] [-q | --quiet] [--allele_reads_tr <int>] [--force-sort] [--visualize] [--boundary-penalty <float>] [--states <string>] babachi (--test) [-O <path> |--output <path>] [-q | --quiet] [--allele_reads_tr <int>] [--force-sort] [--visualize] [--boundary-penalty <float>] babachi visualize <file> (-b <badmap>| --badmap <badmap>) [-q | --quiet] [--allele_reads_tr <int>] babachi -h | --help Arguments: <file> Path to input file in tsv format with columns: chr pos ID ref_base alt_base ref_read_count alt_read_count. <badmap> Path to badmap .bed format file <int> Non negative integer <float> Non negative number <states_string> String of states separated with "," (to provide fraction use "/", e.g. 4/3). Each state must be >= 1 Options: -h, --help Show help. -q, --quiet Less log messages during work time. -b <badmap>, --badmap <badmap> Input badmap file -O <path>, --output <path> Output directory or file path. [default: ./] --allele_reads_tr <int> Allelic reads threshold. Input SNPs will be filtered by ref_read_count >= x and alt_read_count >= x. [default: 5] --force-sort Do chromosomes need to be sorted --visualize Perform visualization of SNP-wise AD and BAD for each chromosome. Will create a directory in output path for the .svg visualizations. --boundary-penalty <float> Boundary penalty coefficient [default: 9] --states <states_string> States string [default: 1,2,3,4,5,6,1.5,2.5] --test Run segmentation on test file
To perform a test run:
The test run takes approximately 2 minutes on a standard computer.
The result is a file named
test.bed that will be produced in the root directory of the project (if
-O option is not used).
The contents of the
test.bed file should start as follows:
#chr start end BAD Q1.00 Q1.33 Q1.50 Q2.00 Q2.50 Q3.00 Q4.00 Q5.00 Q6.00 SNP_count sum_cover chr1 1 125183196 2 -63.47825919524621 -24.598710473939718 -8.145646624117944 -2.000888343900442e-11 -30.773041699645546 -78.80480783186977 -189.88685134708248 -299.82657588596703 -401.6012195141575 1325 17280
Each row represents a single segment with a constant estimated BAD. The columns are as follows:
- #chr: chromosome
- start: segment start position
- end: segment end position
- BAD: estimated BAD
- QX: the logarithmic likelyhood of the segment to have BAD = X
- SNP_count: number of SNPs of the segment
- sum_cover: the total read coverage of all SNPs of the segment
The BABACHI tool is maintained by Sergey Abramov and Alexandr Boytsov.