cpdseqer

CPDseq data analysis


License
MIT
Install
pip install cpdseqer==0.2.0

Documentation

CPDSeqer

This package is used to do CPD sequence data analysis.

Prerequisites

Install tabix

apt-get install -y tabix

Installation

Install python main package

pip install git+git://github.com/shengqh/cpdseqer.git

Usage

Demultiplex fastq file

usage: cpdseqer demultiplex [-h] -i [INPUT] -o [OUTPUT] -b [BARCODEFILE]

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT], --input [INPUT]
                        Input fastq gzipped file
  -o [OUTPUT], --output [OUTPUT]
                        Output folder
  -b [BARCODEFILE], --barcodeFile [BARCODEFILE]
                        Tab-delimited file, first column is barcode, second
                        column is sample name

for example:

cpdseqer demultiplex -i sample.fastq.gz -o . -b barcode.txt

Example of barcode.txt, columns are separated by tab

ATCGCGAT        Control
GAACTGAT        UV

Align reads to genome using bowtie2

bowtie2 -p 8  -x hg38/bowtie2_index_2.3.5.1/GRCh38.p12.genome -U Control.fastq.gz -S Control.sam 2> Control.log
  
samtools view -Shu -F 256 Control.sam | samtools sort -o Control.bam -T Control -

samtools index Control.bam

rm Control.sam

You can download hg38 bowtie2 index files:

mkdir hg38
cd hg38
wget https://cqsweb.app.vumc.org/download1/cpdseqer/hg38_bowtie2.tar.gz
tar -xzvf hg38_bowtie2.tar.gz
cd ..

Extract dinucleotide from bam file

usage: cpdseqer bam2dinucleotide [-h] -i [INPUT] -g [GENOME_SEQ_FILE]
                                 [-q [MAPPING_QUALITY]] [-u] -o [OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT], --input [INPUT]
                        Input BAM file
  -g [GENOME_SEQ_FILE], --genome_seq_file [GENOME_SEQ_FILE]
                        Input genome seq file
  -q [MAPPING_QUALITY], --mapping_quality [MAPPING_QUALITY]
                        Minimum mapping quality of read (default 20)
  -u, --unique_only     Use uniquely mapped read only
  -o [OUTPUT], --output [OUTPUT]
                        Output file name

for example:

cpdseqer bam2dinucleotide \
  -i Control.bam \
  -g hg38/bowtie2_index_2.3.4.1/GRCh38.p12.genome.fa \
  -o Control.dinucleotide.bed.bgz

You can download bam files:

wget https://cqsweb.app.vumc.org/download1/cpdseqer/UV.bam
wget https://cqsweb.app.vumc.org/download1/cpdseqer/UV.bam.bai
wget https://cqsweb.app.vumc.org/download1/cpdseqer/Control.bam
wget https://cqsweb.app.vumc.org/download1/cpdseqer/Control.bam.bai

Get statistic based on bed file

usage: cpdseqer statistic [-h] -i [INPUT] -c [COORDINATE_LIST_FILE] [-s]
                          [--category_index CATEGORY_INDEX] [--add_chr] -o
                          [OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT], --input [INPUT]
                        Input dinucleotide list file, first column is file
                        location, second column is file name
  -c [COORDINATE_LIST_FILE], --coordinate_list_file [COORDINATE_LIST_FILE]
                        Input coordinate list file, first column is file
                        location, second column is file name
  -s, --space           Use space rather than tab in coordinate files
  --category_index CATEGORY_INDEX
                        Zero-based category column index in coordinate file
  --add_chr             Add chr in chromosome name in coordinate file
  -o [OUTPUT], --output [OUTPUT]
                        Output file name

for example:

cpdseqer statistic --category_index 3 \
  -i cpd__fileList1.list \
  -c cpd__fileList2.list \
  -o cpd.dinucleotide.statistic.tsv

cpd__fileList1.list (separated by tab)

Control.dinucleotide.bed.bgz      Control
UV.dinucleotide.bed.bgz           UV

cpd__fileList2.list (separated by tab)

hg38_promoter.bed      Promoter
hg38_tf.bed            TFBinding

You can download example files as following scripts. The hg38_promoter.bed contains three columns only. So, Promoter (from cpd__fileList2.list definition) will be used as category name for all entries in the hg38_promoter.bed. The hg38_tf.bed contians four columns. The forth column in hg38_tf.bed indicates TF name which will be used as category name (--category_index 3) instead of TFBinding.

wget https://cqsweb.app.vumc.org/download1/cpdseqer/UV.dinucleotide.bed.bgz
wget https://cqsweb.app.vumc.org/download1/cpdseqer/UV.dinucleotide.bed.bgz.tbi
wget https://cqsweb.app.vumc.org/download1/cpdseqer/Control.dinucleotide.bed.bgz
wget https://cqsweb.app.vumc.org/download1/cpdseqer/Control.dinucleotide.bed.bgz.tbi
wget https://cqsweb.app.vumc.org/download1/cpdseqer/cpd__fileList1.list
wget https://cqsweb.app.vumc.org/download1/cpdseqer/cpd__fileList2.list
wget https://github.com/shengqh/cpdseqer/raw/master/data/hg38_promoter.bed
wget https://github.com/shengqh/cpdseqer/raw/master/data/hg38_tf.bed

Draw position figure

usage: cpdseqer position [-h] -i [INPUT] -c [COORDINATE_FILE]
                         [-b [BACKGROUND_FILES]] [-s] [--add_chr] -o [OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT], --input [INPUT]
                        Input dinucleotide list file, first column is file
                        location, second column is file name
  -c [COORDINATE_FILE], --coordinate_file [COORDINATE_FILE]
                        Input coordinate bed file
  -b [BACKGROUND_FILES], --background_files [BACKGROUND_FILES]
                        Background files, seprated by "," or set "auto" to use
                        default
  -s, --space           Use space rather than tab in coordinate files
  --add_chr             Add chr in chromosome name in coordinate file
  -o [OUTPUT], --output [OUTPUT]
                        Output file name

for example:

cpdseqer position -s -b auto -i cpd__fileList1.list -c hg38 -o cpd_position.txt

Get report

usage: cpdseqer report [-h] -i [INPUT] -g [GROUP] [-b [BLOCK]] [-d [DB]] -o
                       [OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT], --input [INPUT]
                        Input dinucleotide list file, first column is file
                        location, second column is file name
  -g [GROUP], --group [GROUP]
                        Input group list file, first column is group (0 for
                        control and 1 for case), second column is file name
  -b [BLOCK], --block [BLOCK]
                        Block size for summerize dinucleotide count
  -d [DB], --db [DB]    Input database version, hg38 or hg19, default is hg38
  -o [OUTPUT], --output [OUTPUT]
                        Output file name

for example:

cpdseqer report -i cpd__fileList1.list -g group_definition.txt  -d hg38   -o cpd.report

group_definition.txt (separated by tab)

0      Control
1      Case

Running cpdseqer using singularity

We also build docker container for cpdseqer which can be used by singularity.

Running directly

singularity exec -e docker://shengqh/cpdseqer bowtie2 -h
singularity exec -e docker://shengqh/cpdseqer cpdseqer -h

Convert docker image to singularity image first

singularity build cpdseqer.simg docker://shengqh/cpdseqer
singularity exec -e cpdseqer.simg bowtie2 -h
singularity exec -e cpdseqer.simg cpdseqer -h