CPDSeqer

This package is used to do CPD sequence data analysis.

Prerequisites

Install tabix

apt-get install -y tabix

Installation

Install python main package

pip install git+git://github.com/shengqh/cpdseqer.git

Usage

Demultiplex fastq file

usage: cpdseqer demultiplex [-h] -i [INPUT] -o [OUTPUT] -b [BARCODEFILE]

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT], --input [INPUT]
                        Input fastq gzipped file
  -o [OUTPUT], --output [OUTPUT]
                        Output folder
  -b [BARCODEFILE], --barcodeFile [BARCODEFILE]
                        Tab-delimited file, first column is barcode, second
                        column is sample name

for example:

cpdseqer demultiplex -i sample.fastq.gz -o . -b barcode.txt

Example of barcode.txt, columns are separated by tab

ATCGCGAT        Control
GAACTGAT        UV

Align reads to genome using bowtie2

bowtie2 -p 8  -x hg38/bowtie2_index_2.3.5.1/GRCh38.p12.genome -U Control.fastq.gz -S Control.sam 2> Control.log
  
samtools view -Shu -F 256 Control.sam | samtools sort -o Control.bam -T Control -

samtools index Control.bam

rm Control.sam

You can download hg38 bowtie2 index files:

mkdir hg38
cd hg38
wget https://cqsweb.app.vumc.org/download1/cpdseqer/hg38_bowtie2.tar.gz
tar -xzvf hg38_bowtie2.tar.gz
cd ..

Extract dinucleotide from bam file

usage: cpdseqer bam2dinucleotide [-h] -i [INPUT] -g [GENOME_SEQ_FILE]
                                 [-q [MAPPING_QUALITY]] [-u] -o [OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT], --input [INPUT]
                        Input BAM file
  -g [GENOME_SEQ_FILE], --genome_seq_file [GENOME_SEQ_FILE]
                        Input genome seq file
  -q [MAPPING_QUALITY], --mapping_quality [MAPPING_QUALITY]
                        Minimum mapping quality of read (default 20)
  -u, --unique_only     Use uniquely mapped read only
  -o [OUTPUT], --output [OUTPUT]
                        Output file name

for example:

cpdseqer bam2dinucleotide \
  -i Control.bam \
  -g hg38/bowtie2_index_2.3.4.1/GRCh38.p12.genome.fa \
  -o Control.dinucleotide.bed.bgz

You can download bam files:

wget https://cqsweb.app.vumc.org/download1/cpdseqer/UV.bam
wget https://cqsweb.app.vumc.org/download1/cpdseqer/UV.bam.bai
wget https://cqsweb.app.vumc.org/download1/cpdseqer/Control.bam
wget https://cqsweb.app.vumc.org/download1/cpdseqer/Control.bam.bai

Get statistic based on bed file

usage: cpdseqer statistic [-h] -i [INPUT] -c [COORDINATE_LIST_FILE] [-s]
                          [--category_index CATEGORY_INDEX] [--add_chr] -o
                          [OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT], --input [INPUT]
                        Input dinucleotide list file, first column is file
                        location, second column is file name
  -c [COORDINATE_LIST_FILE], --coordinate_list_file [COORDINATE_LIST_FILE]
                        Input coordinate list file, first column is file
                        location, second column is file name
  -s, --space           Use space rather than tab in coordinate files
  --category_index CATEGORY_INDEX
                        Zero-based category column index in coordinate file
  --add_chr             Add chr in chromosome name in coordinate file
  -o [OUTPUT], --output [OUTPUT]
                        Output file name

for example:

cpdseqer statistic --category_index 3 \
  -i cpd__fileList1.list \
  -c cpd__fileList2.list \
  -o cpd.dinucleotide.statistic.tsv

cpd__fileList1.list (separated by tab)

Control.dinucleotide.bed.bgz      Control
UV.dinucleotide.bed.bgz           UV

cpd__fileList2.list (separated by tab)

hg38_promoter.bed      Promoter
hg38_tf.bed            TFBinding

You can download example files as following scripts. The hg38_promoter.bed contains three columns only. So, Promoter (from cpd__fileList2.list definition) will be used as category name for all entries in the hg38_promoter.bed. The hg38_tf.bed contians four columns. The forth column in hg38_tf.bed indicates TF name which will be used as category name (--category_index 3) instead of TFBinding.

wget https://cqsweb.app.vumc.org/download1/cpdseqer/UV.dinucleotide.bed.bgz
wget https://cqsweb.app.vumc.org/download1/cpdseqer/UV.dinucleotide.bed.bgz.tbi
wget https://cqsweb.app.vumc.org/download1/cpdseqer/Control.dinucleotide.bed.bgz
wget https://cqsweb.app.vumc.org/download1/cpdseqer/Control.dinucleotide.bed.bgz.tbi
wget https://cqsweb.app.vumc.org/download1/cpdseqer/cpd__fileList1.list
wget https://cqsweb.app.vumc.org/download1/cpdseqer/cpd__fileList2.list
wget https://github.com/shengqh/cpdseqer/raw/master/data/hg38_promoter.bed
wget https://github.com/shengqh/cpdseqer/raw/master/data/hg38_tf.bed

Draw position figure

usage: cpdseqer position [-h] -i [INPUT] -c [COORDINATE_FILE]
                         [-b [BACKGROUND_FILES]] [-s] [--add_chr] -o [OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT], --input [INPUT]
                        Input dinucleotide list file, first column is file
                        location, second column is file name
  -c [COORDINATE_FILE], --coordinate_file [COORDINATE_FILE]
                        Input coordinate bed file
  -b [BACKGROUND_FILES], --background_files [BACKGROUND_FILES]
                        Background files, seprated by "," or set "auto" to use
                        default
  -s, --space           Use space rather than tab in coordinate files
  --add_chr             Add chr in chromosome name in coordinate file
  -o [OUTPUT], --output [OUTPUT]
                        Output file name

for example:

cpdseqer position -s -b auto -i cpd__fileList1.list -c hg38 -o cpd_position.txt

Get report

usage: cpdseqer report [-h] -i [INPUT] -g [GROUP] [-b [BLOCK]] [-d [DB]] -o
                       [OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT], --input [INPUT]
                        Input dinucleotide list file, first column is file
                        location, second column is file name
  -g [GROUP], --group [GROUP]
                        Input group list file, first column is group (0 for
                        control and 1 for case), second column is file name
  -b [BLOCK], --block [BLOCK]
                        Block size for summerize dinucleotide count
  -d [DB], --db [DB]    Input database version, hg38 or hg19, default is hg38
  -o [OUTPUT], --output [OUTPUT]
                        Output file name

for example:

cpdseqer report -i cpd__fileList1.list -g group_definition.txt  -d hg38   -o cpd.report

group_definition.txt (separated by tab)

0      Control
1      Case

Running cpdseqer using singularity

We also build docker container for cpdseqer which can be used by singularity.

Running directly

singularity exec -e docker://shengqh/cpdseqer bowtie2 -h
singularity exec -e docker://shengqh/cpdseqer cpdseqer -h

Convert docker image to singularity image first

singularity build cpdseqer.simg docker://shengqh/cpdseqer
singularity exec -e cpdseqer.simg bowtie2 -h
singularity exec -e cpdseqer.simg cpdseqer -h

cpdseqer
Release 0.2.0

Release 0.2.0

0.2.0

0.1.3

Documentation

CPDSeqer

Prerequisites

Installation

Usage

Demultiplex fastq file

Align reads to genome using bowtie2

Extract dinucleotide from bam file

Get statistic based on bed file

Draw position figure

Get report

Running cpdseqer using singularity

Running directly

Convert docker image to singularity image first

Stats

Development practices

Releases

Contributors

cpdseqer Release 0.2.0

Release 0.2.0 Toggle Dropdown 0.2.0 0.1.3

Documentation

CPDSeqer

Prerequisites

Installation

Usage

Demultiplex fastq file

Align reads to genome using bowtie2

Extract dinucleotide from bam file

Get statistic based on bed file

Draw position figure

Get report

Running cpdseqer using singularity

Running directly

Convert docker image to singularity image first

Stats

Development practices

Releases

Contributors

cpdseqer
Release 0.2.0

Release 0.2.0

0.2.0

0.1.3