Pipeline for using IDR to identify a set of reproducible peaks given eClIP dataset with two or three replicates.


Keywords
eCLIP-seq, peaks, bioinformatics
License
MIT
Install
pip install eclip-peak==1.0.14

Documentation

eCLIP-Peak

Pipeline for using IDR to identify a set of reproducible peaks given eClIP dataset with two or three replicates.

Installation

  • For Van Nostrand Lab

    The pipeline has already been installed. Activate its environment by issue the following command: source /storage/vannostrand/software/eclip/venv/environment.sh.

  • For all others:

    • Install Python (3.6+)
    • Install peak (pip install eclip-peak)
    • Install IDR (2.0.3+)
    • Install Perl (5.10.1+) with the following packages:
      • Statistics::Basic (cpanm install Statistics::Basic)
      • Statistics::Distributions (cpanm install Statistics::Distributions)
      • install Statistics::R (cpanm install Statistics::R)

Usage

  • For Van Nostrand Lab

    After activate peak's environment call peak -h to see the detailed usage.

  • For all others:

    After successfully installed Python, peak, Perl (with required packages), call peak -h inside your terminal to see the following detailed usage:

$ peak -h
usage: peak [-h] 
            [--ip_bams IP_BAMS [IP_BAMS ...]] 
            [--input_bams INPUT_BAMS [INPUT_BAMS ...]] 
            [--peak_beds PEAK_BEDS [PEAK_BEDS ...]] 
            [--read_type READ_TYPE] [--outdir OUTDIR] 
            [--species SPECIES] 
            [--l2fc L2FC] [--l10p L10P] [--idr IDR] 
            [--dry_run] [--cores] [--debug]

Pipeline for using IDR to identify a set of reproducible peaks given eClIP dataset 
with two or three replicates.

optional arguments:
  -h, --help            show this help message and exit
  --ip_bams IP_BAMS [IP_BAMS ...]
                        Space separated IP bam files (at least 2 files).
  --input_bams INPUT_BAMS [INPUT_BAMS ...]
                        Space separated INPUT bam files (at least 2 files).
  --peak_beds PEAK_BEDS [PEAK_BEDS ...]
                        Space separated peak bed files (at least 2 files).
  --ids IDS [IDS ...]   Optional space separated short IDs (e.g., S1, S2, S3) for datasets.
  --read_type READ_TYPE
                        Read type of eCLIP experiment, either SE or PE.
  --outdir OUTDIR       Path to output directory.
  --species SPECIES     Short code for species, e.g., hg19, mm10.
  --l2fc L2FC           Only consider peaks at or above this l2fc cutoff, default: 3.
  --l10p L10P           Only consider peaks at or above this l10p cutoff, default: 3.
  --idr IDR             Only consider peaks at or above this idr score cutoff, default: 0.01.
  --cores CORES         Maximum number of CPU cores for parallel processing, default: 1.
  --dry_run             Print out steps and inputs/outputs of each step without 
                        actually running the pipeline.
  --debug               Invoke debug mode (only for develop purpose).

Outline of workflow

  • Normalize CLIP IP BAM over INPUT for each replicate
  • Peak compression/merging on input-normalized peaks for each replicate
  • Entropy calculation on IP and INPUT read probabilities within each peak for each replicate
  • Run IDR on peaks ranked by entropy
  • Normalize IP BAM over INPUT using new IDR peak regions
  • Identify reproducible peaks within IDR regions

Examples

  • eCLIP with 2 replicates

    Assuming we have eCLIP pipeline run successfully and have the following files generated for species hg19:

    replicate 1:
        IP BAM: ip1.bam
        INPUT BAM: input1.bam
        Peak BED: clip1.peak.clusters.bed
    replicate 2:
        IP BAM: ip2.bam
        INPUT BAM: input2.bam
        Peak BED: clip2.peak.clusters.bed
    

    The pipeline then can be called like this to identify reproducible peaks:

    peak \
        --ip_bams ip1.bam ip2.bam \
        --input_bams input1.bam input2.bam \
        --peak_beds clip1.peak.clusters.bed clip2.peak.clusters.bed \
        --species hg19
  • eCLIP with 3 replicates

    Assuming we have eCLIP pipeline run successfully and have the following files generated for species hg19:

    replicate 1:
        IP BAM: ip1.bam
        INPUT BAM: input1.bam
        Peak BED: clip1.peak.clusters.bed
    replicate 2:
        IP BAM: ip2.bam
        INPUT BAM: input2.bam
        Peak BED: clip2.peak.clusters.bed
    replicate 3:
        IP BAM: ip3.bam
        INPUT BAM: input3.bam
        Peak BED: clip3.peak.clusters.bed
    

    The pipeline then can be called like this to identify reproducible peaks:

    peak \
        --ip_bams ip1.bam ip2.bam ip3.bam \
        --input_bams input1.bam input2.bam input3.bam \
        --peak_beds clip1.peak.clusters.bed clip2.peak.clusters.bed clip3.peak.clusters.bed \
        --species hg19

Note:

  • The indentation of the command does not matter, you can write it on the same line.
  • The order of bam and peak files followed by --ip_bams, input_bams, and peak_beds DOES matter, make sure you pass them in a consistent order for these three parameters.
  • There are 3 cutoffs can be set for fine tune the peak filtering, see Usage part for more details.
  • If the pipeline failed, check the log to identify the error and make necessary changes, re-run the pipeline will skip successfully processed parts only continue to processed failed and unprocessed parts.

Output

The peak pipeline will output 5 different types of files into the current work directory or into a user specified output directory (via --outdir):

  1. *.bed: either a 6 columns or 9 columns bed file saves information for peaks.
  2. *.tsv: TSV separated text file saves more information in addition to the BED file.
  3. *.txt: text file saves the mapped reads count
  4. *.out: TAB separated text file generated by IDR.
  5. *.png: plot generated by IDR.

All filenames of output files are self-explained, only the basename of peak bed files ( after the removal of .peak.clusters.bed) was used to mark the name of each replicate.

The reproducible peaks can be found in *.reproducible.peaks.bed and additional information can be found in *.reproducible.peaks.custom.tsv. While the former file is 6-column bed file, the later one is a TSV separated text file with the following columns in order:

  • IDR region (entire IDR identified reproducible region)
  • Peak (reproducible peak region)
  • Geomean of the l2fc
  • Columns of log2 fold change (2 or 3 columns for 2 or 3 replicates experiment, respectively)
  • Columns of -log10 p-value (2 or 3 columns for 2 or 3 replicates experiment, respectively)