termseq-peaks

Peak-calling for term-seq data in bacteria


License
MIT
Install
pip install termseq-peaks==0.1

Documentation

Term-seq peak-caller

Homepage: https://github.com/nichd-bspc/termseq-peaks

This tool was designed to call 3'-end peaks from term-seq data in bacteria, handling replicates in a statistically robust manner.

Installation

With pip:

pip install termseq-peaks

With conda:

conda install termseq-peaks --channel conda-forge --channel bioconda

From source:

git clone <repo>
cd <repo>
python setup.py install

Usage:

termseq_peaks <bedgraphs> --peaks out.bed [additional options]

Background

This tool takes two novel approaches that together yield good results for calling precise peaks in term-seq data with biological replicates. These approaches are 1) a signal processing approach and 2) an implementation of multi-way IDR to handle >2 replicates.

For peak-calling, the venerable macs2 peak-caller [0] is the go-to method. However, macs2 is designed for ChIP-seq data and all the assumptions that go along with that (modeling peaks based on fragment size, controlling for open chromatin, comparing IP against a background input, etc). We found that naively applying macs2 to term-seq data resulted in suboptimal peak calls. Term-seq signal is composed of single-bp positions of read ends and has a very high dynamic range since it is coming from trancribed RNA that can have very high copy numbers per cell (in contrast to ChIP-seq which we have just 1 copy per haploid cell).

Furthermore, many peak-callers have limited support for multiple replicates. One general solution to this is to apply the Irreproducible Discovery Rate method (IDR), originally developed for the ENCODE project. By design, the IDR method only takes two replicates at a time.

Usage

Prepare input

Required input is normalized signal bedGraphs for each replicate. If data are stranded, there should be separate files for each strand. Gzipped bedGraphs are supported automatically if the filename ends in .gz.

One way of doing this might be with deepTools bamCoverage. This example makes a bedGraph file out of unique reads on the minus strand (--samFlagInclude 16), uses 1-bp resolution (--binSize 1), only considers unique reads (--minMappingQuality 20), and uses only the first base of each read to build the signal (--Offset 1, as appropriate for a Term-seq library, for example).

bamCoverage \
  --bam rep1.bam \
  -o rep1_minus.bedgraph \
  --outFileFormat bedgraph \
  --binSize 1 \
  --Offset 1 \
  --minMappingQuality 20 \
  --samFlagInclude 16 \

Run

If we do this for each replicate's minus-strand reads, these bedGraphs can then be provided to termseq-peaks:

termseq-peaks rep1_minus.bedgraph rep2_minus.bedgraph peaks_minus --strand -

By default the output peaks_minus.bed will contain peaks falling below an IDR threshold of 0.05. The full oracle file (the leniently-called peaks on a bedgraph that merges all provided bedgraphs) will be output as peaks_minus.bed.oracle.narrowPeak.

These files can be used with IGV or other genome browsers to inspect the peaks alongside the input signals to assess the peak-calling performance.

For more help, run:

termseq-peaks -h

Algorithm

This tool takes multiple normalized bedGraph files representing the normalized signal for each replicate, and calls a set of consistent peaks at a provided IDR [1] cutoff.

  • Peaks are called using scipy.signal.find_peaks [2] with very lenient parameters to intentionally include both real peaks and noise. These peaks are called on each replicate.
  • The score for the peaks is the "prominence" value for each peak; see [2] for details.
  • For each unique pairwise combination of replicates, IDR routines from [1] are run, resulting in an output file containing merged peaks from those two files along with IDR values for each. In practice the tool stores these as temp files. The number of peaks falling below the IDR threshold is counted for each pairwise comparison. The minimum such number, N, across all pairwise combinations of replicates is used as the final number of peaks to select.
  • All bedGraphs are additionally merged together and peaks are similarly called on that merged signal to get the "oracle" peaks.
  • The oracle peaks are then ranked by their score and the top N peaks are selected as the final peaks. The scores in the final peaks are the scores from the oracle peaks, that is, the peak prominences from calling peaks on the merged bedGraphs.

Output

The find_peaks function returns various metrics. Here, we retrieve the prominence and the width. The prominence is the vertical distance between the peak and the lowest contour line, and the width is measured at half the prominence. See these documentation pages for a visualization of these metrics: prominences and widths.

Output files are in the narrowPeak format, which shows the peak width as well as the position of the summit. We report the prominence as the score as well as the signal value. The position of the peak is the 1-bp position of the prominence.

Caveats

The find_peaks function operates on 1-dimensional vectors, and so returns peak positions in terms of indexes into the input vectors. Internally, we interpolate to back-calculate the corresponding genomic coordinates and round to integers. This may potentially have issues where two peaks that are genomically far away have adjacent indexes (for example, if the intervening region has zero reads anywhere). Empirically we do not observe this to be an issue, but a solution would be to pad out the vector to include zeros at every position in the chromosome/plasmid (and increase RAM usage as a result).

The biggest downside currently is speed and RAM. This is not an issue for the small bacterial genomes the tool was designed for; it takes about 30s to run for E. coli data, and pandas DataFrames are used to store the signal. For larger eukaryotic genomes, parallelization across chromosomes may be required and substantial RAM may be required. This tool remains untested on larger genomes, but has worked quite well for term-seq in several bacterial genomes. Furthermore, since we need to perform IDR between all pairwise combinations of replicates, the running time scales as O(nreplicates^2).

References