MAnorm2-utils

To pre-process a set of ChIP-seq samples


Keywords
ChIP-seq, MAnorm2, normalization, python-library
License
GPL-3.0
Install
pip install MAnorm2-utils==1.0.0

Documentation

MAnorm2_utils 1.0.0

Author: Shiqi Tu
Contact: tushiqi@picb.ac.cn
Date: 2018-07-08

MAnorm2_utils is designed to coordinate with MAnorm2, an R package for differential analysis with ChIP-seq signals between two or more groups of replicate samples. MAnorm2_utils is primarily used for processing a set of ChIP-seq samples into a regular table recording the read abundances and enrichment states of a list of genomic bins in each of these samples.

Usage

The primary utility of MAnorm2_utils comes from the two scripts bound with it, named profile_bins and sam2bed, respectively.

Profiling ChIP-seq signals in reference genomic regions

Given the peak regions and mapping positions of reads of each of a set of ChIP-seq samples, profile_bins comes up with a list of reference genomic bins (each being enriched for ChIP-seq signals in at least one of the samples), and deduces the read count as well as enrichment status of each of the bins in each sample. Refer to MACS for more information about the technical terms mentioned above.

We recommend MACS 1.4 for identifying peaks for ChIP-seq samples associated with narrow genomic regions of reads enrichment (e.g., samples for most transcription factors and histone modifications like H3K4me3 and H3K27ac). In fact, although having a general applicability, profile_bins is specifically suited to processing the output files generated by MACS 1.4. For histone modifications constituting broad enriched domains (e.g., H3K9me3 and H3K27me3), we recommend SICER as the peak caller.

The following is a sample usage of profile_bins of the simplest form:

profile_bins --peaks=peak1.bed,peak2.bed \
             --reads=read1.bed,read2.bed \
             --labs=s1,s2 -n example

Note

profile_bins only recognizes BED-formatted input files. For read alignment results stored in SAM files, use first sam2bed to transform them into BED files before calling profile_bins (BED files created by sam2bed have been specifically designed to suit profile_bins; see also the section below). For BAM-formatted files, refer to Samtools for converting them into SAM files.

If everything goes smoothly, the command above will generate two files, named example_profile_bins_log.txt and example_profile_bins.xls, respectively. The former records the full list of parameter settings for calling profile_bins, as well as some summary statistics regarding each of the supplied ChIP-seq samples. The latter gives the read count and enrichment status for each deduced reference genomic bin in each sample, and has a format like the following (data shown here is only for illustration):

Example output of profile_bins
chrom start end s1.read_cnt s2.read_cnt s1.occupancy s2.occupancy
chr1 28112 29788 115 4 1 0
chr1 164156 166417 233 194 1 1
chr1 166417 168417 465 577 1 1
chr1 168417 169906 15 34 0 1

To clarify, a genomic bin is "occupied" by a ChIP-seq sample if and only if its middle point is covered by some peak region of the sample.

profile_bins supports a number of parameters for a customized configuration for deducing reference genomic bins as well as counting the reads falling in them. Type profile_bins --help in the command line for a complete list of these parameters and a brief description of each of them. Among others, several parameters deserve specific attention:

  • By default, profile_bins merges peaks from all the provided ChIP-seq samples into a consensus set of peak regions, and divides up each broad merged peak into consecutive genomic bins. Specify --typical-bin-size to control the size of such genomic bins. Note that the merged peaks having a size comparable to this parameter are left untouched.

    The default value of --typical-bin-size, which is 2000, suits well the ChIP-seq samples of histone modifications. For ChIP-seq samples of transcription factors, setting the parameter to 1000 is recommended.

  • In cases where summit positions of the supplied peaks are available (e.g., when the peaks are called by using MACS 1.4), you may provide profile_bins with this information via specifying --summits. Summit positions will be used to determine an appropriate start point for dividing up a broad merged peak.

  • Alternatively, you can directly specify a set of genomic regions as the reference bins to profile, by setting --bins to a BED file. In this case, profile_bins focuses on these provided bins and suppresses the peak merging procedure.

    --typical-bin-size and --summits are ignored when --bins is specified.

  • Before being assigned to reference bins, each read (or read pair) is converted into a genomic locus representing the middle point of the underlying DNA fragment. By default, profile_bins treats the supplied reads as single-end, and shifts downstream the 5' end of each of them by --shiftsize to reach the putative middle point. --shiftsize defaults to 100, and may be set to half of the practical DNA fragment size selected in the library preparation process.

  • Set --paired to indicate the reads are paired-end. In this case, middle point of the underlying DNA fragment associated with each read pair could be accurately inferred. Note that two reads from the same ChIP-seq sample are considered as a read pair only if they have exactly the same name (i.e., the 4th column in a BED file).

    --shiftsize is ignored when --paired is set.

  • --keep-dup controls the program's behavior regarding duplicate reads (or read pairs) potentially resulting from PCR amplification. For single-end reads, two reads are considered as duplicates if their 5' ends are mapped to the same genomic locus; for paired-end reads, two read pairs are considered as duplicates if their implied DNA fragments occupy the same genomic interval.

    By default, profile_bins preserves all the reads (or read pairs) for the counting procedure. For both paired-end reads and deep-sequencing single-end reads, we strongly recommend setting --keep-dup to 1 to enhance the specificity of downstream analyses. In that case, for each ChIP-seq sample only one read (or read pair) of a set of duplicates is retained for counting. Note also that the output log file records, for each sample, the ratio of reads (or read pairs) that are removed due to --keep-dup.

  • profile_bins supports the idea of using a configuration file to deliver parameters, to avoid repeated typing in the command line. To do that, write a configuration file following the format as demonstrated below, and pass it to --parameters:

    peaks=peak1.bed,peak2.bed
    reads=read1.bed,read2.bed
    labs=s1,s2
    n=example
    paired
    keep-dup=1
    

    Note that --parameters could be used in mixture with the other command-line arguments.

Refer to the Manual of MAnorm2_utils for a full specification of the parameters supported by profile_bins.

Transforming SAM into BED files

sam2bed is designed to coordinate with profile_bins, since the latter only accepts BED-formatted files. The simplest form of calling sam2bed is as follows:

sam2bed -i File.sam -o File.bed

The program will read from the standard input stream if -i is not specified.

In the vast majority of cases, the default setting of most of the parameters supported by sam2bed should be used. The only parameter that may be customized in practice is --min-qual, which controls the program's behavior regarding filtering out the SAM alignment records with a low mapping quality. Type sam2bed --help in the command line for a brief description of each parameter supported by sam2bed.