TRSalgorithm

Determining 3' termini of transcripts from RNAtag-seq data in bacteria


License
MIT
Install
pip install TRSalgorithm==0.0.1

Documentation

TRS

Table of Contents
  1. Overview
  2. Hardware requirements
  3. Software requirements
  4. Running a demo example
  5. Scheme file
  6. Output files
  7. Usage
  8. Contact
  9. License
  10. Acknowledgements
  11. References

Overview

Transcription termination is one of the fundamental processes in the bacterial cell. It determines the 3’ boundary of the transcript and by this its structure and stability, it prevents transcription leakage to downstream genes, and enables cycling of RNA polymerase. Several experimental approaches were developed to identify the 3’ termini of transcripts, however, these were applied only to a limited number of bacteria and growth conditions. Here we present TRS (Termini by Read Starts), a computational approach to determine 3' termini of transcripts from widely available RNA-seq data generated by the RNAtag-seq protocol [1]. We and others [2] observed that in RNATag-seq data, sequencing reads accumulate at the 3' termini of transcripts. This phenomenon results from the RNAtag-seq protocol stems that involve random fragmentation of the RNA transcripts followed by the ligation of 3' end adaptor steps (corresponding to read 1 in the sequencing machine). While the random fragmentation generates 3' termini at random positions, the transcripts' genuine 3' termini is always present and thus overrepresented in the data. Thus, the task of identifying 3' termini requires the diffrentiation between genuine 3' termini and 3' termini generated by the random fragmentation step in the protocol.
Briefly, we rely on sequencing data from RNAtag-seq experiments done in triplicates. For each library we map the reads to the genome, record the number of read ends per genomic position, and model it by the negative binomial distribution. Next, we identify positions that differ statistically significantly in their read start counts (corresponding to fragments 3' termini) from downstream positions. Positions with statistically significant read starts peaks in multiple libraries are determined as 3’ termini. The computational scheme is fully described in [].

Hardware requirements

TRSalgorithm package requires only a standard computer with enough RAM to support the in-memory operations.

Software requirements

OS requirements

This package is supported for macOS and Linux. The package has been tested on the following systems:

  • macOS: Big Sur (11.7)
  • Linux: Debian GNU/Linux 10.11 (buster)

Prerequisites

This package requires the installation of the following packages:

numpy
pandas
scipy
statsmodels
pysam
pyaml
intervaltree

Installation

pip install TRSalgorithm. The installation time of the TRSalgorithm package should not take over a minute, however, dependencies may take few minutes to install on a standard computer.

Required files

The package requires the following files to determine 3' termini:

  • BAM files containing the mapped reads of the sequencing libraries.
  • A scheme file describing how to read the BAM files, assigning them to groups. It enables the processing of sequencing libraries in multiple conditions in a single run (see here for further detail).

Running a demo example

In the demo_example directory, you can find:

  1. A scheme file named scheme.yaml.
  2. Three BAM files named: rep1.bam, rep2.bam, rep3.bam.
  3. A directory named expected results containing two files: (1) significant peaks file - containing all signficant peaks identified. (2) all peaks file - containing all the peaks considered by the algorithm. This directory would use us to verify that everything runs properly.

After the package is installed, you can run in the demo example directory the following command (running the demo example should not take over a minute on a standard computer):

TRSpeaks scheme.yaml -w "."

This will run the algorithm on the demo example BAM files as defined by the scheme file outputing the results in the current directory ("."). You can observe that three folders were created:

  • results - Including the algorithm output for each group. For each group a directory is created with its name, and the outputs are located within. The outputs consists of two files: (1) significant_peaks.csv a tab delimited text file indicating the positions of the signficant peaks (see output file section). (2) all_mean_peaks.csv a tab delimited text file indicating all positions tested by the algorithm.
  • data - Including processed version of the bam files. Since generating the read starts counts per position can take some time with large files, the pipeline generates compressed read start count files. These can be used to rerun the algorithm with different parameters without processing the BAM files again. If you wish to reprocess the BAM files, for example if you made a mistake in the scheme file, you can use the -b flag when running the algorithm, forcing the reprocessing of the bam files.
  • logs - Including text file of the run output.

To verify that the algorithm ran correctly, please confirm the following in the demo example directory:

  • First verify the number of rows is identical between the output and the expected output by running the following commands:
wc -l "results/LB\ RNAtag-seq/significant_peaks.csv"
wc -l "expected_results/significant_peaks.csv"
wc -l "results/LB\ RNAtag-seq/all_mean_peaks.csv"
wc -l "expected_results/all_mean_peaks.csv"
  • Next to compare the identified signficant peaks are identical run the following command: awk 'NR==FNR{a[$1,$2,$3,$4,$5];next} ($1,$2,$3,$4,$5) in a' expected_results/significant_peaks.csv results/LB\ RNAtag-seq/significant_peaks.csv | wc -l. This should output the same number of lines as the previous step. Note that in the comparison we verified only the reported positions and not p-values outputs, as these may slightly differ depending on the machine.

Scheme File

The scheme file contains is a configuration file written in YAML, and contains the following information:

  • Which libraries (BAM files) should be processed together.
  • How to process each BAM file.

A simple scheme file have the following structure:

libs:
  - group: "[GROUP_1]"
    name: "[REPLICATE 1 NAME]"
    count_file:
      path: [PATH TO REPLICATE 1 BAM FILE]

  - group: "[GROUP_1]"
    name: "[REPLICATE 2 NAME]"
    count_file:
      path: [PATH TO REPLICATE 2 BAM FILE]

  - group: "[GROUP_1]"
    name: "[REPLICATE 3 NAME]"
    count_file:
      path: [PATH TO REPLICATE 3 BAM FILE]

Here we defined three sequencing libraries which belong into a single group ([GROUP 1]). In addition we defined for each library a name that will be used in the script output to report putative 3' termini adjusted p-value, and the path to the mapped sequencing reads which will be used to generate per postion read starts.

The scheme file provides multiple options how to process the BAM file. It contains the following boolean options:

  • is_reverse - This indicates that read 1 is actually the 3' end of the fragment.
  • skip_clipped - Wether to ignore reads with soft clipping.

For example, using these fields would look like this:

lib:
  - group: "[GROUP_1]"
    name: "[REPLICATE 1 NAME]"
    count_file:
      path: [PATH TO REPLICATE 1 BAM FILE]
      is_reverse: True
      skip_clipped: False

Here we do not ignore reads with soft clipping however, we treat read 1 as the reverese complement, which is the typical case in the RNAtag-seq protocol.

Output files

Running the script generates three directories:

  • logs - Containing text files with information on the run.
  • data - Containing compressed count files generated by the script. Since processing the BAM files may take a long time, the script saves the count files so reruning the script again (for instance with different parameters) would shorten the running time.
  • results - A directory containing a folder for each group defined by the scheme file. In each group directory, the script outputs two files: (1)significant_peaks.csv - A file containing the peaks which were determined as signficant in multiple replicates. (2) all_mean_peaks.csv - A file containing all the peaks considered by the algorithm.

Significant peaks file format

The file is a tab-delimited text file containing 7 constant columns, and additional columns per library in the group. Here is the description of the columns:

Column Description
Chromosome The chromosome in which the 3' terminus is located.
Strand The strand in which the 3' terminus is located.
Start The peak boundary leftmost position.
End The peak boundary rightmost position.
Dominant The position with the highest statistic value within the peak. Note that this position is not necessarily the position with the highest number of read starts.
Signal The mean statistic value (R_i) computed for the peak containing the 3' termini.
Rank The rank of the mean statistic value (R_i) out of all tested positions.
Library name The adjusted p-value after applying Bonferroni correction for multiple hypotheses (Library name).

All mean peaks file format

The file follows the same structure of the Signficiant peaks file, however, it contains all peaks considered by the algorithm, and is sorted by the mean statistic (R_i).

Usage

usage: TRSpeaks [-h] [-w WORKDIR] [-t THRESHOLD] [-b] [-c MIN_COUNT] [--min_height MIN_HEIGHT] [--window_margin WINDOW_MARGIN] [--merge_distance MERGE_DISTANCE] [--rel_height REL_HEIGHT]
                [--chr_list CHR_LIST] [-d DS_DISTANCE] [--signif_min_lib_count SIGNIF_MIN_LIB_COUNT] [--insignif_min_ratio INSIGNIF_MIN_RATIO] [-l LOG_LEVEL]
                scheme

Uses read starts to determine 3' termini of transcripts.

positional arguments:
  scheme                Path for file containing info of the libs to process.

optional arguments:
  -h, --help            show this help message and exit
  -w WORKDIR, --workdir WORKDIR
                        Working directory. (default: ./runs)
  -t THRESHOLD, --threshold THRESHOLD
                        The signficance level used by the model. (default: 0.01)
  -b, --force_bam       Forces the program to reprocess the bam files. (default: False)
  -c MIN_COUNT, --min_count MIN_COUNT
                        The minimal coverage for a region to be considered. (default: 10)
  --min_height MIN_HEIGHT
                        The minimal ratio to consider as a peak. (default: None)
  --window_margin WINDOW_MARGIN
                        Defines the region which will be used to count the local number of reads starts. The margin is the number of nucleotides (upstream/downstream) that will be added to the region
                        around the considered site. (default: 3)
  --merge_distance MERGE_DISTANCE
                        The distance which below it, peaks will be merged together. (default: 0)
  --rel_height REL_HEIGHT
                        The relative height of peaks to use in the scipy find_peaks function. (default: 0.75)
  --chr_list CHR_LIST   Conversion of chromosome files from the bam to new name in the following format: bam1:new1,bam2:new2,... (default: )
  -d DS_DISTANCE, --ds_distance DS_DISTANCE
                        The distance (downstream) used to compute the read starts ratio. (default: 70)
  --signif_min_lib_count SIGNIF_MIN_LIB_COUNT
                        The minimal number of libraries in which the 3' terminus should be found as significant to report it depending on the threshold. (default: 2)
  --insignif_min_ratio INSIGNIF_MIN_RATIO
                        The minimal mean ratio to accept if the 3' terminus wasn't significant in all repeats. (default: 0.5)
  -l LOG_LEVEL, --log_level LOG_LEVEL
                        The logging level to report to the log file. (default: debug)

Contact

Amir Bar - amir.bar@mail.huji.ac.il

Acknowledgements

We are thankful to:

License

This project is covered under the MIT License.

References

  • [1] Shishkin, A.A. et al. Simultaneous generation of many RNA-seq libraries in a single reaction. Nat Methods 12, 323-325 (2015).
  • [2] Fuchs, M. et al. An RNA-centric global view of Clostridioides difficile reveals broad activity of Hfq in a clinically important gram-positive bacterium. Proc Natl Acad Sci U S A 118 (2021).
  • [3] Adams, P.P. et al. Regulatory roles of Escherichia coli 5' UTR and ORF-internal RNAs detected by 3' end mapping. eLife 10 (2021).