A simple tool to extract damaged reads from BAM files


Keywords
dmg-reads, GTDB
License
MIT
Install
pip install dmg-reads==1.0.5

Documentation

dReads: a tool to extract damaged reads from BAM files

GitHub release (latest by date including pre-releases) dmg-reads PyPI Conda

A simple tool to extract damaged reads from BAM files

Installation

We recommend having conda installed to manage the virtual environments

Using pip

First, we create a conda virtual environment with:

wget https://raw.githubusercontent.com/genomewalker/dmg-reads/master/environment.yml
conda env create -f environment.yml

Then we proceed to install using pip:

pip install dmg-reads

Using mamba

mamba install -c conda-forge -c bioconda -c genomewalker dmg-reads

Install from source to use the development version

Using pip

pip install git+ssh://git@github.com/genomewalker/dmg-reads.git

By cloning in a dedicated conda environment

git clone git@github.com:genomewalker/dmg-reads.git
cd dmg-reads
conda env create -f environment.yml
conda activate dmg-reads
pip install -e .

Usage

dReads will take a TSV file produced from metaDMG and extract the reads from a BAM file. One can select a list of taxa and ranks to extract the reads from.

For a complete list of options:

$ dReads --help
usage: dReads [-h] -b BAM -m METADMG_RESULTS -f METADMG_FILTER [--fb-data FB_DATA] [--fb-filter FB_FILTER]
              [-p PREFIX] [--combine] [--only-damaged] [-T TAXONOMY_FILE] [-r RANK] [-M SORT_MEMORY] [-t THREADS]
              [--chunk-size CHUNK_SIZE] [--debug] [--version]

A simple tool to extract damaged reads from BAM files

optional arguments:
  -h, --help            show this help message and exit

required arguments:
  -b BAM, --bam BAM     The BAM file used to generate the metaDMG results (default: None)
  -m METADMG_RESULTS, --metaDMG-results METADMG_RESULTS
                        A file from metaDMG ran in local mode (default: None)
  -f METADMG_FILTER, --metaDMG-filter METADMG_FILTER
                        Which filter to use for metaDMG results (default: None)

optional arguments:
  --fb-data FB_DATA     A file from filterBAM ran in local mode (default: None)
  --fb-filter FB_FILTER
                        Which filter to use for filterBAM results (default: None)
  -p PREFIX, --prefix PREFIX
                        Prefix used for the output files (default: None)
  --combine             If set, the reads damaged and non-damaged will be combined in one fastq file (default:
                        False)
  --only-damaged        If set, only the reads damaged will be extracted (default: False)
  -T TAXONOMY_FILE, --taxonomy-file TAXONOMY_FILE
                        A file containing the taxonomy of the BAM references in the format
                        d__;p__;c__;o__;f__;g__;s__. (default: None)
  -r RANK, --rank RANK  Which taxonomic group and rank we want to get the reads extracted. (default: None)
  -M SORT_MEMORY, --sort-memory SORT_MEMORY
                        Set maximum memory per thread for sorting; suffix K/M/G recognized (default: 1G)
  -t THREADS, --threads THREADS
                        Number of threads (default: 1)
  --chunk-size CHUNK_SIZE
                        Chunk size for parallel processing (default: None)
  --debug               Print debug messages (default: False)
  --version             Print program version

One would run dReads as:

dReads -m RISE505_MA873_L1.tp-mdmg.local.weight-1.csv.gz -b RISE505_MA873_L1.dedup.filtered.sorted.bam -f '{ "damage": 0.1, "significance": 2 }' --prefix RISE505_MA873_L1 --taxonomy-file gtdb-r202-organelles-viruses.tax.tsv --rank '{"genus": "Yersinia", "class":"Bacilli"}

The filter is a JSON object where the keys are one of the metaDMG results headers. If --taxonomy-file and --rank are set, the reads will be extracted from the selected taxonomic group and rank. If --only-damaged is set, only the damaged reads will be extracted. If --combine is set, the damaged, non-damaged and multi-mapped reads will be combined in one fastq file.

The previous command will produce the following files:

├── RISE505_MA873_L1.c__Bacilli.non-damaged.fastq.gz
└── RISE505_MA873_L1.g__Yersinia.damaged.fastq.gz

If the --combine and the --only-damaged flag are not set, dReads will produce three files per taxa/rank or BAM file:

  • *.damaged.fastq.gz: The reads mapped to a reference that shows damage
  • *.non-damaged.fastq.gz: The reads mapped to a reference that does not show damage
  • *.multi.fastq.gz: The reads mapped to multiple references which are damaged and non-damaged

Using taxonomies

To be able to extract reads from specific taxa and/or ranks, one needs to provide a taxonomy file. This file should be a TSV file with the following format:

ACCESSION\td__Bacteria;l__Bacteria;k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Yersinia;s__Yersinia pestis

ACCESSION is the reference accession in the BAM file. The taxonomy is separated by ; and the taxonomic groups are separated by __. The taxonomic groups recognized by dReads in --taxonomy-file and --rank are:

  • domain: d__
  • lineage: l__
  • kingdom: k__
  • phylum: p__
  • class: c__
  • order: o__
  • family: f__
  • genus: g__
  • species: s__

Note: The taxonomic groups are case sensitive and one can include as many as desired. For example, if one wants to extract the reads from the genus Yersinia and the class Bacilli, one would use --rank '{"genus": "Yersinia", "class":"Bacilli"}.

Using the results from filterBAM

If the results from filterBAM are available, one can use them to extract the reads. To do so, one needs to provide the --fb-data and --fb-filter arguments. The --fb-data argument should be the path to the filterBAM results file and the --fb-filter argument should be the filter we want to use to filter the references. For example:

dReads -m RISE505_MA873_L1.tp-mdmg.local.weight-1.csv.gz -b RISE505_MA873_L1.dedup.filtered.sorted.bam -f '{ "damage": 0.1, "significance": 2 }' --prefix RISE505_MA873_L1 --taxonomy-file gtdb-r202-organelles-viruses.tax.tsv --rank '{"genus": "Yersinia", "class":"Bacilli"}' --fb-data RISE505_MA873_L1.dedup_stats-filtered.tsv.gz --fb-filter '{"breadth": 0.171, "n_alns": 249}' --threads 4

Note: The final number number of reads might not correspond to the number of reads in the BAM file. The reason is that if you are allowing multiple alignments for each read, the reads might be mapped to multiple references. In this case, the reads will be counted multiple times, for example, a read might map to a certain references, but also map to a reference that might be discarded. In this case, the read will be counted twice, once for the reference that is not discarded and once for the reference that is discarded.