dReads: a tool to extract damaged reads from BAM files
A simple tool to extract damaged reads from BAM files
Installation
We recommend having conda installed to manage the virtual environments
Using pip
First, we create a conda virtual environment with:
wget https://raw.githubusercontent.com/genomewalker/dmg-reads/master/environment.yml
conda env create -f environment.yml
Then we proceed to install using pip:
pip install dmg-reads
Using mamba
mamba install -c conda-forge -c bioconda -c genomewalker dmg-reads
Install from source to use the development version
Using pip
pip install git+ssh://git@github.com/genomewalker/dmg-reads.git
By cloning in a dedicated conda environment
git clone git@github.com:genomewalker/dmg-reads.git
cd dmg-reads
conda env create -f environment.yml
conda activate dmg-reads
pip install -e .
Usage
dReads will take a TSV file produced from metaDMG and extract the reads from a BAM file. One can select a list of taxa and ranks to extract the reads from.
For a complete list of options:
$ dReads --help
usage: dReads [-h] -b BAM -m METADMG_RESULTS -f METADMG_FILTER [--fb-data FB_DATA] [--fb-filter FB_FILTER]
[-p PREFIX] [--combine] [--only-damaged] [-T TAXONOMY_FILE] [-r RANK] [-M SORT_MEMORY] [-t THREADS]
[--chunk-size CHUNK_SIZE] [--debug] [--version]
A simple tool to extract damaged reads from BAM files
optional arguments:
-h, --help show this help message and exit
required arguments:
-b BAM, --bam BAM The BAM file used to generate the metaDMG results (default: None)
-m METADMG_RESULTS, --metaDMG-results METADMG_RESULTS
A file from metaDMG ran in local mode (default: None)
-f METADMG_FILTER, --metaDMG-filter METADMG_FILTER
Which filter to use for metaDMG results (default: None)
optional arguments:
--fb-data FB_DATA A file from filterBAM ran in local mode (default: None)
--fb-filter FB_FILTER
Which filter to use for filterBAM results (default: None)
-p PREFIX, --prefix PREFIX
Prefix used for the output files (default: None)
--combine If set, the reads damaged and non-damaged will be combined in one fastq file (default:
False)
--only-damaged If set, only the reads damaged will be extracted (default: False)
-T TAXONOMY_FILE, --taxonomy-file TAXONOMY_FILE
A file containing the taxonomy of the BAM references in the format
d__;p__;c__;o__;f__;g__;s__. (default: None)
-r RANK, --rank RANK Which taxonomic group and rank we want to get the reads extracted. (default: None)
-M SORT_MEMORY, --sort-memory SORT_MEMORY
Set maximum memory per thread for sorting; suffix K/M/G recognized (default: 1G)
-t THREADS, --threads THREADS
Number of threads (default: 1)
--chunk-size CHUNK_SIZE
Chunk size for parallel processing (default: None)
--debug Print debug messages (default: False)
--version Print program version
One would run dReads
as:
dReads -m RISE505_MA873_L1.tp-mdmg.local.weight-1.csv.gz -b RISE505_MA873_L1.dedup.filtered.sorted.bam -f '{ "damage": 0.1, "significance": 2 }' --prefix RISE505_MA873_L1 --taxonomy-file gtdb-r202-organelles-viruses.tax.tsv --rank '{"genus": "Yersinia", "class":"Bacilli"}
The filter is a JSON object where the keys are one of the metaDMG results headers. If --taxonomy-file
and --rank
are set, the reads will be extracted from the selected taxonomic group and rank. If --only-damaged
is set, only the damaged reads will be extracted. If --combine
is set, the damaged, non-damaged and multi-mapped reads will be combined in one fastq file.
The previous command will produce the following files:
├── RISE505_MA873_L1.c__Bacilli.non-damaged.fastq.gz
└── RISE505_MA873_L1.g__Yersinia.damaged.fastq.gz
If the --combine
and the --only-damaged
flag are not set, dReads
will produce three files per taxa/rank or BAM file:
-
*.damaged.fastq.gz
: The reads mapped to a reference that shows damage -
*.non-damaged.fastq.gz
: The reads mapped to a reference that does not show damage -
*.multi.fastq.gz
: The reads mapped to multiple references which are damaged and non-damaged
Using taxonomies
To be able to extract reads from specific taxa and/or ranks, one needs to provide a taxonomy file. This file should be a TSV file with the following format:
ACCESSION\td__Bacteria;l__Bacteria;k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Yersinia;s__Yersinia pestis
ACCESSION
is the reference accession in the BAM file. The taxonomy is separated by ;
and the taxonomic groups are separated by __
. The taxonomic groups recognized by dReads
in --taxonomy-file
and --rank
are:
-
domain:
d__
-
lineage:
l__
-
kingdom:
k__
-
phylum:
p__
-
class:
c__
-
order:
o__
-
family:
f__
-
genus:
g__
-
species:
s__
Note: The taxonomic groups are case sensitive and one can include as many as desired. For example, if one wants to extract the reads from the genus Yersinia and the class Bacilli, one would use
--rank '{"genus": "Yersinia", "class":"Bacilli"}
.
Using the results from filterBAM
If the results from filterBAM
are available, one can use them to extract the reads. To do so, one needs to provide the --fb-data
and --fb-filter
arguments. The --fb-data
argument should be the path to the filterBAM
results file and the --fb-filter
argument should be the filter we want to use to filter the references. For example:
dReads -m RISE505_MA873_L1.tp-mdmg.local.weight-1.csv.gz -b RISE505_MA873_L1.dedup.filtered.sorted.bam -f '{ "damage": 0.1, "significance": 2 }' --prefix RISE505_MA873_L1 --taxonomy-file gtdb-r202-organelles-viruses.tax.tsv --rank '{"genus": "Yersinia", "class":"Bacilli"}' --fb-data RISE505_MA873_L1.dedup_stats-filtered.tsv.gz --fb-filter '{"breadth": 0.171, "n_alns": 249}' --threads 4
Note: The final number number of reads might not correspond to the number of reads in the BAM file. The reason is that if you are allowing multiple alignments for each read, the reads might be mapped to multiple references. In this case, the reads will be counted multiple times, for example, a read might map to a certain references, but also map to a reference that might be discarded. In this case, the read will be counted twice, once for the reference that is not discarded and once for the reference that is discarded.