HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data
haystac
is a light-weight, fast, and user-friendly species identification tool. It evaluates the presence of a
particular species of interest in a metagenomic sample, and provides statistical support for the species assignment.
The method is designed to estimate the probability that a specific taxon is present in a metagenomic sample given a set
of sequencing reads and a database of reference genomes. It works equally well with both modern and ancient DNA sequence
data.
haystac
can be run on either macOS or Linux based systems.
The recommended way to install haystac
, and all if its dependencies, is via the mamba
package manager (a fast replacement for conda).
You can install haystac
using conda
, however, it will take significantly longer to install and analyses will run slower.
If you do not have either mamba
or conda
already installed, please refer to the install instructions for mambaforge
.
If you have conda
installed, but not mamba
, then install mamba
into the base environment:
conda install -n base -c conda-forge mamba
Then use mamba
to install haystac
into a new environment:
mamba create -c conda-forge -c bioconda -n haystac haystac
And activate the environment:
conda activate haystac
We recommend that you install haystac
into a new environment to avoid dependency conflicts with other software.
haystac
consists of three main modules:
-
database
for building a database of reference genomes -
sample
for pre-processing of samples prior to analysis -
analyse
for analysing a sample against a database
To begin using haystac
we firstly need to construct a database containing all species of interest to our study. In our
preprint, we show that haystac
makes robust species
identifications with genus specific databases (for prokaryotes), allowing for very fast hypothesis driven analyses.
In this example, we will build a database containing all species in the Yersinia genus, by supplying haystac
with a
simple NBCI search query.
haystac database \
--mode build \
--query '"Yersinia"[Organism] AND "complete genome"[All Fields]' \
--output yersinia_db
To construct an NCBI search query for your area of interest, visit the NCBI Nucleotide database and use the search feature to obtain a correctly formatted query string from
the "Search details" box. This search query can be used directly with haystac
to automatically download and build
a reference database based on the accession codes present in the resultset returned by the query.
For more exhaustive analyses, you can build a database containing the 5,681 species present in the RefSeq representative database of prokaryotic species by running:
haystac database \
--mode build \
--refseq-rep prokaryote_rep \
--output refseq_db
Note: Building a database this big is not recommended on a laptop computer.
The second step in using haystac
is to prepare a sample for analysis.
In this example, we will download an aDNA library from Rasmussen et al. (2015),
by giving haystac
the SRA accession code ERR1018966.
Most published genomics papers include a BioProject code (e.g. PRJEB10885), from which you can obtain SRA accessions for each sequencing
library.
haystac sample \
--sra ERR1018966 \
--output ERR1018966
To prepare a sample of your own, you will need either single-end or paired-end short read sequencing data in
fastq
format.
For a paired-end library, you specify the location of the fastq
files and the name of
the output directory. You may also choose to collapse overlapping mate pairs (e.g. for an aDNA library).
haystac sample \
--fastq-r1 /path/to/sample1_R1.fq.gz \
--fastq-r2 /path/to/sample1_R2.fq.gz \
--collapse True \
--output sample1
By default, haystac
will scan the supplied library, identify adapter sequences, and automatically remove them.
The third step in using haystac
is to perform an analysis of a sample against a database.
Here, we will use haystac
to calculate the mean posterior abundance of all species in the Yersinia genus found within
the sample ERR1018966
.
haystac analyse \
--mode abundances \
--database yersinia_db\
--sample ERR1018966 \
--output yersinia_ERR1018966
When the analysis is complete, there will be several new sub-folders in the output directory yersinia_ERR1018966/
. To
determine if sample ERR1018966
contains Yersinia pestis (i.e. the plague) we can consult the spreadsheet containing
the mean posterior abundance estimates for all species in the Yersinia database (i.e.,
yersinia_ERR1018966/probabilities/ERR1018966/ERR1018966_posterior_abundance.tsv
). From this, we can see that 3,266
reads were uniquely assigned to Yersinia pestis, with an overall abundance of 0.047%, and that the chi-squared test
indicates that the reads are spread evenly across the genome.
Before we can confidently conclude that ERR1018966
contains ancient Yersinia pestis, we may want to perform a
damage pattern analysis.
haystac analyse \
--mode abundances \
--database yersinia_db\
--sample ERR1018966 \
--output yersinia_ERR1018966 \
--mapdamage True
haystac
has many features and potential uses, and we encourage you to use module help menus (e.g. haystac database --help
)
to explore these options. The full user documentation is available here: https://haystac.readthedocs.io/en/master/
haystac
is under active development and we encourage you to report any issues you encounter via the GitHub issue
tracker.
A preprint describing haystac
is available on bioRxiv:
Dimopoulos, E.A.*, Carmagnini, A.*, Velsko, I.M., Warinner, C., Larson, G., Frantz, L.A.F., Irving-Pease, E.K., 2020. HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data. bioRxiv 2020.12.16.419085. https://www.biorxiv.org/content/10.1101/2020.12.16.419085v1
MIT