spacegraphcats
Explore large, annoying graphs using hierarchies of dominating sets - because in space, no one can hear you miao!
This is a collaboration between the Theory In Practice lab at University of Utah, the Lab for Data Intensive Biology at UC Davis, and Dr. Felix Reidl at Birkbeck University of London. Initial development of spacegraphcats was generously supported by the Moore Foundation's Data Driven Discovery Initiative.
Documentation
This README file contains quickstart information. For use cases and other information, please see the spacegraphcats documentation at https://spacegraphcats.github.io/spacegraphcats.
Installation and execution quickstart
See installation instructions and the run guide.
For help or support with this software, please file an issue on GitHub. Thank you!
Quickstart
There are two quickstart examples available! Please see dory-example and twofoo-example. The latter example includes a snakemake Snakefile.
Notable dependencies
spacegraphcats uses code from BBHash, a C++ library for building minimal perfect hash functions (Guillaume Rizk, Antoine Limasset, Rayan Chikhi; see Limasset et al., 2017, arXiv, as wrapped by pybbhash.
spacegraphcats also uses functionality from khmer and sourmash.
Citation information
See the Genome Biology publication Exploring neighborhoods in large metagenome assembly graphs using spacegraphcats reveals hidden sequence diversity, Brown et al., 2020, doi: https://doi.org/10.1186/s13059-020-02066-4.
Pointers to interesting code
Interesting algorithms
The rdomset
code for efficently calculating a dominating set of a graph
at a given radius R is in spacegraphcats/catlas/rdomset.py.
The graph denoising code for removing low-abundance pendants from
BCALM cDBGs is in function contract_degree_two
in
cdbg/bcalm_to_gxt.py.
Part of the indexPieces
code for indexing cDBG nodes by dominating
nodes is
cdbg/index_cdbg_by_kmer.py. The
remainder is implemented in search
, below.
The search
code for extracting query neighborhoods is in
search/query_by_sequence.py;
see especially the call to kmer_idx.count_cdbg_matches(...)
.
Interesting library functionality
Code for indexing large FASTQ/FASTA read files by cDBG unitig, and
extracting the reads corresponding to individual unitigs from BGZF
files, is available in
cdbg/label_cdbg.py
and
search/search_utils.py,
get_reads_by_cdbg
, respectively.