org.genomicsdb:genomicsdb

Java API for GenomicsDB


Keywords
bioinformatics, cpp, gatk, genomics, genomicsdb, java, mpi, precision-medicine, scala, spark, variant, variant-calling
Licenses
MIT/GFDL-1.3-or-later

Documentation

License: MIT readthedocs Maven Central

Master Develop
actions actions
codecov codecov

GenomicsDB is built on top of a fork of htslib and a tile-based array storage system for importing, querying and transforming variant data. Variant data is sparse by nature (sparse relative to the whole genome) and using sparse array data stores is a perfect fit for storing such data. GenomicsDB is a highly performant scalable data storage written in C++ for importing, querying and transforming genomic variant data. See genomicsdb.readthedocs.io for documentation and usage.

  • Supported platforms : Linux and MacOS.
  • Supported filesystems : POSIX, HDFS, EMRFS(S3), GCS and Azure Blob.

Included are

  • JVM/Spark wrappers that allow for streaming VariantContext buffers to/from the C++ layer among other functions. GenomicsDB jars with native libraries and only zlib dependencies are regularly published on Maven Central.
  • Native tools for incremental ingestion of variants in the form of VCF/BCF/CSV into GenomicsDB for performance.
  • MPI and Spark support for parallel querying of GenomicsDB.

GenomicsDB is packaged into gatk4 and benefits qualitatively from a large user base.

External Contributions

GenomicsDB is open source and all participation is welcome. GenomicsDB is released under the MIT License and all external contributors are expected to grant an MIT License for their contributions.

Checklist before creating Pull Request

Please ensure that the code is well documented in Javadoc style for Java/Scala. For Java/C/C++ code formatting, roughly adhere to the Google Style Guides. See GenomicsDB Style Guide