github.com/will-rowe/hulk

Histosketching Using Little Kmers


Keywords
genomics, hashing, microbiome, sketching
License
MIT
Install
go get github.com/will-rowe/hulk

Documentation

UPDATE: JULY 2019

I no longer work for STFC. All versions of HULK pre 1.0.0 have been renamed and archived to the STFC github. The STFC Hartree Centre are building genomic solutions based on these and other tools - if you are interested, please contact them.

This repo now hosts HULK >= version 1.0.0, which is a complete re-implementation of HULK and based solely off the method described in the open-access paper.

I've tried to keep much of the syntax and existing functionality, but make sure to check the change log below. It's a work in progress but the master branch should be a close drop-in replacement for the old HULK (for sketching at least). There are a few algorithmic differences, mainly that HULK now uses minimizers frequencies for representing the underling microbiome sample.

Importantly, this project is now fully open source!

Overview

HULK is a tool that creates small, fixed-size sketches from streaming microbiome sequencing data, enabling rapid metagenomic dissimilarity analysis. HULK approximates a k-mer spectrum from a FASTQ data stream, incrementally sketches it and makes similarity search queries against other microbiome sketches.

HULK works by collecting minimizers from sequences. Minimizers are assigned to a finite number of histogram bins using a consistent jump hash; these bins are incremented as their corresponding minimizers are found. At set intervals (i.e. after X sequences have been processed), the bins are histosketched by HULK. Similarly to MinHash sketches, histosketches can be used to estimate similarity between sequence data sets.

The advantages of HULK include:

  • it's fast and can run on a laptop
  • hulk sketches are compact, fixed size and incorporate k-mer frequency information
  • it works on data streams and does not require complete data instances
  • it can use concept drift for histosketching
  • you get to type hulk smash into the command line...

Finally, you can use hulk sketches to with a Machine Learning classifier to predict microbiome sample origin (see the paper and BANNER).

Change log

version 1.0.1 (dev branch)

  • WASM interface
    • run HULK locally and from a browser
    • based on my baby-GROOT user interface
  • HULK will output additional sketches
    • KMV MinHash
    • HyperMinHash
  • Indexing
    • re-implementation of the LSH Forest index

version 1.0.0 (current release)

  • fully re-written codebase
    • I've aimed for it to be largely backwards compatible with previous releases
  • fully open-sourced!
  • algorithm changes
    • underlying histogram is now based on minimizer frequencies
    • count-min sketch for k-mer frequencies is now replaced with a fixed-size array and a jump-hash for minimizer placement
  • changes to the sketch subcommand:
    • sketches saved to JSON by default (ala sourmash)
    • histosketch count-min sketch is no longer configurable by the user (this was Epsilon and Delta)
    • spectrum size is determined based on k-mer size
    • minCount for k-mer frequencies is removed
  • changes to the smash subcommand:
    • operates on JSON input
    • outputs matrix as csv
  • replaced some unecessary features
    • the functionality of the print and distance subcommands is available in the smash subcommand

pre version 1.0.0

  • all versions of HULK (and BANNER) pre v1.0.0 have been moved to the UKRI github and renamed. I can no longer work on these code bases.

Installation

Check out the releases to download a binary. Alternatively, install using Bioconda or compile the software from source.

Bioconda

For versions <1.0.0, use bioconda. I will add the recipe for HULK 1.0.0 asap.

conda install -c bioconda hulk

Source

HULK is written in Go (v1.12) - to compile from source you will first need the Go tool chain. Once you have it, try something like this to compile:

# Clone this repository
git clone https://github.com/will-rowe/hulk.git

# Go into the repository and get the package dependencies
cd hulk
go get -d -t -v ./...

# Run the unit tests
go test -v ./...

# Compile the program
go build ./

# Call the program
./hulk --help

Quick Start

HULK is called by typing hulk, followed by the subcommand you wish to run. There main subcommands are sketch and smash:

# Create a hulk sketch
gunzip -c microbiome.fq.gz | hulk sketch -o sketches/sampleA

#  Get a pairwise weighted Jaccard similarity matrix for a set of hulk histosketches
hulk smash -k 31 -m weightedjaccard -d ./sketches -o myOutfile

Further Information & Citing

I'm working on some new documentation and this will be available on readthedocs soon.

A paper describing the HULK method is published in Microbiome:

Rowe WPM et al. Streaming histogram sketching for rapid microbiome analytics. Microbiome. 2019.