espresso-caller

Automated and reproducible tool for identifying genomic variations at scale


Keywords
bioinformatics, cromwell, ga4gh, openapi, tes, wes
License
MIT
Install
pip install espresso-caller==1.2.2

Documentation

Automated and reproducible tool for identifying genomic variants at scale

Processing high-throughput sequencing data was the biggest challenge in the past, dealing with small datasets of large sample data. Genomics projects with thousands of samples have became popular due to decreasing in sequencing costs. Now we are facing a new challenge: how to process and store large-scale datasets efficiently.

The Sequence Alignment/Map Format Specification (SAM) and its binary version (BAM) is the most popular and recommended file format to store sequence alignment data. However, due to the increasing throughput of sequencing platforms, BAM files have lost their efficiency, as we are dealing more frequently with deeper whole-genome sequencing (WGS), whole-exome sequencing (WES) and other large-scale genomics data.

The CRAM file format is an improved version of SAM format. Instead of storing DNA sequences aligned to a reference genome, it stores only the differences between the sample and the genome sequence. With this feature, it reduces dramatically the file sizes. To use this new format, common tools, like Samtools, require the original genome file (normally in FASTA format).

On another front, different computational tools have been developed to address different problems and they are often chained together in a pipeline to solve a bigger problem. For example, a basic variant caller workflow will:

  1. map the reads from a FASTQ file to a reference genome;
  2. use the allelic counts at each site to call a genotype.

It is often the case that each of the aforementioned steps will be performed by a different tool (like BWA + GATK). To assist on this matter (combining different tools, linking the output of one step to the input of the following), languages that are agnostic to the choice of platform were developed. These languages allow us to describe tasks and combine them into workflows, using a dialect closer to the research domain and distant from the computer science domain. Common Workflow Language (CWL), Nextflow and Workflow Description Language (WDL) are the most popular languages to define workflows for processing genomics data. It is important to highlight that these languages only describe what needs to be done.

As an actual user, one expects that the aforementioned tasks/workflows are executed to produce results of interest. For this reason, researchers developed workflow execution systems, which combine workflow files (.cwl, .nf or .wdl) with parameter files (.json of .yaml) and input data to generate a dependency tree and actually execute the processing tasks. The most popular workflow executors are: Toil and Cromwell. These tools are flexible enough to support different computing backends such as local, cluster, custom environments and cloud.

Distributing bioinformatics tools and keeping track of their versions between several workflows and computing nodes have become a big problem for both system administrators (system updates breaks user tools) and users (updated version o tool outputs different results, compromising the reproducibility of the research -- for more information see FAIR Guiding Principles). To solve this problem the research community adopted container technologies, mainly Singularity and Docker. Projects like BioContainers provide thousands of container images of bioinformatics tools. Also, Docker Hub is the main repository for popular open-source projects.

Back to workflows... the workflow file + inputs file have been working well for small datasets. With our wftools software, we can submit workflows (and their inputs) to execution services, such as Cromwell in server mode, check workflow executions status and logs, and collect output files copying them to desirable directory. The input JSON file must contain workflow-specific parameter to work, for example, a workflow that require FASTQ files as input we have to provide a list of absolute paths to FASTQ files (two list if paired-end), other workflows may require several resource files that can be specific for different genome versions. It is important to note that Cromwell checks the file existence immediately prior to its use within the task, and this is associated with several downstream errors that force the system to abort the execution of the workflow. Additionally, the input JSON files are manually produced.

Espresso-Caller is a tool that automates execution of workflows for identification of genomic variants from whole-exome and whole-genome datasets. It collects raw sequencing paired-end FASTQ or raw gVCF files, together with resources files (b37 or hg38 versions), assessing their existence, to generate the JSON file that is used as input of workflows for processing WES/WGS data. Our software takes some conventions to automatically find the required files and to define sample metadata. Internally it submits WDL files to Cromwell server through Cromwell API. The next section introduces the workflows, next section provides examples of command lines and explanation of arguments, conventions and outputs. The last section shows advanced uses of espresso This document ends with the actual usage help.

Workflows

espresso provides three workflows: hc (HaplotypeCalling) takes raw FASTQ files and their metadata as input plus resources files and generates per-sample raw gVCF alongside unmapped BAM and analysis-ready CRAM files. joint (JointDiscovery) takes raw gVCF files and merge into a single unified VCF; all executes all data processing steps: from raw FASTQs to unified VCF.

Our tool bundles in-house workflows and workflows defined by the Broad Institute as GATK Best Practices defined in WDL format. The workflow files are stored inside the tool package. The list below presents each workflow in the order that is executed by this tool.

HaplotypeCalling runs the following workflows:

  1. Convert raw paired-end FASTQ files to unmapped BAM format (uBAM)
  2. Map sequences to reference genome, recalibrate alignments and generate analysis-ready BAM files
  3. Produce sample-specific sufficient statistics to call variants
  4. Validate BAM files
  5. Convert BAM files to CRAM format

JointDiscovery, we combine the sample-specific sufficient statistics to produce final genotype calls. For this step, we use the original WDL file produced by the Broad Institute.

Espresso-Caller follows some convention over configuration (where configuration is the inputs JSON file). Your data files must files the following conventions to work:

For all and hc:

  • Partial FASTQ files were previously merged into a single one
  • Each sample is paired-end sequencing data divided in two FASTQ files by strand: forward and reverse
  • FASTQ files are located at the same directory, one directory for each library name/batch
  • FASTQ file names matches this pattern: (sample_name)_R?[12].fastq(.gz)?
  • FASTQ sequence headers match this pattern: @_:_:(sample_id):(flowcell):_:_:_:_:_:(primer) which is merged as sample_id.flowcell.primer
  • Resource files, including reference genome files, are in the same directory, one directory for each version
  • Resource files have the same name from download URL
  • Destination directory does not exist or is empty, any conflicting file will be overwritten silently
  • Relative paths are allowed in command arguments

Resource files can be downloaded using the following scripts. Downloaded files will have the right names to work with espresso.

To download resource files run. It is important to inform the absolute path to destination directory.

bash download_resources_b37_files.sh /home/data/ref/b37
bash download_resources_hg38_files.sh /home/data/ref/hg38

The following command will run espresso with two directories containing raw FASTQ files. Samples files can be separated by directories to use different metadata for each group of samples (or batch). If two --fastq <fastq directory> argument is defined you have to inform the following arguments twice.

  • --library <library name> Library (LB SAM tag)
  • --date Sequencing run date(DT SAM tag) - Must be ISO8601 format (YYYY-MM-DD)!
  • --platform Sequencing platform (PL SAM tag)
  • --center Sequencing center (CN SAM tag)

Also, sample name (SM), platform unit (PU) are extracted from FASTQ automatically by using predefined conventions.

Next we have to inform the path to resources files (--reference <resources directory>) and the reference genome version (only b37 and hg38 are supported), --version <genome version>.

Some computing environments do not support container technology. In this case we have to inform the absolute paths to this software. See installing required software script.

  • --gatk_path_override path to GATK version 4, it must point to gatk wrapper script (not the Jar file).
  • --gotc_path_override path to directory containing all softwares (bwa, picard.jar, etc)
  • --samtools_path_override path to samtools.

The --dont_run flag does espresso to not submit workflows to Cromwell server. In this mode, the tool will check all required files, copy required workflow files and generate inputs JSON file writing both to destination directory. It is very recommend that you run our tool in this mode and check JSON file before running in production. Also, it is useful when there are some change to do in the default JSON file and then submit workflow using espresso instead.

The optional --move flag will tell espresso to move output files from Cromwell execution directory to destination directory. It is useful for processing large-scale genomics datasets avoiding file duplication.

The last two arguments are required: callset name and destination directory to write all files.

espresso all \
	--fastq ~/raw/batch1 \
	--fastq ~/raw/batch2 \
	--library Batch1 \
	--library Batch2 \
	--date 2018-10-19 \
	--date 2019-02-07 \
	--platform ILLUMINA \
	--platform ILLUMINA \
	--center MyCenter \
	--center MyCenter \
	--reference ~/ref/b37 \
	--version b37 \
	--gatk_path_override ~/bin/gatk-4.1.0.0/gatk \
	--gotc_path_override ~/bin \
	--samtools_path_override ~/bin/samtools-1.9/samtools \
	--move \
	--dont_run \
	MyDataset.b37 \
	~/res/my_dataset
Starting the haplotype-calling workflow with reference genome version b37
Workflow file: /home/data/res/my_dataset/haplotype-calling.wdl
Inputs JSON file: /home/data/res/my_dataset/haplotype-calling.b37.inputs.json
Workflow will not be submitted to Cromwell. See workflow files in /home/data/res/my_dataset

If you run without --dont_run argument you will see the text below. As you can see, after workflow is submitted no output is presented until execution is finished. Then the tool will print the collected output files and exit.

In this version all run hc first then joint. TODO: write WDL file that do both

Starting haplotype-calling workflow with reference genome version b37
Workflow file: /home/data/res/my_dataset/haplotype-calling.wdl
Inputs JSON file: /home/data/res/my_dataset/haplotype-calling.b37.inputs.json
Workflow submitted to Cromwell Server (http://localhost:8000)
Workflow id: 9977f400-d1b6-41ff-ab92-7ebbbf7a30a8

Expected outputs

At the end of execution espresso will collect output files copying them to destination directory. These are the file you expect to see.

  • {callset}.vcf.gz is the final unified VCF file where callset was previously defined (index {callset}.vcf.gz.tbi)
  • haplotype-calling.wdl is hc workflow in WDL
  • joint-discovery-gatk4-local.wdl is joint workflow in WDL
  • haplotype-calling.{version}.inputs.json is inputs for hc where version is b37 or hg38
  • joint-discovery.{version}.inputs.json is inputs for joint
  • out.intervals is an intervals files used by joint to call variants on all samples
  • {callset}.variant_calling_detail_metrics is the variant call metrics files
  • {callset}.variant_calling_summary_metrics summary variant calling metrics file

For each sample it should have.

  • {sample}.{version}.g.vcf.gz is raw gVCF file where sample was extracted from FASTQ file name (index {sample}.{version}.g.vcf.gz.tbi)
  • {sample}.{version}.cram is the analysis-ready CRAM file (contains both aligned sequences and sequences that weren't mapped at the end of file)
  • {sample}.{version}.duplicate_metrics is the duplication metrics file
  • {sample}.{version}.validation_.txt is the BAM validation output file
  • {sample}.unmapped.bam is the unmapped sequences in BAM format (exactly the same sequences from FASTQ but with recalibrated qualities)
  • {sample}.{version}.recal_data.csv is the file used to recalibrate PHRED qualities

Running hc or joint individually

It is possible to run only hc by espresso hc ... and joint by espresso joint .... Running hc is useful when we don't want to generate the unified VCF for theses FASTQ files. The joint is useful when we already have raw gVCF files and we want to merge to or more data from batches or studies. However there are some conventions that your VCF files must follow (only if they weren't generated by espresso).

  • VCF file names must match this pattern: (sample_name)(.version)?.g.vcf(.gz)? (it must have .g.vcf extension to skip unified VCFs that may exist in the same directory).
  • Index files with the same name plus .tbi extension in the same directory.

Reproduce data processing

With WDL and JSON files written in the result directory it is possible to re-execute data processing without espresso. To do that we need the Cromwell binary file. It should also works on different workflow engine such as miniwdl or workflow submission toolm such as wf.

Run workflow Cromwell in Local mode

java -jar cromwell.jar run \
	-i /home/data/res/my_dataset/haplotype-calling.b37.inputs.json \
	/home/data/res/my_dataset/haplotype-calling.wdl