Pipeline for joint calling, sample and variant QC for WGS germline variant calling data


Keywords
bioinformatics
License
MIT
Install
pip install joint-calling==0.1.60

Documentation

Joint calling pipeline

A Hail based pipeline for post-processing and filtering of large scale genomic variant calling datasets.

  1. Combines GVCFs (generated by GATK4) to a Hail Matrix Table.
  2. Performs sample-level QC.
  3. Performs variant QC using a random forest model.
  4. Performs variant QC using a allele-specific VQSR model.

Usage

The workflow should be run using the CPG analysis runner.

Install the CPG analysis runner:

mamba install -c cpg -c conda-forge analysis-runner

Assuming the name of the dataset is fewgenomes,

export DATASET=fewgenomes

Add joint-calling as a submodule to your analysis repositry. E.g., to a branch develop:

cd $DATASET  # change to the dataset repository
git submodule add -f -b develop https://github.com/populationgenomics/joint-calling
git submodule init
git submodule update

Test the workflow using the analysis runner on a given dataset:

# Assuming we already changed to the dataset repository root
DATASET=$(basename $(pwd))
analysis-runner \
--dataset ${DATASET} \
--output-dir "gs://cpg-${DATASET}-hail/joint-calling/test" \
--description "Joint calling, test" \
--access-level test \
joint-calling/driver_for_analysis_runner.sh workflows/batch_workflow.py\
--scatter-count 10 \
--from test \
--to tmp \
--callset ${DATASET} \
--version test-$(VERSION)

This command will use the test access level, which means finding the input GVCFs in the test bucket gs://cpg-$DATASET-test/gvcf/batch1/*.g.vcf.gz, writing the resulting matrix tables to the test-tmp bucket, gs://cpg-fewgenomes-test-tmp/mt/v0.mt, and writing plots to gs://cpg-fewgenomes-test-tmp/joint-calling/v0.

--scatter-count controls the number of secondary workers for Dataproc clusters, as well as the number of shards to parition data for the AS-VQSR analysis.

To use the main bucket for input and output, run the workflow with the full access level:

DATASET=$(basename $(pwd))
analysis-runner \
--dataset ${DATASET} \
--output-dir "gs://cpg-${DATASET}-hail/joint-calling" \
--description "Joint calling, full" \
--access-level full \
joint-calling/driver_for_analysis_runner.sh workflows/batch_workflow.py\
--scatter-count 200 \
--from main \
--to main \
--callset ${DATASET} \
--version $(VERSION) \
--batch batch1 \
--batch batch2 \
--batch batch3

It will find input GVCFs in the main bucket, assuming the batch IDs are batch1, batch2, batch3: gs://cpg-$DATASET-main/gvcf/{batch1,batch2,batch3}/*.g.vcf.gz; write the resulting matrix table to the main bucket: gs://cpg-fewgenomes-main/mt/v0.mt, and plots to gs://cpg-fewgenomes-analysis/joint-calling/v0.

Overview of the pipeline steps

  1. Find inputs. According to the specified --dataset and --batch parameters, look at gs://cpg-<dataset>-main/gvcf/<batch-id>/*.g.vcf.gz (orgs://cpg-<dataset>-test/gvcf/<batch-id>/*.g.vcf.gz) for the GVCFs and a CSV file with QC metadata.

  2. Post-process the GVCFs:

    • Run GATK ReblockGVCFs to annotate with allele-specific VCF INFO fields required for recalibration (QUALapprox, VarDP, RAW_MQandDP),
    • Subset GVCF to non-alt chromosomes.
  3. Run the GVCF combiner using scripts/combine_gvcfs.py. The script merges GVCFs into a sparse Matrix Table using Hail's vcf_combiner.

  4. Run the scripts/sample_qc.py script, that performs the sample-level QC, and generates a Table with the filtered sample IDs, as well as a metadata Table with metrics that were used for filtering (coverage, sex, ancestry, contamination, variant numbers/distributions, etc).

  5. Run the random forest approach to perform the variant QC.

  6. Run the allele-specific VQSR approach to perform the variant QC.

Sample QC

The sample QC and random forest variant QC pipelines are largely a re-implementation and orchestration of the Hail methods used for the quality control of GnomAD release. Good summaries of gnomAD QC pipeline can be found in gnomAD update blog posts:

Here we give a brief overview of the sample QC steps:

  1. Compute sample QC metrics using Hail’s sample_qc module on all autosomal bi-allelic SNVs.

  2. Filter outlier samples using the following cutoffs. Note that the most up to date cutoffs are speified in the configuration file filter_cutoffs.yaml, which can be overridden with --filter-cutoffs-file.

  3. Filter using BAM-level metrics was performed when such metrics were available. We removed samples that were outliers for:

    • Contamination: freemix > 5% (call-UnmappedBamToAlignedBam/UnmappedBamToAlignedBam/*/call-CheckContamination/*.selfSM/FREEMIX)
    • Chimeras: > 5% (call-AggregatedBamQC/AggregatedBamQC/*/call-CollectAggregationMetrics/*.alignment_summary_metrics/PCT_CHIMERAS)
    • Duplication: > 30% (call-UnmappedBamToAlignedBam/UnmappedBamToAlignedBam/*/call-MarkDuplicates/*.duplicate_metrics/PERCENT_DUPLICATION)
    • Median insert size: < 250 (call-AggregatedBamQC/AggregatedBamQC/*/call-CollectAggregationMetrics/*.insert_size_metrics/MEDIAN_INSERT_SIZE)
    • Median coverage < 18X (calculated from the GVCFs).
  4. Sex inferred for each sample with Hail's impute_sex. Filter samples with sex chromosome aneuploidies or ambiguous sex assignment.

  5. Note that all filtering above makes it exclude samples from the variant QC modelling, as well as from the AC/AF/AN frequency calculation. However, it keeps the samples in the final matrix table, with labels in mt.meta.hardfilter.

  6. Relatedness inferred between samples using Hail'spc_relate. Identified pairs of 1st and 2nd degree relatives. Filter to a set of unrelated individuals using Hail's maximal_independent_set that tries to keep as many samples as possible. When multiple samples could be selected, we kept the sample with the highest coverage.

  7. PCA was a ran on high-quality variants, and RF was trained using 16 principal components as features on samples with known ancestry. Ancestry was assigned to all samples for which the probability of that ancestry was >75%.

  8. Hail sample_qc was used stratified by 8 ancestry assignment PCs. Within each PC, outliers were filtered if they are 4 median absolute deviations (MADs) away from the median for the following metrics: n_snp, r_ti_tv, r_insertion_deletion, n_insertion, n_deletion, r_het_hom_var, n_het, n_hom_var, n_transition, n_transversion, or 8 MADs away from the median number of singletons (n_singleton metric).

Allele-specific VQSR

  1. Export variants into a sites-only VCF and split it into SNPs and indels, as well as region-wise for parallel processing.

  2. Run Gnarly Genotyper to perform "quick and dirty" joint genotyping.

  3. Create SNP and indel recalibration models using the allele-specific version of GATK Variant Quality Score Recalibration VQSR, using the standard GATK training resources (HapMap, Omni, 1000 Genomes, Mills indels), with the following features:

    • SNVs: AS_FS, AS_SOR, AS_ReadPosRankSum, AS_MQRankSum, AS_QD, AS_MQ
    • Indels: AS_FS, AS_SOR, AS_ReadPosRankSum, AS_MQRankSum, AS_QD
    • No sample had a high quality genotype at this variant site (GQ>=20, DP>=10, and AB>=0.2 for heterozygotes) (all fields are populated by GATK)
    • InbreedingCoeff < -0.3 (there was an excess of heterozygotes at the site compared to Hardy-Weinberg expectations) (InbreedingCoeff is populated by GATK)
  4. Apply the models to the VCFs and combine them back into one VCF.

  5. Import the VCF back to a matrix table.

VQSR pipeline is a compilation from the following 2 WDL workflows:

  1. hail-ukbb-200k-callset/GenotypeAndFilter.AS.wdl
  2. The Broad VQSR workflow documented here, translated from WDL with a help of Janis.

Random forest variant QC

  1. Gather information for the random forest model

  2. Impute missing entries

  3. Select variants for training examples

  4. Train random forests model

  5. Test resulting model on chr20

  6. Save training data with metadata describing random forest parameters used

  7. Apply random forest model to the full variant set.