click-annotvcf

✏️ Convert and annotate of VCFs into tables.


License
BSD-3-Clause
Install
pip install click-annotvcf==1.0.9

Documentation

click_annotvcf

pypi badge travis badge codecov badge code formatting docker badge docker badge

This pipeline enables the conversion and annotation of VCF files into tables (TSV format). It has several steps of annotations:

  • Annotates the VCF with VAGrENT, if a cache is provided (--vagrent).
  • Annotates the VCF using ensembl-vep, keeping the most deleterious transcript for each variant. If a cache is provided (--vep-dir).
  • Annotates against custom databases using vep if --custom and --reference are provided.
  • Converts the VCF to TSV, calculating info about the frequencies of the variants, for specific detected callers (caveman, pindel, mutect, strelka)
  • Annotates with COSMIC, if --cosmic is provided.
  • Annotates against a Panel of Normals (PON) if --normal_tsv with variants from several normals is provided as TSV format.
  • Filters the TSV file in an specific region, if a bed file is provided (--bedfile)
  • Filters coding variants with annotated consequences of Vagrent.
  • Annotates against a Panel of Normals, if several annotated files are provided (-tsv).

Installation

This package can be installed using pypi, or cloning it from its github repo:

    # install package from pypi
    pip install click_annotvcf

    # install from github repo
    pip install git+https://github.com/papaemmelab/click_annotvcf

Annotation

Run all steps of the annotation pipeline:

click_annotvcf annotvcf
    --input_vcf /ifs/work/leukgen/home/zhouy1/repos/projects/p221/merged/snv/I-H-118632-T1-1-D1-1.flagged.snv.vcf.gz
    --outdir .
    --output_prefix annotvcf
    --assembly GRCH37D5
    --reference /ifs/work/leukgen/ref/homo_sapiens/GRCh37d5/genome/gr37.fasta
    --vagrent /ifs/work/leukgen/ref/homo_sapiens/GRCh37d5/vagrent/Homo_sapiens_KnC.GRCh37.75.vagrent.cache.gz
    --vep-dir /ifs/work/leukgen/ref/homo_sapiens/GRCh37d5/vep/cache/
    --ensembl-version 91
    --custom /ifs/work/leukgen/home/leukbot/tests/vep/gnomad_genomes/gnomad.genomes.r2.0.1.sites.noVEP.vcf.gz gnomAD_genome AC_AFR,AC_AMR,AC_ASJ,AC_EAS,AC_FIN,AC_NFE,AC_OTH,AC_Male,AC_Female,AN_AFR,AN_AMR,AN_ASJ,AN_EAS,AN_FIN,AN_NFE,AN_OTH,AN_Male,AN_Female,AF_AFR,AF_AMR,AF_ASJ,AF_EAS,AF_FIN,AF_NFE,AF_OTH,AF_Male,AF_Female,Hom_HomR,Hom_AMR,Hom_ASJ,Hom_EAS,Hom_FIN,Hom_NFE,Hom_OTH,Hom_Male,Hom_Female \
    --custom /ifs/work/leukgen/home/leukbot/tests/vep/gnomad_exomes/gnomad.exomes.r2.0.1.sites.noVEP.vcf.gz gnomAD_exome AC_AFR,AC_AMR,AC_ASJ,AC_EAS,AC_FIN,AC_NFE,AC_OTH,AC_Male,AC_Female,AN_AFR,AN_AMR,AN_ASJ,AN_EAS,AN_FIN,AN_NFE,AN_OTH,AN_Male,AN_Female,AF_AFR,AF_AMR,AF_ASJ,AF_EAS,AF_FIN,AF_NFE,AF_OTH,AF_Male,AF_Female,Hom_HomR,Hom_AMR,Hom_ASJ,Hom_EAS,Hom_FIN,Hom_NFE,Hom_OTH,Hom_Male,Hom_Female \
    --custom /ifs/work/leukgen/ref/homo_sapiens/GRCh37d5/cosmic/81/CosmicMergedVariants.vcf.gz COSMIC GENE,STRAND,CDS,AA,CNT,SNP
    --cosmic /ifs/work/leukgen/ref/homo_sapiens/GRCh37d5/cosmic/81
    --normal_tsv several.normals.variants.tsv.gz
    --bedfile /ifs/work/leukgen/ref/bedfiles/IMPACT-468.bed

Vep Annotation

Information about each column annotation can be found in vep documentation, as well as how to interpret some of the consequences annotations, according of how vep predicts data.

Tables Cross Referencing

Required Columns

Your initial tsv file must have the following columns:

Column Description
ID_VARIANT Unque identifier per variant
CHR Chromosome
START Variant start position
END For SNPs same as START. For Indels, START + indel length
REF Reference sequence
ALT Alternative sequence

Cosmic Table Cross Referencing

Annotate against cosmic using:

click_annotvcf crossref_cosmic_tabix
    --input_file_name input.tsv.gz
    --cosmic_reference /ifs/work/leukgen/ref/homo_sapiens/GRCh37d5/cosmic/81
    --input_file_type generic
    --output_file_format raw
    --output_file_name output.annot.tsv.gz
    --loglevel error

Bed Filtering

Filter bed region using:

click_annotvcf filter_non_bait_design_variants
    --input_file_name input.tsv.gz
    --ref_file_name /ifs/work/leukgen/ref/bedfiles/IMPACT-468.bed
    --input_file_type generic
    --output_file_format raw
    --nbases 5
    --output_file_name output.bed.tsv.gz
    --loglevel error

Annotate Normals

Annotate the tsv file against several tsv files used as panel of normals. All files passed as --normal_tsv will be aggregated upon unique variants and a output.pon.normals.tsv.gz will be generated. The --threshold is the window to annotate close variant matches, by default is 10.

click_annotvcf annotate_normals
    --input_file_name input.tsv.gz
    --normal_tsv normal1.tsv.gz
    --normal_tsv normal2.tsv.gz
    --normal_tsv normal3.tsv.gz
    --normal_tsv normal4.tsv.gz
    --output_file_name output.pon.tsv.gz
    --threshold 10
    --loglevel error

Output files

When the main command click_annotvcf annotvcf [options] is used, considering input_file -> input.vcf and output_prefix -> my_output, the following files will be in the output directory outdir:

  • The following VCF files:

    Output file Description
    input.annot_vagrent.vcf.gz annotated with vagrent if --vagrent was passed.
    input.annot_vagrent.annot-vep.vcf.gz annotated with vep if --vep-dir was passed.
  • The following TSV files:

    Output file Description
    my_output.most-severe.tsv.gz VEP outputs all transcripts for each variant, this file only has the transcript for the most severe effect.
    my_output.annot.tsv.gz annotated with cosmic if --cosmic was passed.
    my_output.annot.pon.tsv.gz annotated against a Panel of Normals if there is one available for the panel
    pon.normals.tsv.gz aggregated panel of normals used to annotate
    my_output.annot.pon.bed.tsv.gz filtered region variants if --bedfile was passed.
    my_output.annot.pon.bed.coding.tsv.gz filtered coding variants.

Callers VAFs

For each caller the following methods are used:

  • CAVEMAN: VAF is calculated with the number of reads of the mutation in the forward and reverse strand, divided over the total reads in both strands. Ex: if ALT is G the method is (FGZ+RGZ)/(FAZ+RAZ+FCZ+RCZ+FGZ+RGZ+FTZ+RTZ)

  • STRELKA: every value is a tuple of 2 tiers: tier1 are reads with Q >= 40 and tier2 are reads with Q >=5. For SNV, the VAF is calculated using the number of reads in the tier1, divided by the depth in the tier1 (DP). Tier1 is used as depth for tier2 is not available. For indels, the reads are taken from the TIR field (Total Indel Reads) for the tier1 as well. Both for snv and indels, no info for dirprop is available.

  • MUTECT: depth is given as AD by mutect as tuple. And VAF is taken from reference depth and allele depth last tuple value. For the directionality of the mutations F1R2 and F2R1 is used, as are reads in the +/- strands respectively, and are given as tuples (REF, ALL1, ALL2, ..ALLN) so always take the first allele. This method is used both for snv and indels.

  • PINDEL: VAF is calculated with the number of reads of the mutation in the forward and reverse strand, divided over the total reads. PU and NU are the unique reads of the mutation in the +/- strands, and PR and NR are the total reads in both strands.

  • FREEBAYES GERMLINE: VAF is calculated with the number of reads AD in the allele, over the read depth DP. Both for snvs and indels.

  • STRELKA GERMLINE: VAF is calculated with the number of reads AD in the allele, over the read depth DP for snvs, and DPI for indels.

Testing

Some of the tests are not available to run in every environmente as the files used for testing can't be reduced to smaller sizes for testing. That's why travis can't run all the tests in the cloud.

To run tests execute tox where /ifs is available.