click_annotvcf
This pipeline enables the conversion and annotation of VCF files into tables (TSV format). It has several steps of annotations:
- Annotates the VCF with VAGrENT, if a cache is provided (
--vagrent
). - Annotates the VCF using ensembl-vep, keeping the most deleterious transcript for each variant. If a cache is provided (
--vep-dir
). - Annotates against custom databases using vep if
--custom
and--reference
are provided. - Converts the VCF to TSV, calculating info about the frequencies of the variants, for specific detected callers (caveman, pindel, mutect, strelka)
- Annotates with COSMIC, if
--cosmic
is provided. - Annotates against a Panel of Normals (PON) if
--normal_tsv
with variants from several normals is provided as TSV format. - Filters the TSV file in an specific region, if a bed file is provided (
--bedfile
) - Filters coding variants with annotated consequences of Vagrent.
- Annotates against a Panel of Normals, if several annotated files are provided (
-tsv
).
Installation
This package can be installed using pypi, or cloning it from its github repo:
# install package from pypi
pip install click_annotvcf
# install from github repo
pip install git+https://github.com/papaemmelab/click_annotvcf
Annotation
Run all steps of the annotation pipeline:
click_annotvcf annotvcf
--input_vcf /ifs/work/leukgen/home/zhouy1/repos/projects/p221/merged/snv/I-H-118632-T1-1-D1-1.flagged.snv.vcf.gz
--outdir .
--output_prefix annotvcf
--assembly GRCH37D5
--reference /ifs/work/leukgen/ref/homo_sapiens/GRCh37d5/genome/gr37.fasta
--vagrent /ifs/work/leukgen/ref/homo_sapiens/GRCh37d5/vagrent/Homo_sapiens_KnC.GRCh37.75.vagrent.cache.gz
--vep-dir /ifs/work/leukgen/ref/homo_sapiens/GRCh37d5/vep/cache/
--ensembl-version 91
--custom /ifs/work/leukgen/home/leukbot/tests/vep/gnomad_genomes/gnomad.genomes.r2.0.1.sites.noVEP.vcf.gz gnomAD_genome AC_AFR,AC_AMR,AC_ASJ,AC_EAS,AC_FIN,AC_NFE,AC_OTH,AC_Male,AC_Female,AN_AFR,AN_AMR,AN_ASJ,AN_EAS,AN_FIN,AN_NFE,AN_OTH,AN_Male,AN_Female,AF_AFR,AF_AMR,AF_ASJ,AF_EAS,AF_FIN,AF_NFE,AF_OTH,AF_Male,AF_Female,Hom_HomR,Hom_AMR,Hom_ASJ,Hom_EAS,Hom_FIN,Hom_NFE,Hom_OTH,Hom_Male,Hom_Female \
--custom /ifs/work/leukgen/home/leukbot/tests/vep/gnomad_exomes/gnomad.exomes.r2.0.1.sites.noVEP.vcf.gz gnomAD_exome AC_AFR,AC_AMR,AC_ASJ,AC_EAS,AC_FIN,AC_NFE,AC_OTH,AC_Male,AC_Female,AN_AFR,AN_AMR,AN_ASJ,AN_EAS,AN_FIN,AN_NFE,AN_OTH,AN_Male,AN_Female,AF_AFR,AF_AMR,AF_ASJ,AF_EAS,AF_FIN,AF_NFE,AF_OTH,AF_Male,AF_Female,Hom_HomR,Hom_AMR,Hom_ASJ,Hom_EAS,Hom_FIN,Hom_NFE,Hom_OTH,Hom_Male,Hom_Female \
--custom /ifs/work/leukgen/ref/homo_sapiens/GRCh37d5/cosmic/81/CosmicMergedVariants.vcf.gz COSMIC GENE,STRAND,CDS,AA,CNT,SNP
--cosmic /ifs/work/leukgen/ref/homo_sapiens/GRCh37d5/cosmic/81
--normal_tsv several.normals.variants.tsv.gz
--bedfile /ifs/work/leukgen/ref/bedfiles/IMPACT-468.bed
Vep Annotation
Information about each column annotation can be found in vep documentation, as well as how to interpret some of the consequences annotations, according of how vep predicts data.
Tables Cross Referencing
Required Columns
Your initial tsv
file must have the following columns:
Column | Description |
---|---|
ID_VARIANT | Unque identifier per variant |
CHR | Chromosome |
START | Variant start position |
END | For SNPs same as START. For Indels, START + indel length |
REF | Reference sequence |
ALT | Alternative sequence |
Cosmic Table Cross Referencing
Annotate against cosmic using:
click_annotvcf crossref_cosmic_tabix
--input_file_name input.tsv.gz
--cosmic_reference /ifs/work/leukgen/ref/homo_sapiens/GRCh37d5/cosmic/81
--input_file_type generic
--output_file_format raw
--output_file_name output.annot.tsv.gz
--loglevel error
Bed Filtering
Filter bed region using:
click_annotvcf filter_non_bait_design_variants
--input_file_name input.tsv.gz
--ref_file_name /ifs/work/leukgen/ref/bedfiles/IMPACT-468.bed
--input_file_type generic
--output_file_format raw
--nbases 5
--output_file_name output.bed.tsv.gz
--loglevel error
Annotate Normals
Annotate the tsv file against several tsv files used as panel of normals. All files passed as --normal_tsv
will be aggregated upon unique variants and a output.pon.normals.tsv.gz
will be generated. The --threshold
is the window to annotate close variant matches, by default is 10.
click_annotvcf annotate_normals
--input_file_name input.tsv.gz
--normal_tsv normal1.tsv.gz
--normal_tsv normal2.tsv.gz
--normal_tsv normal3.tsv.gz
--normal_tsv normal4.tsv.gz
--output_file_name output.pon.tsv.gz
--threshold 10
--loglevel error
Output files
When the main command click_annotvcf annotvcf [options]
is used, considering input_file
-> input.vcf
and output_prefix
-> my_output
, the following files will be in the output directory outdir
:
-
The following VCF files:
Output file Description input.annot_vagrent.vcf.gz annotated with vagrent if --vagrent
was passed.input.annot_vagrent.annot-vep.vcf.gz annotated with vep if --vep-dir
was passed. -
The following TSV files:
Output file Description my_output.most-severe.tsv.gz VEP outputs all transcripts for each variant, this file only has the transcript for the most severe effect. my_output.annot.tsv.gz annotated with cosmic if --cosmic
was passed.my_output.annot.pon.tsv.gz annotated against a Panel of Normals if there is one available for the panel pon.normals.tsv.gz aggregated panel of normals used to annotate my_output.annot.pon.bed.tsv.gz filtered region variants if --bedfile
was passed.my_output.annot.pon.bed.coding.tsv.gz filtered coding variants.
Callers VAFs
For each caller the following methods are used:
-
CAVEMAN: VAF is calculated with the number of reads of the mutation in the forward and reverse strand, divided over the total reads in both strands. Ex: if ALT is G the method is (FGZ+RGZ)/(FAZ+RAZ+FCZ+RCZ+FGZ+RGZ+FTZ+RTZ)
-
STRELKA: every value is a tuple of 2 tiers: tier1 are reads with Q >= 40 and tier2 are reads with Q >=5. For SNV, the VAF is calculated using the number of reads in the tier1, divided by the depth in the tier1 (DP). Tier1 is used as depth for tier2 is not available. For indels, the reads are taken from the TIR field (Total Indel Reads) for the tier1 as well. Both for snv and indels, no info for dirprop is available.
-
MUTECT: depth is given as AD by mutect as tuple. And VAF is taken from reference depth and allele depth last tuple value. For the directionality of the mutations F1R2 and F2R1 is used, as are reads in the +/- strands respectively, and are given as tuples (REF, ALL1, ALL2, ..ALLN) so always take the first allele. This method is used both for snv and indels.
-
PINDEL: VAF is calculated with the number of reads of the mutation in the forward and reverse strand, divided over the total reads. PU and NU are the unique reads of the mutation in the +/- strands, and PR and NR are the total reads in both strands.
-
FREEBAYES GERMLINE: VAF is calculated with the number of reads AD in the allele, over the read depth DP. Both for snvs and indels.
-
STRELKA GERMLINE: VAF is calculated with the number of reads AD in the allele, over the read depth DP for snvs, and DPI for indels.
Testing
Some of the tests are not available to run in every environmente as the files used for testing can't be reduced to smaller sizes for testing. That's why travis can't run all the tests in the cloud.
To run tests execute tox
where /ifs
is available.