vcf-consensus-builder

Build a consensus sequence from a VCF and ref sequence masking low and no coverage positions.


Keywords
vcf_consensus_builder
License
MIT
Install
pip install vcf-consensus-builder==0.1.0

Documentation

VCF Consensus Builder

Documentation Status

Build a consensus sequence from a VCF and reference sequence masking low and no coverage positions.

You could use bcftools consensus but then you would need to apply the low and no coverage position masking after bcftools has generated the consensus, which may be tricky.

Features

  • Masks low and no coverage positions in the reference (default: 0X and <5X) with N and - by default
  • No need to bgzip the VCF file or index it like bcftools consensus requires.

Usage

Install

Install with pip from PyPI with

pip install vcf_consensus_builder

Show Help

Help message:

$ vcf_consensus_builder --help
Usage: vcf_consensus_builder [OPTIONS]

  Build a consensus sequence from a VCF and ref sequence masking low and no
  coverage positions.

Options:
  -v, --vcf-file PATH      VCF file path (v4)  [required]
  -d, --depths-file PATH   samtools depth output file (no headers)  [required]
  -r, --ref-fasta PATH     Reference sequence FASTA file (single sequence
                           entry only!)  [required]
  -o, --output-fasta TEXT  Output consensus sequence FASTA file path (default
                           write to stdout)
  --low-coverage INTEGER   Low coverage threshold; replace positions with less
                           than this depth with "N" by default
  --no-coverage INTEGER    No coverage threshold; replace positions with less
                           than or equal this depth with "-" by default
  --low-cov-char TEXT      Low coverage character ("N" by default)
  --no-cov-char TEXT       No coverage character ("-" by default)
  --sample-name TEXT       Optional sample name for output fasta header ID
  --help                   Show this message and exit.

Basic usage

Run on the test data including in the repo

# Clone this repo and enter it
$ git clone https://github.com/peterk87/vcf_consensus_builder.git --depth=1
$ cd vcf_consensus_builder/
# run vcf_consensus_builder on test data
$ vcf_consensus_builder -v tests/data/test.vcf \
    -d tests/data/test-depths.tsv \
    -r tests/data/ref.fa
# produces the following to stdout
>sample1 ref="ref ref"
NACCGTANACAATAN--

Masking of no and low coverage positions in reference sequence

vcf_consensus_builder first masks no and low coverage positions in the reference sequence file and then applies the ALT variants in the VCF.

NOTE: vcf_consensus_builder does not perform any VCF variant filtering. It is assumed that the VCF input file contains only variants you wish to see in your consensus sequence. Please use bcftools filter with appropriate filtering/exclusion expressions to get the variants you wish to see represented in your consensus (see https://samtools.github.io/bcftools/howtos/filtering.html for more info about how to filter your VCF file)

Given this reference sequence

>ref
NGCCAAGTCTNCGACATN-

And this samtools depth output

sample1     ref     1       4
sample1     ref     2       9
sample1     ref     3       9
sample1     ref     4       9
sample1     ref     5       9
sample1     ref     6       9
sample1     ref     7       10
sample1     ref     8       10
sample1     ref     9       10
sample1     ref     10      10
sample1     ref     11      3
sample1     ref     12      9
sample1     ref     13      9
sample1     ref     14      9
sample1     ref     15      9
sample1     ref     16      9
sample1     ref     17      5
sample1     ref     18      4
sample1     ref     19      0
sample1     ref     20      0

The low (below 5X) and no (0X) coverage positions in the reference sequence will be replaced with N and -, respectively.

The masked reference sequence will be:

>ref
NGCCAAGTCTNCGACATN-

This masked sequence will be used for generating the consensus sequence.

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.