Frequently used commands in bioinformatics


Keywords
visualization, api, cli, bioinformatics, sam, vcf, fasta, bed, maf, bam, fastq, cram
License
MIT
Install
pip install fuc==0.13.0

Documentation

README

Documentation Status

Introduction

The main goal of the fuc package (pronounced "eff-you-see") is to wrap some of the most frequently used commands in the field of bioinformatics into one place.

The package supports both command line interface (CLI) and application programming interface (API) whose documentations are available at the Read the Docs.

Currently, fuc can be used to analyze, summarize, visualize, and manipulate the following file formats:

  • Sequence Alignment/Map (SAM)
  • Binary Alignment/Map (BAM)
  • CRAM
  • Variant Call Format (VCF)
  • Mutation Annotation Format (MAF)
  • Browser Extensible Data (BED)
  • FASTQ
  • FASTA
  • delimiter-separated values format (e.g. comma-separated values or CSV format)

Additionally, fuc can be used to parse output data from the following programs:

  • Ensembl Variant Effect Predictor (VEP)
  • SnpEff
  • bcl2fastq and bcl2fastq2

Your contributions (e.g. feature ideas, pull requests) are most welcome.

Author: Seung-been "Steven" Lee
License: MIT License

Installation

The following packages are required to run fuc:

biopython
lxml
matplotlib
matplotlib-venn
numpy
pandas
pyranges
pysam
scipy
seaborn

There are various ways you can install fuc. The recommended way is via conda:

$ conda install -c bioconda fuc

Above will automatically download and install all the dependencies as well. Alternatively, you can use pip to install fuc and all of its dependencies:

$ pip install fuc

Finally, you can clone the GitHub repository and then install fuc locally:

$ git clone https://github.com/sbslee/fuc
$ cd fuc
$ pip install .

The nice thing about this approach is that you will have access to development versions that are not available in Anaconda or PyPI. For example, you can access a development branch with the git checkout command. When you do this, please make sure your environment already has all the dependencies installed.

Getting Help

For detailed documentations on the fuc package's CLI and API, please refer to the Read the Docs.

For getting help on the fuc CLI:

$ fuc -h
usage: fuc [-h] [-v] COMMAND ...

positional arguments:
  COMMAND        name of the command
    bam_head     [BAM] print the header of a SAM/BAM/CRAM file
    bam_index    [BAM] index a SAM/BAM/CRAM file
    bam_rename   [BAM] rename the samples in a SAM/BAM/CRAM file
    bam_slice    [BAM] slice a SAM/BAM/CRAM file
    bed_intxn    [BED] find intersection of two or more BED files
    bed_sum      [BED] summarize a BED file
    fq_count     [FASTQ] count sequence reads in FASTQ files
    fq_sum       [FASTQ] summarize a FASTQ file
    fuc_compf    [FUC] compare contents of two files
    fuc_demux    [FUC] parse Reports directory from bcl2fastq or bcl2fastq2
    fuc_exist    [FUC] check whether files/directories exist
    fuc_find     [FUC] find files with certain extension recursively
    maf_maf2vcf  [MAF] convert a MAF file to a VCF file
    maf_oncoplt  [MAF] create an oncoplot with a MAF file
    maf_sumplt   [MAF] create a summary plot with a MAF file
    maf_vcf2maf  [MAF] convert an annotated VCF file to a MAF file
    tbl_merge    [TABLE] merge two table files
    tbl_sum      [TABLE] summarize a table file
    vcf_merge    [VCF] merge two or more VCF files
    vcf_rename   [VCF] rename the samples in a VCF file.
    vcf_slice    [VCF] slice a VCF file
    vcf_vcf2bed  [VCF] convert a VCF file to a BED file
    vcf_vep      [VCF] filter a VCF file annotated by Ensemble VEP

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  show the version number and exit

For getting help on a specific command (e.g. vcf_merge):

$ fuc vcf_merge -h

Below is the list of submodules available in the fuc API:

  • common : The common submodule is used by other fuc submodules such as pyvcf and pybed. It also provides many day-to-day actions used in the field of bioinformatics.
  • pybam : The pybam submodule is designed for working with sequence alignment files (SAM/BAM/CRAM). It essentially wraps the pysam package to allow fast computation and easy manipulation.
  • pybed : The pybed submodule is designed for working with BED files. It implements pybed.BedFrame which stores BED data as pandas.DataFrame via the pyranges package to allow fast computation and easy manipulation. The submodule strictly adheres to the standard BED specification.
  • pycov : The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements pycov.CovFrame which stores read depth data as pandas.DataFrame via the pysam package to allow fast computation and easy manipulation.
  • pyfq : The pyfq submodule is designed for working with FASTQ files. It implements pyfq.FqFrame which stores FASTQ data as pandas.DataFrame to allow fast computation and easy manipulation.
  • pymaf : The pymaf submodule is designed for working with MAF files. It implements pymaf.MafFrame which stores MAF data as pandas.DataFrame to allow fast computation and easy manipulation. The pymaf.MafFrame class also contains many useful plotting methods such as MafFrame.plot_oncoplot and MafFrame.plot_summary. The submodule strictly adheres to the standard MAF specification.
  • pysnpeff : The pysnpeff submodule is designed for parsing VCF annotation data from the SnpEff program. It should be used with pyvcf.VcfFrame.
  • pyvcf : The pyvcf submodule is designed for working with VCF files. It implements pyvcf.VcfFrame which stores VCF data as pandas.DataFrame to allow fast computation and easy manipulation. The pyvcf.VcfFrame class also contains many useful plotting methods such as VcfFrame.plot_comparison and VcfFrame.plot_tmb. The submodule strictly adheres to the standard VCF specification.
  • pyvep : The pyvep submodule is designed for parsing VCF annotation data from the Ensembl VEP program. It should be used with pyvcf.VcfFrame.

For getting help on a specific submodule (e.g. pyvcf):

>>> from fuc import pyvcf
>>> help(pyvcf)

CLI Examples

BAM

To print the header of a SAM file:

$ fuc bam_head in.sam

To index a CRAM file:

$ fuc bam_head in.cram

To slice a BAM file:

$ fuc bam_slice in.bam chr1:100-200 out.bam

BED

To find intersection between BED files:

$ fuc bed_intxn 1.bed 2.bed 3.bed > intersect.bed

FASTQ

To count sequence reads in a FASTQ file:

$ fuc fq_count example.fastq

FUC

To check whether a file exists in the operating system:

$ fuc fuc_exist example.txt

To find all VCF files within the current directory recursively:

$ fuc fuc_find .vcf.gz

TABLE

To merge two tab-delimited files:

$ fuc tbl_merge left.tsv right.tsv > merged.tsv

VCF

To merge VCF files:

$ fuc vcf_merge 1.vcf 2.vcf 3.vcf > merged.vcf

To filter a VCF file annotated by Ensemble VEP:

$ fuc vcf_vep in.vcf 'SYMBOL == "TP53"' > out.vcf

API Examples

BAM

To create read depth profile of a region from a CRAM file:

>>> from fuc import pycov
>>> cf = pycov.CovFrame.from_file('HG00525.final.cram', zero=True,
...    region='chr12:21161194-21239796', names=['HG00525'])
>>> cf.plot_region('chr12', start=21161194, end=21239796)

https://raw.githubusercontent.com/sbslee/fuc-data/main/images/coverage.png

VCF

To filter a VCF file based on a BED file:

>>> from fuc import pyvcf
>>> vf = pyvcf.VcfFrame.from_file('original.vcf')
>>> filtered_vf = vf.filter_bed('targets.bed')
>>> filtered_vf.to_file('filtered.vcf')

To remove indels from a VCF file:

>>> from fuc import pyvcf
>>> vf = pyvcf.VcfFrame.from_file('with_indels.vcf')
>>> filtered_vf = vf.filter_indel()
>>> filtered_vf.to_file('no_indels.vcf')

To create a Venn diagram showing genotype concordance between groups:

>>> from fuc import pyvcf, common
>>> common.load_dataset('pyvcf')
>>> f = '~/fuc-data/pyvcf/plot_comparison.vcf'
>>> vf = pyvcf.VcfFrame.from_file(f)
>>> a = ['Steven_A', 'John_A', 'Sara_A']
>>> b = ['Steven_B', 'John_B', 'Sara_B']
>>> c = ['Steven_C', 'John_C', 'Sara_C']
>>> vf.plot_comparison(a, b, c)

https://raw.githubusercontent.com/sbslee/fuc-data/main/images/plot_comparison.png

To create various figures for normal-tumor analysis:

>>> import matplotlib.pyplot as plt
>>> from fuc import common, pyvcf
>>> common.load_dataset('pyvcf')
>>> vf = pyvcf.VcfFrame.from_file('~/fuc-data/pyvcf/normal-tumor.vcf')
>>> af = pyvcf.AnnFrame.from_file('~/fuc-data/pyvcf/normal-tumor-annot.tsv', 'Sample')
>>> normal = af.df[af.df.Tissue == 'Normal'].index
>>> tumor = af.df[af.df.Tissue == 'Tumor'].index
>>> fig, [[ax1, ax2], [ax3, ax4]] = plt.subplots(2, 2, figsize=(10, 10))
>>> vf.plot_tmb(ax=ax1)
>>> vf.plot_tmb(ax=ax2, af=af, hue='Tissue')
>>> vf.plot_hist('DP', ax=ax3, af=af, hue='Tissue')
>>> vf.plot_regplot(normal, tumor, ax=ax4)
>>> plt.tight_layout()

https://raw.githubusercontent.com/sbslee/fuc-data/main/images/normal-tumor.png

MAF

To create an oncoplot with a MAF file:

>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> f = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(f)
>>> mf.plot_oncoplot()

https://raw.githubusercontent.com/sbslee/fuc-data/main/images/oncoplot.png

To create a customized oncoplot with a MAF file, see the 'Create customized oncoplot' tutorial:

https://raw.githubusercontent.com/sbslee/fuc-data/main/images/customized_oncoplot.png

To create a summary figure for a MAF file:

>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> f = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
>>> mf = pymaf.MafFrame.from_file(f)
>>> mf.plot_summary()

https://raw.githubusercontent.com/sbslee/fuc-data/main/images/maf_summary.png