biogrouper

Graph-based clustering and annotation for improved de novo transcriptome analysis


License
BSD-3-Clause-Attribution
Install
pip install biogrouper==0.1.4

Documentation

Grouper

============

Grouper is a tool for clustering and annotating contigs from de novo transcriptome assemblies. There are two main modules in Grouper: the clustering module and the labeling module. The former is based on the tool, RapClust, and is designed to be run downstream of the Sailfish or Salmon tools for rapid transcript-level quantification. It relies on the fragment equivalence classes, orphaned read mappings and quantification information computed by these tools in order to determine how contigs in the assembly are potentially related and cluster them accordingly. The labeling module in Grouper is able to incorporate information from annotated genomes of closely related species to annotate contigs in the de novo assembly. Hence, the different modules of Grouper are able to efficiently utilize information from multiple sources to accurately cluster and annotate contigs from transcriptome assemblies.

Dependencies


The clustering module of Grouper depends on the MCL clustering tool (to be available in the environment where it runs).

Similarly, the labeling module depends on the Junto library for label propagation (to be available in the environment where it runs). This will require the relevant Java version. You can add this by cloning the repository and running the following commands:

export JUNTO_DIR=<path to junto folder>
export PATH="$PATH:$JUNTO_DIR/bin"

Further, Grouper depends on the following Python packages:

  1. Click
  2. PyYAML
  3. Pandas
  4. NumPy
  5. Networkx v1.11

However, you should be able to install Grouper via pip and have these python dependencies installed automatically. To install Grouper via pip, you can use:

> pip install biogrouper

You should now have a Grouper executable in your path. You can test this with the following command:

> Grouper --help

You should see the following output:

Usage: Grouper [OPTIONS]

Options:
  --config TEXT  Config file describing the experimental setup
  --help         Show this message and exit.

Using Grouper


Grouper is written in Python, is easy to use. Below, we explain how to use it with Salmon. There are two main steps involved in running Grouper:

  1. Run Salmon on each sample in your experiment, passing it the --dumpEq option. This will tell Salmon to dump a representation of the fragment equivalence classes that it computed during quasi-mapping of each sample. If you wish to use orphan read information for joining contigs in Grouper, use the --writeOrphanLinks option as well, which will dump orphan read pair information to a file. Apart from these additional option, Salmon should be run normally (i.e. passing in whatever other options are appropriate for your samples).
  2. Run Grouper, providing it with a configuration file that describes the experimental setup of your samples, and where the Salmon quantification results have been written. You can also choose whether or not to use the additional filters and the labeling module in Grouper.

Let's illustrate this pipeline with a particular example, the following experimental data from the Trapnell et al. paper:

Accession Condition Replicate
SRR493366 scramble 1
SRR493367 scramble 2
SRR493368 scramble 3
SRR493369 HOXA1KD 1
SRR493370 HOXA1KD 2
SRR493371 HOXA1KD 3

We'll assume that the raw read files reside in the directory reads. Assuming that you've already built the index on the transcriptome you wish to quantify, a typical run of Salmon on this data would look something like.

> parallel -j 6 "samp={}; salmon quant -i index -l a -1 <(gunzip -c reads/{$samp}_1.fq.gz) -2 <(gunzip -c reads/{$samp}_2.fq.gz) -o {$samp}_quant --dumpEq --writeOrphanLinks -p 4" ::: SRR493366 SRR493367 SRR493368 SRR493369 SRR493370 SRR493371

This will quantify each sample, and write the result to the directory samplename_quant. Given this setup, we're now ready to run the clustering module in Grouper. First, we have to make an appropriate config file. We demonstrate one using both the optinal filters in Grouper:

conditions:
    - Control
    - HOXA1 Knockdown
samples:
    Control:
        - SRR493366_quant
        - SRR493367_quant
        - SRR493368_quant
    HOXA1 Knockdown:
        - SRR493369_quant
        - SRR493370_quant
        - SRR493371_quant
outdir: human_grouper
orphan: True
mincut: True

you can place this in a file called config.yaml. Grouper uses YAML to specify its configuration files. The configuration file must contain the following three entries; conditions, samples, and outdir. The conditions entry lists the conditions present in the sample. The samples entry is a nested dictionary of lists; there is a key corrseponding to each condition listed in the conditions entry, and the value associated with this key is a list of quantification directories of the samples for this condition. Finally, the outdir entry specifies where the Grouper output and intermediate files should be stored. Optionally, the orphan and mincut entries tell Grouper which extra filters to use. If these lines are not added to the config file, by default, the filters are not applied. Given the above, we can run Grouper as:

> Grouper --config config.yaml

This will process the samples, generate the mapping ambiguity graph, filter it according to the conditions and the optional filter, and cluster the resuling graph (Grouper uses MCL internally for clustering). Once Grouper is finished, the human_grouper directory should exist. It will contain the following files:

mag.clust, mag.filt.net, mag.flat.clust, mag.net, stats.json, log.txt, mag.orphan.net

The most important file for downstream processing is mag.flat.clust. It contains the computed cluster information in a "transcript-to-gene" mapping formation that is compatible with downstream tools like tximport. The other files may be useful for exploration, but they are more intended for Grouper's internal use (e.g. mag.filt.net contains the filtered mapping ambiguity graph that is used for clustering).

Labeling Module


In order to annotate contigs in the assembly, the labeling module of Grouper requires information from a closely related species. Our test species in this example is human and the closely related annotated species is chimp. This information can be added to the config file in in any one of the following formats:

  1. You can pass the FASTA files to Grouper in the following way and it will run a two-way BLAST assigning seed labels to contigs. Ensure that the FASTA files are passed in the following order (the first is from the test species, second from the annotated species)
fasta:
    - human.transcripts.fa
    - chimp.transcripts.fa
  1. If you have already run BLAST, you can pass the output files (in BLAST outfmt 6). Again, ensure that the first one is BLAST of contigs from test species against the annotated species and the second is BLAST of contigs from annotated species against the test species.
labels:
    - human.chimpdb.txt
    - chimp.humandb.txt
  1. If you wish to use a pre-processed label file, you can pass a two-column file where the first is the set of contigs from the test species and second the label. If a contig has multiple labels in the input file, one will be chosen arbitrarily as seed.
labels:
    - human.labels.txt

So a sample config file provided with the FASTA files (example 1) would look something like this:

conditions:
    - Control
    - HOXA1 Knockdown
samples:
    Control:
        - SRR493366_quant
        - SRR493367_quant
        - SRR493368_quant
    HOXA1 Knockdown:
        - SRR493369_quant
        - SRR493370_quant
        - SRR493371_quant
fasta:
    - human.transcripts.fa
    - chimp.transcripts.fa
outdir: human_grouper
orphan: True
mincut: True
threads: 12

This also uses the optional filters in Grouper to generate the mapping ambiguity graph and runs BLAST using 12 threads (if this is not specified, it is run using 8 threads by default). The ouput directory in this case will contain a sub-folder Annotated with the following files:

final.labels.txt, label.graph.txt, label.mag.clust, label.mag.flat.clust, label.stats.json, raw.label.graph.txt, seed.labels.txt

Once again, the mag.flat.clust contains the computed cluster information in a "transcript-to-gene" mapping formation that can be used for downstream analyses. The file seed.labels.txt contains the initial contig to gene labeling. More importantly, the file final.labels.txt contains the labels after running Grouper and a contig may have multiple labels in this file, each with an associated score. The rest of the files are for internal use in the algorithm.

Citations:


Experiments in Graph-based Semi-Supervised Learning Methods for Class-Instance Acquisition. Partha Pratim Talukdar, Fernando Pereira, ACL 2010

Differential analysis of gene regulation at transcript resolution with RNA-seq by Cole Trapnell, David G Henderickson, Martin Savageau, Loyal Goff, John L Rinn and Lior Pachter, Nature Biotechnology 31, 46–53 (2013).

Stijn van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, 2000

Charlotte Soneson, Michael I Love, and Mark D Robinson. Differential analyses for rna-seq: transcript-level estimates improve gene-level inferences. F1000Research, 4, 2015.