RFPlasmid Predicting plasmid contigs


License
GPL-3.0
Install
pip install rfplasmid==1.0

Documentation

License: GPL v3 PIP CONDA

RFPlasmid

Predicting plasmid contigs from assemblies

Webinterface

A web-interface to test single fasta files is available here: http://klif.uu.nl/rfplasmid/

Table of Contents

Abstract

Predicting plasmid contigs from assemblies using single copy marker genes, plasmid genes, kmers

Linda van der Graaf-van Bloois, Jaap Wagenaar, Aldert Zomer

Introduction: Antimicrobial resistant (AMR) genes in bacteria are often carried on plasmids. Since these plasmids can spread the AMR genes between bacteria, it is important to know if the genes are located on highly transferable plasmids or in the more stable chromosomes. Whole genome sequence (WGS) analysis makes it easy to determine if a strain contains a resistance gene, however, it is not easy to determine if the gene is located on the chromosome or on a plasmid as genome sequence assembly generally results in 50-300 DNA fragments (contigs). With our newly developed prediction tool, we analyze the composition of these contigs to predict their likely source, plasmid or chromosomal. This information can be used to determine if a resistant gene is chromosomally located or on a plasmid. The tool is optimized for 19 different bacterial species, including Campylobacter, E. coli, and Salmonella, and can also be used for metagenomic assemblies.

Methods: The tool identifies the number of chromosomal marker genes, plasmid replication genes and plasmid typing genes using CheckM and DIAMOND Blast, and determines pentamer frequencies and contig sizes per contig. A prediction model was trained using Random Forest on an extensive set of plasmids and chromosomes from 19 different bacterial species and validated on separate test sets of known chromosomal and plasmid contigs of the different bacteria.

Results: We show that RFplasmid is able to predict chromosomal and plasmid contigs with error rates ranging from 0.002% to 4.66% and that the use of taxon specific models can be superior to a general plasmid prediction model. Single copy chromosomal marker genes, plasmid genes, k-mer content and length of contig all appear to be informative, however k-mer content is highly specific for taxa. Prediction of small contigs remains unreliable, since these contigs consists primarily of repeated sequences present in both plasmid and chromosome, e.g. transposases or because k-mer content or marker genes cannot be easily identified.

Conclusion: The newly developed tool is able to determine if contigs are chromosomal or plasmid with a very high specificity and sensitivity (up to 99%) and can be very useful to analyze WGS data of bacterial genomes and their antimicrobial resistance genes.

Running RFPlasmid

$ rfplasmid --initialize # Only once after installing it. See "Getting the software" below
$ rfplasmid

Error; no arguments. Required to specificy --input and --species
usage: rfplasmid.py [-h] [--species SPECIES] [--input INPUT] [--specieslist] [--jelly] [--out OUT] [--debug] [--training] [--threads THREADS] [--version]

optional arguments:
  -h, --help         show this help message and exit
  --species SPECIES  define species (required)
  --input INPUT      directory with input fasta files (required)
  --specieslist      list of available species models
  --jelly            run jellyfish as kmer-count (faster)
  --out OUT          specify output directory
  --debug            no cleanup of intermediate files
  --training         trainings mode Random Forest
  --threads THREADS  specify number of threads to be used, default is max available threads up to 16 threads
  --version          print version number

# Example
rfplasmid --species Campylobacter --input inputfolder --jelly --threads 8 --out outputfolder

A folder containing .fasta file is required as input.

--jelly requires a functional jellyfish install. Greatly speeds up the analysis. Strongly recommended as our kmer profiling method in Python is slow

Read specieslist.txt or run rfplasmid --specieslist for species specific models. We have a general Enterobacteriaceae model instead of a species model. All others are species except for the "Generic" model which can be used for unknown or metagenomics samples.

Getting the software

Using Conda

thanks to https://github.com/rpetit3. Installs CheckM database as well. A Google Colab notebook in this repository gives an example. The script rfplasmid is placed in ~/.local/bin and assumes that is in your PATH, which is according to the systemd specification (https://www.freedesktop.org/software/systemd/man/file-hierarchy.html). If not, please run the export PATH line.

$ conda install -c bioconda rfplasmid 
$ # or alternatively: conda create -n rfplasmid -c conda-forge -c bioconda rfplasmid ; conda activate rfplasmid
$ rfplasmid --initialize # Bash helper script to locate rfplasmid.py and initialize the plasmid databases
$ export PATH=$PATH:~/.local/bin/ #only necessary if you have not included ~/.local/bin in your path (unusual but it has been observed). 
$ rfplasmid

Using Pip

Installs most requirements except DIAMOND and JellyFish and R (see below). You need to download HMMER, Prodigal and a database for CheckM if you have never installed it. A Google Colab notebook in this repository gives an example if you want to do this systemwide. This is much more work than Conda and requires more skills.

$  pip3 install rfplasmid
$  export PATH=$PATH:~/.local/bin # pip installs in ~/.local/bin and it should be in your path but some distros don't have this set (even though they should).
$  rfplasmid --initialize #We makes use of a bash helper script to locate the rfplasmid.py file and to download the plasmid databases as they are too large for pip
$  export PATH=$PATH:~/.local/bin/ #only necessary if you have not included ~/.local/bin in your path (unusual but it has been observed). 
$  rfplasmid

Required if you have never installed CheckM before

CheckM relies on a number of precalculated data files which can be downloaded from https://data.ace.uq.edu.au/public/CheckM_databases/. Decompress the file to an appropriate folder and run the following to inform CheckM of where the files have been placed. The example below uses wget to download the an archive of file and installs them in your homedir.

$  sudo apt install hmmer # CheckM needs HMMER. See http://hmmer.org/documentation.html for other methods of downloading and installing it
$  cd ~
$  wget https://github.com/hyattpd/Prodigal/releases/download/v2.6.3/prodigal.linux
$  cp prodigal.linux ~/bin/prodigal #we assume you have a ~/bin/ folder and it's in your path. 
$  chmod +x ~/bin/prodigal  
$  mkdir .checkm
$  cd .checkm
$  wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
$  tar xzvf checkm_data_2015_01_16.tar.gz
$  checkm data setRoot ~/.checkm

Dependencies you need to install when installing RFPlasmid using Pip

RandomForest package in R ( https://cran.r-project.org/web/packages/randomForest/index.html ) (likely already installed).

$  R
> install.packages("randomForest") #likely also already installed

DIAMOND ( https://github.com/bbuchfink/diamond )

$ wget http://github.com/bbuchfink/diamond/releases/download/v0.9.24/diamond-linux64.tar.gz
$ tar xzf diamond-linux64.tar.gz
$ cp diamond ~/bin/diamond

Strongly recommended: Jellyfish ( http://www.genome.umd.edu/jellyfish.html )

$ wget https://github.com/gmarcais/Jellyfish/releases/download/v2.2.10/jellyfish-linux
$ cp jellyfish-linux ~/bin/jellyfish
$ chmod +x ~/bin/jellyfish

For advanced users that want to install the latest version from Github

You can get the source and using git and run from the folder you downloaded it to. You will need to install the requirements by hand as well

$ git clone https://github.com/aldertzomer/RFPlasmid.git
$ cd RFPlasmid
$ bash getdb.sh # downloads and formats the plasmid DBs
$ python3 rfplasmid.py

Installing requirements. Assumes you have ~/bin/ in your PATH. Depending on your setup you may need to follow the systemwide version (see further below)

Python 3 with pandas ( https://pandas.pydata.org/)

$  pip3 install pandas

CheckM ( https://ecogenomics.github.io/CheckM/ ). According to the github page of CheckM:

$  pip3 install numpy
$  pip3 install scipy
$  pip3 install pysam
$  pip3 install checkm-genome
$  cd ~
$  mkdir checkm_data
$  cd checkm_data
$  wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
$  tar xzvf checkm_data_2015_01_16.tar.gz
$  checkm data setRoot ~/checkm_data

RandomForest package in R ( https://cran.r-project.org/web/packages/randomForest/index.html )

$  R
> install.packages("randomForest")

DIAMOND ( https://github.com/bbuchfink/diamond )

$ wget http://github.com/bbuchfink/diamond/releases/download/v0.9.24/diamond-linux64.tar.gz
$ tar xzf diamond-linux64.tar.gz
$ cp diamond ~/bin/diamond

Strongly recommended: Jellyfish ( http://www.genome.umd.edu/jellyfish.html )

$ wget https://github.com/gmarcais/Jellyfish/releases/download/v2.2.10/jellyfish-linux
$ cp jellyfish-linux ~/bin/jellyfish
$ chmod +x ~/bin/jellyfish

Systemwide install

Only if you are a system administrator and you know what you are doing.

$ sudo pip3 install rfplasmid
$ sudo rfplasmid --initialize # Should install the databases as well provided all requirements are met
$ rfplasmid

Requirements if you want to install the requirements systemwide. Let your system administrator do this.

Python 3 with pandas ( https://pandas.pydata.org/)

$ sudo pip3 install pandas

CheckM ( https://ecogenomics.github.io/CheckM/ ). According to the Wiki page of CheckM:

$  sudo su -
$  apt install hmmer # CheckM needs HMMER. See http://hmmer.org/documentation.html for other methods of downloading and installing it
$  wget https://github.com/hyattpd/Prodigal/releases/download/v2.6.3/prodigal.linux
$  cp prodigal.linux /usr/local/bin/prodigal
$  chmod +x /usr/local/bin/prodigal 
$  sudo pip3 install numpy
$  sudo pip3 install scipy
$  sudo pip3 install pysam
$  sudo pip3 install checkm-genome 
$  mkdir /usr/local/checkm_data
$  cd /usr/local/checkm_data
$  wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
$  tar xzvf checkm_data_2015_01_16.tar.gz
$  checkm data setRoot/usr/local/checkm_data

RandomForest package in R ( https://cran.r-project.org/web/packages/randomForest/index.html )

$ sudo R
> install.packages("randomForest")

DIAMOND ( https://github.com/bbuchfink/diamond )

$ wget http://github.com/bbuchfink/diamond/releases/download/v0.9.24/diamond-linux64.tar.gz
$ tar xzf diamond-linux64.tar.gz
$ sudo cp diamond /usr/local/bin/diamond

Strongly recommended: Jellyfish ( http://www.genome.umd.edu/jellyfish.html )

$ wget https://github.com/gmarcais/Jellyfish/releases/download/v2.2.10/jellyfish-linux
$ sudo cp jellyfish-linux /usr/local/bin/jellyfish
$ sudo chmod +x /usr/local/bin/jellyfish

Google Colab Notebook

An alternative to installing RFPlasmid on your own server is running it from Google Colab. We have provided a Notebook suitable for Google Colab. Please navigate to

https://github.com/aldertzomer/RFPlasmid/blob/master/RFPlasmid_Conda_install_and_run.ipynb

and press "Open in Google Colab" and follow the instructions on the Google Colab notebook.

The direct link is available here but only works if you have a signed in to your Google account: https://colab.research.google.com/github/aldertzomer/RFPlasmid/blob/master/RFPlasmid_Conda_install_and_run.ipynb

Output Files

File Description
prediction.csv This is the primary output as comma delimited file. c = chromosome, p=plasmid.
prediction_full.csv This is the primary output as comma delimited file including all input data.
classification.RData The R object containing input and output.
outputdataframe.csv Output from the various scripts used as input by the randomForest classifier.
other files Temporary input files. Can be safely removed.

Output explained

The file prediction.csv contains the contig number (column 1), the prediction wether it's chromosomal or plasmid (column 2), the votes for chromosome or plasmids (columns 3 and 4, and the original contig ID (column 5).

Contig Prediction Votes chromosomal Votes plasmid ContigID
Kp1_1 p 0.100 0.900 Kp1_ctg1

The file prediction_full.csv contains the data as before, but also includes the input data.

Contig prediction Votes chromosomal Votes plasmid contigID Genome contig length % SCM % plasmid genes %ID plasmidfinder Number of kmers SCM genes plasmid genes kmer fractions etc
Kp1_1 p 0.101 0.899 Kp1_ctg1 Kp1 50000 0.12 0.51 69.5 138845 12 32 0.01477

All headers are explained below

Header Header contents Explanation
Contig Contig number
prediction prediction Plasmid or chromosome
votes chromosomal Votes chromosomal Random forest votes for the chromosome class (0-1)
votes plasmid Votes plasmid Random forest votes for the plasmid class (0-1)
contigID contigID Original contig ID
genome Genome Name of genome
contig_length contig length length of contig in bases
SCM_genes % SCM Percentage of genes that is classified as chromosomal marker gene based on CheckM
plasmid_genes_genes % plasmid genes Percentage of genes that is classified as chromosomal marker gene based on alignment against a plasmid proteins database
plasmidcge_id %ID plasmidfinder Highest identity percentage of hits against the PlasmidFinder database
kmer_number Number of kmers Number of kmers in the contig
SCM SCM genes Number of genes that is classified as chromosomal marker gene based on CheckM
plasmid_genes plasmid genes Number of genes that is classified as chromosomal marker gene based on alignment against a plasmid proteins database
kmer1 kmer fractions Fractions of the 1024 4-mers from AAAA to TTTTT
etc

Training your own model

RFPlasmid comes with an option to train your own model if you provide it with contigs of chromosomes and plasmids. The following steps will generate the model for you and make it available for your RFplasmid installation. We will use the genus Lactococcus as an example.

The following steps need to be taken.

  1. Get and install RFPlasmid from Github
  2. Prepare a folder containing your plasmid contigs and your chromosome contigs. Draft genome assemblies are highly recommended, don't use completed genomes, as the genomes you want to analyze also won't be complete. Make sure that the plasmid contig file(s) start with "p" and the file(s) with the chromosomal contigs start with "c". In the example they are called plasmids.fasta and chromosomes.fasta, but as long as the files start with p or c your are good. They may also be separate files per genome/plasmid. You need about 20 plasmids and 20 chromosomes for a decent classification.
  3. Add the name of your genus/family/etc to specieslist.txt. Use the name exactly as it is listed in CheckM for the appropriate taxon level. It is not recommended to use a species level model, else you will have to deal with spaces in names. Use only genus level or higher.
  4. Run RFplasmid in training mode. If you have very large numbers of contigs (>5000) it is recommended to adjust the sampsize option in the training.R script. It is automatically set to the maximum value (2/3rds of the smallest number of contigs per class).
  5. Move the resulting training.rfo R object in the output folder to the RFplasmid folder and name it appropriately
  6. Test your new plasmid prediction model

Below an example of the commands. Replace Lactococcus with the genus of interest and ofcourse change "/mnt/data/files" to the folder containing the input

$ git clone https://github.com/aldertzomer/RFPlasmid.git
$ cd RFPlasmid
$ bash getdb.sh
$ mkdir Lactococcus
$ cat /mnt/data/files/p*.fasta > Lactococcus/plasmids.fasta
$ cat /mnt/data/files/c*.fasta > Lactococcus/chromosomes.fasta
$ checkm taxon_list |grep Lactococcus
$ echo "Lactococcus genus" >> specieslist.txt # this is important. Don't skip this step.
$ python3 rfplasmid.py --training --jelly --input Lactococcus --threads 16 --out Lactococcus.out --species Lactococcus
$ cp Lactococcus.out/training.rfo Lactococcus.rfo

It is recommended to load the object in R and explore it to check how well it performs using the OOB output and the confusion matrix. Generally you want to have the number of chromosome contigs that are predicted as being plasmid contigs as low as possible. It is also recommended to check the sizes of the contigs that are incorrectly predicted, as contigs <1 kb are more difficult to predict and at some point no improvements are possible. Modifying the ratio of the contigs sampled in the sampsize option in the training.R script may push the values to more chromosomal or plasmid contigs correctly predicted. In all our models the ratio was kept 1:1.

$ R
> library(randomForest)
> load("Lactococcus.rfo")
> rf

Test your model using

python3 rfplasmid.py --jelly --input Lactococcus --threads 16 --out Lactococcus.out2 --species Lactococcus

Training data

Plasmid databases can be downloaded from: http://klif.uu.nl/download/plasmid_db/ Data used for training can be downloaded here: http://klif.uu.nl/download/plasmid_db/trainingsets2/