deepsig-biocomp

DeepSig - Predictor of signal peptides in proteins based on deep learning


Keywords
bioinformatics, annotation, bacteria, signal, peptides
License
GPL-3.0
Install
pip install deepsig-biocomp==0.9

Documentation

DeepSig - Predictor of signal peptides in proteins based on deep learning

Publication

Savojardo C., Martelli P.L., Fariselli P., Casadio R. DeepSig: deep learning improves signal peptide detection in proteins Bioinformatics (2017) 34(10): 1690-1696.

The DeepSig Docker image

Image availbale on DockerHub https://hub.docker.com/r/bolognabiocomp/deepsig

Usage of the image

The first step to run DeepSig Docker container is the pull the container image. To do so, run:

$ docker pull bolognabiocomp/deepsig

Now the DeepSig Docker image is installed in your local Docker environment and ready to be used. To show DeepSig help page run:

$ docker run bolognabiocomp/deepsig -h

Using TensorFlow backend.
usage: deepsig.py [-h] -f FASTA -o OUTF -k {euk,gramp,gramn} [-a CPU]

DeepSig: Predictor of signal peptides in proteins

optional arguments:
  -h, --help            show this help message and exit
  -f FASTA, --fasta FASTA
                        The input multi-FASTA file name
  -o OUTF, --outf OUTF  The output tabular file
  -k {euk,gramp,gramn}, --organism {euk,gramp,gramn}
                        The organism the sequences belongs to

The program accepts three mandatory arguments:

  • The full path of the input FASTA file containing protein sequences to be predicted;
  • The kingdom the sequences belong to. You must specify "euk" for Eukaryotes, "gramp" for Gram-positive bacteria or "gramn" for Gram-negative bacteria;
  • The output file where predictions will be stored.

Let's now try a concrete example. First of all, let's downlaod an example sequence from UniProtKB, e.g. the Transthyretin-like protein 52 form Caenorhabditis elegans with accession G5ED35:

$ wget https://www.uniprot.org/uniprot/G5ED35.fasta

Now, we are ready to predict the signal peptide of our input protein. Run:

$ docker run -v $(pwd):/data/ bolognabiocomp/deepsig -f G5ED35.fasta -o G5ED35.out -k euk

In the example above, we are mapping the current program working directory ($(pwd)) to the /data/ folder inside the container. This will allow the container to see the external FASTA file G5ED35.fasta. The file G5ED35.out now contains the DeepSig prediction, in GFF3 format:

$ cat G5ED35.out

sp|G5ED35|TTR52_CAEEL	DeepSig	Signal peptide	1	20	0.98	.	.	evidence=ECO:0000256
sp|G5ED35|TTR52_CAEEL	DeepSig	Chain	21	135	.	.	.	evidence=ECO:0000256

Columns are as follows:

  • Column 1: the protein ID/accession as reported in the FASTA input file;
  • Column 2: the name of tool performing the annotation (i.e. DeepSig)
  • Column 3: the annotated feature alogn the sequence. Can be "Signal peptide" or "Chain" (indicating the mature protein). When no signal peptide is detected, the entire protein sequence is annotated as "Chain";
  • Column 4: start position of the feature;
  • Column 5: end position of the feature;
  • Column 6: feature annotation score (as assigned by DeepSig);
  • Columns 7,8: always empty, reported for compliance with GFF3 format
  • Column 9: Description field. Report the evidence code for the annotation (i.e. ECO:0000256, automatic annotation).

Install and use DeepSig from source

Source code available on GitHub at https://github.com/BolognaBiocomp/deepsig

Installation and configuration

DeepSig is designed to run on Unix/Linux platforms. The software was written using the Python programming language and it was tested under the Python version 3.

To obtain DeepSig, clone the repository from GitHub:

$ git clone https://github.com/BolognaBiocomp/deepsig.git

This will produce a directory “deepsig”. Before running deepsig you need to set and export a variable named DEEPSIG_ROOT to point to the deepsig installation dir:

$ export DEEPSIG_ROOT='/path/to/deepsig'

Before running the program, you need to install DeepSig dependencies. We suggest to use Conda (we suggest Miniconda3) create a Python virtual environment and activate it.

To create a conda env for deepsig:

$ conda create -n deepsig

To activate the environment:

$ conda activate deepsig

The following Python libraries are required:

  • biopython (version 1.78)
  • Keras (version 2.4.3)
  • Tensorflow (version 2.2)

To install all requirements:

$ conda install --yes nomkl keras==2.4.3 biopython==1.78 tensorflow==2.2.0

Now you are able to use deepsig (see next Section). Remember to keep the environment active. If you whish, you can copy the “deepsig.py” script to a directory in the users' PATH.

Usage

The program accepts three mandatory arguments:

  • The full path of the input FASTA file containing protein sequences to be predicted;
  • The kingdom the sequences belong to. You must specify "euk" for Eukaryotes, "gramp" for Gram-positive bacteria or "gramn" for Gram-negative bacteria;
  • The output file where predictions will be stored.

As an example, run the program on the eukaryotic example FASTA file contained in the folder "testdata":

$ ./deepsig.py -f testdata/SPEuk.nr.fasta -k euk -o testdata/SPEuk.nr.out

This will run deepsig on sequences contained in the "testdata/SPEuk.nr.fasta" file, using the Eukaryotes models and storing the output in the "testdata/SPEuk.nr.out" file.

Once the prediction is done, the GFF3 output should look like the following:

$ cat testdata/SPEuk.nr.out
##gff-version 3
G5ED35	DeepSig	Signal peptide	1	20	0.98	.	.	evidence=ECO:0000256
G5ED35	DeepSig	Chain	21	135	.	.	.	evidence=ECO:0000256
Q59XX2	DeepSig	Signal peptide	1	21	1.0	.	.	evidence=ECO:0000256
Q59XX2	DeepSig	Chain	22	378	.	.	.	evidence=ECO:0000256
Q9VMD9	DeepSig	Signal peptide	1	18	0.98	.	.	evidence=ECO:0000256
Q9VMD9	DeepSig	Chain	19	2188	.	.	.	evidence=ECO:0000256
Q4V4I9	DeepSig	Signal peptide	1	22	0.98	.	.	evidence=ECO:0000256
Q4V4I9	DeepSig	Chain	23	182	.	.	.	evidence=ECO:0000256
Q8SXL2	DeepSig	Signal peptide	1	18	1.0	.	.	evidence=ECO:0000256
Q8SXL2	DeepSig	Chain	19	136	.	.	.	evidence=ECO:0000256
F1NSM7	DeepSig	Signal peptide	1	18	1.0	.	.	evidence=ECO:0000256
F1NSM7	DeepSig	Chain	19	743	.	.	.	evidence=ECO:0000256
Q9SUQ8	DeepSig	Chain	1	187	.	.	.	evidence=ECO:0000256
P0DKU2	DeepSig	Chain	1	145	.	.	.	evidence=ECO:0000256
C9K4X8	DeepSig	Signal peptide	1	29	1.0	.	.	evidence=ECO:0000256
C9K4X8	DeepSig	Chain	30	116	.	.	.	evidence=ECO:0000256
....

Columns are as follows:

  • Column 1: the protein ID/accession as reported in the FASTA input file;
  • Column 2: the name of tool performing the annotation (i.e. DeepSig)
  • Column 3: the annotated feature alogn the sequence. Can be "Signal peptide" or "Chain" (indicating the mature protein). When no signal peptide is detected, the entire protein sequence is annotated as "Chain";
  • Column 4: start position of the feature;
  • Column 5: end position of the feature;
  • Column 6: feature annotation score (as assigned by DeepSig);
  • Columns 7,8: always empty, reported for compliance with GFF3 format
  • Column 9: Description field. Report the evidence code for the annotation (i.e. ECO:0000256, automatic annotation).

Please, reports bugs to: castrense.savojardo2@unibo.it