PCALF - Retrieve Calcyanin protein and ccyA gene from genomes.

PCALF stand for Python CALcyanin Finder.

Benzerara, K., Duprat, E., Bitard-Feildel, T., Caumes, G., Cassier-Chauvat, C., Chauvat, F., ... & Callebaut, I. (2022). A new gene family diagnostic for intracellular biomineralization of amorphous Ca carbonates by cyanobacteria. Genome Biology and Evolution, 14(3), evac026.

The ccyA gene is searched at the protein level following a simple decision tree based on the specific composition of the C-ter extremity of this protein. We use a weighted HMM profile specific of the glycine zipper triplication (aka GlyX3) to detect sequences with a putative glycine triplication. Sequences with at least one hit against this profile are annotated with three HMM profiles specific of each Glycine zipper : Gly1, Gly2 and Gly3. A set of known calcyanins is used to infere the type of the N-ter extremity (X, Z, Y, CoBaHMA or ?). Finally, a flag is assign to each sequence depending on its N-terminus type and its C-terminus modular organization.

Decision tree :

Installation :

mamba create -n pcalf -c k2sohigh pcalf;

pip3 install pcalf;

Dependencies :

python==3.9
pyhmmer==0.7.1
biopython==1.81
numpy>=1.21
pyyaml>=6
snakemake==7.22
pandas==1.5.3
tqdm==4.64.1
plotly==5.11.0
python-igraph==0.10.4

External dependency :

blast

Usage :

Pcalf is composed of several commands :

pcalf
pcalf-datasets-workflow
pcalf-annotate-workflow
pcalf-report

pcalf :

This command can be used to look quickly for the presence of calcyanin in a set of amino acid sequences. It take one or more fasta files as input and output several files including a summary, a list of features and a list of raw hits produced during the search. It also produce updated HMMs and updated MSAs with newly detected calcyanin tagged as Calcyanin with known N-ter.

pcalf -i proteins.fasta 
      -o pcalf_results 
      --iterative-search
      --iterative-update 
      --gly1-msa custom_gly1_msa.fasta

--iterative-search : If True, the search will be performed with profiles produced by the previous iteration until there is no new sequence detected or if --max-iteration is reached.

--iterative-update : True calcyanins (if any) will be added to the HMMs profiles iteratively starting with the best sequence (based on feature' score).

--gly1-msa : Use another MSA instead of the default one. A HMM profile will be built from the given MSA and Glycine weight will be increased.

Thresholds:

By default, coverage and E-value threshold are infered from the HMM profiles that will be used for the search. To define the thresholds, a MSA is converted into a simple fasta file by deleting all gaps and aligned against its own HMM profile. A soft filtering is made using a coverage threshold of 0.5. The maximum E-values and the minimum coverage values are used to define the thresholds as follows : max(E-value)*10, min(coverage)-0.1

pcalf-datasets-workflow :

pcalf-datasets-workflow can be used to retrieve genomes from NCBI databases such as RefSeq and GenBank based on accession (e.g GCA_012769535.1) or TaxID.

Genomes and annotations (genes and CDS) will be downloaded using the new command line tools from NCBI. If annotations does not exists for a genome, then genes and CDS will be predicted using Prodigal.

A yaml file is also produced and can be used as input for pcalf-annotate-workflow.

GCA_012769535.1:
  genome : /path/to/genome.gz
  cds_faa: /path/to/cds_faa.gz
  cds_fna: /path/to/cds_fna.gz
GCF_012769535.1:
...

example :

pcalf-datasets-workflow -t 1117 
                        -o pcalf_datasets_results 
                        -a file_with_accession.txt 
                        -e file_with_accession_to_ignore.txt

The command above will download all cyanobacteria genome (taxid : 1117) and genomes specified in file_with_accession.txt. If any of them are specified in file_with_accession_to_ignore.txt, then they will be ignored.

pcalf-annotate-workflow :

pcalf-annotate-workflow is actually the workflow that will help you recover calcyanin proteins and ccyA genes from a set of genomes. This workflow is composed of multiple steps :

genome taxonomic classification using GTDB-TK V2
genome quality assessment using CheckM
calcyanin detection using pcalf
calcyanin / ccyA linking using a specific python script.
NCBI metadatas recovery using NCBI

Note, that GTDB-TK and checkM requires external databases, respectively GTDB and CheckM datas. In addition, it's advised to run GTDB-TK and CheckM on a computer cluster. Because pcalf-annotate-workflow rely on snakemake you can easily provide a snakemake profile through the --snakargs option to run it on your favorite cluster. On the other hand, you can skip the genome taxonomic classification and the quality assessment with the --quick flag.

pcalf-annotate-workflow take as input a yaml file with a specific format, see pcalf-datasets-workflow for details.

example :

pcalf-annotate-workflow -i input_file.yaml 
                        -o pcalf_annotate_results 
                        --db db_file_from_another_run.sqlite3 
                        --snakargs "--profile my_slurm_profile --use-conda --jobs 50" 
                        --gtdb my_gtdb_directory 
                        --checkm my_checkm_datas_directory

The command above will process all the genome specified in input_file.yaml through the pcalf-annotate-workflow including checkm and gtdb-tk steps. The sqlite3 file produced will be merged with db_file_from_another_run.sqlite3. The workflow will be ran on your computer cluster with 50 jobs at a time. See snakemake documentation for details about cluster execution.

Several output files for each step will be produced but the final output is a sqlite3 database storing multiple tables:

- genome          # NCBI metadatas
- gtdbtk          # GTDB-TK classification results
- checkm          # Checkm Results 
- summary         # PCALF summary table
- features        # PCALF features table
- hits            # PCALF hits table
- ccyA            # ccyA table
- gly1            # Gly1 MSA
- gly2            # Gly2 MSA
- gly3            # Gly3 MSA
- glyx3           # Glyx3 MSA
- nterdb          # N-ter table

You can use the sqlite3 database from another run as a basis for a new one. In this case, MSAs stored in the sqlite3 file will be used to generate HMM profiles for pcalf.

pcalf-report :

This command produce an HTML report from a sqlite3 database given by pcalf annotate workflow.

example :

pcalf-report --db sqlite3_file.sqlite3 --out report.html

pcalf
Release 1.2.6

Release 1.2.6

0.0.1

0.0.2

0.0.3

1.1.0

1.2.6

1.2.5

1.2.4

1.2.3

1.2.2

1.2.1

Documentation

PCALF - Retrieve Calcyanin protein and ccyA gene from genomes.

Table of Contents :

Calcyanin detection :

Decision tree :

Installation :

Dependencies :

External dependency :

Usage :

pcalf :

Thresholds:

pcalf-datasets-workflow :

pcalf-annotate-workflow :

pcalf-report :

Workflow :

Stats

Development practices

Releases

Contributors

pcalf Release 1.2.6

Release 1.2.6 Toggle Dropdown 0.0.1 0.0.2 0.0.3 1.1.0 1.2.6 1.2.5 1.2.4 1.2.3 1.2.2 1.2.1

Documentation

Stats

Development practices

Releases

Contributors

pcalf
Release 1.2.6

Release 1.2.6

0.0.1

0.0.2

0.0.3

1.1.0

1.2.6

1.2.5

1.2.4

1.2.3

1.2.2

1.2.1