
A computational pipeline enabling functional annotation using amino acid recoding.

Pipeline to apply encoded Kmer analysis to protein sequences

Model mode:

  • Input: FASTA files containing protein sequences in known families
  • Output: Models of the known protein families based on kmer vector analysis
    • Evaluation output: Assessment of model performance

Cluster mode:

  • Input: FASTA files containing protein sequences
  • Output: Clusters of similar proteins
    • Evaluation output: Assessment of how well the clusters of similar proteins represent functions

Search mode:

  • Input: FASTA files containing protein sequences; Trained model (output from snekmer model mode)
  • Output Predictions of family membership


We recommend using Anaconda to create a virtual environment. Anaconda handles dependencies and versioning, which simplifies the process of installation.


Use conda to install an environment from the YML file with all required dependencies. (Note: Users may either download the YML file directly from the repository, or clone the repository beforehand using the git clone command below.)

conda env create -f environment.yml

Activate the environment:

conda activate snekmer

Install Snekmer using pip (note: git clone step is optional if you already have the repo cloned locally):

# option 1: clone repository (if you haven't already) and install
git clone
pip install Snekmer

# option 2: direct install
pip install git+

The package should now be ready to use!

Troubleshooting Notes

For Windows users: If you are running into conflicts/errors when creating the conda environment in Windows, you may need to install the minimal version of Snakemake instead:

conda create -n snekmer -c conda-forge -c bioconda biopython matplotlib numpy pandas seaborn snakemake-minimal scikit-learn

Command-Line Interface

To run Snekmer, create a config.yaml file containing desired parameters. A template is provided at snekmer/config.yaml. Note that the config.yaml file should be stored in the same directory as input directory.

Snekmer assumes that input files are stored in the input directory, and automatically creates an output directory to save all output files. Snekmer also assumes background files, if any, are stored in input/background/. An example of the assumed directory structure is shown below:

├── config.yaml
├── input/
│   ├── background/
│   │   ├── X.fasta
│   │   ├── Y.fasta
│   │   └── etc.
│   ├── A.fasta
│   ├── B.fasta
│   └── etc.
├── output/
│   ├── ...
│   └── ...


Snekmer has three operation modes: model (supervised modeling), cluster (unsupervised clustering), and search (application of model to new sequences). We will call first two learning modes due to their utility in learning relationships between protein family input files. Users may choose a mode to best suit their use case.

The mode must be specified in the command line, e.g. to specify the model mode, the following should be called:

snekmer model [--options]

In the resources directory, two example configuration files are included:

  • resources/config.yaml: Configuration file for snekmer model and snekmer cluster modes.
  • resources/search.yaml: Configuration file for snekmer search mode. Note that the Snekmer CLI automatically assumes that the configuration file will be named config.yaml, so to use the provided file, use snekmer search --configfile search.yaml
snekmer [mode] --dryrun

(For instance, in supervised mode, run snekmer model --dryrun.)

The output of the dry run shows you the files that will be created by the pipeline. If no files are generated, double-check that your directory structure matches the format specified above.

When you are ready to process your files, run:

snekmer [mode]


Each step in the Snekmer pipeline generates its own associated output files. Both operation modes will preprocess parameters, generate labels, and vectorize sequences based on labels. The associated output files can be found in the respective directories.

The following output directories and files are created in both operation modes:

├── input/
│   ├── A.fasta
│   └── B.fasta
└── output/
    ├── processed/
    │   ├── A.json             # processed parameter values for A
    │   ├── B.json             # processed parameter values for B
    │   ├── A_description.csv  # summary of sequences in A.fasta
    │   └── B_description.csv  # summary of sequences in B.fasta
    ├── labels/
    │   ├── A.txt              # kmer labels for A
    │   └── B.txt              # kmer labels for B
    ├── features/
    └── ...

Model Mode

Executing snekmer model produces the following output files and directories in addition to the files described previously.

└── output/
    ├── ...
    ├── features/
    │   ├── A/            # kmer vectors in A kmer space
    │   │   ├── A.json.gz
    │   │   └── B.json.gz
    │   └── B/            # kmer vectors in B kmer space
    │       ├── A.json.gz
    │       └── B.json.gz
    ├── score/
    │   ├── A.pkl         # A sequences, scored
    │   ├── B.pkl         # B sequences, scored
    │   └── weights/
    │       ├── A.csv.gz  # kmer score weights in A kmer space
    │       └── B.csv.gz  # kmer score weights in B kmer space
    └── model/
        ├── A.pkl         # (A/not A) classification model
        ├── B.pkl         # (B/not B) classification model
        ├── results/      # cross-validation results table
        │   ├── A.csv
        │   └── B.csv
        └── figures/      # cross-validation results figures
            ├── A/
            └── B/

Cluster Mode

Executing snekmer cluster produces the following output files and directories in addition to the files described previously.

└── output/
    ├── ...
    ├── features/
    │   └── full/     # kmer vectors in full kmer space for (alphabet, k)
    │       ├── A.json.gz
    │       └── B.json.gz
    └── cluster/
        ├── A.pkl     # A cluster model
        ├── B.pkl     # B cluster model
        └── figures/  # cluster figures (t-SNE)
            ├── A/
            └── B/

Search Mode

The snekmer search mode assumes that the user has pre-generated family models using the snekmer model workflow, and thus operates as an independent workflow. The location of the basis sets, scorers, and models must be specified in the configuration file (see: resources/search.yaml).

For instance, say that the above output examples have already been produced. The user would then like to search a set of unknown sequences against the above families.

In a separate directory, the user should place files in an input directory with the appropriate YAML file. The assumed input file structure is as follows:

├── search.yaml
├── input/
│   ├── unknown_1.fasta
│   ├── unknown_2.fasta
│   └── etc.
├── output/
│   ├── ...
│   └── ...

The user should then modify search.yaml to point toward the appropriate basis set, scorer, and model directories.

Executing snekmer search --configfile search.yaml produces the following output files and directories in addition to the files described previously.

└── output/
    ├── features/
    │   ├── A/
    │   │   ├── unknown_1.json.gz
    │   │   └── unknown_2.json.gz
    │   └── B/
    │       ├── unknown_1.json.gz
    │       └── unknown_2.json.gz
    └── search/
        ├── A.csv  # A probabilities and predictions for unknown sequences
        └── B.csv  # B probabilities and predictions for unknown sequences

Partial Workflow

To execute only a part of the workflow, the parameter --until can be invoked. For instance, to execute the workflow only through the kmer vector generation step, run:

snekmer [mode] --until vectorize