pyswrd

Cython bindings and Python interface to SWORD (Smith Waterman On Reduced Database), a heuristic method for fast database search.


Keywords
bioinformatics, sequence, alignment, opal, sword
License
GPL-3.0-only
Install
pip install pyswrd==0.2.0

Documentation

🐍🗡️ PySWRD Stars

Cython bindings and Python interface to SWORD (Smith Waterman On Reduced Database), a method for fast database search.

Actions Coverage License PyPI Bioconda AUR Wheel Python Versions Python Implementations Source Mirror Issues Docs Changelog Downloads

🗺️ Overview

Searching a sequence inside a database of target sequences involves aligning the sequence to all the targets to find the highest scoring ones, which has a high computational cost. Several methods have been proposed over the years that use a pre-filter to select. In BLAST[1], k-mers are extracted from the query, and only targets containing high-scoring k-mers, with respect to the scoring matrix, are actually aligned.

SWORD[2] proposes a pre-filter built on perfect hashing of short mismatching k-mers. The k-mers generated from the query sequence also include k-mers with mismatches to improve sensitivity. When a k-mer is found in a target sequence, SWORD computes the diagonal where it is located, similarly to FASTA[3]. Target sequences are then selected based on the number of hits they have on the same diagonal. The pairwise alignment is then handled by the platform-accelerated Opal[4] library.

PySWRD is a Python module that provides bindings to the heuristic filter part of SWORD using Cython. It implements a user-friendly, Pythonic interface to build a heuristic filter, process a database in chunks, and produce the indices of targets passing the filter. The resulting indices can be used to filter a PyOpal database, using Opal for pairwise alignment like the original C++ implementation.

  • no binary dependency: PySWRD is distributed as a Python package, so you can add it as a dependency to your project, and stop worrying about the SWORD binary being present on the end-user machine.
  • no intermediate files: Everything happens in memory, in a Python object you control, so you don't have to invoke the SWORD CLI using a sub-process and temporary files.
  • better portability: Using only the heuristic filter of SWORD allows the code to be independent of the local CPU features, unlike SWORD and Opal which require SIMD. PySWRD delegates the SIMD compilation and dynamic dispatch to PyOpal to make the package easier to install. It also benefits from the wider platform support of PyOpal compared to the original Opal, featuring support for Windows and for Aarch64 CPUs.

🔧 Installing

PySWRD is available for all modern Python versions (3.6+).

It can be installed directly from PyPI, which hosts some pre-built x86-64 wheels for Linux, MacOS, and Windows, as well as the code required to compile from source with Cython:

$ pip install pyswrd

💡 Example

PySWRD does not provide I/O, so the sequences to be used have to be loaded through another library, such as Biopython. PySWRD only requires the sequences to be available as Python strings:

targets = [
    'MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPK', 
    'MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASG', 
    'MASNTVSAQGGSNRPVRDFSNIQDVAQFLLFDPIWNEQPGSIVPWKMNRE', 
    'MYQAINPCPQSWYGSPQLEREIVCKMSGAPHYPNYYPVHPNALGGAWFDT', 
    'MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVYQMSN'
]
queries = [
    'MASNTVSAQGGSNRPVRDFSNIQDVAQFLLFDPIWNEQPG', 
    'MSFKVYDPIAELIATQFPTSNPDLQIINNDVLVVSPHKIT', 
    'MEQVPIKEMRLSDLRPNNKSIDTDLGGTKLVVIGKPGSGK'
]

Use the high-level search function, which wraps the internal classes in a single function to quickly run many-to-many searches in the event all your sequences are in memory. It expects the sequences as iterable of Python strings, and yields hits passing E-value and alignment thresholds:

import pyswrd
for hit in pyswrd.search(queries, targets):
    print(hit.query_index, hit.target_index, hit.score, hit.evalue)

Different parameters can be passed to pyswrd.search and are passed to the SWORD filter and Opal alignment. For instance, to run SWORD in fast mode instead of the default sensitive mode, and using the PAM70 matrix instead of BLOSUM62, use:

for hit in pyswrd.search(queries, targets, scorer_name="PAM70", score_threshold=0, kmer_length=5):
    print(hit.query_index, hit.target_index, hit.score, hit.evalue)

By default multithreading is supported, using one thread per CPU on the local machine as reported by os.cpu_count, but it can be changed with the threads argument:

for hit in pyswrd.search(queries, targets, threads=1):
    print(hit.query_index, hit.target_index, hit.score, hit.evalue)

You can also use the pyswrd.HeuristicFilter class directly if you wish to manage the data yourself, or if you want to use a different aligner.

⏱️ Benchmarks

The table below shows the time for running pyswrd.search using 196 proteins as queries (uniprot_sprot196.fasta) against a database of 12,701 proteins (uniprot_sprot12071.fasta) pre-loaded into memory:

threads=1 threads=2 threads=4 threads=8 threads=12
max_candidates=10 0.87s 0.83s 0.83s 0.80s 0.76s
max_candidates=50 0.98s 0.91s 0.98s 0.97s 1.04s
max_candidates=100 1.24s 1.33s 1.44s 1.63s 1.67s
max_candidates=500 1.86s 1.83s 1.95s 2.09s 2.15s
max_candidates=1000 2.87s 2.64s 2.83s 2.82s 2.90s
max_candidates=5000 9.33s 8.11s 7.59s 6.60s 6.06s
max_candidates=15000 21.50s 15.85s 14.74s 11.83s 11.34s
max_candidates=30000 23.44s 16.13s 14.61s 12.47s 11.08s
no filter (Opal) 31.38s 23.60s 19.57s 15.43s 14.60s
BLAST+ (blastp) 7.46s 4.97s 4.01s 3.63s 3.66s

The max_candidates parameter controls the strictness of the SWORD heuristic filter, and reduces the total number of alignments made by Opal, at the cost of a lowered sensivity (see SWORD Supplementary Figs. S1 and S2.).

SWORD uses 15,000 candidates in fast mode and 30,000 candidates in sensitive mode by default. This was benchmarked against the NCBI NR database, which contains more than 54M sequences; it is likely a smaller max_candidates value can be selected for smaller databases and/or databases with less redundant sequences without loss of sensitivity.

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

🏗️ Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

📋 Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

⚖️ License

This library is provided under the GNU General Public License v3.0. SWORD was written by Robert Vaser and is distributed under the terms of the GPLv3 as well. See vendor/sword/LICENSE for more information. SWORD redistributes additional libraries under the terms of the MIT License.

This project is in no way not affiliated, sponsored, or otherwise endorsed by the SWORD authors. It was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.

📚 References