Cython bindings and Python interface to SWORD (Smith Waterman On Reduced Database), a method for fast database search.
Searching a sequence inside a database of target sequences involves aligning the sequence to all the targets to find the highest scoring ones, which has a high computational cost. Several methods have been proposed over the years that use a pre-filter to select. In BLAST[1], k-mers are extracted from the query, and only targets containing high-scoring k-mers, with respect to the scoring matrix, are actually aligned.
SWORD[2] proposes a pre-filter built on perfect hashing of short mismatching k-mers. The k-mers generated from the query sequence also include k-mers with mismatches to improve sensitivity. When a k-mer is found in a target sequence, SWORD computes the diagonal where it is located, similarly to FASTA[3]. Target sequences are then selected based on the number of hits they have on the same diagonal. The pairwise alignment is then handled by the platform-accelerated Opal[4] library.
PySWRD is a Python module that provides bindings to the heuristic filter part of SWORD using Cython. It implements a user-friendly, Pythonic interface to build a heuristic filter, process a database in chunks, and produce the indices of targets passing the filter. The resulting indices can be used to filter a PyOpal database, using Opal for pairwise alignment like the original C++ implementation.
- no binary dependency: PySWRD is distributed as a Python package, so you can add it as a dependency to your project, and stop worrying about the SWORD binary being present on the end-user machine.
- no intermediate files: Everything happens in memory, in a Python object you control, so you don't have to invoke the SWORD CLI using a sub-process and temporary files.
- better portability: Using only the heuristic filter of SWORD allows the code to be independent of the local CPU features, unlike SWORD and Opal which require SIMD. PySWRD delegates the SIMD compilation and dynamic dispatch to PyOpal to make the package easier to install. It also benefits from the wider platform support of PyOpal compared to the original Opal, featuring support for Windows and for Aarch64 CPUs.
PySWRD is available for all modern Python versions (3.6+).
It can be installed directly from PyPI, which hosts some pre-built x86-64 wheels for Linux, MacOS, and Windows, as well as the code required to compile from source with Cython:
$ pip install pyswrd
PySWRD does not provide I/O, so the sequences to be used have to be loaded through another library, such as Biopython. PySWRD only requires the sequences to be available as Python strings:
targets = [
'MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPK',
'MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASG',
'MASNTVSAQGGSNRPVRDFSNIQDVAQFLLFDPIWNEQPGSIVPWKMNRE',
'MYQAINPCPQSWYGSPQLEREIVCKMSGAPHYPNYYPVHPNALGGAWFDT',
'MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVYQMSN'
]
queries = [
'MASNTVSAQGGSNRPVRDFSNIQDVAQFLLFDPIWNEQPG',
'MSFKVYDPIAELIATQFPTSNPDLQIINNDVLVVSPHKIT',
'MEQVPIKEMRLSDLRPNNKSIDTDLGGTKLVVIGKPGSGK'
]
Use the high-level search
function, which wraps the internal classes in a single
function to quickly run many-to-many searches in the event all your sequences are in
memory. It expects the sequences as iterable of Python strings, and yields hits
passing E-value and alignment thresholds:
import pyswrd
for hit in pyswrd.search(queries, targets):
print(hit.query_index, hit.target_index, hit.score, hit.evalue)
Different parameters can be passed to pyswrd.search
and are passed to the
SWORD filter and Opal alignment. For instance, to run SWORD in fast mode
instead of the default sensitive mode, and using the PAM70 matrix instead
of BLOSUM62, use:
for hit in pyswrd.search(queries, targets, scorer_name="PAM70", score_threshold=0, kmer_length=5):
print(hit.query_index, hit.target_index, hit.score, hit.evalue)
By default multithreading is supported, using one thread per CPU on the local
machine as reported by os.cpu_count
, but it can be changed with the threads
argument:
for hit in pyswrd.search(queries, targets, threads=1):
print(hit.query_index, hit.target_index, hit.score, hit.evalue)
You can also use the pyswrd.HeuristicFilter
class directly if you wish to
manage the data yourself, or if you want to use a different aligner.
The table below shows the time for running pyswrd.search
using 196 proteins
as queries (uniprot_sprot196.fasta
) against a database of 12,701 proteins
(uniprot_sprot12071.fasta
) pre-loaded into memory:
threads=1 |
threads=2 |
threads=4 |
threads=8 |
threads=12 |
|
---|---|---|---|---|---|
max_candidates=10 |
0.87s | 0.83s | 0.83s | 0.80s | 0.76s |
max_candidates=50 |
0.98s | 0.91s | 0.98s | 0.97s | 1.04s |
max_candidates=100 |
1.24s | 1.33s | 1.44s | 1.63s | 1.67s |
max_candidates=500 |
1.86s | 1.83s | 1.95s | 2.09s | 2.15s |
max_candidates=1000 |
2.87s | 2.64s | 2.83s | 2.82s | 2.90s |
max_candidates=5000 |
9.33s | 8.11s | 7.59s | 6.60s | 6.06s |
max_candidates=15000 |
21.50s | 15.85s | 14.74s | 11.83s | 11.34s |
max_candidates=30000 |
23.44s | 16.13s | 14.61s | 12.47s | 11.08s |
no filter (Opal) | 31.38s | 23.60s | 19.57s | 15.43s | 14.60s |
BLAST+ (blastp ) |
7.46s | 4.97s | 4.01s | 3.63s | 3.66s |
The max_candidates
parameter controls the strictness of the SWORD heuristic filter, and reduces
the total number of alignments made by Opal, at the cost of a lowered sensivity
(see SWORD Supplementary Figs. S1 and S2.).
SWORD uses 15,000 candidates in fast mode and 30,000 candidates in sensitive mode by default.
This was benchmarked against the NCBI NR
database, which contains more than 54M sequences; it is likely a smaller max_candidates
value can
be selected for smaller databases and/or databases with less redundant sequences without loss of sensitivity.
Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.
Contributions are more than welcome! See
CONTRIBUTING.md
for more details.
This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.
This library is provided under the GNU General Public License v3.0.
SWORD was written by Robert Vaser and is distributed under the terms of the
GPLv3 as well. See vendor/sword/LICENSE
for more information. SWORD redistributes additional
libraries under the terms of the MIT License.
This project is in no way not affiliated, sponsored, or otherwise endorsed by the SWORD authors. It was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.
- [1] Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, David J. Lipman. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. doi:10.1016/S0022-2836(05)80360-2. PMID:2231712.
- [2] Robert Vaser, Dario Pavlović, Mile Šikić. SWORD—a highly efficient protein database search. Bioinformatics, Volume 32, Issue 17, September 2016, Pages i680–i684, doi:10.1093/bioinformatics/btw445.
- [3] David J. Lipman, William R. Pearson. Rapid and sensitive protein similarity searches. Science. 1985 Mar 22;227(4693):1435-41. doi:10.1126/science.2983426. PMID:2983426.
- [4] Korpar Matija, Martin Šošić, Dino Blažeka, Mile Šikić. SW#db: ‘GPU-Accelerated Exact Sequence Similarity Database Search’. PLoS One. 2015 Dec 31;10(12):e0145857. doi:10.1371/journal.pone.0145857. PMID:26719890. PMC4699916.