A NumPy port of the foldseek
code for encoding structures to 3di.
foldseek
is a method developed
by van Kempen et al.[1] for the fast and accurate search of
protein structures. In order to search proteins structures at a large scale,
it first encodes the 3D structure into sequences over a structural alphabet,
3di, which captures tertiary amino acid interactions.
mini3di
is a pure-Python package to encode 3D structures of proteins into
the 3di alphabet, using the trained weights from the foldseek
VQ-VAE model.
This library only depends on NumPy and is available for all modern Python versions (3.7+).
Install the mini3di
package directly from PyPi
which hosts universal wheels that can be installed with pip
:
$ pip install mini3di
mini3di
provides a single Encoder
class, which expects the 3D coordinates
of the Cα, Cβ, N and C atoms from each peptide residue. For
residues without Cβ (Gly), simply write the coordinates as math.nan
.
Call the encode_atoms
method to get a sequence of 3di states:
from math import nan
import mini3di
encoder = mini3di.Encoder()
states = encoder.encode_atoms(
ca=[[32.9, 51.9, 28.8], [35.0, 51.9, 26.6], ...],
cb=[[ nan, nan, nan], [35.3, 53.3, 26.4], ...],
n=[ [32.1, 51.2, 29.8], [35.3, 51.5, 28.1], ...],
c=[ [34.4, 51.7, 29.1], [36.1, 51.1, 25.8], ...],
)
The states returned as output will be a NumPy array of state indices. To turn
it into a sequence, use the build_sequence
method of the encoder:
sequence = encoder.build_sequence(states)
print(sequence)
The encoder can work directly with Biopython objects, if Biopython is available.
A helper method encode_chain
to extract the atom coordinates from
a Bio.PDB.Chain
and encoding them directly. For instance, to encode all the chains from a
PDB file:
import pathlib
import mini3di
from Bio.PDB import PDBParser
encoder = mini3di.Encoder()
parser = PDBParser(QUIET=True)
struct = parser.get_structure("8crb", pathlib.Path("tests", "data", "8crb.pdb"))
for chain in struct.get_chains():
states = encoder.encode_chain(chain)
sequence = encoder.build_sequence(states)
print(chain.get_id(), sequence)
Found a bug? Have an enhancement request? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.
Contributions are more than welcome! See
CONTRIBUTING.md
for more details.
This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.
This library is provided under the
GNU General Public License v3.0.
It includes some code ported from foldseek
, which is licensed under the
GNU General Public License v3.0
as well.
This project is in no way not affiliated, sponsored, or otherwise endorsed
by the original foldseek
authors.
It was developed by Martin Larralde during his
PhD project at the European Molecular Biology Laboratory
in the Zeller team.
- [1] Kempen, Michel van, Stephanie S. Kim, Charlotte Tumescheit, Milot Mirdita, Jeongjae Lee, Cameron L. M. Gilchrist, Johannes Söding, and Martin Steinegger. ‘Fast and Accurate Protein Structure Search with Foldseek’. Nature Biotechnology, 8 May 2023, 1–4. doi:10.1038/s41587-023-01773-0.