Our tool is under active development and feedback is very much appreciated.
Multiple sequence alignment formulated as a statistical machine learning problem, where an optimal profile hidden Markov model for a potentially ultra-large family of protein sequences is learned from unaligned sequences and an alignment is decoded. We use a novel, automatically differentiable variant of the Forward algorithm to train pHMMs via gradient descent.
- Aligns large numbers of protein sequences with state-of-the-art accuracy
- Enables ultra-large alignment of millions of sequences
- GPU acceleration, multi-GPU support
- Scales linear in the number of sequences (does not require a guide tree)
- Memory efficient (depending on sequence length, aligning millions of sequences on a laptop is possible)
- Visualize a profile HMM or a sequence logo of the consensus motif
- Experimental use of large protein language models to improve alignment accuracy
- Requires many sequences (in most cases starting at 1000, a few 100 might still be enough) to achieve state-of-the-art accuracy
- Only for protein sequences
- Increasingly slow for long proteins with a length > 1000 residues
Choose according to your preference:
If you haven't done it yet, set up Bioconda channels first.
Recommended way to install learnMSA:
conda install mamba
mamba create -n learnMSA_env learnMSA
which creates an environment called learnMSA_env
and installs learnMSA in it.
To run learnMSA, you have to activate the environment first:
conda activate learnMSA_env
.
While in principle we attempt to support all tensorflow versions since 2.5.0, there are known incompatiblities with tf >= 2.12.0. We recommend tensorflow version 2.10.0 if there are no particular reasons to use something else.
pip install learnMSA
Optional, but recommended for proteins longer than 100 residues. The install instructions above may already be sufficient to support GPU depending on your system. LearnMSA will notify you whether it finds any GPUs it can use or it will fall back to CPU.
You have to meet the TensorFlow GPU requirements and may do the cuda setup steps.
learnMSA -i INPUT_FILE -o OUTPUT_FILE
learnMSA -h
Since learnMSA version 1.2.0, insertions are aligned with famsa. This improves overall accuracy. The old behavior can be restored with the --unaligned_insertions
flag.
Requirements:
- TensorFlow (we recommend 2.10.0, tested versions: 2.5, >=2.7)
- networkx
- logomaker
- seaborn
- biopython (>=1.69)
- pyfamsa
- transformers
- python 3.9 (there are known issues with 3.7 which is deprecated and 3.8 is untested)
- Clone the repository
git clone https://github.com/Gaius-Augustus/learnMSA
-
Install dependencies with pip or conda
-
Run
cd learnMSA
python3 learnMSA.py --help
Run the notebooks learnMSA_demo.ipynb
or learnMSA_with_language_model_demo.ipynb
with juypter.
- Use
pyfamsa
to align insertions, also made aligning insertions the default behavior (also added--unaligned_insertions
flag). - Use
biopython
for data parsing. Many more input file formats are now available as well as the experimentalindexed_data
flag for large datasets that allows constant memory model training. - Multi GPU training works now. It is mostly beneficial for large datasets with long sequences. It can negatively affect performance otherwise.
- Added the experimental
--use_language_model
flag that uses a large, pretrained protein language model to guide the MSA and improve alignment accuracy.
- insertions that were left unaligned by learnMSA can now be aligned retroactively by a third party aligner which improves accuracy on the HomFam benchmark by about 2%-points
- Parallel training of multiple models and reduced memory footprint (train more models in less time)
- Customize learnMSA via code (e.g. by changing emission type, prior or the number of rate matricies used to compute ancestral probabilities)
Becker F, Stanke M. learnMSA: learning and aligning large protein families. GigaScience. 2022