candiy-lemon

Mine data from the PDB in minutes


Keywords
chemistry, computational, cheminformatics, proteins, structural, biology, computational-chemistry, data-mining, machine-learning, structural-biology
License
BSD-3-Clause
Install
pip install candiy-lemon==0.3.0

Documentation

Lemon: A framework for rapidly mining structural information from the Protein Data Bank

Logo

Documentation Build Status -- Linux and Mac OSX CircleCI Build status -- Windows Coverage Status PyPI version

What is Lemon's purpose?

Lemon is a tool for mining features used in downstream structural biology software dealing with 3D Macromolecules. It is designed to be fast and flexible, allowing users to quickly query the 3D features of a given collection of 3D structures. To do so, the user writes a workflow function which is applied to all the structures in the PDB. Currently, these workflows can be developed using C++(typically through lambdas) and Python.

Due to the incredibly fast parsing speed of the MMTF format, Lemon uses this format by default. This helps Lemon query the entire Protein Data Bank under 10 minutes on an 8 core machine. Lemon handles all the threading, compression, and MMTF parsing leaving the rest up to the user!

With these ideas in mind, the major, and crucial role of Lemon is the creation of standardized workflows for mining structural features. Since Lemon handles the rest, these workflows can be used for any future versions of the PDB. Hopefully, the structural biology community can use our software to replace custom/in-house scripts that need to be run on the ever growing PDB!

You can read more about lemon in our publication.

How do I obtain Lemon?

C++ Library

Technically speaking, Lemon is a header-only library. This means to use lemon in your own chemfiles-based project, just copy the include/lemon directory into your project and include the file lemon/lemon.hpp. There is no need to link a special library or package.

Lemon is developed to have as few dependencies as possible. You only need a recent C++ compiler which supports C++11. If you plan on building Python support, you will also need a copy of the Python interpreter and occompaning libraries and header files. All other dependencies are installed for you by the build system.

git clone https://github.com/chopralab/lemon.git

cd lemon

mkdir build

cd build

cmake .. -DCMAKE_BUILD_TYPE=Release

make -j 2

Python Module

Pre-built Python modules for v3.5+ are on PyPI under the name candiy-lemon. You can install them with pip using the following command:

python3 -m pip install candiy-lemon

For details on how to use this module, please see the Getting Starting page of the documentation.

How does one use Lemon?

The Protein Data Bank is used to test Lemon's capabilites and is the source of the majority of structural biology benchmarking sets. Therefore we have included a script to download the entire PDB archive. It is recommended to use the latest Hadoop sequence files located here.

Currently, the archive takes ~9Gb of space.

To run Lemon, select a program. For example, if one wants to query all the small molecules which interact with SAM, use the following command:

tar xf full.tar /dev/shm/
/path/to/lemon/build/progs/count_sam_small_molecules -w /dev/shm/full -n <number of cores>

The results for this program are printed to stdout.

Citation

If you find Lemon useful, please cite:

Jonathan Fine, Gaurav Chopra, Lemon: a framework for rapidly mining structural information from the Protein Data Bank, Bioinformatics, , btz178, https://doi.org/10.1093/bioinformatics/btz178

Copyright

Lemon is © 2018 Chopra Lab and Purdue University, developed by Jonathan Fine and is available as open source under the terms of the BSD License.