gb-io

A Python interface to gb-io, a fast GenBank parser and serializer written in Rust.


Keywords
genbank, parser, sequence, record
License
MIT
Install
pip install gb-io==0.3.3

Documentation

🧬🏦 gb-io.py Stars

A Python interface to gb-io, a fast GenBank parser and serializer written in Rust.

Actions Coverage License PyPI Bioconda AUR Wheel Python Versions Python Implementations Source Mirror GitHub issues Changelog Downloads Docs

πŸ—ΊοΈ Overview

gb-io.py is a Python package that provides an interface to gb-io, a very fast GenBank format parser implemented in Rust. It can reach much higher speed than the Biopython or the scikit-bio parsers.

This library has no external dependency and is available for all modern Python versions (3.7+).

To improve performance, the library implements a copy-on-access pattern, so that data is only copied on the Python heap when it is actually being accessed, rather than on object creation. For instance, if the consumer of the parser only requires the GenBank features and not the record sequence, the sequence will not be copied to a Python bytes object.

πŸ”§ Installing

Install the gb-io package directly from PyPi which hosts pre-compiled wheels that can be installed with pip:

$ pip install gb-io

Wheels are provided for common platforms, such as x86-64 Linux, Windows and MacOS, as well as Aarch64 Linux and MacOS. If no wheel is available, the source distribution will be downloaded, and a local copy of the Rust compiler will be downloaded to build the package, unless it is already installed on the host machine.

πŸ“– Documentation

A complete API reference can be found in the online documentation, or directly from the command line using pydoc:

$ pydoc gb_io

πŸ’‘ Usage

Use the gb_io.load function to obtain a list of all GenBank records in a file:

records = gb_io.load("tests/data/AY048670.1.gb")

Reading from a file-like object is supported as well, both in text and binary mode:

with open("tests/data/AY048670.1.gb") as file:
    records = gb_io.load(file)

It is also possible to iterate over each record in the file without having to load the entirety of the file contents to memory with the gb_io.iter method, which returns an iterator instead of a list:

for record in gb_io.iter("tests/data/AY048670.1.gb"):
    print(record.name, record.sequence[:10])

You can use the gb_io.dump method to write one or more records to a file (either given as a path, or a file-like handle):

with open("tests/data/AY048670.1.gb", "wb") as file:
    gb_io.dump(records, file)

πŸ“ Example

The following small script will extract all the CDS features from a GenBank file, and write them in FASTA format to an output file:

import gb_io

with open("tests/data/AY048670.1.faa", "w") as dst:
    for record in gb_io.iter("tests/data/AY048670.1.gb"):
        for feature in filter(lambda feat: feat.type == "CDS", record.features):
            qualifiers = feature.qualifiers.to_dict()
            dst.write(">{}\n".format(qualifiers["locus_tag"][0]))
            dst.write("{}\n".format(qualifiers["translation"][0]))

Compared to similar implementations using Bio.SeqIO.parse, Bio.GenBank.parse and Bio.GenBank.Scanner.GenBankScanner.parse_cds_features, the performance is the following:

gb_io.iter GenBankScanner GenBank.parse SeqIO.parse
Time (s) 2.264 7.982 15.259 19.351
Speed (MiB/s) 136.5 37.1 20.5 16.2
Speedup x8.55 x2.42 x1.27 -

πŸ’­ Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

πŸ—οΈ Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

βš–οΈ License

This library is provided under the MIT License. The gb-io Rust crate package was written by David Leslie and is licensed under the terms of the MIT License. This package vendors the source of several additional packages that are licensed under the Apache-2.0, MIT or BSD-3-Clause licenses; see the license file distributed with the source copy of each vendored dependency for more information.

This project is in no way not affiliated, sponsored, or otherwise endorsed by the original gb-io authors. It was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.