Installation:
There are multiple options available for installation
# From GitHub
git clone https://github.com/betteridiot/pygff.git
cd pygff
python3 ./setup.py install
Optionally, you may want to test the program. A very small testing suite has been
provided. Note: pytest
is required for testing.
If you choose to test the program on your current environment/build:
# If installed from conda-forge or the PyPI
pytest --pyargs pygff
# Or, if installed from source
python ./setup.py test
'As-is' Warranty:
This program was written expressly for the purposes of education at the University of Michigan's Department of Computational Medicine & Bioinformatics. It was only intended to work for a specific subset of GFF/GTF files: GFF3 files. No promises are made as to its wider functionality.
Background:
The General Feature Format (GFF) was developed as a way to succinctly represent genomic features (e.g. exons, introns, genes, etc). They are a 9-column, tab-delimited plain text (or gzip compressed) file. The 9 columns are described as such:
Column | Content | Description |
---|---|---|
1 | seqid | The ID of the landmark used to establish the coordinate system for the current feature |
2 | source | The source is a free text qualifier intended to describe the algorithm or operating procedure that generated this feature |
3 | type | The type of the feature (previously called the "method") |
4 | start | The start coordinate of the feature, given in positive 1-based integer coordinates, relative to the landmark given in column one |
5 | end | The end coordinate of the feature, given in positive 1-based integer coordinates, relative to the landmark given in column one |
6 | score | The score of the feature, a floating point number |
7 | strand | The strand of the feature. '+' for positive strand (relative to the landmark), '-' for minus strand, and '.' for features that are not stranded |
8 | phase | For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame |
9 | attributes | A list of feature attributes in the format tag=value |
Note: this information was taken, in part, from The Sequence Ontology group's GitHub.
Description:
pygff
was written to provide a helpful interface when handling GFF3 files. It
allows the user to easily process the parsed GFF3 content, entry-by-entry, in a
lazily generated way.
However, an additional functionality was written that produces a GFF3 index on the fly. This index allows for pseudo-random access. Due to how it is implemented, it is not recommended that this program is run from the command line directly since the index is ephemeral by nature.
pygff
exposes two classes for dealing with GFF3 data:
pygff.GffFile
:
Main class of the pygff package
Handles the opening, iterating, and closing GFF3 files. Can handle both zipped and unzipped GFF3 files.
When iterated, it lazily returns a pygff.GffEntry
object. These objects
can be compared against each other and programmatically accessed for all traits.
Args:
-
filename
(str
): /path/to/file.gff[.gz] -
periods
(int
): For indexing purposes, the period to determine thresholding (default: 3)
Raises:
-
TypeError
if GFF file is not version 3
And exposes the pygff.GffFile.fetch(seqid, start, stop)
method:
Generator that fetches all GFF entries within a given region.
Also can only pull specific types of GFF entries (if supplied)
Args:
-
seqid
(str
): name of the chromosome of scaffold -
start
(int
): start position of the feature (1-indexed) -
end
(int
): end position of the feature (1-indexed) -
type
(str
): GFF feature type (default: None)
Yields:
- (
pygff.GffEntry
): A given GFF entry from the region of interest
pygff.GffEntry
:
An object that represents a single GFF entry.
This object also has the ability to perform total ordered comparison operations (<, <=, ==, !=, >=, >) based on seqid first, then start position, and finally the end position.
Attributes:
-
seqid
(str
): name of the chromosome of scaffold
-
source
(str
): name of the program that generated the feature
-
type
(str
): type of feature
-
start
(int
): start position of the feature (1-indexed)
-
end
(int
): end position of the feature (1-indexed)
-
score
(float
): a quality score of the feature
-
strand
(str
): either '+' (forward), '-'(reverse), or '.'
-
phase
(int
): 0,1, or 2 that indicates that the first base of the is the first base of the codon
-
attributes
(dict
): a dictionary of all tag/value pairs
Quickstart
Importing
import pygff
Sequential Iteration
with pygff.GffFile('/path/to/file.gff[.gz]') as gff:
for entry in gff:
do_something(entry)
Pseudo-Random Access
gff = pygff.GffFile('/path/to/file.gff[.gz]')
for entry in gff.fetch('chr1', 123040, 128040):
do_something(entry)
Output
with open('outfile.gff', 'wb') as outfile:
with pygff.GffFile('/path/to/file.gff[.gz]') as gff:
for entry in gff:
# Some filtering
print(entry, file = outfile)
Contributing & Code of Conduct:
This project is built on Open Science, Open Source, and Open Minds. To encourage an environment of inclusivity and positivity, please see our Code of Conduct.
If you are interested in contributing to the project, please see the CONTRIBUTING guidelines