Installation:

There are multiple options available for installation

# From GitHub
git clone https://github.com/betteridiot/pygff.git
cd pygff
python3 ./setup.py install

Optionally, you may want to test the program. A very small testing suite has been provided. Note: pytest is required for testing.

If you choose to test the program on your current environment/build:

# If installed from conda-forge or the PyPI
pytest --pyargs pygff

# Or, if installed from source
python ./setup.py test

'As-is' Warranty:

This program was written expressly for the purposes of education at the University of Michigan's Department of Computational Medicine & Bioinformatics. It was only intended to work for a specific subset of GFF/GTF files: GFF3 files. No promises are made as to its wider functionality.

Background:

The General Feature Format (GFF) was developed as a way to succinctly represent genomic features (e.g. exons, introns, genes, etc). They are a 9-column, tab-delimited plain text (or gzip compressed) file. The 9 columns are described as such:

Column	Content	Description
1	seqid	The ID of the landmark used to establish the coordinate system for the current feature
2	source	The source is a free text qualifier intended to describe the algorithm or operating procedure that generated this feature
3	type	The type of the feature (previously called the "method")
4	start	The start coordinate of the feature, given in positive 1-based integer coordinates, relative to the landmark given in column one
5	end	The end coordinate of the feature, given in positive 1-based integer coordinates, relative to the landmark given in column one
6	score	The score of the feature, a floating point number
7	strand	The strand of the feature. '+' for positive strand (relative to the landmark), '-' for minus strand, and '.' for features that are not stranded
8	phase	For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame
9	attributes	A list of feature attributes in the format tag=value

Note: this information was taken, in part, from The Sequence Ontology group's GitHub.

Description:

pygff was written to provide a helpful interface when handling GFF3 files. It allows the user to easily process the parsed GFF3 content, entry-by-entry, in a lazily generated way.

However, an additional functionality was written that produces a GFF3 index on the fly. This index allows for pseudo-random access. Due to how it is implemented, it is not recommended that this program is run from the command line directly since the index is ephemeral by nature.

pygff exposes two classes for dealing with GFF3 data:

`pygff.GffFile`:

Main class of the pygff package

Handles the opening, iterating, and closing GFF3 files. Can handle both zipped and unzipped GFF3 files.

When iterated, it lazily returns a pygff.GffEntry object. These objects can be compared against each other and programmatically accessed for all traits.

Args:

filename (str): /path/to/file.gff[.gz]
periods (int): For indexing purposes, the period to determine thresholding (default: 3)

Raises:

TypeError if GFF file is not version 3

And exposes the pygff.GffFile.fetch(seqid, start, stop) method:

Generator that fetches all GFF entries within a given region.

Also can only pull specific types of GFF entries (if supplied)

Args:

seqid (str): name of the chromosome of scaffold
start (int): start position of the feature (1-indexed)
end (int): end position of the feature (1-indexed)
type (str): GFF feature type (default: None)

Yields:

(pygff.GffEntry): A given GFF entry from the region of interest

`pygff.GffEntry`:

An object that represents a single GFF entry.

This object also has the ability to perform total ordered comparison operations (<, <=, ==, !=, >=, >) based on seqid first, then start position, and finally the end position.

Attributes:

seqid (str): name of the chromosome of scaffold
source (str): name of the program that generated the feature
type (str): type of feature
start (int): start position of the feature (1-indexed)
end (int): end position of the feature (1-indexed)
score (float): a quality score of the feature
strand (str): either '+' (forward), '-'(reverse), or '.'
phase (int): 0,1, or 2 that indicates that the first base of the is the first base of the codon
attributes (dict): a dictionary of all tag/value pairs

Quickstart

Importing

import pygff

Sequential Iteration

with pygff.GffFile('/path/to/file.gff[.gz]') as gff:
    for entry in gff:
        do_something(entry)

Pseudo-Random Access

gff = pygff.GffFile('/path/to/file.gff[.gz]')
for entry in gff.fetch('chr1', 123040, 128040):
    do_something(entry)

Output

with open('outfile.gff', 'wb') as outfile:
    with pygff.GffFile('/path/to/file.gff[.gz]') as gff:
        for entry in gff:
            # Some filtering
            print(entry, file = outfile)

Contributing & Code of Conduct:

This project is built on Open Science, Open Source, and Open Minds. To encourage an environment of inclusivity and positivity, please see our Code of Conduct.

If you are interested in contributing to the project, please see the CONTRIBUTING guidelines

pygff
Release 1.0.0

Release 1.0.0

1.1.0

1.0.0

0.0.2

0.0.1

Documentation

Installation:

'As-is' Warranty:

Background:

Description:

`pygff.GffFile`:

`pygff.GffEntry`:

Quickstart

Importing

Sequential Iteration

Pseudo-Random Access

Output

Contributing & Code of Conduct:

Stats

Development practices

Releases

Contributors

pygff Release 1.0.0

Release 1.0.0 Toggle Dropdown 1.1.0 1.0.0 0.0.2 0.0.1

Documentation

Installation:

'As-is' Warranty:

Background:

Description:

pygff.GffFile:

pygff.GffEntry:

Quickstart

Importing

Sequential Iteration

Pseudo-Random Access

Output

Contributing & Code of Conduct:

Stats

Development practices

Releases

Contributors

pygff
Release 1.0.0

Release 1.0.0

1.1.0

1.0.0

0.0.2

0.0.1

`pygff.GffFile`:

`pygff.GffEntry`: