A bunch of bioinformatics utilities.

gene prediction, prokaryotes, effectors
pip install biotools==1.2.12



A bunch of bioinformatics utilities.


This module is used to align sequences. Currently, there is only a single alignment algorithm implementented; it is a hybrid between Needleman-Wunsch and Smith-Waterman and is used to find the subsequence within a larger sequence that best aligns to a reference.

biotools.align.OptimalCTether(reference, translation, extend=1, create=10)

This function will take two sequences: a reference sequence and another protein sequence (translation; usually, this is an open reading frame that has been translated). Needleman-Wunsch alignment will be performed and the substring of translation with the highest identity that begins with a start codon [default: ['ATG']] is reported.

This function returns a dictionary of relevent information from the alignment; specifically, the alignments itself [keys: query, subject], the score [key: score], the length of the alignment [key: length], the length of the substring of translation used [key: sublength], the number of identities [key: identities], and the number of gaps [key: gaps].


This module is used to create annotation files (currently, only GFF files). The annotations can be used to create a heirarchy among the annotations (e.g., genes contain exons, introns, ... etc.).

biotools.annotation.Annotation(self, ref, src, type, start, end, score, strand, phase, attr, name_token='ID', gff_token='=')

An object to help with reading and writing GFF files.


A module to manage BLAST databases and interface with the BLAST+ standalone program available from NCBI.

biotools.BLAST.Result(self, file)

A class which take the raw output from BLAST and generates dictionaries from the data from BLAST. This data includes the alignment, percent identity, gaps, e-value, score, length of subject, length of query, and start and stop positions for both sequences. This class should be used in a for loop like so:

    for res in Result(file_or_data):

The class instance has a single other property, headers, which are the lines in BLAST results before the BLAST hits (e.g., citation info, etc.).

biotools.BLAST.run(db, sfile, mega_blast=False, **kwargs)

Takes a database and a query and runs the appropriate type of BLAST on them. The database can be an existing BLAST database or a fasta/fastq file. If it is a sequence file, this function will look in the places where BLAST would look for an existing database created from that file and use that instead. If there is no such database, this function will make one for you and then use the newly created database with BLAST.

Optional named arguments can currently only be evalue, num_threads, gapopen, or gapextend. The correspond to the BLAST options of the same name.


Needs documentation

biotools.clustal.run(infile, outfile, **kwargs)

Needs documentation


Needs documentation


Creates the complement of a sequence, which can then be reversed by using seq[::-1], if it needs to be reversed. This function accepts either Sequences or strings.


Needs documentation

biotools.sequence.annotation(seq, source, type, **kwargs)

Creates an Annotation object for the given sequence from a source (e.g., "phytozome7.0") of a particular type (e.g., "gene").

biotools.sequence.chop(seq, length=70)

Yields a chunk of a sequence of no more than length characters, it is meant to be used to print fasta files.

biotools.sequence.Sequence(self, name, seq, **kwargs)

A wrapper class for sequences.


Needs documentation


Needs documentation


Translate a nucleotide using the standard genetic code. The sequence parameter can be either a string or a Sequence object. Stop codons are denoted with an asterisk (*).


Needs documentation


Needs documentation

biotools.analysis.cluster.run(direc, inputs)

Takes a collection of files generated by gene prediction, creates clusters based off of the genes that have homology to those predicted genes, and creates new fasta files in the clusters sub directory under the given directory and separated according to whether they are nucleotide or amino acid sequnces. These new fasta files are then used to create clustalw alignments of the genes if more than 1 sequence exists in the fasta file.


This is a pretty simple JSON-like parser. Specifically, it can load Python-like object, list, and other literals, i.e., the sort of stuff you'd get it you dumped the the string representation of some data into a file.

The real difference is that you must specify a variable name, e.g.:

my_stuff = { ... }

These variable names don't need to be on a newline or anything like that, you should be able to omit any and all whitespace. The result of a successful parse is a dictionary:

{'my_stuff': { ... }}

This function really only works for None, True, False, numbers, strings, dictionaries, and lists.


Needs documentation


Needs documentation


Needs documentation


Prints the usage.


Parses pargs and sets global variables to be accessible to other modules.

These variables are:

  • args


Needs documentation

biotools.analysis.plot.axes(bottom, side, bound, fig, **kwargs)

Needs documentation

biotools.analysis.plot.draw(x, y, ax, color, **kwargs)

Needs documentation

biotools.analysis.plot.models(starts, ends, counts, bound, ax, **kwargs)

Needs documentation

biotools.analysis.plot.plot(plotdata, directory, bottom=True, side=True, legend=True, save=True, filename='untitled.pdf', upperbound=0.05, factor=21, fig=<matplotlib.figure.Figure object at 0x1021f8610>, **kwargs)

Needs documentation

biotools.analysis.plot.report(ntvar, aavar, lnt, laa)

Needs documentation

biotools.analysis.plot.smoothed(unsmoothed, factor)

Needs documentation


Needs documentation

biotools.analysis.predict.GeneFromBLAST(db, sequences, pref, names)

BLASTs database against sequences, and for those results that pass the length and percent identity requirements, attempt to locate the full gene that corresponds to that BLAST hit. Genes that are found are saved in the subdirectory sequences under the given directory, divided depending on whether the sequnece is amino acid or nucleotide.


Scans both strands of the given sequence and yields the longest subsequence that starts with a start codon and contains no stop codon other than the final codon.

biotools.analysis.predict.run(subject, query, prefix, names)

Needs documentation


Needs documentation


Needs documentation

biotools.analysis.renamer.rename(direc, db, files)

This isn't really for bioinformatics, this is more for the pipeline, to rename the files generated by cluster.py with a little human interaction.


Needs documentation

biotools.analysis.report.plot(plotdata, directory, bottom=True, side=True, legend=True, save=True, filename='untitled.pdf', upperbound=0.05, factor=21, fig=<matplotlib.figure.Figure object at 0x10225f8d0>, **kwargs)

Needs documentation

biotools.analysis.report.report(plotdata, **kwargs)

Needs documentation


Needs documentation

biotools.analysis.run.run(infile, strains)

Run several instances of genepredict.run at once.


Needs documentation


Takes a clustalw alignment and will return a dictionary of data relevent to plotting the sequence variance for the sequences in the given clustalw alignment. These data are:

  • var: the measure of sequence variation,
  • starts: the starting positions for each gene model in amino acids,
  • ends: the ending positions for each gene model in amino acids, and
  • count: the number of sequences with a particular gene model. The values given in starts, ends, and counts are sorted to that the nth element in starts corresponds to the nth value in ends and the nth value in counts.

biotools.analysis.variance.var(strain, fmt)

Returns plot data and metadata for plotting later on in the pipeline.


A module for reading and writing to sequence and annotation files. Currently supported file types are: FASTA, FASTQ, CLUSTAL alignments, and GFF3 files.


Methods for manipulating clustalw alignment files.


Needs documentation


Needs documentation


Needs documentation


Functions for manipulating FASTA files.


Probe a file to determine whether or not it is a FASTA file. That is, the first non-empty line should begin with a caret ('>'). If no caret is found on the first line, then we conclude that it is not a FASTA file and return False, otherwise, we return a dictionary with information relevant to the FASTA file type.


Read sequences in FASTA format; identifiers (names and definition lines) are on lines that begin with carets ('>') and sequence is on lines that intervene between the carets. This function is a generator that yields Sequence objects.

biotools.IO.fasta.write(fh, s)

Write sequences in FASTA format, i.e.,

>name defline
sequence ...

Sequences are wrapped to 70 characters by default.


Functions for manipulating FASTQ files.


Probe a file to determine whether or not it is a FASTQ file. That is, the first non-empty line should begin with a caret ('@') and the 3rd line following that first non-empty line should contain no character with ordinal value less than 32. If none of the characters have ordinal value less than 64, then the file is guessed to be encoded in Phred64, otherwise it is encoded in Phred32. This function will return False if the file is not in FASTQ format and will return a dictionary with the phred score and type ('fastq') if the file is FASTQ.


Read sequences in FASTQ format; identifiers are on lines that begin with at symbols ('@'), sequence follows on the next line, then a line that begins and sequence is with a plus sign ('+') and finally the quality scores on the subsequent line. Quality scores are encoded in Phred format, the type of which (either 32 or 64) is determined when the file is probed for opening. The scores are decoded into a list of integers. This function is a generator that yields Sequence objects.

biotools.IO.fastq.write(fh, s)

Write sequences in FASTA format, i.e.,

sequence ...
quality scores


Needs documentation


Needs documentation


Needs documentation


Needs documentation


Needs documentation

biotools.IO.gff.write(fh, a)

Needs documentation

biotools.IO.IOBase(self, name, mode)

Generic IO class for sequence files.


Close the file handle.

biotools.IO.IOBase.format(self, fmt)

Forces a file to be parsed as a particular format. By default, the values for fmt can be any recognized format.


This module is home to the IOManager class, which manages the various input and output formats (specifically, FASTA, FASTQ, CLUSTAL alignments, and GFF files, currently).

biotools.IO.manager.IOManager(self, methods=None)

A class used by the IOBase class to manage the various input and output methods for the different file types. Additional file types can be added to the manager by using

manager[format] = methods

From the above example, methods is a dictionary with keys rhook, read, whook, write, and probe. Each of the values must be callable object:

  • rhook => takes a file handle opened for reading; called before reading of the file has begun,
  • whook => takes a file handle opened for writing; called before writing to the file has begun,
  • read => takes a file handle opened for reading; should be a generator that yields entries,
  • write => takes a file handle opened for writing and a single entry; writes the entry to the file,
  • probe => takes a file handle opened for reading; returns a dictionary of attributes to be applied to the IOBase instance.

This class behaves similarly to a dictionary, except that the get method will default to the default method (which does nothing) if no truthy second parameter is passed.

biotools.IO.manager.IOManager.get(self, key, default=None)

Try to get a set of methods via format (e.g., 'fasta') or fall-back to the default methods (which do nothing).

biotools.IO.open(filename, mode='r')

Open a file for parsing or creation. Returns either a Reader or Writer object, depending on the open mode.

biotools.IO.Reader(self, filename, mode='r')

A class that wraps IOBase and restricts the ability to write.


Reads a single entry in the file and returns it.

biotools.IO.Reader.read(self, n=None)

If n is provided, the next (up to) n entries are parsed and returned. Otherwise, all remaining entries are parsed and returned.

biotools.IO.Writer(self, filename, mode='w')

A class that wraps IOBase and restricts the ability to read.

biotools.IO.Writer.write(self, sequence)

Writes sequence as the correct format to the file.


Usage: prok-geneseek [options] <database> <sequences ...>

  -h, --help            show this help message and exit
  -S START, --start=START
                        define a start codon [default: -S ATG]
  -E STOP, --stop=STOP  define a stop codon [default: -E TAG -E TAA -E TGA]
  -j THREADS, --threads=THREADS
                        number of threads [default: 16]
  -p PROCESSES, --processes=PROCESSES
                        number of parallel processes to run [default: 2]
  -e EVALUE, --evalue=EVALUE
                        maximum e-value [default: 1e-30]
  -I IDENTITY, --identity=IDENTITY
                        minimum percent identity [default: 0.45]
                        allowable relative error in hit length [default: 0.2]
  -O bases, --orflen=bases
                        minimum allowable length for ORFs [default: 300]
  -d DIRECTORY, --directory=DIRECTORY
                        set working directory [default: current]
  -P PLOTTER, --plotter=PLOTTER
                        plotting module [default: biotools.analysis.plot]
  -v, --verbose         print debug messages [default: False]
  --no-plots            suppress the drawing of plots [default: False]
  --no-predict          don't predict genes, instead treat the input files as
                        predicted genes [default: False]
  --no-cluster          don't cluster the sequences, instead treat the input
                        files as alignments [default: False]
  --no-rename           don't rename the fasta and clustal files [default:
  --no-reports          don't generate files for variance data [default:
  --no-calculation      don't calculate sequence variance [default: False]


Usage: grepseq [options] <pattern> <files ...>

  -h, --help            show this help message and exit
  -c, --count           Suppress normal output; instead print a count of
                        matching lines for each input file. With the -v,
                        --invert-match option (see below), count non-matching
  -H, --with-filename   Print the filename for each match.
  -i, --ignore-case     Ignore case distinctions in both the pattern and
                        input files.
  -m NUM, --max-count=NUM
                        Stop reading a file after NUM matching lines. When
                        the -c or --count option is also used, grepseq does
                        not output a count greater than NUM. When the -v or
                        --invert-match option is also used, grep stops after
                        outputting NUM non-matching lines.
  -N, --names-only      Search only sequence names. Cannot be used with -S.
  -S, --sequences-only  Search only sequences. Cannot be used with -N.
  -v, --invert-match    Invert the sense of matching, to select non-matching