polymera

Representing ambiguous sequences written with complement alphabets.


Keywords
sequence
License
MIT
Install
pip install polymera==0.1.2

Documentation

Polymera logo

Polymera

Build StatusCoverage Status

Work in progress!

Polymera is a Python package for representing ambiguous sequences. An ambiguous sequence has a number of possible letters (symbols, elements) at each position. Additionally, Polymera can model sequences written with complement alphabets. Each letter of a complement alphabet can form a pair with specific letters, their complements. A special type is the exact sequence, which has only one letter at each position.

These sequences can describe linear polymers, for example DNA, that can pair with a complement polymer.

Polymera is a genus of crane fly.

Details

The Polymer class consists of the Sequence and the Alphabet classes. The relations between the letters are described by the alphabet, and can be complement relations or other type. Polymera has built-in alphabets for nucleic acids (DNA) and proteins.

Representation in writing

The sequence can contain an arbitrary set of letters. For example, in the case of DNA, xenonucleotides and methylated nucleotides are represented by X and numbers: AGXCTGXGTGTA55GTAGT66.

Sequence with choice ambiguity: GCG|A,G|TC,GG, where a segment separator character, here | (vertical bar), denotes the sections of the sequence, and the choices are separated by another character: here we use , (comma). The above can mean one of 4 (=1*2*2) strings:

GCGATC
GCGAGG
GCGGTC
GCGGGG

The choices can span multiple positions with multiletter choices (GCG|ATC,AGA|TC,GT|AGCA) and can contain deletions (indels), marked with - (hyphen): GTAGTG|AT,-T|TAA. Note that |AATCCGTCAA| does not equal to |AA|TCCGTC|AA| because the segment boundaries specify the full subsequence that has to exist as is, in the sequence.

Finally, the letters can be written with multiple characters, using a separator character between the letters (., period): A,6mA|T.G.C.T|5mC,C|G.C.5mC. This is useful if we want to represent similarities between some letters in a readable way. In the example above the multiletters denote methylated variants of the standard letters: A = adenine, 6mA = N6-methyladenine, C = cytosine, 5mC = C5-methylcytosine. Another example is writing diphthongs, for example ae.

Information content

Note that in a sequence, an ambiguous position can mean one of two things:

  1. Options: all letters noted in the position are suitable.
  2. Uncertainty: it's not exactly known what letter occupies the position.

This has implications for interpreting the Shannon information content of the sequence. For case (1) above (options), the Shannon information of a letter position is -log2(p), where p = 1 / n, with n letters in the alphabet. The information of a 1 letter-long sequence with 2 choices (e.g. A,T), from a 4-letter alphabet is 2 bit: -log2(1/4). For calculating information of longer sequences, the information of a position is multiplied by the length of the sequence.

For case (2) (uncertainty), the probability (p) is calculated as the number of sequences represented divided by the number of possible sequences with the same length. Thus the information of A,T (which means one of A or T, but not known which one) is only 1 bit: -log2(2/4). Consequently, the information of the uncertain position A,T,C,G (representing A or T or C or G) is zero, because -log2(4/4) = 0.

Edit distance

The edit distance is the minimum number of operations required to transform one string into another. The Hamming distance is an edit distance where the only allowed operation is substitution. As with information, we get different values depending on interpretation of ambiguity and the way we measure distance. In the simplest case, we ignore segments and compare position in one sequence with position in the other. In case (1) (options), the Hamming distance between two positions is zero if any of the choices match. In case (2) (uncertainty), the distance is one minus the sum of the chance of a match for each choice, divided by the number of choices. The total distance between two sequences is the sum of distances of each position.

Install

pip install polymera

Usage

Define a sequence:

sequence = polymera.Sequence()
sequence.add_sequence_from_string("ATGAA,ATGCC|TATATTAGAAAAAA")
sequence.calculate_number_of_combinations()
# 2

Instantiate polymer:

polymer = polymera.Polymer(sequence, alphabet=polymera.bio.DNAAlphabet)
polymer.get_sequence_reverse_complement().to_string()
# TTTTTTCTAATATA|GGCAT,TTCAT

Get an exact sequence:

exact_seq = sequence.get_exact_seq(randomize=True)
exact_seq.to_string()
# ATGAATATATTAGAAAAAA

Polymera can calculate the information of a sequence, in bits:

sequence = polymera.Sequence()
sequence.add_sequence_from_string("T,A")
polymer = polymera.Polymer(
    sequence, alphabet=polymera.Alphabet(letters={"A", "T", "C", "G"})
)
polymer.get_information_content(method="option")
# 2
polymer.get_information_content(method="uncertainty")
# 1

Calculate Hamming distance:

seq1 = polymera.Sequence()
seq1.add_sequence_from_string("T,C,G|CCC")
seq2 = polymera.Sequence()
seq2.add_sequence_from_string("T|GGG")

polymera.hamming(seq1, seq2, comparison="options")
# 3
polymera.hamming(seq1, seq2, comparison="uncertainty")
# 3.666666666666667

Versioning

Polymera uses the semantic versioning scheme.

License = MIT

Polymera is free software, which means the users have the freedom to run, copy, distribute, study, change and improve the software.

Polymera was written at the Edinburgh Genome Foundry by Peter Vegh and is released under the MIT license.