MakeGMS
For a given genome and read size, the makegms
module creates a binary track, where 0 corresponds to the index of the repeating read in the genome, and 1 to the index of the unique read. A reed is considered repetitive if it is found more than once in the genome (including reverse complementary).
For example, let the genome be the string AAAAAATGCA
and the length of the read is 4
gms(AAAAAATGCA, 4) = 0001100
AAAAAATGCA -> AAAAAATGCA + TGCATTTTTT
AAAAAATGCATGCATTTTTT
AAAA x 3 -> 0
AAAA x 3 -> 0
AAAA x 3 -> 0
AAAT x 1 -> 1
AATG x 1 -> 1
ATGC x 2 -> 0
TGCA x 2 -> 0
Installation:
pip3 install --user git+https://github.com/latur/MakeGMS
Usage examples:
To run with the default parameters, you only need to specify the size of the read. In this case, the optimal track building algorithm will be selected, depending on the genome size:
>>> import makegms
>>> track = makegms.run('example.fa', read=15)
>>> ('').join(map(str, track))
'00111111111111000000000111111111111100111111...'
To indicate that you want to use QSort based algorithm exactly, pass the number of threads as an argument. It makes sense when threads > 1
. However, if you specify here a number greater than the number of processors you have, performance will decrease:
>>> track = makegms.run('example.fa', read=15, threads=2)
To use the algorithm on the filters, specify the quality of the hash function. The quality is related to the probability of a collision and the amount of RAM consumed:
>>> track = makegms.run('example.fa', read=15, quality=4)
Algorithm Description:
QSort-based method:
- Sorting reads. Used QuickSort
- Сomparison of current and next read (In the sort order)
- If the reads match, then increase the value of the repeat counter.
- If the reads do not match, save counter value for previous read and reset the counter for current read
Multithreading is provided by dividing all reads into disjoint blocks using the hash function:
block_number = Hash(read) % blocks_count
Bloom-based method:
- Creating two Bloom-filters of the same size:
F1
,F2
- The first round of all reads in the genome:
- For each read in the genome — add it to filter
F1
and if the read already exists in filterF1
then add it to filterF2
- The second round of all reads in the genome:
- Form a binary sequence: each read that is in filter
F2
corresponds to0
, otherwise1
Filter size is determined based on the quality parameter.
Quality parameter
To work, you need to allocate memory for two Bloom filters. The number of hash functions of the filter is determined in the optimal way to reduce the likelihood of a collision. Below is a table of memory requirements and the corresponding probability of a false definition of a unique read as repeated. GS
is Genome Size in bytes
Quality | Memory | Hashes | False Positive |
---|---|---|---|
3 | 5 × GS | 8 | 3.1423 * 10^-3 |
4 | 6 × GS | 11 | 4.5869 * 10^-4 |
5 | 7 × GS | 14 | 6.7134 * 10^-5 |
6 | 8 × GS | 17 | 9.8393 * 10^-6 |
7 | 9 × GS | 19 | 1.4400 * 10^-6 |
8 | 10 × GS | 22 | 2.1044 * 10^-7 |