FastContext
Description
FastContext is a tool for identification of adapters and other sequence patterns in the next generation sequencing (NGS) data. The algorithm parses FastQ files (in a single-end or paired-end mode), searches read or read pair for user-specified patterns, and then generates a human-readable representation of the search results, which we call "read structure". Also FastContext gathers statistics on frequency of occurence for each read structure.
Installation
python3 -m pip install FastContext
Check installation:
FastContext --help
Usage
Optional arguments:
- -1, --r1
-
Required.
Format: String
Description: FastQ input R1 file. May be uncompressed, gzipped or bzipped.
Usage:-1 input.fastq.gz
- -p, --patterns
-
Required.
Description: Patterns to look for. The order of patterns is the order of search.
Format: Plain Javascript Object String (Key-Value). Names must contain 2-24 Latin and numeric symbols, and -_-, sequences must contain more than one symbols ATGC.
Usage:-p '{"First": "CTCAGCGCTGAG", "Second": "AAAAAA", "Third": "GATC"}'
- -s, --summary
-
Required.
Description: Output HTML file. Contains statistics summary in human-readable form.
Format: String
Usage:-s statistics.htm
- -2, --r2
-
Description: FastQ input R2 file. May be uncompressed, gzipped or bzipped. If single-end mode, ignore this option.
Format: String
Usage:-2 input_R2.fastq.gz
- -j, --json
-
Description: Output JSON.GZ file (gzipped JSON). Contains extended statistics on pattern sequences, each read or read pair: read structure, Levenshtein distances (see -l option).
Format: String
Usage:-j statistics.json.gz
- -k, --kmer-size
-
Description: Max size of unrecognized sequence to be written as K-mer of certain length.
Format: Non-negative Integer
Default:0
Usage:-k 9
- -u, --unrecognized
-
Description: Long unrecognized sequences replacement.
Format: 2-24 Latin and numeric symbols, and -_-
Default:unknown
Usage:-u genome
- -m, --max-reads
-
Description: Max reads number to analyze (0 -- no limit). Notice that read number bigger than recommended may cause memory overflow.
Format: Non-negative Integer
Default:1000000
Usage:-m 1000
- -f, --rate-floor
-
Description: Min rate to write read structure into statistics TSV table.
Format: Float from 0 to 1
Default:0.001
Usage:-f 0.1
- -@, --threads
-
Description: Threads number.
Format: Non-negative integer less than2 * cpu_count()
Default:cpu_count()
Usage:-@ 10
- -d, --dont-check-read-names
-
Description: Don't check read names. Use this if you have unusual (non-Illumina) paired read names. Makes sense only in paired-end mode.
Usage:-d
- -l, --levenshtein
-
Description: Calculate patterns Levenshtein distances for each position in read. Results are written into extended statistics file (JSON.GZ). Notice that it highly increases the time of processing.
Usage:-l
- -h, --help
-
Description: Show help message and exit.
Usage:-h
- -v, --version
-
Description: Show program's version number and exit.
Usage:-v
Examples
Summary statistics table
Contains counts, percentage and read structures. Length of K-mer or pattern strand (Forward or Reverse) is displayed after the comma.
Example:
R1
Count | Percentage | Read Structure |
---|---|---|
5,197 | 48.807 | {unknown} |
3,297 | 30.963 | {unknown}--{oligme:F}--{oligb:F}--{701:F}--{unknown} |
114 | 1.070 | {unknown}--{oligb:F}--{701:F}--{unknown} |
71 | 0.666 | {unknown}--{oligme:F}--{unknown} |
69 | 0.648 | {unknown}--{oligme:F}--{unknown}--{701:F}--{unknown} |
60 | 0.563 | {unknown}--{oligme:F}--{oligb:F}--{701:F}--{kmer:14bp} |
R2
Count | Percentage | Read Structure |
---|---|---|
7,545 | 70.858 | {unknown} |
616 | 5.785 | {unknown}--{oligme:F}--{oliga:R}--{502:R}--{unknown} |
540 | 5.071 | {unknown}--{oligme:F}--{unknown} |
441 | 4.141 | {unknown}--{oligme:F}--{oliga:R}--{unknown} |
298 | 2.798 | {unknown}--{oliga:R}--{unknown} |
263 | 2.469 | {unknown}--{502:R}--{unknown} |
233 | 2.188 | {unknown}--{oligme:F}--{kmer:14bp}--{502:R}--{unknown} |
163 | 1.530 | {unknown}--{oliga:R}--{502:R}--{unknown} |
56 | 0.525 | {unknown}--{502:F}--{unknown} |
Extended statistics JSON.GZ file
Contains extended statistics: run options, performance, pattern analysis, full summary without rate floor, each read analysis. Example is shorten.
{
"FastQ": {
"R1": "tests/standard_test_R1.fastq.gz",
"R2": "tests/standard_test_R2.fastq.gz"
},
"RunData": {
"Read Type": "Paired-end",
"Max Reads": 100,
"Rate Floor": 0.001
},
"Performance": {
"Reads Analyzed": 100,
"Threads": 4,
"Started": "2022-07-13T18:15:48.277660",
"Finished": "2022-07-13T18:15:48.964721"
},
"PatternsData": {
"PatternsList": {
"oligme": {
"F": "CTGTCTCTTATACACATCT",
"R": "AGATGTGTATAAGAGACAG",
"Length": 19
},
"s502": {
"F": "CTCTCTAT",
"R": "ATAGAGAG",
"Length": 8
}
},
"PatternsAnalysis": [
{
"Analysis": "reverse complement only",
"FirstPattern": "oligme",
"SecondPattern": "oligme",
"FirstLength": 19,
"SecondLength": 19,
"LevenshteinAbsolute": 11,
"LevenshteinSimilarity": 0.42105263157894735,
"Type": "good",
"Risk": "low"
},
{
"Analysis": "full",
"FirstPattern": "oligme",
"SecondPattern": "s502",
"FirstLength": 19,
"SecondLength": 8,
"LevenshteinAbsolute": 2,
"LevenshteinSimilarity": 0.75,
"Type": "nested",
"Risk": "medium"
}
],
"Other": {
"Unrecognized Sequence": "unknown",
"K-mer Max Size": 15
}
},
"Summary": {
"R1": {
"{unknown}--{oligme:F}--{oligb:F}--{s701:F}--{unknown}": {
"Count": 34,
"Percentage": 34.0,
"ReadStructure": [
{ "type": "unrecognized" },
{ "type": "pattern", "name": "oligme", "strand": "F" },
{ "type": "pattern", "name": "oligb", "strand": "F" },
{ "type": "pattern", "name": "s701", "strand": "F" },
{ "type": "unrecognized" }
]
},
"{unknown}--{oligme:F}--{unknown}--{s701:F}--{unknown}": [ "..." ],
"{unknown}--{s701:F}--{unknown}": [ "..." ]
},
"R2": [ "..." ]
},
"RawDataset": [
{
"Name": "M02435:112:000000000-DFC9M:1:1101:14970:1484",
"R1": {
"Sequence": "ACCTAGAAGAGCCAAAAGACTCT...AATCTCGTATGCCGTCT",
"PhredQual": [29,32,32,33,33,37,37,37,37,"...",38,38,38,13],
"Levenshtein": [
{
"name": "oligme",
"strand": "F",
"length": 19,
"values": [14,14,12,13,12,12,12,"...",NaN,NaN,NaN]
},
{
"name": "oligme",
"strand": "R",
"length": 19,
"values": [12,11,10,9,9,9,10,10,"...",NaN,NaN,NaN]
}
],
"ReadStructure": [
{ "type": "unrecognized" },
{ "type": "pattern", "name": "oligme", "strand": "F" },
{ "type": "pattern", "name": "oligb", "strand": "F" },
{ "type": "pattern", "name": "s701", "strand": "F" },
{ "type": "unrecognized" }
],
"TextReadStructure": "{unknown}--{oligme:F}--{oligb:F}--{s701:F}--{unknown}"
},
"R2": "..."
}
]
}