TE-splitter

Extract terminal repeats from retrotransposons (LTRs) or DNA transposons (TIRs). Compose synthetic MITES from complete DNA transposons.


Keywords
Transposon, LTR, TIR, MITE, TE, retrotransposon
License
MIT
Install
pip install TE-splitter==0.1.0

Documentation

TE-exterminate

Extract terminal repeats from retrotransposons (LTRs) or DNA transposons (TIRs). Optionally, compose synthetic MITES from complete DNA transposons.

Table of contents

Algorithm overview

Exterminate attempts to identify terminal repeats in transposeable elements by first aligning each element to itself using nucmer, and then applying a set of tuneable heuristics to select an alignment pair most likely to represent an LTR or TIR.

  1. Exclude all diagonal/self matches
  2. If LTR mode: Retain only alignment pairs on the same strand (tandem repeats)
  3. If TIR mode: Retain only alignment pairs on opposite strands (inverse repeats)
  4. Retain pairs for which the 5' match begins within x bases of element start and whose 3' match ends within x bases of element end
  5. Exclude alignment pairs which overlap (potential SSRs)
  6. If multiple candidates remain select alignment pair with largest internal segment (i.e. closest to element ends)

Options and usage

Installing Exterminate

Requirements:

  • pymummer version >= 0.10.3 with wrapper for nucmer option --diagfactor.
  • MUMmer
  • BLAST+ (Optional)

Install from PyPi:

pip install exterminate

Clone and install from this repository:

git clone https://github.com/Adamtaranto/TE-exterminate.git && cd TE-exterminate && pip install -e .

Example usage

For each element in retroelements.fasta split into internal and external segments. Split segments will be written to LTR_split_exterminate_output.fasta with suffix "_I" for internal or "_LTR" for external segments. LTRs must be at least 10bp in length and share 80% identity and occur within 10bp of each end of the input element.

exterminate -i retroelements.fasta -p LTR_split --findmode LTR

Standard options

Run exterminate --help to view the program's most commonly used options:

usage: exterminate [-h] -i INFILE [-p PREFIX] [-d OUTDIR]
                        [--findmode {LTR,TIR}]
                        [--splitmode {all,split,internal,external,None}]
                        [--makemites] [--keeptemp] [-v] [-m MAXDIST]
                        [--minid MINID] [--minterm MINTERM] [--minseed MINSEED]
                        [--diagfactor DIAGFACTOR] [--method {blastn,nucmer}]


Help:
  -h, --help                        Show this help message and exit


Input:
  -i INFILE, --infile infile        Multifasta containing complete elements. (required)  


Output:
  -p PREFIX, --prefix PREFIX        All output files begin with this string.  (Default:[infile basename])  
  -d OUTDIR, --outdir OUTDIR        Write output files to this directory. (Default: cwd)  
  --keeptemp                        If set do not remove temp directory on completion.
  -v, --verbose                     If set, report progress.


Report settings:
  --findmode                        Type of terminal repeat to identify. (Default: LTR)  
                                      Options: {LTR,TIR}  
  --splitmode                       Options: {all,split,internal,external,None} (Default: split)  
                                      all = Report input sequence as well as internal and external segments.  
                                      split = Report internal and external segments after splitting.  
                                      internal = Report only internal segments.  
                                      external = Report only terminal repeat segments.  
                                      None = Only report synthetic MITES (when --makemites is also set).  
  --makemites                       Attempt to construct synthetic MITE sequences from TIRs by concatenating 5' and 3' TIRs.  


Alignment settings:
  --method                          Select alignment tool. Note: blastn may perform better on very short high-identity TRs,
                                    while nucmer is more robust to small indels.
                                    Options: {blastn,nucmer} (Default: nucmer)
  --minid MINID                     Minimum identity between terminal repeat pairs. As float. (Default: 80.0)  
  --minterm MINTERM                 Minimum length for a terminal repeat to be considered.  
                                      Equivalent to nucmer "--mincluster" (Default: 10)  
  -m MAXDIST, --maxdist MAXDIST     Terminal repeat candidates must be no more than this many bases from ends of input element. 
                                      Note: Increase this value if you suspect that your element is nested within some flanking sequence. (Default: 10)
  --minseed MINSEED                 Minimum length of a maximal exact match to be included in final match cluster. 
                                      Equivalent to nucmer "--minmatch". (Default: 5)
  --diagfactor DIAGFACTOR           Maximum diagonal difference factor for clustering of matches within nucmer, i.e. diagonal difference / match separation (default 0.20) 
                                      Note: Increase value for greater tolerance of indels between terminal repeats.

License

Software provided under MIT license.