Motif Scraper
Pythonic tool to search for degenerate motif matches in FASTA sequence files.
Installation
Motif scraper is available via pip or GitHub download. We HIGHLY recommend installing in a Python virtual environment.
pip install motif_scraper
Or user install
pip install --user motif_scraper
Or install from GitHub clone.
git clone https://github.com/RobersonLab/motif_scraper.git
git checkout vN.N.N # Choose highest version tag instead of vN.N.N
pip install -e .
Testing local install
The installation can be quickly checked for proper installation in a Linux-like environment that has wget. If necessary, first switch to the appropriate virtual environment. The following code snippet will download a test FASTA file containing only human contig KI270394.1 from the GRCh38 human genome build and search a simple exact motif.
wget https://raw.githubusercontent.com/RobersonLab/motif_scraper/master/sample_data/KI270394.fa
motif_scraper --motif TTTGCA --outputFile test.csv KI270394.fa
The logging for the tool should list that 3 sites were found. You can confirm them with:
cat test.csv
which should display the following:
Contig,Start,End,Strand,Sequence,Motif
KI270394.1,94,99,+,TTTGCA,TTTGCA
KI270394.1,436,441,+,TTTGCA,TTTGCA
KI270394.1,170,165,-,TTTGCA,TTTGCA
Usage
Find all sites for the CTCF motif NNDCCACYAGRKGGCASYR in GRCh38.
motif_scraper --motif NNDCCACYAGRKGGCASYR --outputFile ctcf_sites.csv --search_strand=both GRCh38.fa
Find CTCF sites on chromosome 1 only.
motif_scraper -r chr1 --motif NNDCCACYAGRKGGCASYR --outputFile ctcf_sites.csv --search_strand=both GRCh38.fa
Find CTCF sites only from position 10,000 to position 10,000,000 on chromosome 1.
motif_scraper -r chr1:10000-10000000 --motif NNDCCACYAGRKGGCASYR --outputFile ctcf_sites.csv --search_strand=both GRCh38.fa
Find CTCF match sites only on the top strand, using 10 processors.
motif_scraper --cores 10 --motif NNDCCACYAGRKGGCASYR --outputFile ctcf_sites.csv --search_strand=+1 GRCh38.fa
Search an Ensembl download of all protein coding transcript 3' UTRs for hsa-miR-10a sites on minus strand.
motif_scraper --cores 10 --motif TACCCTGTAGATCCGAATTTGTG --outputFile mir10a_sites.csv --search_strand=-1 GRCh38_3pUTRs.fa
Search an Ensembl download of all protein coding transcript 3' UTRs for hsa-miR-10a sites on minus strand, again. But this time print output to temporary file per contig / strand. Combines and removes the temporary files last. Produces identical md5 sum to memory buffering all sites first, but works on low memory machines.
motif_scraper --file_buffer --cores 10 --motif TACCCTGTAGATCCGAATTTGTG --outputFile mir10a_sites.csv --search_strand=-1 GRCh38_3pUTRs.fa
Get debugging messages to troubleshoot code problems.
motif_scraper --loglevel DEBUG --cores 10 --motif TACCCTGTAGATCCGAATTTGTG --outputFile mir10a_sites.csv --search_strand=-1 GRCh38_3pUTRs.fa
Search for all motifs contained in a file.
motif_scraper --motif_file many_motifs.txt --outputFile many_motif_sites.csv GRCh38.fa
Directly input a valid regular expression instead of a sequence motif
motif_scraper --motif N{19}CTR{3} --valid_regex GRCh38.fa