NCBI Tool Kit
A tool kit for downloading and curating collections of genomes retrieved from the National Center for Biotechology Information's public database, GenBank. NCBITK currenlty only supports downloading bacteria genomes.
- Automatically synchronize your local collection with the latest assembly versions.
- Give FASTAs useful names based on information avaialable in the assembly summary file and the taxonomy dump file.
Requires rsync. Tested only with rsync version 3.1.2 protocol version 31.
Installation
Using pip:
pip install ncbitk
Or:
git clone https://github.com/andrewsanchez/NCBITK.git python setup.py install
Regardless of which installation method you choose, I recommend using a virtual environment.
Basic Usage
Download all GenBank bacteria:
ncbitk [directory] --update
If you have already run NCBITK, the above will also update your local collection, i.e. remove old genomes no longer in the assembly summary and download the latest assembly versions.
Download only E. coli genomes:
ncbitk [directory] --species Escherichia_coli --update
Note that in the above command, the list of strings given to the --species
option must match exactly a species directory at ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/
Get the status of your collection:
ncbitk [directory] --status
This will tell you how many genomes you have, what is missing from your collection, and how many deprecated genomes are present.