Access BioMart from the command line.


Keywords
BioMart
License
GPL-3.0
Install
pip install biomartian==0.0.1

Documentation

 _     _                            _   _
| |__ (_) ___  _ __ ___   __ _ _ __| |_(_) __ _ _ __
| '_ \| |/ _ \| '_ ` _ \ / _` | '__| __| |/ _` | '_ \
| |_) | | (_) | | | | | | (_| | |  | |_| | (_| | | | |
|_.__/|_|\___/|_| |_| |_|\__,_|_|   \__|_|\__,_|_| |_|
                   Query biomart from the command line

biomartian enables querying BioMart from the command line.

biomartian greatly simplifies extracting data from BioMart.

Instead of having to

  1. open an R session
  2. load biomaRt
  3. load a mart and dataset
  4. write the code required to extract the data you need
  5. merge the new data into a dataset

you can simply call a single simple biomartian command!

biomartian also aids BioMart discoverability since you can use standard unix tools like grep to search the BioMart results, instead of having to write code to look for and extract specific items from dataframes.

Lastly, biomartian caches all queries across sessions (in ~/.biomartian), so that subsequent queries are instantaneous.

Note that due to server trouble for the main biomart, biomartian now only works for ensembl biomart. It is the one that is used in 99% of all cases. I might reverse this in the future when the biomart server is up and stable again (if you create an issue on the tracker).

Changelog
# 0.0.19 (23.11.2015)
- Only use the ensembl host due to unstable server for the main biomart.
# 0.0.18 (03.09.2015)
Allow for four letter abbreviations for ensembl datasets, like `hsap` for `hsapiens_gene_ensembl`.

Examples

Find the name of mrna ids in BioMart for the common rat

biomartian -d rnor --list-attributes | grep -i mrna
refseq_mrna RefSeq mRNA [e.g. NM_001195597]
refseq_mrna_predicted   RefSeq mRNA predicted [e.g. XM_001125684]

Note that we did not need to write the name of the mart since ensembl is the default.

Get the refseq mrna id for all regular gene names and attach them to an input file

$ head simple.txt
"logFC" "AveExpr"
"Ipcef1"    -2.70987558746701   4.80047582653889
"Sema3b"    2.00143465979322    3.82969788437155
"Rab26" -2.40250648553797   5.57320249609294
"Arhgap25"  -1.84668909768998   3.66617832656769
"Ociad2"    -1.99052684394044   5.26213130909702
"Mmp17" -2.01026790614161   4.88012776225311
"C4a"   2.22003976804983    3.52842041243544
"Gna14" -2.42391191670209   1.56313048066253
"Kcna6" -1.74168813159872   6.54586068659631

$ biomartian -d rnorvegicus_gene_ensembl -c 0 -i external_gene_name -o refseq_mrna simple.txt
index   logFC   AveExpr refseq_mrna
Ipcef1  -2.70987558746701   4.80047582653889    NM_001170799
Sema3b  2.00143465979322    3.82969788437155    NM_001079942
Rab26   -2.40250648553797   5.57320249609294    NM_133580
Arhgap25    -1.84668909768998   3.66617832656769    NM_001109247
Ociad2  -1.99052684394044   5.26213130909702    NM_001271181
Mmp17   -2.01026790614161   4.88012776225311    NM_001105925
C4a 2.22003976804983    3.52842041243544    NM_031504
C4a 2.22003976804983    3.52842041243544    NA
Gna14   -2.42391191670209   1.56313048066253    NM_001013151
Kcna6   -1.74168813159872   6.54586068659631    NM_023954

Install

pip install biomartian

Usage

biomartian

Query biomart from the command line.
For help and examples, visit github.com/endrebak/biomartian

Usage:
    biomartian [--mart=MART] [--dataset=DATA] --mergecol=COL... --intype=IN... --outtype=OUT... [--noheader] FILE
    biomartian [--mart=MART] [--dataset=DATA] --intype=IN --outtype=OUT
    biomartian --list-marts
    biomartian [--mart=MART] --list-datasets
    biomartian [--mart=MART] [--dataset=DATASET] --list-attributes

Arguments:
    FILE                   file with COL(s) to join mart data on (- for STDIN)
    -i IN --intype=IN      the datatype in the column to merge on
    -o OUT --outtype=OUT   the datatype to get (joining on value COL)
    -c COL --mergecol=COL  name or number of the column to join on in FILE

Note:
    Required args --intype, --outtype and --mergecol must be equal in number.

Options:
    -h      --help          show this message
    -m MART --mart=MART     which mart to use [default: ENSEMBL_MART_ENSEMBL]
    -d DATA --dataset=DATA  which dataset to use [default: hsapiens_gene_ensembl]
    -n --noheader           the input data does not contain a header (must
                            use integers to denote COL)

Lists:
    --list-marts       show all available marts
    --list-datasets    show all available datasets for MART
    --list-attributes  show all kinds of data available for MART and DATASET

TODO

  • enable viewing dates of cached data
  • enable removing one single dataset from the cache
  • add one letter abbreviations for the list functions (-A is --list-attributes, etc.)
  • get more than two datasets at a time using get_bm (will still only query pairs of data under the hood for maximum cacheability and merge them locally.)

Issues

Please use the biomartian issues page for issues, suggestions, feature-requests and troubleshooting.

Requirements

Python, either version 2.7 or 3.x.

  • bioservices, pandas, docopt, joblib, ebs

biomartian is 100% Python, so all the required dependencies are installed with a simple pip install biomartian.

API

biomartian does not intentionally expose an API, but please see the module biomartian/bm_queries/bm_query.py to learn how to query biomart using Python. It is serviceable as an API, and some of my packages use it as one so it will most likely be extremely stable.

Thanks

Thomas Cokelaer for his bioservices package.

See his page for citation info.

Similar software

(Feel free to add to this list)

  • biomartpy (Python -> rpy2 -> R/BioConductor's biomaRt)