OmicIDX Builder
OmicIDX Builder includes supporting code to process and build the OmicIDX applications and data resources. It is not meant for end-users and requires a Google Cloud Project ($$) to use.
Related OmicIDX projects can be found on the OmicIDX Github Organization.
Roadmap
- [-] Bigquery tables
- [X] SRA
- [X] Biosample
- [ ] GEO
- [-] JSON dump files
- [X] SRA
- [X] Biosample
- [ ] GEO
- [-] REST API
- [X] SRA
- [X] Biosample
- [ ] GEO
- [-] GraphQL API
- [ ] SRA
- [ ] Biosample
- [ ] GEO
Installation for local usage
Installation
pip install poetry
poetry install omicidx_builder
Configuration
Needs environment variables:
- ES_HOST
- GCS_STAGING_URL
- GCS_EXPORT_URL
Pipeline
The data processing pipelines are run from the command-line. Notes are below.
SRA
omicidx_builder sra --help
export NCBI_MIRROR_DIR=NCBI_SRA_Mirroring_20200201_Full
omicidx_builder sra download $NCBI_MIRROR_DIR
cd NCBI_SRA_Mirroring_20190801_Full
omicidx_builder sra parse-entity study
omicidx_builder sra parse-entity sample
omicidx_builder sra parse-entity experiment
omicidx_builder sra parse-entity run
cd ..
omicidx_builder sra upload $NCBI_MIRROR_DIR
omicidx_builder sra load-sra-data-to-bigquery
omicidx_builder sra sra-to-bigquery
omicidx_builder sra sra-bigquery-for-elasticsearch
omicidx_builder sra gcs-dump
omicidx_builder sra gcs-to-elasticsearch
Biosample
omicidx_builder biosample --help
Here are the steps. This requires about 20GB of local storage.
omicidx_builder biosample download
omicidx_builder biosample parse biosample_set.xml.gz biosample.json
omicidx_builder biosample upload
omicidx_builder biosample load
omicidx_builder biosample etl-to-public
omicidx_builder biosample gcs-dump
omicidx_builder biosample gcs-to-elasticsearch
elasticsearch
import elasticsearch_dsl
import omicidx_builder.elasticsearch_utils as es
searcher = elasticsearch_dsl.Search(using = es.get_client())
from elasticsearch_dsl import Search
s = (searcher.index("sra_study")
.query("match", title="cancer")
.exclude("match", description="cancer"))
response = s.execute()
for hit in response:
print(hit.meta.score, hit.title)
for tag in response.aggregations.per_tag.buckets:
print(tag.key, tag.max_lines.value)
Development
running tests
poetry run pytest --cov=omicidx_builder tests
Running long-running tests:
LONG_TESTS=1 poetry run pytest --cov=omicidx_builder tests