omicidx-builder

Tooling to build and deploy omicidx data and resources


Keywords
bioinformatics, genomics, metadata, omicidx, sequencing, NCBI
License
MIT
Install
pip install omicidx-builder==0.2.6

Documentation

OmicIDX Builder

OmicIDX Builder includes supporting code to process and build the OmicIDX applications and data resources. It is not meant for end-users and requires a Google Cloud Project ($$) to use.

Related OmicIDX projects can be found on the OmicIDX Github Organization.

Roadmap

  • [-] Bigquery tables
    • [X] SRA
    • [X] Biosample
    • [ ] GEO
  • [-] JSON dump files
    • [X] SRA
    • [X] Biosample
    • [ ] GEO
  • [-] REST API
    • [X] SRA
    • [X] Biosample
    • [ ] GEO
  • [-] GraphQL API
    • [ ] SRA
    • [ ] Biosample
    • [ ] GEO

Installation for local usage

Installation

pip install poetry
poetry install omicidx_builder

Configuration

Needs environment variables:

  • ES_HOST
  • GCS_STAGING_URL
  • GCS_EXPORT_URL

Pipeline

The data processing pipelines are run from the command-line. Notes are below.

SRA

omicidx_builder sra --help
export NCBI_MIRROR_DIR=NCBI_SRA_Mirroring_20200201_Full
omicidx_builder sra download $NCBI_MIRROR_DIR
cd NCBI_SRA_Mirroring_20190801_Full
omicidx_builder sra parse-entity study
omicidx_builder sra parse-entity sample
omicidx_builder sra parse-entity experiment
omicidx_builder sra parse-entity run
cd ..
omicidx_builder sra upload $NCBI_MIRROR_DIR
omicidx_builder sra load-sra-data-to-bigquery
omicidx_builder sra sra-to-bigquery
omicidx_builder sra sra-bigquery-for-elasticsearch
omicidx_builder sra gcs-dump
omicidx_builder sra gcs-to-elasticsearch

Biosample

omicidx_builder biosample --help

Here are the steps. This requires about 20GB of local storage.

omicidx_builder biosample download
omicidx_builder biosample parse biosample_set.xml.gz biosample.json
omicidx_builder biosample upload
omicidx_builder biosample load
omicidx_builder biosample etl-to-public
omicidx_builder biosample gcs-dump
omicidx_builder biosample gcs-to-elasticsearch

elasticsearch

import elasticsearch_dsl
import omicidx_builder.elasticsearch_utils as es
searcher = elasticsearch_dsl.Search(using = es.get_client())
from elasticsearch_dsl import Search

s = (searcher.index("sra_study")
    .query("match", title="cancer")   
    .exclude("match", description="cancer"))

response = s.execute()

for hit in response:
    print(hit.meta.score, hit.title)

for tag in response.aggregations.per_tag.buckets:
    print(tag.key, tag.max_lines.value)

Development

running tests

poetry run pytest --cov=omicidx_builder tests

Running long-running tests:

LONG_TESTS=1 poetry run pytest --cov=omicidx_builder tests