omicidx-builder

Tooling to build and deploy omicidx data and resources


Keywords
bioinformatics, genomics, metadata, omicidx, sequencing, NCBI
License
MIT
Install
pip install omicidx-builder==0.2.6

Documentation

OmicIDX Builder

This project includes a command-line utility and supporting code to process and build the OmicIDX applications and data resources.

The code for parsing and modeling the data are available in the main OmicIDX repo.

Roadmap

  • [-] Bigquery tables
    • [X] SRA
    • [X] Biosample
    • [ ] GEO
  • [-] JSON dump files
    • [X] SRA
    • [X] Biosample
    • [ ] GEO
  • [-] REST API
    • [ ] SRA
    • [ ] Biosample
    • [ ] GEO
  • [-] GraphQL API
    • [ ] SRA
    • [ ] Biosample
    • [ ] GEO

Pipeline

The data processing pipelines are run from the command-line. Notes are below.

SRA

python -m omicidx_builder.cli sra
Usage: cli.py sra [OPTIONS] COMMAND [ARGS]...

  Use these commands to process SRA metadata

Options:
  --help  Show this message and exit.

Commands:
  download                        Downloads the files necessary to build
                                  the...
  gcs-dump                        Write json.gz format of sra entities to...
  load-sra-data-to-bigquery       Load gcs files to Bigquery
  parse-entity                    SRA XML to JSON Transforms an SRA XML...
  sra-bigquery-for-elasticsearch  ETL queries to create elasticsearch
                                  tables...
  sra-gcs-to-elasticsearch        ETL query to public schema for all SRA...
  sra-to-bigquery                 ETL query to public schema for all SRA...
  upload                          Upload SRA json to GCS
python -m omicidx_builder.cli sra download NCBI_SRA_Mirroring_20190801_Full
cd NCBI_SRA_Mirroring_20190801_Full
python -m omicidx_builder.cli sra parse-entity study
python -m omicidx_builder.cli sra parse-entity sample
python -m omicidx_builder.cli sra parse-entity experiment
python -m omicidx_builder.cli sra parse-entity run
cd ..
python -m omicidx_builder.cli sra upload NCBI_SRA_Mirroring_20190801_Full
python -m omicidx_builder.cli sra load-sra-data-to-bigquery
python -m omicidx_builder.cli sra sra-to-bigquery
python -m omicidx_builder.cli sra sra-bigquery-for-elasticsearch
python -m omicidx_builder.cli sra gcs-dump
python -m omicidx_builder.cli sra sra-gcs-to-elasticsearch

Biosample

Usage: cli.py biosample [OPTIONS] COMMAND [ARGS]...

  Use these commands to process biosample records.

Options:
  --help  Show this message and exit.

Commands:
  download              Download biosample xml file from NCBI
  etl-to-public         ETL process (copy) from etl schema to public
  gcs-dump              Write json.gz format of biosample to gcs
  gcs-to-elasticsearch
  load                  Load the gcs biosample.json file to bigquery
  parse                 Parse xml to json, output to stdout
  upload                Download biosample xml file from NCBI

In order:

  • download
  • parse
  • upload
  • load
  • etl-to-public
  • gcs-dump
  • gcs-to-elasticsearch

elasticsearch

import elasticsearch_dsl
import omicidx_builder.elasticsearch_utils as es
searcher = elasticsearch_dsl.Search(using = es.get_client())
from elasticsearch_dsl import Search

s = (searcher.index("sra_study")
    .query("match", title="cancer")   
    .exclude("match", description="cancer"))

response = s.execute()

for hit in response:
    print(hit.meta.score, hit.title)

for tag in response.aggregations.per_tag.buckets:
    print(tag.key, tag.max_lines.value)

Development

running tests

poetry run pytest --cov=omicidx_builder tests

Running long-running tests:

LONG_TESTS=1 poetry run pytest --cov=omicidx_builder tests