id3c

Data logistics for the Seattle Flu Study and Flu@Home, enabling real-time genomic epidemiology studies


Install
pip install id3c==0.0.1

Documentation

ID3C: Infectious Disease Data Distribution Center

Data logistics for the Seattle Flu Study and Flu@Home, enabling real-time genomic epidemiology studies.

Navigation

Database

PostgreSQL 10 on AWS, soon to be Azure.

Eventually upgrading to PostgreSQL 11 when it becomes available on our cloud provider.

Initially aims to provide:

  • Access via SQL and REST APIs, with PostgREST and/or ES2018 web app to maybe come later

  • De-identified metadata for participants (age, sex, address token, etc.) and samples (tissue, date, location, etc.)

  • Sample diagnostic results (positive/negative for influenza, RSV, and more)

  • Sequencing read sets and genome assemblies stored in the cloud and referenced via URLs in database

  • Rich data types (key/value, JSON, geospatial, etc)

  • Strong data integrity and validation controls

  • Role-based authentication and restricted data fields using row and column-level access control

  • Encrypted at rest and TLS-only connections

  • Administration via standard command-line tools (and maybe later pgAdmin4)

Design

The database is designed as a distribution center which receives data from external providers, repackages and stores it in a data warehouse, and ships data back out of the warehouse via views, web APIs, and other means. Each of these three conceptual areas are organized into their own PostgreSQL schemas within a single database.

The "receiving" area contains tables to accept minimally- or un-controlled data from external providers. The general expectation is that most tables here are logs (in the journaling sense) and will be processed later in sequential order. For example, participant enrollment documents from our consent and questionnaire app partner, Audere, are stored here when received by our web API.

The "warehouse" area contains a hybrid relational + document model utilizing standard relational tables that each have a JSON column for additional details. Data enters the warehouse primarily through extract-transform-load (ETL) routines which process received data and copy it into suitable warehouse rows and documents. These ETL routines are run via bin/id3c etl subcommands, where they're defined in Python (though lean heavily on Pg itself).

The "shipping" area contains views of the warehouse designed with specific data consumers and purposes in mind, such as the incidence modeling team.

While the receiving and shipping areas are expected to be fairly fluid and reactive to new and changing external requirements, the warehouse area is expected to change at a somewhat slower pace informed by longer-term vision for it.

Guidelines

General principles to follow when developing the schema.

  • Do the ~simplest thing that meets our immediate needs. Aim for ease of modification in the future rather than trying to guess future needs in advance.

    It can be really hard to stick to this principle, but it turns out that the best way to make something flexible for future needs is to make it as simple as possible now so it can be modified later.

  • Columns should be maximally typed and constrained, unless there exists a concrete use case for something less.

  • Consider if a column should ever be unknown (null).

  • Consider if a column should have a default.

  • Consider what constraints make sense at both a column- and table-level. Would a CHECK constraint be useful to express domain logic?

  • Write a description (comment) for all schemas, tables, columns, etc.

  • Grant only the minimal privileges necessary to the read-only and read-write roles. For example, if the read-write role isn't expected to UPDATE existing records, then only grant it INSERT.

  • Consider expected data access patterns and create indexes to match.

Integration with other data systems

Although we're building our own data system, we want to design and create it with interoperability in mind. To this extent, our system should adopt or parallel practices and terminology from other systems when appropriate. For example:

  • Nouns (tables, columns, etc) in our system should consider adopting the equivalent terms used by FHIR R4 resources.

    This will aid with producing FHIR documents in the future and provides a consistent terminology on which to discuss concepts more broadly than our group. FHIR is a large specification and there is a lot to digest; it's easy to be daunted or confused by it, so please don't hesitate to ask questions.

  • Value vocabulary (specimen types, organism names, diagnostic tests, etc) should consider using and/or referencing the preferred terms from an appropriate ontology like SNOMED CT, LOINC, or GenEpiO.

Deploying

The database schema is deployed using Sqitch, a database change management tool that really shines. You can install it a large number of ways, so pick the one that makes most sense to you.

Development

For development, you'll need a PostgreSQL server and superuser credentials for it. The following commands assume the database server is running locally and your local user account maps directly to a database superuser.

Create a database named seattleflu using the standard Pg tools. (You can use another name if you want, maybe to have different dev instances, but you'll need to adjust the sqitch target you deploy to.)

createdb --encoding=UTF-8 seattleflu

Then use sqitch to deploy to it. (dev is a sqitch target configured in sqitch.conf which points to a local database named seattleflu.)

sqitch deploy dev

Now you can connect to it for interactive use with:

psql seattleflu

Testing and production

Our testing and production databases are configured as testing and production sqitch targets. When running sqitch against these targets, you'll need to provide a username via PGUSER and a password via an entry in ~/.pgpass.

Web API

Python 3 + Flask

  • Consumes and stores enrollment documents from the Audere backend systems

Config

  • Database connection details are set entirely using the standard libpq environment variables, such as PGHOST and PGDATABASE. You may provide these when starting the API server.

    User authentication is performed against the database for each request, so you do not (and should not) provide a username and password when starting the API server.

  • The maximum accepted Content-Length defaults to 20MB. You can override this by setting the environment variable FLASK_MAX_CONTENT_LENGTH.

  • The LOG_LEVEL environment variable controls the level of terminal output. Levels are strings: debug, info, warning, error.

Starting the server

The commands pipenv run python -m seattleflu.api or pipenv run flask run will run the application's development server. To provide database connection details while starting the development server, run the command PGDATABASE=DB_NAME pipenv run flask run, substituting DB_NAME with the name of your database.

For production, a standard api.wsgi file is provided which can be used by any web server with WSGI support.

Examples

User authentication must be provided when making POST requests to the API. For example, you can run the following curl command to send JSON data named enrollments.json to the /enrollment endpoint on a local development server:

curl http://localhost:5000/enrollment \
  --header "Content-Type: application/json" \
  --data-binary @enrollments.json \
  --user USERNAME

Substitute your own local database username for USERNAME. This will prompt you for a password; you can also specify it directly by using --user "USERNAME:PASSWORD", though be aware it will be saved to your shell history.

CLI

Python 3 + click

Interact with the database on the command-line in your shell to:

  • Mint identifiers and barcodes

  • Run ETL routines, e.g. enrollments, to process received data into the warehouse

  • Parse, diff, and upload sample manifests.

  • Preprocess clinical data and upload it into receiving.

  • Send Slack notifications from the Reportable Conditions Notifications Slack App

The bin/id3c command is the entry point. It must be run within the project environment, for example by using pipenv run bin/id3c.

The LOG_LEVEL environment variable controls the level of terminal output. Levels are strings: debug, info, warning, error.

Setup

Dependencies

Python dependencies are managed using Pipenv.

Install all the (locked, known-good) dependencies by running:

pipenv sync

Add new dependencies to Pipfile, run:

pipenv install <name>

and then commit the changes to Pipfile and Pipfile.lock.

Connection details

Details for connecting to the ID3C database are by convention controlled entirely by the standard libpq environment variables, service definitions, and password files.

For example, if you want to list the identifier sets available in the Seattle Flu Study testing database, you could create the following files:

~/.pg_service.conf

[seattleflu-testing]
host=testing.db.seattleflu.org
user=your_username
dbname=testing

~/.pgpass

testing.db.seattleflu.org:5432:*:your_username:your_password

Make sure the ~/.pgpass file is only readable by you since it contains your password:

chmod u=rw,og= ~/.pgpass

and then run:

PGSERVICE=seattleflu-testing pipenv run bin/id3c identifier set ls

These files will also allow you to connect using psql:

psql service=seattleflu-testing