Harvest-Vocab
Harvest-Vocab provides a set of abstract Django models for making it trivial to query hierarchical data. Most database systems do not have operators supporting logical conjunction across multiple rows. The IN
operator is disjunctive since only one of operands must be true to satisfy the truth. Harvest-Vocab provides support for constructing conjunctive queries (and the negation of that) as well as exclusive conjunctive queries. The query generation are exposed using Django manager methods.
In addition, hooks are for integrating with Harvest applications including an Avocado Translator which support defines Avocado operators that map to the manager methods for constructing these kinds of queries. This can be used with the Harvest-Vocab Client which integrates with the Cilantro, the official client of Harvest.
Install
pip install harvest-vocab
Problem & Example
A common vocabulary used in healthcare for billing purposes are ICD9 codes. These codes are hierarchical with each level being more specific than the previous. The standard way to store hierarchical data in a relational database is having a self-relationship to the parent (the Patient
model is also defined here for the example).
class Diagnosis(models.Model):
code = models.CharField(max_length=10)
description = models.CharField(max_length=50)
parent = models.ForeignKey('self', related_name='children', null=True)
class Patient(models.Model):
diagnoses = models.ManyToManyField(Diagnosis)
# other fields...
This is perfectly suitable for storing the data, but it falls over when performing queries. An example query could be "find all the patients who have a diagnosis of ICD9 367 (Disorders of refraction and accommodation)". Performing this query is simple.
Patient.objects.filter(diagnoses__code='367')
However, this will only find patients who have this exact diagnosis. The problem is that Disorders of refraction and accommodation is a very general diagnosis (it is more of a category) and it is has have two levels of codes underneath it. Any patient that has a more specific diagnosis under 367 (such as code 367.1 for Myopia, i.e. near-sightedness) will be left out of the results. This naive behavior is generally never desired and users expect the descedents of a code to be queried as well. When the depth of the hierarchy is unknown or arbitrary, this kind of query becomes difficult to write (using the Django ORM or raw SQL).
Solution & Setup
The solution harvest-vocab takes to enable querying arbitrary depth hierarchies is to utilize a flat index. That is, an item will have an association to each and every ancestor up to the root. Start by defining the model for the item, the through model for the self relationship and the index.
from vocab.models import AbstractItem, AbstractItemIndex
# Subclass the abstract item model
class Diagnosis(AbstractItem):
code = models.CharField(max_length=10)
description = models.CharField(max_length=50)
parent = models.ForeignKey('self', null=True)
# Define an index of item/parent fields
class DiagnosisIndex(AbstractItemIndex):
item = models.ForeignKey(Diagnosis, related_name='item_indexes')
parent = models.ForeignKey(Diagnosis, related_name='parent_indexes')
# Add many-to-many field to associated model
class Patient(models.Model):
diagnoses = models.ManyToManyField(Diagnosis, through='PatientDiagnosis')
# other fields...
# Through table between patient and diagnosis, the custom manager provides
# methods for working the custom operators
class PatientDiagnosis(models.Model):
diagnosis = models.ForeignKey(Diagnosis, null=True)
patient = models.ForeignKey(Patient, null=True)
objects = ItemThroughManager('diagnosis', 'patient')
After the tables are created in the database, we can build an index by calling:
DiagnosisIndex.objects.index()
So now, the same question stated above can be answered this way (using the same condition from above):
diagnosis = Diagnosis.objects.filter(code='367')
# Removes patient ids to be used as a subquery
subquery = PatientDiagnosis.objects.requires_any(diagnosis)
diagnoses = Diagnosis.objects.filter(id__in=subquery)
The above is a bit verbose and the plan is for abstraction to be a bit more transparent.
This utilizes the index and returns all diagnoses that match the condition explictly or any of the descedents of the diagnosis of interest.
Manager Methods
harvest-vocab defines five methods to support hierarchy-based queries and exposes them via the ItemThroughManager
class.
-
requires_any(values)
- Corresponds to theIN
clause (defined for completeness) -
excludes_any(values)
- Corresponds to theNOT IN
clause -
requires_all(values)
- Requires all values to match -
excludes_all(values)
- Requires all vlaues to not match -
only(values)
- Matches if the object only contains the specified values
Harvest Integration
Harvest-Vocab comes bundled with a custom Avocado translator which exposes custom operators corresponding to the above manager methods. The translator must be subclassed and the through_model
class attribute must be set:
from avocado.query.translators import registry
from vocab.translators import VocabularyTranslator
from myapp.models import PatientDiagnosis
class DiagnosisTranslator(VocabularyTranslator):
through_model = PatientDiagnosis
registry.register(DiagnosisTranslator)
To support the harvest-vocab-client, endpoints must be defined for the client components to query. Include the vocab.urls
in the ROOT_URLCONF
patterns:
from django.conf.urls import url, reverse, patterns
urlpatterns = patterns('',
# Other url patterns...
url(r'^vocab/', include('vocab.urls')),
)
In addition, define the VOCAB_FIELDS
setting which is a list/tuple of Avocado field IDs that are supported.
Implementation
The custom operators are implemented using SQL CASE
statements. An example output for an requires all query would look something like this. Although this query may look daunting, the important bits are only the summed CASE
statements combined with the WHERE
condition for those expressions, e.g. sc1
and sc5
. To handle the hierarchical nature of the data (in this case ICD9 codes, the index table is being used which enables matching against the item itself (via "item_id" = 1
) or a descendent of the item (via "parent_id" = 1
).
SELECT DISTINCT "core_person"."id",
FROM "core_person",
"core_subject"
WHERE "core_person"."id" = "core_subject"."person_id"
AND "core_subject"."id" IN (
SELECT "patient_id"
FROM (
SELECT "patient_id",
SUM(
CASE
WHEN "core_diagnosisindex"."item_id" = 1 THEN 1
WHEN "core_diagnosisindex"."parent_id" = 1 THEN 1
ELSE 0
END
) AS "sc1",
SUM(
CASE
WHEN "core_diagnosisindex"."item_id" = 5 THEN 1
WHEN "core_diagnosisindex"."parent_id" = 5 THEN 1
ELSE 0
END
) AS "sc5",
FROM "core_patientdiagnosis"
INNER JOIN "core_diagnosisindex" ON ("core_patientdiagnosis"."diagnosis_id" = "core_diagnosisindex"."item_id" )
GROUP BY "patient_id"
) AS T
WHERE "sc1" > 0
AND "sc5" > 0
)
The difference between operators are simply the whether the condition is negated and how many are required to match.
- requires all - An item must match at least once for all items
-
requires any - An item must match at least once for any items (equivalent to the
IN
clause) - excludes all - An item must not match for all items
-
excludes any - An item must not match for any items (equivalent to the
NOT IN
clause) - only - An item must match only once for all items and nothing else
Resources
Harvest-vocab come with two custom resource classes intended to override the default Serrano values resource for fields. The primary resource exposes a superset of Serrano's FieldValuesResource
to ensure compatibility. The superset exposes _links
and id
properties. The _links
object enables it to be crawled and used for descendending in the hierarchy if the children
entry is present (it is not below). Below is an example representation:
{
"_links": {
"parent": {
"href": "http://localhost:8000/api/fields/2209/values/"
},
"self": {
"href": "http://localhost:8000/api/fields/2209/values/190/"
}
},
"id": 190,
"label": "L-LOOP CORRECTED TRANSPOSITION {SL?}",
"value": 190
}