What is this?
Mario is a metadata processing pipeline that will process data from various sources and write to Elasticsearch.
mario command can be installed with:
$ make install
How to Use This
An Elasticsearch index can be started for development purposes by running:
$ docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" \ docker.elastic.co/elasticsearch/elasticsearch:6.8.4
Alternatively, if you intend to test this with a local instance of TIMDEX as well, use docker-compose to run both docker and TIMDEX locally using the instructions in the TIMDEX README.
Here are a few sample Mario commands that may be useful for local development:
mario ingest -c json -s aspace fixtures/aspace_samples.xmlruns the ingest process with ASpace sample files and prints out each record as JSON
mario ingest -s dspace fixtures/dspace_samples.xmlingests the DSpace sample files into a local Elasticsearch instance.
mario ingest -s alma --auto fixtures/alma_samples.mrcingests the Alma sample files into a local Elasticsearch instance and promotes the index to the timdex-prod alias on completion.
mario indexeslist all indexes
mario promote -i [index name]promotes the named index to the timdex-prod alias.
This project uses modules for dependencies. To upgrade all dependencies to the latest minor/patch version use:
$ make update
Tests can be run with:
$ make test
Adding a new source parser
To add a new source parser:
- (Probably) create a source record struct in
- Add a source parser module in
- Add a tests file that tests ALL fields mapped from the source.
pkg/ingester/ingester.goto add a Config.source that uses the new generator.
- Update documentation to include the new generator param option (as "type") to command options.
- (Probably) don’t need to update the CLI.
- After all of that is completed, tested, and merged, create tasks to harvest the source metadata files and ingest them using our airflow implementation.
Updating the data model
Updating the data model is somewhat complicated because many files need to be edited across multiple repositories and deployment steps should happen in a particular order so as not to break production services. Start by updating the data model here in Mario as follows:
config/es_record_mappings.jsonto reflect added/updated/deleted fields.
pkg/record/record.goto reflect added/updated/deleted fields.
- Update ALL relevant source record definitions and source parser files in
pkg/generator. If a field is edited or deleted, be sure to check every source file for usage. If a field is new, add to all relevant sources (confirm mapping with metadata folks first).
- Update relevant tests in
- Once the above steps are done, update the data model in TIMDEX following the instructions in the TIMDEX README and test locally with the docker-compose orchestrated environment to ensure all changes are properly indexed and consumable via the API.
We have several config files that are essential for mapping various metadata
field codes to their human-readable translations, and some of them may need to
be updated from time to time. Most of these config files are pulled from
authoritative sources, with the exception of
marc_rules.json which we created
and manage ourselves. Sources of the other config files are as follows:
dspace_set_list.jsonthis is harvested from our DSpace repository using our OAI-PMH harvester app. The app includes a flag to convert the standard XML response to JSON, which just makes it easier to parse.
Architecture Decision Records
This repository contains Architecture Decision Records in the docs/architecture-decisions directory.
adr-tools should allow easy creation of additional records with a standardized template.