Manage statuses for a large amount of data analysis resources, such as files, imports, etc.


License
MIT
Install
pip install hadrosaur==0.4.1

Documentation

hadrosaur — computed resource management

logo

Hadrosaur makes it easy to track the completion status, errors, and logs of large amounts of resources (files, metadata, analytics, database imports, etc.).

You simply define your resource as a decorated Python function that can create files and save metadata using an identifier in a certain namespace. Later on, you can quickly fetch the status and results of previously computed resources.

This library uses a combination of LevelDB and the file system to track the state of your tasks.

Quick usage tutorial

Install

pip install hadrosaur

Define a resource collection

Import the lib and initialize a project using a base directory. Files, metadata, and logs will all get stored under this directory.

from hadrosaur import Project

proj = Project('./base_directory')

Define a collection using a decorator around a function. This function's job is to generate a single resource for the collection given a unique ID and some arguments.

The collection should have a unique name, and its function must take these params:

  • ident — an identifier (unique across the collection) for each computed resource
  • args — a dictionary of optional arguments
  • ctx — a Context object which holds some extra data you may find useful during computation:
    • ctx.subdir - the path of a directory in which you can store files for this resource
    • ctx.logger - a special Python logging instance that will write to a rotating log file stored in the resource directory, with some nice default formatting
@proj.resource('collection_name')
def compute_resource(ident, args, ctx):
  ctx.logger.info("Starting up")
  # Run some things...
  # Maybe save stuff into ctx.subdir...
  time.sleep(1)
  # Return any JSON-serializable data for the resource, such as metadata, run results, filepaths, etc.
  return {'ts': time.time()}

Fetch a resource

Use the proj.fetch(collection_name, ident) method to compute and cache resources in a collection.

Keyword arguments:

  • args -- an optional dict of extra arguments for the resource compute function
  • recompute -- force the resource to be re-computed, even if it has already been computed

What happens when you fetch a resource:

  • If the resource has not yet been computed, the collection's compute function will be run.
  • If the resource was already computed in the past, then the saved results will get returned instantly (unless recompute=True has been set in the keyword arguments).
  • If an error is thrown in the function, logs will be saved and the status will be updated
>> proj.fetch('collection_name', 'uniq_ident123', optional_args)
<Resource>

The resource object has the following properties:

  • resource.result: any JSON-serializable data returned by the resource's compute function
  • resource.start_time: The unix epoch (in milliseconds) of when the resource started being computed
  • eresource.end_time: the unix epoch (in ms) of when the resource finished computing (or failed)
  • resource.status: whether the resource has been computed already ("completed"), is currently being computed ("pending"), has not yet been fetched at all ("unavailable"), or threw a Python error while running the function ("error")
  • resource.paths: A dictionary of all the filesystem paths associated with your resource, with the following keys:
    • 'base': The base directory that holds all data for the resource
    • 'error': A Python stacktrace of any error that occured while running the resource's function
    • 'log': A line-by-line log file produced by the resource's logger (ctx.logger)
    • 'status': Path to the current status ("unavailable", "completed", "pending", "error")
    • 'result': Path to a JSON file of serializable data returned by the resource's function
    • 'storage': Directory path of additional files written by the resource's function (ctx.subdir)

Fetch status and information

Fetch stats for a collection

To see status counts for a whole collection, use proj.stats('collection_name'):

> proj.stats('collection_name')
{
  'counts': {
      'total': 100,
      'pending': 75,
      'completed': 20,
      'error': 5,
      'unavailable': 0
  }
}

Use proj.stats() without an argument to fetch the stats for all collections.

To get a list of resource IDs for a given status, use proj.fetch_by_status:

> proj.fetch_by_status('collection_name', 'pending')
['1', '2', '3'..]

Fetch info about a single resource

Use proj.status('collection_name', 'resource_id') to see the status of a particular resource.

> proj.status('collection_name', 'resource_id')
"complete"

If an exception was raised during the execution of the function used to compute a resource, then use proj.fetch_error to see the error.

> proj.fetch_error('collection_name', 'resource_id')
"""Traceback (most recent call last):
  File "/home/j/code/hadrosaur/hadrosaur/main.py", line 211, in fetch
    result = func(ident, args, ctx)
  File "/home/j/code/hadrosaur/test/test_general.py", line 26, in throw_something
    raise RuntimeError('This is an error!')
RuntimeError: This is an error!"""

To see the run log (produced by ctx.logger during function execution), then use proj.fetch_log

> proj.fetch_log('collection_name', 'resource_id')
"""
2020-02-05 16:15:35 INFO     output here (test_general.py:25)
2020-02-05 16:15:35 INFO     more output here (test_general.py:25)
"""