Python interfaces for TA1 primitives
pip install primitive_interfaces==2018.1.5
A collection of standard Python interfaces for TA1 primitives. All primitives should extend one of the base classes available and optionally implement available mixins.
DARPA Data Driven Discovery (D3M) Program is researching ways to get machines to build machine learning pipelines automatically. It is split into three layers: TA1 (primitives), TA2 (systems which combine primitives automatically into pipelines and executes them), and TA3 (end-users interfaces).
This package works with Python 3.6+.
You can install latest stable version from PyPI:
$ pip install --process-dependency-links primitive_interfaces
To install latest development version:
$ pip install --process-dependency-links git+https://gitlab.com/datadrivendiscovery/primitive-interfaces.git@devel
--process-dependency-links
argument is required for correct processing of dependencies.
See HISTORY.md for summary of changes to this package.
master
branch contains latest stable release of the package.
devel
branch is a staging branch for the next release.
Releases are tagged.
Standard TA1 primitive interfaces have been designed to be possible for TA2 systems to call primitives automatically and combine them into pipelines.
Some design principles applied:
Interface classes, mixins, and methods are documented in detail through use of docstrings and typing annotations. Here we note some higher-level concept which can help understand basic ideas behind interfaces and what they are trying to achieve, the big picture. This section is not normative.
A primitive should extend one of the base classes available and optionally mixins as well.
Not all mixins apply to all primitives. That being said, you probably do not want to
subclass PrimitiveBase
directly, but instead one of other base classes to signal to
a caller more about what your primitive is doing. If your primitive belong to a larger
set of primitives no exiting non-PrimitiveBase
base class suits well, consider suggesting
that a new base class is created by opening an issue or making a merge request.
Base class and mixins have generally four type arguments you have to provide: Inputs
,
Outpus
, Params
, and Hyperparms
. One can see a primitive as parameterized by those
four type arguments. You can access them at runtime through metadata:
FooBarPrimitive.metadata.query()['class_type_arguments']
Inputs
should be set to a primary input type of a primitive. Primary, because you can
define additional inputs your primitive might need, but we will go into these details later.
Similarly for Outputs
. produce
method then produces outputs from inputs. Other primitive
methods help the primitive (and its produce
method) achieve that, or help the runtime execute
the primitive as a whole, or optimize its behavior.
Both Inputs
and Outputs
should be of a
standard container data type.
We allow a limited set of value types being passed between primitives so that both TA2 and TA3
systems can implement introspection for those values if needed, or user interface for them, etc.
Moreover this allows us also to assure that they can be efficiently used with
Arrow/Plasma store.
Container values can then in turn contain values of an extended but still limited set of data types.
Those values being passed between primitives also hold metadata. Metadata is available on
their metadata
attribute. Metadata on values is stored in an instance of
DataMetadata
class. This is a reason why we have
our own versions of some standard container types:
to have the metadata
attribute.
All metadata is immutable and updating a metadata object returns a new, updated, copy. Metadata internally remembers the history of changes, but there is no API yet to access that. But the idea is that you will be able to follow the whole history of change to data in a pipeline through metadata. See metadata API for more information how to manipulate metadata.
Primitives have a similar class PrimitiveMetadata
, which when created automatically analyses
its primitive and populates parts of metadata based on that. In this way author does not have
to have information in two places (metadata and code) but just in code and metadata is extracted
from it. When possible. Some metadata author of the primitive stil has to provide directly.
Currently most standard interface base classes have only one produce
method, but design
allows for multiple: their name has to be prefixed with produce_
, have similar arguments
and same semantics as all produce methods. The main motivation for this is that some primitives
might be able to expose same results in different ways. Having multiple produce methods
allow the caller to pick which type of the result they want.
To keep primitive from outside simple and allow easier compositionality in pipelines, primitives have arguments defined per primitive and not per their method. The idea here is that once a caller satisfies (computes a value to be passed to) an argument, any method which requires that argument can be called on a primitive.
There are three types of arguments:
None
or another default value)Methods can accept additional pipeline and hyper-parameter arguments and not just those from the standard interfaces.
Because methods can accept additional arguments and because structural types are not enough
in many cases to know if a value is a good for a particular argument, there is a
can_accept
class method which primitives can override to give the caller feedback in advance,
before the pipeline itself is already running. Default implementation jut checks structural
typing information and passes outputs structural typing information on, but ideally this
class method should check much more: shapes, dimensions, types of internal data structures, etc.
For example, the fact that inputs are a numpy array does not help much to know which value
to bring as inputs, because numpy array structural type does not cary enough information.
But metadata associated with values hopefully does.
Produce methods and some other methods return results wrapped in CallResult
. In this way
primitives can expose information about internal iterative or optimization process and
allow caller to decide how long to run.
When calling a primitive, to access Hyperparams
class you can do:
hyperparams_class = FooBarPrimitive.metadata.query()['class_type_arguments']['Hyperparms']
You can now create an instance of the class by directly providing values for hyper-parameters, use available simple sampling, or just use default values:
hp1 = hyperparams_class({'threshold': 0.01})
hp2 = hyperparams_class.sample(random_state=42)
hp3 = hyperparams_class.defaults
You can then pass those instances as the hyperparams
argument to primitive's constructor.
Author of a primitive has to define what internal parameters does the primitive have, if
any, by extending the Params
class. It is just a fancy dict, so you can both create
an instance of it in the same way, and access its values:
class Params(params.Params):
coefficients: numpy.ndarray
ps = Params({'coefficients': numpy.array[1, 2, 3]})
ps['coefficients']
Hyperparams
class and Params
class have to be pickable and copyable so that instances
of primitives can be serialized and restored as needed.
Primitives (and some other values) are uniquely identified by their ID and version. ID does not change through versions.
Primitives should not modify in-place any input argument but always first make a copy before any modification.
Examples of simple primitives using these interfaces can be found in this repository:
container.List
, define and use Params
and Hyperparams
, and implement multiple methods needed by a supervised learner primitivecontainer.ndarray
as inputs and outputs, how to set metadata for outputs,
and how to extend can_accept
to check if inputs have the right structure and types
because container.ndarray
does not expose those details through its structural type,
but inputs' metadata provides itrandom_seed
, too.