Porcupine stands for Portable, Reusable & Customizable Pipeline. It is a tool aimed at data scientists and numerical analysts, so that they can express general data manipulation and analysis tasks,
- in a way that is agnostic from the source of the input data and from the destination of the end results,
- such that a pipeline can be re-executed in a different environment and on different data without recompiling, by just a shift in its configuration,
- while maintaining composability (any task can always be reused as a subtask of a greater task pipeline).
Porcupine provides three core abstractions: serials, tasks and resource trees.
SerialsFor a b encompasses functions to write data of type
a and read data
b. Porcupine provides a few serials if your datatype already
implements standard serialization interfaces, such as
binary, and makes it easy to reuse custom serialization functions you might
SerialsFor is a profunctor. That means that once you know how to (de)serialize
A (ie. if you have a
SerialsFor A A), then you can just use
SerialsFor B B if you know how to convert
A to & from
only one-way serialization or deserialization is perfectly possible, that just mean you
SerialsFor Void B or
SerialsFor A () and use only lmap or rmap. Any
SerialsFor a b is also
a monoid, meaning that you can for instance gather default serials, or serials
from an external source and add to them your custom serialization methods,
before using it in a task pipeline.
The end goal of
SerialsFor is that the user writing a task pipeline will not
have to care about how the input data will be serialized. As long as the data it
tries to feed into the pipeline matches some known serialization function. Also,
the introspectable nature of resource trees (more on that later) allows you to
add serials to an existing pipeline before reusing it as part of your own
pipeline. This sort of makes Porcupine an "anti-ETL": rather than marshall and
curate input data so that it matches the pipeline expectations, you augment the
pipeline so that it can deal with more different data sources.
Every task in Porcupine exposes a resource tree. This a morally a hierarchy of
VirtualFiles, which the end user of the task pipeline (the one who runs the executable)
can bind to physical
locations. However this tree isn't created manually by the developper of the
pipeline, it's completely hidden from them. This tree is made of atomic bits
(constructed by the primitive tasks) which are composed when tasks are composed
together to create the whole pipeline.
Once the user has their serials, they just need to create a
thin layer over
SerialsFor, which is also a profunctor). For instance, this is
how you create a readonly resource that can only be mapped to a JSON file in the
myInput :: VirtualFile Void MyType -- MyType must be an instance of FromJSON here myInput = dataSource ["Inputs", "MyInput"] -- A virtual path (somePureDeserial JSONSerial)
And then, using
myInput in a task pipeline is just a matter of calling the
accessVirtualFile myInput, and the whole pipeline will expose
/Inputs/MyInput virtual file.
accessVirtualFile just turns a
VirtualFile a b into a
PTask a b.
PTask is an arrow. Atomic
PTasks can access resources, as we saw, or
perform computations, as any pure function can be lifted. Each
expose its requirements (in the form of a resource tree) and a function that will
actually execute the task when the pipeline runs.
PTasks compose much like
functions do, and they merge their requirements as they compose.
Once you have e.g. a
mainTask :: PTask () () that corresponds to your whole
pipeline, your application just needs to call:
main :: IO () main = runPipelineTask "myApp" cfg mainTask () where cfg = FullConfig "pipeline-config.yaml" "./default-root-dir"
Running a Porcupine application
Once you have built your executable, what you will usually want is to expose its configuration. Just run it with:
$ my-exe write-config-template
main looks like the one we presented previously, that will generate a
pipeline-config.yaml file in the current directory. In this file, you will see
the totality of the virtual files accessed by your pipeline and the totality of
the options^[Options are just VirtualFiles, but they are created with the
getOptions primitive task, and their values can be embedded directly in the
configuration file] exposed by it. You can see that by default, the root (
of the location tree is mapped to
./default-root-dir. If you leave it as it
is, then every VirtualFile will be looked for under that directory, but you can
alter that on a per-file basis.
Once you're done tweaking the configuration, just call:
$ my-exe run
and the pipeline will run (logging its accesses along). The
run is optional,
it's the default subcommand. Any option you defined inside your pipeline is also
exposed on the CLI, and shown by
my-exe --help. Specifying it on the CLI
overrides the value set in the yaml config file.
Philosophy of use
Porcupine's intent is to make it easy to cleanly separate the work between 3 persons:
- The storage developer will be in charge of determining how the data gets read and written in the end. He will target the serials framework, and propose new datatypes (data frames, matrices, vectors, trees, etc.) and ways to write and read them to the various storage technologies.
- The scientist will determine how to carry out the data analyses, extract some sense out of the data, run simulations based on that data, etc. He doesn't have to know how the data is represented, just that it exists. She just reuses the serials written by the storage developper and targets the tasks framework.
- The software architect work will start once we need to bump things up a bit. Once we have iterated several times over our analyses and simulations and want to have things running in a bigger scale, then it's time for the pipeline to move from the scientist's puny laptop and go bigger. This is time to "patch" the pipeline, make it run in different context, in the cloud, behind a scheduler, as jobs in a task queue reading its inputs from all kinds of databases. The software architect will target the resource tree framework (possibly without ever recompiling the pipeline, only by adjusting its configuration from the outside)
Of course, these people can be the same person, and you don't need to plan on runnning anything in the cloud to start benefiting from porcupine. But we want to support workflows where these three persons are distinct people, each one with her different set of skills.
Aside from the general usage exposed previously, porcupine proposes several features to facilitate iterative development of pipelines and reusability of tasks.
[TO BE FILLED]
Options and embeddable data
Repeated tasks and VirtualFiles
Mapping S3 objects
katip to do logging. It's quite a versatile tool, and we
benefit from it. By default, logging uses a custom readable format. You can
switch to more standard formats using:
$ my-exe --log-format json