An extensible, concise and light weight DSL on Rake to automate data processing tasks


License
MIT
Install
gem install raka -v 0.3.10

Documentation

Raka is a DSL(Domain Specific Language) on top of Rake for defining rules and running data processing workflows. Raka is specifically designed for data processing with improved pattern matching, scopes, language extensions and lots of conventions to prevent verbosity.

Installation

Raka is a library based on rake. Though rake is cross platform, raka may not work on Windows since it relies some shell facilities. Ruby is available for most *nix systems including Mac OSX so the only task is to install raka like:

gem install raka

Quick Start

First create a file named main.raka and import & initialize the DSL

require 'raka'

dsl = Raka.new(self,
  output_types: [:txt],
  input_types: [:txt]
)

Then the code below will define two simple rules:

txt._.first50 = shell* "cat $< | head -n 50 > $@"
txt.sort = [txt.input] | shell* "cat $(dep0) | sort -rn > $@"

For testing let's prepare an input file named input.txt:

seq 1000 > input.txt

Invoke:

raka first50__sort.txt

Raka will read data from input.txt, sort the numbers descendingly and copy the first 50 lines to first50__sort.txt.

The workflow here is as follows:

  1. Try to find first50__sort.txt: not exists.
  2. Rule with target txt.sort.first50 matched.
  3. Find input file sort.txt, not exists.
  4. Rule with target txt.sort matched.
  5. This rule has no input but a depended target txt.input.
  6. File input.txt exists. Use it.
  7. Run rule txt.sort and create sort.txt.
  8. Run rule txt.sort.first50 and create first50__sort.txt

We may want to skip the sort step, and invoke:

raka first50__input.txt

Raka will read data from input.txt and copy the first 50 lines to first50__input.txt.

This illustrates some basic ideas but may not be particularly interesting. Following is a slightly more complex example which covers more features.

require 'raka'

dsl = Raka.new(self,
               output_types: %i[csv pdf],
               input_types: %i[csv],
               lang: ['lang/shell', 'lang/python'])

py_template = <<~PYTHON
  import os.path
  import pandas as pd

  def write_variety(input, output, variety):
    print(variety)
    folder = os.path.dirname(output)
    if len(folder) > 0:
      os.makedirs(folder, exist_ok=True)
    df = pd.read_csv(input)
    df[df['class'] == variety].to_csv(output)

  <code>
PYTHON
py.config script_template: py_template

groups = %i[virginica versicolor]

csv(groups.join('|')).iris =
  [csv.iris_all] | py* %(write_variety('$<', '$@', 'Iris-$(target_scope)'))

csv.iris_all = shell* %(curl -L https://datahub.io/machine-learning/iris/r/iris.csv > $@)

dsl.scope(*groups) do
  pdf.iris.plot['plot_(\S+)_(\S+)'] = py do |rask|
    <<-PYTHON
    import seaborn as sns
    from matplotlib import pyplot as plt

    df = pd.read_csv('#{rask.input}')
    ax = sns.displot(x=df['#{rask.captures.plot0}#{rask.captures.plot1}'])
    ax.set_axis_labels('#{rask.captures.plot0} #{rask.captures.plot1}', 'frequency')
    plt.savefig('#{rask.output}')
    PYTHON
  end
end

task figures: (groups.product(%w[sepal petal], %w[length width]).map do |info|
  "_out/#{info[0]}/plot_#{info[1]}_#{info[2]}__iris.pdf"
end)

In this example, we download a classical dataset named iris.csv, use python code to extract two varieties including virginica and versicolor, and generate thematic plots of frequency histograms for both varieties.

To invoke the script, we run in terminal:

raka -j 8 -v figures

The option -j 8 indicates we want to parallelize the tasks with 8 concurrent processes at most where possible. The option -v let raka print detailed information so we can view the generated python code.

The tool will then act as the following:

  1. Match figures with the lastrule, which is a normal rake task.
  2. The prerequisites include 8 figures, none of them exists yet. Take *_out/versicolor/plot_petal_length__iris.pdf * as an example from now on.
  3. Rule pdf.iris.plot['plot_(\S+)_(\S+)']... is matched, where "petal" is bound to plot0 and "length" is bound to plot1.
  4. Neither of the 2 possible input files: _out/versicolor/iris.csv and _out/versicolor/iris.pdf and can be found. But the rule csv(groups.join('|')).iris = ... (csv('virginica|versicolor').iris) can be matched for the former, where the target scope is matched as versicolor.
  5. The only dependecy csv.iris_all is resolved as out/iris_all.csv. The path does not contain vesicolor since the target scope only applies to the target.
  6. Rule csv.iris_all is matched without any dependencies.
  7. The protocol shell replaces the automatic variable$@ with _out/iris_all.csv to build a curl command and download the iris dataset from ()[datahub.io].
  8. Now raka goes back to generate output _out/versicolor/iris.csv, by executing the code generated by the python protocol, which extracts rows where the class field equals "Iris-versicolor".
  9. Raka goes back to generate output _out/versicolor/plot_petal_length__iris.pdf, , by executing the code generated by the python protocol, which draws a histogram plot to depict the distribution of petal length.
  10. Raka continues to generate plot files until all 8 figures exist.

As an example, the generated python code in 9 are:

import sys
import os.path
import pandas as pd

def write_variety(input, output, variety):
  print(variety)
  folder = os.path.dirname(output)
  if len(folder) > 0:
    os.makedirs(folder, exist_ok=True)
  df = pd.read_csv(input)
  df[df['class'] == variety].to_csv(output)

import seaborn as sns
from matplotlib import pyplot as plt

df = pd.read_csv('_out/versicolor/iris.csv')
ax = sns.displot(x=df['petallength'])
ax.set_axis_labels('petal length', 'frequency')
plt.savefig('_out/versicolor/plot_petal_length__iris.pdf')

The rule-based system, the strategy to execute tasks only when necessary, and the capable host language make it fairly easy to adjust the experiments during the exploration. For example, suppose we want to also apply experiments also to the setosa class, we can just change the line

groups = %i[virginica versicolor]

to

groups = %i[virginica versicolor setosa]

The command raka -j 8 -v figures will generate 4 figures for the new class, without re-executing tasks for the other two classes.

Why Raka

Data processing tasks can involve plenty of steps, each with its dependencies. Compared to bare Rake or the more classical Make, Raka offers the following advantages:

  1. Advanced pattern matching and template resolving to define general rules and maximize code reuse.
  2. Extensible and context-aware protocol architecture.
  3. Multilingual. Other programming languages can be easily embedded.
  4. Auto dependency and naming by conventions.
  5. Scopes to ease comparative studies.
  6. Terser syntax.

... and more.

Compared to more comlex, GUI-based solutions (perhaps classified as scientific-workflow software) like Kepler, etc., Raka has the following advantages:

  1. Lightweight and easy to setup, especially on platforms with ruby preinstalled.
  2. Easy to deploy, version-control, backup or share workflows since the workflows are merely text files.
  3. Easy to reuse modules or create reusable modules, which are merely plain ruby code snippets (or in other languages with protocols).
  4. Expressive so a few lines of code can replace many manual operations.

Documentation

Conceptual Model

A raka rule consists of target, dependencies, actions and

Syntax Definition

It is possible to use Raka with little knowledge of ruby / rake, though minimal understandings are highly recommended. The formal syntax of rule can be defined as follows (W3C EBNF form):

rule ::= target "=" (dependencies "|")* action ("|" post_target)*

target ::= ext "." ltoken ("." ltoken)*

dependencies ::= "[]" | "[" dependency ("," dependency)* "]"

dependency ::= rexpr | template

post_target ::= rexpr | template

rexpr ::= ext "." rtoken ("." rtoken)*

ltoken ::= word | word "[" pattern "]"
rtoken ::= word | word "(" template ")"

word ::= ("_" | letter) ( letter | digit | "_" )*

action ::= ("shell" | "r" | "psql" | "py" ) ("*" template | block ) | "run" block

The corresponding railroad diagrams are:

rule

target

dependencies

dependency

post_target_

rexpr

ltoken

rtoken

word

action

The definition is concise but several details are omitted for simplicity:

  1. BLOCK and HASH is ruby's block and hash object.
  2. A template is just a ruby string, but with some placeholders (see the next section for details)
  3. A pattern is just a ruby string which represents regex (see the next section for details)
  4. The listed protocols are merely what we offered now. It can be greatly extended.
  5. Nearly any concept in the syntax can be replaced by a suitable ruby variable.

Pattern matching and template resolving

When defined a rule like target = <specification>, the left side represents a pattern and the right side contains specifications for extra dependecies, actions and some targets to create thereafter. When raking a target file, the left sides of the rules will be examined one by one until a rule is matched. The matching process based on Regex also support named captures so that some varibales can be extracted for use in the right side.

The specifications on the right side of a rule can contain templates. The "holes" in the templates will be fulfilled by automatic variables and variables captured when matching the left side.

Pattern matching

To match a given file with a target, the extension will be matched first. The substrings of the file name between "__" are mapped to tokens separated by ., in reverse order. After that, each substring is matched to the corresponding token or the regex in []. For example, the rule

pdf.buildings.indicator['\S+'].top['top_(\d+)']

can match "top_50__node_num__buildings.pdf". The logical process is:

  1. The extension pdf matches.
  2. The substrings and the tokens are paired and they all match:
    • buildings ~ buildings
    • '\S+' ~ node_num
    • top_(\d+) ~ top_50
  3. Two levels of captures are made. First, 'node_num' is captured as indicator, 'top_50' is captured as top; Second, '50' is captured as top0 since \d+ is wrapped in parenthesis and is the first.

One can write special token _ to match any token. Since raka uses prefix matching, something like token0[''] can also match any token and capture it in token0 in addition. End-of-line symbol $ can be used to match the whole token, e.g., token0['word$'] will not match word_bench.

Template resolving

In some places of rexpr, templates can be written instead of strings, so that it can represent different values at runtime. There are two types of variables that can be used in templates. The first is automatic variables, which is just like $@ in Make or task.name in Rake. We even preserve some Make conventions for easier migrations. All automatic varibales begin with $. The possible automatic variables are:

symbol description
$@, $(output) the output file
$<, $(input) the input file defined in the chained target
$^, $(deps) all dependecies concated by comma (including input)
$(dep0), $(dep1), ... the i-th depdency (input is $(dep0))
$(input_stem) stem of the input file
$(output_stem) stem of the output file
$(func) the token added to input to generate output, e.g., stat in csv.data.stat
$(ext) extension of the output file
$(scope) scope for current task, i.e. the common directory for output, input and dependencies
$(target_scope) the inline scope defined in target
$(target_scope0), $(target_scope1), ... the i-th captured value by inline scope defined in target
$(rule_scope0), $(rule_scope1), ... the i-th scope defined in rule-level by nested calls of the dsl.scope function (i is larger insideout)

The other type of variables are those captured during pattern matching, which can be referred to using %{var}. In the example of the pattern matching section, %{indicator} will be replaced by node_num, %{top} will be replaced by top_50 and %{top0} will be replaced by 50. In such case, a template as 'calculate top %{top0} of %{indicator} for $@' will be resolved as 'calculate top 50 of node_num for top_50__node_num__buildings.pdf'

Templates can happen in various places. For depdencies and post targets, tokens with parenthesis can contain templates, like csv._('%{indicator}'). The symbol of a token with parenthesis is of no use and is generally omitted with an underscore. It is also possible to write template literal directly, i.e. '%{indicator}.csv'. Templates can also be applied in actions but it depends on the implementations of protocols.

Actions and protocols

Raka invokes actions when all input and dependencies are presented. Generally, users define an action that generates the output. To maximize the flexibility, users can feed code in an arbitrary programming language to the corresponding protocol. The protocol will then transform and execute the code. Raka natively supports the host(ruby) protocol and several foreign protocols including shell, python, psql, and r.

The host protocol is special and just executes the given ruby block. All other protocols can accept a templated code string given an aterisk operator or a block producing a templated code string. Following illustrates examples for each protocol.

In the host protocol and the block versions of other protocols, a raka task (the rask variable) is provided, which offers the following properties:

property description
output the output file
input the input file defined in the chained target
deps the depdencies (input is deps[0])
func the token added to input to generate output, e.g., stat in csv.data.stat
ext extension of the output file
captures captured text during pattern matching, key-value
scope scope for current task, i.e. the common directory for output, input and dependencies
target_scope the inline scope defined in target
target_scope_captures captured values by inline scope defined in target
rule_scopes the inline scope defined in target
require 'raka'
require 'csv'

dsl = Raka.new(
  self, output_types: %i[table view csv],
        lang: ['lang/psql', 'lang/shell', 'lang/python', 'lang/r']
)

csv.iris_all = shell* %(curl -L https://datahub.io/machine-learning/iris/r/iris.csv > $@)

# host(ruby) protocol
csv.rb_out = [csv.iris_all] | run do |rask|
  in_f = File.open(rask.deps[0])
  out_f = File.open(rask.output, 'w')
  options = { headers: true, return_headers: true, write_headers: true }
  CSV.filter(in_f, out_f, options) do |row|
    row['class'] == 'Iris-versicolor'
  end
end

# python protocol
csv.py_out = [csv.iris_all] | py* %(
  import pandas as pd
  df = pd.read_csv('$(dep0)')
  df[df['class'] == 'Iris-versicolor'].to_csv('$@')
)

# python protocol (block)
csv.py_out2 = [csv.iris_all] | py do |rask|
  <<-PYTHON
  import pandas as pd
  df = pd.read_csv('#{rask.deps[0]}')
  df[df['class'] == 'Iris-versicolor'].to_csv('#{rask.output}')
  PYTHON
end

# r protocol
csv.r_out = [csv.iris_all] | r* %(
  df <- read.csv("$(dep0)")
  write.csv(df[(df$class == "Iris-versicolor"),], file="$@")
)

# r protocol (block)
csv.r_out = [csv.iris_all] | r do |rask|
  <<-R
  df <- read.csv("#{rask.deps[0]}")
  write.csv(df[(df$class == "Iris-versicolor"),], file="#{rask.output}")
  R
end

# shell protocol
csv.shell_out = [csv.iris_all] | shell* %(
  cat <(head $(dep0)) <(grep "Iris-versicolor" $(dep0)) > $@
)

# shell protocol (block)
csv.shell_out2 = [csv.iris_all] | shell do |rask|
  "cat <(head -1 #{rask.deps[0]}) <(grep 'Iris-versicolor' #{rask.deps[0]}) > rask.output"
end

# psql protocol
pg = OpenStruct.new(
  user: 'postgres',
  port: 5433,
  host: '127.0.0.1',
  db: 'postgres',
  password: 'postgres'
)
psql.config conn: pg, create: :mview

table.iris_all = [csv.iris_all] | psql(create: nil)* %(
  DROP TABLE IF EXISTS $(output_stem);
  CREATE TABLE $(output_stem) (
    sepallength float,
    sepalwidth float,
    petallength float,
    petalwidth float,
    class varchar
  );
  \\COPY $(output_stem) FROM '$(dep0)' CSV HEADER;
)

table.psql_out = [table.iris_all] | psql* %(
  SELECT * FROM $(dep0_stem) WHERE class='Iris-versicolor'
)

# psql protocol (block)
table.psql_out2 = [table.iris_all] | psql do |rask|
  <<-SQL
  SELECT * FROM #{dsl.stem(rask.deps[0])} WHERE class='Iris-versicolor'
  SQL
end

Initialization and options

These APIs are bounded to an instance of DSL, you can create the object at the top:

dsl = DSL.new(<env>, <options>)

The argument <env> should be the self of a running Rakefile. In most case you can directly write:

dsl = DSL.new(self, <options>)

Two important fields of options are output_types and input_types. For each item in output_types, you will get a global function to bootstrap a rule. For example, with

dsl = DSL.new(self, { output_types: [:csv, :pdf] })

you can write these rules like:

csv.data = ...
pdf.graph = ...

which will match

/data.csv and /graph.pdf

The input_types involves the strategy to find inputs. All possible input types will be tried when resolving an input file in chained target. For example, raka will try to find both numbers.csv and numbers.table for a rule like table.numbers.mean = … if input_type = [:csv, :table].

Scope

Scopes define constraints which help users create rules more precisely. A scope generally refer to a folder and can happen in several places.

Task scope is the scope when executing a task, a.k.a. scope. When a rule is matched given a desired output, a task is generated and its scope is the common folder of the output and all dependencies. For example, a rule csv.out = [csv.in] | ... can be matched given out/out.csv and the task scope is resolved out/. The task will thus search for out/in.csv as dependency.

Rule scope is the scope to restrict possible task scope, given by Raka::scope. In the following example, the rule scopes are

Target scope.

Rakefile Template

Write your own protocols

Compare to other tools

Raka borrows some ideas from Drake but not much (currently mainly the name "protocol"). Briefly we have different visions and maybe different suitable senarios.