Make in Python


Keywords
build-pipelines, reproducible-research, workflow-engine, workflow-management
License
GPL-3.0
Install
pip install buildpy==9.4.0

Documentation

BuildPy

Build Status

BuildPy is a workflow engine to manage data analysis pipelines. It has following features:

  • Integration for
    • BigQuery ("bq://project.dataset.table")
    • Google Cloud Storage ("gs://bucket/blob")
    • S3 ("s3://bucket/path/to/objec")
  • Parallel processing (similar to --jobs of GNU Make)
  • Checksum-based update scheme (similar to SCons)
  • Dynamic job declaration
  • Support for prioritized job declaration
  • Job scheduling based on load average (similar to --load-average of GNU Make)
  • DOT-format output of a dependency graph (similar to --prereqs of Rake)
  • Deferred error (similar to --keep-going of GNU Make)
  • Dry-run (similar to --dry-run of GNU Make)
  • Declaration of multiple targets for a single job
  • Versioned API (buildpy.v1, buildpy.v2, ...)

BuildPy requires Python version ≥ 3.6 and is available from PyPI and conda-forge:

# Pip
pip install buildpy

# Conda
conda install -c defaults -c conda-forge buildpy

The typical form of build.py is as follows:

python build.py --help
python build.py all --jobs="$(nproc)" --keep-going
import sys

import buildpy.vx

dsl = buildpy.vx.DSL(sys.argv)
file = dsl.file
phony = dsl.phony
sh = dsl.sh

all_jobs = []
test_jobs = []

all_jobs.append("test")
test_jobs.extend(["main.exe.log1", "main.exe.log2"])
@file(["main.exe.log1", "main.exe.log2"], ["main.exe"])
def _(j):
    # j.ts: list of targets
    # j.ds: list of dependencies
    sh(f"./{j.ds[0]} 1> {j.ts[0]} 2> {j.ts[1]}")

all_jobs.append("build")
test_jobs.append("main.exe")

@file("main.exe", ["main.c"])
def _(j):
    sh(f"gcc -o {j.ts[0]} {j.ds[0]}")

phony("all", all_jobs)
phony("test", test_jobs)

if __name__ == '__main__':
    dsl.run()

Please see ./build.py and buildpy/v*/tests/*.sh for more examples.

Usage

After importing the buildpy module, please make a DSL instance by dsl = buildpy.vx.DSL(sys.argv). The instance, dsl, provides methods to construct a dependency graph and to execute the declared jobs. dsl.file is used to declare the dependencies and the command to make target files. dsl.file is used as follows:

# Make `target` from `dep1` and `dep2` by `cat dep1 dep2 >| target`.
# You are able to pass a description of the job via the `desc` keyword argument.
@dsl.file("target", ["dep1", "dep2"], desc="Optional description argument")
def _(job):
    dsl.sh(f"cat {' '.join(job.ds)} >| {job.ts[0]}")

# You are able to declare a job to make multiple outputs via a single command invocation.
# In the following example, `target1` and `target2` are made by `diff dep1 dep2 1>| target1 2>| target2`.
@dsl.file(["target1", "target2"], ["dep1", "dep2"])
def _(job):
    dsl.sh(f"diff {' '.join(job.ds)} 1>| {job.ts[0]} 2>| {job.ts[1]}")

Like the task method of Rake or .PHONY rule of GNU Make, you are able to declare a job, which does not produce target files, by using dsl.phony. dsl.phony is used as follows:

# Make a phony target named `taregetA`, which depends on `dep1` and `dep2`.
# An invocation of `targetA` executes the decorated method, `_`, and prints `targetA invoked.`
@dsl.phony("targetA", ["dep1", "dep2"], desc="Optional description argument")
def _(job):
    print(job.ts[0] + " invoked.")

# Make a phony target named `taregetB`, which depends on `dep3` and `dep4`.
# An invocation of `targetB` executes no command.
dsl.phony("targetB", ["dep3", "dep4"])

The phony target named all is invoked if no target is specified on the command line. If you want to make libfinalproduct.so by default, please add the following line to your build.py:

dsl.phony("all", ["libfinalproduct.so"])

To execute the declared jobs, please add the following line to your build.py:

dsl.run()

News

v9.4.0

  • hash/ -> hash/ts

v9.3.0

  • buildpy_dir no longer contains os.getcwd().

v9.2.0

  • with_symlink now uses the relative path instead of the absolute path.

v9.1.0

  • Add DSL.lazy_call and DSL.lazy_val.
  • with_symlink now modifies the DAG.

v9.0.0

  • Add DSL.with_symlink.
  • Use SHA256 instead of SHA1.

v8.1.0

  • Support argparse.Namespace for j.ts, j.ds, and j.data.

v8.0.0

  • Support automatic target naming via @file(auto=True).

v7.1.0

  • Support specifying targets and dependencies as dicts.

v7.0.0

  • Support machine readable logging (--execution_log_dir <dir>).
  • Add DSL.check_existence_only.

v6.2.0

  • Make the termination of subprocesses optional.

v6.1.0

  • Set use_hash=True by default.

v6.0.0

  • Remove the complicated dynamic graph update capability, which has been rarely used.

v5.1.0

  • Support cut argument for DSL.phony and DSL.file

v5.0.0

  • Dynamic DAG (accessible through j.ty and j.dy).

  • Kill all subprocesses on a failure (if not --keep-going) or SIGINT.

  • Add _Job.data.

    @file(ts, ds, data=dict(params=dict(a=1, b=2)))
    def this(j):
        print(j.data["params"])
    

v4.3.0

  • Improve error messages.

v4.2.0

  • Print more informative error messages.

v4.1.0

  • Cache clients.

v4.0.0

  • Support BigQuery ("bq://project.dataset.table")

  • Support Google Cloud Storage ("gs://bucket/blob")

    import sys
    
    import buildpy.v4
    dsl = buildpy.v4.DSL()
    
    pyony = dsl.phony
    file = dsl.file
    sh = dsl.sh
    uriparse = dsl.uriparse
    
    @file(["bq://myproject.mydataset.mytable"], ["gs://mybucket/myblob.csv"])
    def _(j):
        project, dataset, table = uriparse(j.ts[0]).netloc.split(".", 2)
        sh(f"""
        bq load --autodetect {project}:{dataset}.{table} {j.ds[0]}
        """)
    
    phony("all", ["bq://myproject.mydataset.mytable"])
    
    if __name__ == "__main__":
        dsl.main(sys.argv)

v3.6.0

  • Add DSL.cd.

v3.5.0

  • Add DSL.serialize, a canonical serializer.

v3.4.0

  • Tweak cache directory naming convention

v3.3.1

  • Respect job.priority a bit more

v3.2.0

  • Support {file,phony}(priority=) (smaller is higher).

v3.1.0

  • Support DSL.rm("dir")

v3.0.0

  • Use the JSON format to store cache files

v2.6.0

  • buildpy.vx.logger no longer has handlers.

v2.5.0

  • Support parallel execution of serial jobs

Development

Release of a new version

Add "buildpy.v9" to setup.py.

cd buildpy
git mv v9 v10
cp -a vx v9
cd v9
grep -l buildpy.vx -R . | xargs -n1 sed -i'' -e 's/buildpy.vx/buildpy.v9/g'
find . -type f | grep -v done | xargs git add
cd ../..
# edit setup.py
python build.py sdist
python3 -m venv venv
venv/bin/python3 pip install -e .[dev]
venv/bin/python3 build.py -h

TODO

  • v6: fully support the --cut argument.