reprotools

A set of tools to evaluate the reproducibility of computations


License
MIT
Install
pip install reprotools==0.0.2

Documentation

PyPI DOI Build Status Coverage Status

Spot

Spot identifies the processes in a pipeline that produce different results in different execution conditions.

Table of Contents

Installation

Simply install the package with pip

$ pip install spottool

Pre-requisites

  • Install and start Docker
  • Build Docker images for the pipelines in different conditions (e.g. Debian10 and CentOS7)
  • Create Boutiques descriptors for the pipeline, in each condition
  • Get provenance information using ReproZip tool in one condition

The auto_spot command finds processes that create differences in results obtained in different conditions and reports them in a JSON file.

Usage example

In this example, we run a bash script that calls the grep command multiple times, creating different output files when run on different OSes. We use spot to compare the outputs obtained in CentOS 7 and Debian 10.

The example can be run in this Git repository as follows:

git clone https://github.com/big-data-lab-team/spot.git
cd spot
pip install .

docker build . -f spot/example/centos7/Dockerfile -t spot_centos_latest
docker build . -f spot/example/debian/Dockerfile -t spot_debian_latest

cd spot/example 

auto_spot -d descriptor_centos7.json -i invocation_centos7.json -d2 descriptor_debian10.json -i2 invocation_debian10.json -s trace_test.sqlite3 -c conditions.txt -e exclude_items.txt -o commands.json .

In this command:

  • descriptor_<distro>.json is the Boutiques descriptor of the application executed in OS <distro>.
  • invocation_<distro>.json is the Boutiques invocation of the application executed in OS <distro>, containing the input files.
  • trace_test.sqlite3 is a ReproZip trace of the application, acquired in CentOS 7.
  • condition.txt contains the result folder for each condition.
  • exclude_items.txt contains the list of items to be ignored while parsing the files and directories.

The command produces the following outputs:

  • commands_captured_c.json contains the list of processes with temporary files and files written by multiple processes.
  • commands.json contains the list of processes that create differences in two conditions. Attribute total_commands_multi contains processes that write files written by multiple processes and total_commands contains the other processes.

How to Contribute

  1. Clone repo and create a new branch: $ git checkout https://github.com/big-data-lab-team/spot -b name_for_new_branch.
  2. Make changes and test
  3. Submit Pull Request with comprehensive description of changes

License

MIT © /bin Lab