Spot
Spot identifies the processes in a pipeline that produce different results in different execution conditions.
Table of Contents
Installation
Simply install the package with pip
$ pip install spottool
Pre-requisites
- Install and start Docker
- Build Docker images for the pipelines in different conditions (e.g. Debian10 and CentOS7)
- Create Boutiques descriptors for the pipeline, in each condition
- Get provenance information using ReproZip tool in one condition
The auto_spot
command finds processes that create differences in results obtained in different conditions and reports them in a JSON file.
Usage example
In this example, we run a bash script that calls the grep
command
multiple times, creating different output files when run on different
OSes. We use spot
to compare the outputs obtained in CentOS 7 and Debian 10.
The example can be run in this Git repository as follows:
git clone https://github.com/big-data-lab-team/spot.git
cd spot
pip install .
docker build . -f spot/example/centos7/Dockerfile -t spot_centos_latest
docker build . -f spot/example/debian/Dockerfile -t spot_debian_latest
cd spot/example
auto_spot -d descriptor_centos7.json -i invocation_centos7.json -d2 descriptor_debian10.json -i2 invocation_debian10.json -s trace_test.sqlite3 -c conditions.txt -e exclude_items.txt -o commands.json .
In this command:
-
descriptor_<distro>.json
is the Boutiques descriptor of the application executed in OS<distro>
. -
invocation_<distro>.json
is the Boutiques invocation of the application executed in OS<distro>
, containing the input files. -
trace_test.sqlite3
is a ReproZip trace of the application, acquired in CentOS 7. -
condition.txt
contains the result folder for each condition. -
exclude_items.txt
contains the list of items to be ignored while parsing the files and directories.
The command produces the following outputs:
-
commands_captured_c.json
contains the list of processes with temporary files and files written by multiple processes. -
commands.json
contains the list of processes that create differences in two conditions. Attributetotal_commands_multi
contains processes that write files written by multiple processes andtotal_commands
contains the other processes.
How to Contribute
- Clone repo and create a new branch:
$ git checkout https://github.com/big-data-lab-team/spot -b name_for_new_branch
. - Make changes and test
- Submit Pull Request with comprehensive description of changes
License
MIT © /bin Lab