fenpei

Distribution of tasks.


License
BSD-3-Clause
Install
pip install fenpei==2.7.2

Documentation

Fenpei

This little tool helps in scheduling, tracking and aggregating calculations and their results. It forms the step that brings you from 'a directory with working code for a job' to 'running dozens of jobs and getting results easily'.

pip install fenpei

This is intended to be used to run multiple intensive computations on a (linux) cluster. At present, it assumes a shared file system on the cluster.

It takes a bit of work to integrate with your situation but it is very flexible and should make your life easier after setting it up. Some features:

  • Jobs are created in Python files, making it short and extremely flexible.
  • It uses a command line interface (some shell experience required) to easily start, stop or monitor jobs.
  • Easy to use with existing code and easily reproducible, since it works by creating isolated job directories.
  • Can replaces scheduling queue functionality and start jobs through ssh, or can work with existing systems (slurm and qsum included, others implementable).
  • Flexibility for caching, preparation and result extraction.
  • Uses multi-processing and can easily use caching for greater performance, and symlinks to save space.

Note that:

  • You will have to write Python code for your specific job, as well as any analysis or visualization for the extracted data.
  • Except for status monitoring mode, it derives the state on each run, it doesn't keep a database that can get outdated or corrupted.

One example to run reproducible jobs with Fenpei (there are many ways):

  • Make a script that runs your code from source to completion for one set of parameters.
  • Subclass the ShJobSingle job and add all the files that you need in get_nosub_files.
  • Replace all the parameters in the run script and other config files by {{ some_param_name }}. Add these files to get_sub_files.
  • Make a Python file (example below) for each analysis you want to run, and fill in all the some_param_name with the appropriate values.
  • From a shell, use python your_jobfile.py -s to see the status, then use other flags for more functionality (see below).
  • Implement is_complete and result in your job (and crash_reason if you want -t) (others can be overridden too, if you require special behaviour).
  • Add analysis code to your job file if you want to visualize the results.

Example file to generate jobs:

def generate_jobs():
    for alpha in [0.01, 0.10, 1.00]:
        for beta in range(0, 41):
            dict(name='a{0:.2f}_b{1:d}'.format(alpha, beta), subs=dict(
                alpha=alpha,
                beta=beta,
                gamma=5,
                delta='yes'
            ), use_symlink=True)

def analyze(queue):
    results = queue.compare_results(('J', 'init_vib', 'init_rot',))
    # You now have the results for all jobs, indexed by the above three parameters.
    # Visualization is up to you, and will be run when the user adds -x

if __name__ == '__main__':
    jobs = create_jobs(JobCls=ShefJob, generator=generate_jobs(), default_batch=splitext(basename(__file__))[0])
    queue = SlurmQueue(partition='example', jobs=jobs, summary_func=analyze)
    queue.run_argv()

This file registers many jobs for combinations of alpha and beta parameters. You can now use the command line:

usage: results.py [-h] [-v] [-f] [-e] [-a] [-d] [-l] [-p] [-c] [-w WEIGHT]
                  [-q LIMIT] [-k] [-r] [-g] [-s] [-m] [-x] [-t] [-j]
                  [--jobs JOBS] [--cmd ACTIONS]

distribute jobs over available nodes

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         more information (can be used multiple times, -vv)
  -f, --force           force certain mistake-sensitive steps instead of
                        failing with a warning
  -e, --restart         with this, start and cleanup ignore complete
                        (/running) jobs
  -a, --availability    list all available nodes and their load (cache reload)
  -d, --distribute      distribute the jobs over available nodes
  -l, --list            show a list of added jobs
  -p, --prepare         prepare all the jobs
  -c, --calc            start calculating one jobs, or see -z/-w/-q
  -w WEIGHT, --weight WEIGHT
                        -c will start jobs with total WEIGHT running
  -q LIMIT, --limit LIMIT
                        -c will add jobs until a total LIMIT running
  -k, --kill            terminate the calculation of all the running jobs
  -r, --remove          clean up all the job files
  -g, --fix             fix jobs, check cache etc (e.g. after update)
  -s, --status          show job status
  -m, --monitor         show job status every few seconds
  -x, --result          run analysis code to summarize results
  -t, --whyfail         print a list of failed jobs with the reason why they
                        failed
  -j, --serial          job commands (start, fix, etc) may NOT be run in
                        parallel (parallel is faster but order of jobs and
                        output is inconsistent)
  --jobs JOBS           specify by name the jobs to (re)start, separated by
                        whitespace
  --cmd ACTIONS         run a shell command in the directories of each job
                        that has a dir ($NAME/$BATCH/$STATUS if --s)

actions are executed (largely) in the order they are supplied; some actions
may call others where necessary

Pull requests, extra documentation and bug reports are welcome! It's Revised BSD-licensed so you can do many things.