easy-slurm

Easily manage and submit robust jobs to Slurm using Python and Bash.


Keywords
slurm, sbatch, slurm-cluster, slurm-job, slurm-job-scheduler, slurm-workload-manager
License
MIT
Install
pip install easy-slurm==0.2.1

Documentation

Easy Slurm

License: MIT PyPI

Easily manage and submit robust jobs to Slurm using Python and Bash.

Features

  • Freezes source code and assets by copying to separate $JOB_DIR.
  • Auto-submits another job if current job times out.
  • Exposes hooks for custom bash code: setup/setup_resume, on_run/on_run_resume, and teardown.
  • Format job names using parameters from config files.
  • Interactive jobs supported for easy debugging.

Installation

pip install easy-slurm

Usage

To submit a job, simply fill in the various parameters shown in the example below.

import easy_slurm

easy_slurm.submit_job(
    job_dir="$HOME/jobs/{date}-{job_name}",
    src="./src",
    assets="./assets",
    setup="""
        virtualenv "$SLURM_TMPDIR/env"
        source "$SLURM_TMPDIR/env/bin/activate"
        pip install -r "$SLURM_TMPDIR/src/requirements.txt"
    """,
    setup_resume="""
        # Runs only on subsequent runs. Call setup and do anything else needed.
        setup
    """,
    on_run="python main.py",
    on_run_resume="python main.py --resume",
    teardown="""
        # Do any cleanup tasks here.
    """,
    sbatch_options={
        "job-name": "example-simple",
        "account": "your-username",
        "time": "3:00:00",
        "nodes": "1",
    },
    resubmit_limit=64,  # Automatic resubmission limit.
)

All job files will be kept in the job_dir directory. Provide directory paths to src and assets -- these will be archived and copied to the job_dir directory. Also provide Bash code in the hooks, which will be run in the following order:

First run: Subsequent runs:
setup setup_resume
on_run on_run_resume
teardown teardown

Full examples can be found here, including a simple example to run "training epochs" on a cluster.

Jobs can also be fully configured using YAML files. See examples/simple_yaml.

job_dir: "$HOME/jobs/{date}-{job_name}"
src: "./src"
assets: "./assets"
setup: |
  virtualenv "$SLURM_TMPDIR/env"
  source "$SLURM_TMPDIR/env/bin/activate"
  pip install -r "$SLURM_TMPDIR/src/requirements.txt"
setup_resume: |
  # Runs only on subsequent runs. Call setup and do anything else needed.
  setup
on_run: "python main.py"
on_run_resume: "python main.py --resume"
teardown: |
  # Do any cleanup tasks here.
sbatch_options:
  job-name: "example-simple"
  account: "your-username"
  time: "3:00:00"
  nodes: 1
resubmit_limit: 64  # Automatic resubmission limit.

Formatting

One useful feature is formatting paths using custom template strings:

easy_slurm.submit_job(
    job_dir="$HOME/jobs/{date:%Y-%m-%d}-{job_name}",
)

The job names can be formatted using a config dictionary:

job_name = easy_slurm.format.format_with_config(
    "bs={hp.batch_size:04},lr={hp.lr:.1e}",
    config={"hp": {"batch_size": 32, "lr": 1e-2}},
)

easy_slurm.submit_job(
    job_dir="$HOME/jobs/{date:%Y-%m-%d}-{job_name}",
    sbatch_options={
        "job-name": job_name,  # equals "bs=0032,lr=1.0e-02"
        ...
    },
    ...
)

This helps in automatically creating descriptive, human-readable job names.

See the documentation for more information and examples.