sjq

SJQ - Simple Job Queue - Single host, resource aware, batch job scheduler


License
Other
Install
pip install sjq==0.1.1

Documentation

Simple Job Queue (SJQ)

You specify:

  • how many processes to run (CPUs)
  • how much memory to use (and defaults per job)

The SJQ will open a UNIX socket to accept job requests. The protocol for these requests is a simple text-based protocol. By default, this socket will be located in $HOME/.sjq.sock. Jobs will run under the credentials of the running user. When there are no more jobs, SJQ will wait for $TIMEOUT seconds (default: 60). If there are still no jobs, then it will shut itself down.

Note: If you want a more complicated setup for more than one client user, you can specify an absolute path for the socket, and the SJQ will run the jobs whatever account has started the daemon. You can also run the daemon as root and it will run the jobs as the UID/GID of the submitting user. However, if you really want to do this, please consider a different scheduler, such as Open Grid Engine/SGE, PBS, or SLURM. SJQ was designed to make it easier to run multi-task pipelines on a single server that didn't already have a job scheduler installed. It should be used sparingly!

Scheduling priority

Jobs are executed on a first-come, first-serve basis. If a job can not execute due to CPU, memory, or job dependencies, then it will be skipped and SJQ will attempt to run the next available job.

Configuration

You can set various default values in the $HOME/.sjqrc file. The config values that are relevant to SJQ are:

sjq.socket=path-to-file           default: $HOME/.sjq.sock
sjq.logfile=path-to-logfile       default: none
sjq.daemon=[TF]                   default: F
sjq.pidfile=path-to-pidfile       default: $HOME/.sjq.pid
sjq.autoshutdown=[TF]             default: T
sjq.schedtime=value-in-seconds    default: 60 - the time between scheduling runs
sjq.waittime=value-in-seconds     default: 60 - the time to wait before autoshutdown
sjq.maxprocs=max-procs            default: total CPUs in system
sjq.maxmem=max-mem (ex: 8M, 2G)   default: None (memory use not restricted)

sjq.defaults.procs=num            default: 1
sjq.defaults.mem=mem-per-job      default: 2G

Protocol

To check the connection:

send: PING\r\n
recv: OK <job queue stats>\r\n

To close the connection:

send: EXIT\r\n
recv: OK BYE\r\n

To submit a job:

send: SUBMIT\r\n
send: OPTION VALUE\r\n
send: OPTION VALUE\r\n
      ...
send: SRC script-len\r\n
send  <script-bytes>
recv: OK jobid\r\n
recv: ERROR error-message(s)\r\n

Valid options:
MEM PROCS DEPENDS STDOUT STDERR ENV CWD NAME UID* GID* HOLD

  MEM   maximum memory required (100M, 2G, etc...) (G=1024^3, M=1024^2)
  PROCS number of CPUs this job requires
        (can not be larger than the number of managed CPUs)

        Note: CPU and MEM restrictions aren't enforced - only used for
              scheduling purposes

  DEPENDS is a colon delimited list of job-ids that must finish
          (successfully) for this job to start

  STDOUT, STDERR should be filenames - if not specified, they will be saved
                in the CWD as jobid.{stdout,stderr}

  ENV a colon-delimited list of the environment variables 
      that should be set prior to running the script

  CWD the current working directory for the job

  NAME a human-readable name for the job (not required)

  UID/GID the uid/gid to run this job under - this only works when SJQ is
          running as root and is *not* recommended (this is a large security
          hole, so you should only do this if you know what you are doing)

  HOLD - Place a user-hold on the job. The user can release this this hold
         with the "RELEASE" command. This job should be released within
         the autoshutdown timeout or the server could be shutdown before
         the job is released. (no value needed)

To kill a job:

send: KILL jobid\r\n
recv: OK\r\n

Note: this returns OK even if a jobid doesn't exist or a job has already finished

To release a job:

send: RELEASE jobid\r\n
recv: OK\r\n

To stop the server:

send: SHUTDOWN\r\n
recv: OK\r\n

Note: this implies "EXIT"

To list job status:

send: STATUS {JOBID}\r\n
recv: OK bytes\r\n
recv: <bytes of status message>
      JOBID\tJOBNAME\t[RQHSFAKUE]\tDEPENDS\r\n
      (one line for each job)

If "JOBID" is not given, then all jobs, including jobs that have finished (successfully or not) will be listed.

[RQHSFAKUE] - running, queued, holding, success, fail, abort (parent job failed), killed, user-hold, error (issue spawning job)