License
MIT
Install
pip install KubeJobSub==0.1.17

Documentation

PyPI version

KubeJobSub/AzureStorage/AzureBatch

This repository contains a Python package for three purposes - submitting jobs to a kubernetes cluster, manipulating azure file storage with a bash-like interface, and simplifying submission of jobs to Azure Batch Service.

KubeJobSub

Writing YAML files for submitting jobs to Kubernetes (or for anything else, for that matter) is the opposite of fun.

This makes that process much easier, although very specific to my own workflows.

Installation

Use pip to install:

pip install KubeJobSub

This should take care of everything for you. Installing in a virtualenv is recommended. Script is Python3 only - if you try to run using Python2 you will get an error.

Usage

KubeJobSub can do two things (more may come in the future) - submit jobs to a kubernetes cluster, and get some info about the cluster you're using.

usage: KubeJobSub [-h] {submit,info} ...

KubeJobSub

positional arguments:
  {submit,info}  SubCommand Help
    submit       Submits a job to your kubernetes cluster. Configured to
                 assume you're using azure with a file mount.
    info         Tells you about your kubernetes cluster - number of nodes,and
                 specs/usage for each node.

The submit subcommand makes the assumption that you're using an Azure file share and want to mount it as a volume. Support for other things might get added eventually, depending on needs. It will write a YAML file for you , submit a job to Kubernetes based on that file, and then clean it up. It will do some checks for you to make sure your job actually gets submitted - it ensures no other jobs have the same name as what you've specified, as well as making sure that you haven't requested more CPU/Memory than a node in your cluster has.

For example, the following command would do some read trimming for you, and store the results in the mounted file share. This assumes the FASTQ files you want to trim are in the root of your file share:

KubeJobSub submit -j trim -c "bbduk.sh in=/mnt/azure/2014-SEQ-0276_S2_L001_R1_001.fastq.gz in2=/mnt/azure/2014-SEQ-0276_S2_L001_R2_001.fastq.gz out=/mnt/azure/trimmed_R1.fastq.gz out2=/mnt/azure/trimmed_R2.fastq.gz ref=adapters trimq=10 qtrim=w minlength=50" -i cathrine98/bbmap -share my_share_name -n 3 -m 4

Full usage options for the submit subcommand are below.

usage: KubeJobSub submit [-h] -j JOB_NAME -c COMMAND -i IMAGE [-n NUM_CPU]
                            [-m MEMORY] [-v VOLUME] -share SHARE
                            [-secret SECRET] [-k]

optional arguments:
  -h, --help            show this help message and exit
  -j JOB_NAME, --job_name JOB_NAME
                        Name of job.
  -c COMMAND, --command COMMAND
                        The command you want to run. Put it in double quotes.
                        (")
  -i IMAGE, --image IMAGE
                        Docker image to create container from.
  -n NUM_CPU, --num_cpu NUM_CPU
                        Number of CPUs to request for your job. Must be an
                        integer greater than 0. Defaults to 1.
  -m MEMORY, --memory MEMORY
                        Amount of memory to request, in GB. Defaults to 2.
  -v VOLUME, --volume VOLUME
                        The mountpath for your azure file share. Defaults to
                        /mnt/azure
  -share SHARE, --share SHARE
                        Name of Azure file share that you want mounted to the
                        point specified by -v
  -secret SECRET, --secret SECRET
                        The name of the secret created by kubectl for azure
                        file mounting. Defaults to azure-secret. See
                        https://docs.microsoft.com/en-us/azure/aks/azure-
                        files-volume for more information on creating your
                        own.
  -k, --keep            A YAML file will be created to submit your job.
                        Deleted by default once job is submitted, but if this
                        flag is active it will be kept.

The info subcommand has no options - just run KubeJobSub info. You should see something like:

Number of nodes in cluster: 2
NodeName	CPU_Capacity	CPU_Usage	Memory_Capacity	Memory_Usage
aks-nodepool1-25823294-2	4	(28%)	8145492Ki	(26%)
aks-nodepool1-25823294-3	4	(41%)	8145492Ki	(49%)

AzureStorage

I've found Azure File shares to be a bit of a pain to manipulate with Azure's tools, so this tool provides a more bash-esque interface for manipulating/uploading/downloading files. There are lots of things you can't do yet, and probably lots of bugs.

Installation

Also part of the KubeJobSub package, so use pip to install:

pip install KubeJobSub

Usage

The following commands are currently available - each command has it's own help menu that can be accessed with -h

usage: AzureStorage [-h] {set_credentials,ls,mkdir,upload,download,rm} ...

StorageWrapper: Using azure file shares is kind of a pain.This wraps a bunch
of Azure CLI file share commands into a more linux-esque environment.

positional arguments:
  {set_credentials,ls,mkdir,upload,download,rm}
                        SubCommand Help
    set_credentials     Sets the azure file share and account key as
                        environment variables.
    ls                  Lists files in a directory. Wildcard (*) can be used,
                        but only in final part of expression. (you can ls
                        foo/bar*.py, but not foo*/bar.py)
    mkdir               Makes a directory.
    upload              Uploads a file to azure file storage. Can usewildcard
                        to upload multiple files.
    download            Downloads one or more files from cloud to your
                        machine.
    rm                  Deletes a file. Can be run recursively to delete
                        entire directories with the -r flag.

optional arguments:
  -h, --help            show this help message and exit

Examples

Note that if your credentials haven't been set, you'll be asked to set them. They'll be remembered, and can be changed (in the event you want to use a different storage account or share) using the set_credentials subcommand.

List files in root dir:

AzureStorage ls

List all python files in directory scripts:

AzureStorage ls scripts/*.py

Make a directory called new-dir in root directory:

AzureStorage mkdir new-dir

Upload all .py files in your current directory to new-dir in Azure File Storage:

AzureStorage upload *.py -p new-dir

Upload a folder called folder and all of its subfolders to root in Azure:

AzureStorage upload -r folder

Remove all the .py files in new-dir:

AzureStorage rm new-dir/*.py

Remove a folder called example and all of its subfolders and files:

AzureStorage rm -r example

Download a file called file.txt from directory dir to your current working directory:

AzureStorage download dir/file.txt

Download folder folder and all of its subfolders from root of Azure to directory foo on your machine:

AzureStorage download -r folder foo

Azure Batch

Submitting things to Azure Batch is an awful lot of work - this script makes your job much easier.

You'll need three things:

If you have these, all you need to do is provide a configuration file, which needs the following information:

  • BATCH_ACCOUNT_NAME: The name of your batch account.
  • BATCH_ACCOUNT_KEY: The key for your batch account
  • BATCH_ACCOUNT_URL: The URL for your batch account
  • STORAGE_ACCOUNT_NAME: Name of your Azure Storage account
  • STORAGE_ACCOUNT_KEY: Access key for your Azure Storage account
  • JOB_NAME: A name to give your job - this must be unique among currently running jobs within your Azure Batch account.
  • INPUT: Files that you want to have as input for your job. This is done in a unix mv fashion. If, for example, you wanted to put all .fastq.gz files in your current directory into a folder called raw_data for your batch job, you would put *.fastq.gz raw_data. You can also upload entire directories - the script will check if the path is a directory on your system, and if it is, files will be uploaded while preserving directory structure.
  • OUTPUT: Output files from your job you want. To recursively download a directory by adding a / to the end of your output path. eate a folder called processed_sequence_data. You may specify any number of OUTPUT lines in your configuration file.
  • VM_IMAGE: The URL for a custom VM image that you've created with any necessary programs and databases to run your commands pre-installed. This image must be in the same subscription as your Azure Batch Account

To get the VM client ID, secret, and tenant, follow this guide. You'll also need to check out the section called Use Integrated Authentication found here in order to allow API access.

  • VM_CLIENT_ID: client ID, as found in above guide

  • VM_SECRET: secret key for VM, as found in guide

  • VM_TENANT: Tenant ID for VM, as found in guide.

  • COMMAND: The command you want to use to run your analysis

If any of these parameters are missing, you will see an error message and the program will not proceed.

Optionally, you may also set the size of the VM you want with the VM_SIZE parameter. This defaults to a 16 core, 64 GB RAM machine that should be suitable for most tasks, but this can be scaled up/down to your liking. (See https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes-general for a list of VM sizes).

Each parameter must have a := after the parameter, followed by the value for that parameter with no spaces. An example is below.

BATCH_ACCOUNT_NAME:=mybatchaccount
BATCH_ACCOUNT_KEY:=keyforbatchaccount123jh4g123jh412h34ndafdas==
BATCH_ACCOUNT_URL:=https://mybatchaccount.region.batch.azure.com
STORAGE_ACCOUNT_NAME:=mystorageaccount
STORAGE_ACCOUNT_KEY:=keyforstorageaccount122837462387423akljsdhfaksdfa==
JOB_NAME:=myjob
INPUT:=*.fastq.gz input_sequences
OUTPUT:=processed_sequences/*
VM_IMAGE:=/subscriptions/....
VM_CLIENT_ID:=myvmclientid
VM_SECRET:=myvmsecretkey
VM_TENANT:=myvmtenantid

COMMAND:=do_things.py --input input --output processed_sequences

Actually Submitting Jobs

All you need to do to run your job is AzureBatch -c /path/to/config/file.

This will:

  • Upload files from your local machine to the Azure Cloud.
  • Start up a VM from the image specified in your configuration file and run your job.
  • Download any output files specified to your local machine in your current working directory.
  • Remove the VM and any files from Azure Storage.

You'll also get the stdout.txt and stderr.txt files your job created downloaded to your local machine - useful for troubleshooting! If you don't want to have your output files downloaded and would prefer they remain in Azure Storage, activate the -d flag. Your output files will persist in blob storage in a container called yourjobname-output

usage: AzureBatch [-h] -c CONFIGURATION_FILE [-d] [-k] [-o OUTPUT_DIR]
                  [-e EXIT_CODE_FILE]

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIGURATION_FILE, --configuration_file CONFIGURATION_FILE
                        Path to your configuration file.
  -d, --download_output_files
                        By default, output files will be downloaded from blob
                        storage to local machine and the blob files deleted.
                        Activate this to not download files and keep them in
                        blob storage.
  -k, --keep_input_container
                        By default, input container is deleted. Activate this
                        to not delete the input container
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        Directory where you want your output files stored.
                        Defaults to your current working directory.
  -e EXIT_CODE_FILE, --exit_code_file EXIT_CODE_FILE
                        If specified, will create a file that stores the exit
                        code for each task.