Spark Management Consolidated

A small module that will load as a singleton class object to manage Spark related things.

Installation

Directly via pip on the command line, in a virtualenv:

pip install https://github.com/matz-e/sparkmanager/tarball/master

or for the current user:

pip install --user https://github.com/matz-e/sparkmanager/tarball/master

Usage

The module itself acts as a mediator to Spark:

import sparkmanager as sm

# Create a new application
sm.create("My fancy name",
          [("spark.executor.cores", 4), ("spark.executor.memory", "8g")])

data = sm.spark.range(5)
# Will show up in the UI with the name "broadcasting some data"
with sm.jobgroup("broadcasting some data"):
    data = sm.broadcast(data.collect())

The Spark session can be accessed via sm.spark, the Spark context via sm.sc. Both attributes are instantiated once the create method is called, with the option to call unambiguous methods from both directly via the :py:class:`SparkManager` object:

# The following two calls are equivalent
c = sm.parallelize(range(5))
d = sm.sc.parallelize(range(5))
assert c.collect() == d.collect()

Cluster support scripts

Note

Scripts to run on the cluster are still somewhat experimental and should be used with caution!

Environment setup

To create a self-contained Spark environment, the script provided in examples/env.sh can be used. It is currently tuned to the requirements of the bbpviz cluster. A usage example:

SPARK_ROOT=/path/to/my/spark/installation SM_WORKDIR=/path/to/a/work/directory examples/env.sh

The working directory will contain:

A Python virtual environment
A basic Spark configuration pointing to directories within the working directory
An environment script to establish the setup

To use the resulting working environment:

. /path/to/a/work/directory/env.sh

Spark deployment on allocations

Within a cluster allocation, the script sm_cluster can be used to start a Spark cluster. The script will be automatically installed by pip. To use it, pass either a working directory containing an environment or specify them separately:

sm_cluster startup $WORKDIR
sm_cluster startup $WORKDIR /path/to/some/env.sh

Similar, to stop a cluster (not necessary with slurm):

sm_cluster shutdown $WORKDIR
sm_cluster shutdown $WORKDIR /path/to/some/env.sh

Spark applications then can connect to a master found via:

cat $WORKDIR/spark_master

TL;DR on BlueBrain 5

Setup a Spark environment in your current shell, and point WORKDIR to a shared directory. SPARK_HOME needs to be in your environment and point to your Spark installation. By default, only a file with the Spark master and the cluster launch script will be copied to WORKDIR. Then submit a cluster with:

sbatch -A proj16 -t 24:00:00 -N4 --exclusive -C nvme $(which sm_cluster) startup $WORKDIR

sparkmanager
Release 0.7.3

Release 0.7.3

0.7.3

0.7.1

0.7.0

0.6.0

0.5.8

0.5.7

0.5.6

0.5.5

0.5.4

0.5.3

Documentation

Spark Management Consolidated

Installation

Usage

Cluster support scripts

Environment setup

Spark deployment on allocations

TL;DR on BlueBrain 5

Stats

Development practices

Releases

Contributors

sparkmanager Release 0.7.3

Release 0.7.3 Toggle Dropdown 0.7.3 0.7.1 0.7.0 0.6.0 0.5.8 0.5.7 0.5.6 0.5.5 0.5.4 0.5.3

Documentation

Spark Management Consolidated

Installation

Usage

Cluster support scripts

Environment setup

Spark deployment on allocations

TL;DR on BlueBrain 5

Stats

Development practices

Releases

Contributors

sparkmanager
Release 0.7.3

Release 0.7.3

0.7.3

0.7.1

0.7.0

0.6.0

0.5.8

0.5.7

0.5.6

0.5.5

0.5.4

0.5.3