pluribus

A pure-python highly-distributed MapReduce cluster.


License
Apache-2.0
Install
pip install pluribus==0.0.1

Documentation

pluribus

Having just finished reading the original Google MapReduce paper, I obviously felt the need to try to implement such a system in Python.

My goals are to implement enough of the functionality described in the paper to be usable, though I strongly warn against ever using this code for anything real.

Since one of the goals (see Goals, below) is simplicity from an end-user standpoint, I am following some of Kenneth Reitz's advice and starting with a readme and documentation.

Examples

The canonical word-count example:

# myjob.py
from pluribus import job


@job.map_
def emit_words(key, value):
    # key: document name
    # value: document contents
    for word in value.split():
        yield word, 1


@job.reduce_
def sum_occurences(key, values):
    # key: a word
    # values: a list of counts
    return sum(values)

Assuming you're running everything on one host, you can ignore the network connection information.

Start a pluribus master:

$ pluribus master

Start a pluribus worker (or several hundred):

$ pluribus worker

On the master or on another machine that can talk to the master:

$ pluribus job myjob
# ... wait
<results>

Goals

Explicit goals are:

  • Simple to use, both as an administrator and end-user.
  • Well-documented.
  • Robust to worker failure.
  • Fast-enough.
  • Use only the Python (2.7+) standard library (at least to run).

Explicit non-goals are:

  • Be a filesystem.
  • Robust to master failure.