A pure-python highly-distributed MapReduce cluster.

pip install pluribus



Having just finished reading the original Google MapReduce paper, I obviously felt the need to try to implement such a system in Python.

My goals are to implement enough of the functionality described in the paper to be usable, though I strongly warn against ever using this code for anything real.

Since one of the goals (see Goals, below) is simplicity from an end-user standpoint, I am following some of Kenneth Reitz's advice and starting with a readme and documentation.


The canonical word-count example:

from pluribus import job

def emit_words(key, value):
    # key: document name
    # value: document contents
    for word in value.split():
        yield word, 1

def sum_occurences(key, values):
    # key: a word
    # values: a list of counts
    return sum(values)

Assuming you're running everything on one host, you can ignore the network connection information.

Start a pluribus master:

$ pluribus master

Start a pluribus worker (or several hundred):

$ pluribus worker

On the master or on another machine that can talk to the master:

$ pluribus job myjob
# ... wait


Explicit goals are:

  • Simple to use, both as an administrator and end-user.
  • Well-documented.
  • Robust to worker failure.
  • Fast-enough.
  • Use only the Python (2.7+) standard library (at least to run).

Explicit non-goals are:

  • Be a filesystem.
  • Robust to master failure.