A set of map-reduce high-order functions to use with parallel or xargs


License
MIT
Install
pip install shmr==1.4.5

Documentation

SHMR

A set of high-order map-reduce functions

PyPI Python GitHub Issues Contributions welcome License

Table of Contents

Installation

From PyPi: pip install shmr

Features

This library is designed to work with xargs or parallel for paralleling processing large data as simple as possible. Its main goal is to reduce the time spending writing code with respect to reasonable computing speed up by doing parallelization (i.e., not trying to be as fast as possible, but still faster than sequential algorithms). It is more suitable to research environment than production environment as existing parallel computing frameworks.

Its API is highly influent by Spark API.

Below are some examples:

  1. Split one file (partition) to multiple files (partitions)
python -m shmr -i <file_path> partitions.coalesce --outfile <output_files> --num_partitions=128
  1. Parallel applying a mapping function
ls <input_files> | xargs -n 1 -I {} -P <n_threads> python -m shmr \
    -i {} partition.map --fn <func> --outfile <output_file>

If you provide the -v, it will show the progression bar telling you how long it will take to process one partition.