Reservoir

Generate k samples from any iterable or data stream.


License
Other
Install
pip install Reservoir==0.1.1

Documentation

Reservoir

A lightweight module for generating samples from interable data structures.

Random sampling is often applied to very large datasets and in particular to data streams. In this case, the random sample has to be generated in one pass over an initially unknown population. An elegant and efficient approach to generate random samples from data streams is the use of a reservoir of size k, where k is sample size. The reservoir-based sampling algorithms maintain the invariant that, at each step of the sampling process, the contents of the reservoir are a valid random sample for the set of items that have been processed up to that point. --- Weighted Random Sampling over Data Streams

Installation

pip install Reservoir

Usage

As a command line interface, just run the -h flag to see whats up.

$ res-sample -h

usage: r [-h] [-f FILE | -s] size

Sample k elements from a stream or file

positional arguments:
  size                  The number of elements you wish you sample

optional arguments:
  -h, --help            show this help message and exit
  -f FILE, --file FILE
  -s, --stream          For use with pipes and redirects

File and Streams

# sample 10 lines from 'input.txt'
$ res-sample -f input.txt 10

To use pipes and redirects use the -s | --stream flag.

# sample 10 lines from my twitter api
$ twitter_stream.py | res-sample  -s 10 > output.txt

Module

As a module the UniformSampler interface is quite simple.

from reservoir import UniformSampler
sampler = UniformSampler(max_samples=2)

# sampler.feed takes in an item at a time
sampler.feed("github")
sampler.feed(1)
sampler.feed([1, 2, 3]) 

sampler.saves
>>> ["github", [1, 2, 3]]

# sampler.stream_sample takes in an iterable and selects the k elements
sampler.stream_sample(some_generator())
sampler.stream_sample(dictionary.keys())
sampler.stream_sample(xrange(1,10))