vose

This is a Cython implemention of Michael Vose's Alias method. It can be used to perform weighted sampling with replacement of integers in O(1) time. It requires a construction phase that runs in O(n) time, with n being the number of integers with associated weights. As far as I know, it's faster than any other method available in Python. But I would love to be proven wrong!

I wrote this because I had a specific usecase where I needed to repeatidly sample integers with a weight associated to each integer. I stumbled on Keith Schwarz's Darts, Dice, and Coins: Sampling from a Discrete Distribution, which is very well written, and decided to use the Alias method. Alas, numpy doesn't seem to have it available, and neither does the random module from Python's standard library. There is, however, the vose_sampler package, but it is written in pure Python and isn't fast enough for my purposes. I therefore decided to write it in Cython and shamelessly adapted Keith Schmarz's Java implementation.

Installation

pip install vose

Usage

You first have to initialize a sampler with an array of weights. These weights are not required to sum up to 1.

>>> import numpy as np
>>> import vose

>>> weights = np.array([.1, .3, .4, .2])
>>> sampler = vose.Sampler(weights, seed=42)

You can then call the .sample() method to sample a random integer in range [0, n - 1], where n is the number of weights that were passed.

>>> sampler.sample()
2

You can set the k parameter in order to produce multiple samples.

>>> sampler.sample(k=10)
array([1, 1, 2, 1, 2, 1, 0, 1, 3, 3])

By default, a copy of the weights is made. You can disable this behavior in order to save a few microseconds, but this will modify the provided array.

>>> sampler = vose.Sampler(weights, seed=42, copy=False)

Note that vose.Sampler expects to be provided with a memoryview. In order to pass a list, you have to convert it yourself to a numpy array.

>>> weights = [.2, .3, .5]
>>> sampler = vose.Sampler(np.array(weights))

You can also use vose.Sampler from within your own Cython .pyx file:

import numpy as np

cimport vose
cimport numpy as np

cdef np.float [:] weights = np.array([.2, .3, .5], dtype=float)
cdef vose.Sampler sampler
sampler = vose.Sampler(weights)

cdef int sample = sampler.sample_1()
cdef np.int [:] samples = sampler.sample_k(10)

Note that the latter requires having to include the numpy headers in the extension definition of your setup.py:

from setuptools import Extension
from setuptools import setup
from Cython.Build import cythonize
import numpy as np

extension = Extension(
    '*', ['your_file.pyx'],
    include_dirs=[np.get_include()],
    define_macros=[('NPY_NO_DEPRECATED_API', 'NPY_1_7_API_VERSION')]
)

setup(ext_modules=cythonize([extension]))

Is it reliable?

It seems to be working correctly; at least according to the following chi-squared tests:

>>> import numpy as np
>>> from scipy import stats

>>> rng = np.random.RandomState(42)
>>> k = 1000

>>> for n in range(3, 20):
...     weights = rng.dirichlet(np.arange(1, n))
...     sampler = vose.Sampler(weights, seed=42)
...     samples = sampler.sample(k)
...     chi2 = stats.chisquare(f_obs=np.bincount(samples), f_exp=weights * k)
...     assert chi2.pvalue > .05

Is it fast?

Hell yeah. The following graph shows the average time taken to sample one integer for different amounts of weights:

As you can see, vose.Sampler takes less than a nanosecond to produce a random integer. Here is the construction time:

vose.Sampler is also fast enough to compete with numpy and random, even when including the construction time. The following table summarizes a comparison I made on my laptop, with n being the number of weights and k the number of integers to sample:

n	k	np.random.choice	random.choices	vose.Sampler	vose_sampler.VoseAlias
3	1	26ns	2ns	4ns	11ns
3	2	26ns	3ns	7ns	13ns
3	3	26ns	3ns	7ns	14ns
3	10	26ns	6ns	7ns	27ns
3	100	28ns	47ns	8ns	198ns
3	1000	46ns	461ns	19ns	1μs, 887ns
30	1	27ns	6ns	4ns	69ns
30	2	26ns	7ns	7ns	73ns
30	3	27ns	7ns	8ns	72ns
30	10	27ns	14ns	7ns	88ns
30	100	31ns	63ns	8ns	256ns
30	1000	67ns	580ns	19ns	1μs, 935ns
300	1	29ns	47ns	6ns	661ns
300	2	29ns	47ns	9ns	659ns
300	3	29ns	49ns	9ns	685ns
300	10	29ns	54ns	9ns	685ns
300	100	36ns	112ns	10ns	877ns
300	1000	96ns	717ns	20ns	2μs, 599ns
3000	1	52ns	416ns	18ns	6μs, 988ns
3000	2	50ns	420ns	21ns	7μs, 39ns
3000	3	51ns	439ns	21ns	7μs, 102ns
3000	10	51ns	420ns	21ns	7μs, 332ns
3000	100	59ns	496ns	23ns	7μs, 349ns
3000	1000	137ns	1μs, 213ns	35ns	10μs, 190ns

In summary, you probably don't need to be using vose.Sampler if you only need to sample once, regardless of the number of integers you wish to sample. You want to use vose.Sampler when you need to sample in a sequential manner, because at that point the construction time will be amortized. Indeed, this will bring you two orders of magnitude improved speed, when compared to calling np.random.choice or random.choices multiple times.

Development

git clone https://github.com/MaxHalford/vose
cd vose
python setup.py build_ext --inplace
pytest

Further work

The weights assigned to each integer cannot be modified. A search tree can be used as a data structure that supports modifications. This allows modifying weights in O(log(n)) time, but means sampling also happens in O(log(n)) time. More information here.
I'm not 100% the memory allocation of the memoryviews is optimal.
Initializing a vose.Sampler from another Cython .pyx file seems to require some Python interaction; there's probably a way to avoid this.

vose
Release 0.0.2

Release 0.0.2

0.0.1

0.0.2

Documentation

vose

Installation

Usage

Is it reliable?

Is it fast?

Development

Further work

Stats

Development practices

Releases

Contributors

vose Release 0.0.2

Release 0.0.2 Toggle Dropdown 0.0.1 0.0.2

Documentation

vose

Installation

Usage

Is it reliable?

Is it fast?

Development

Further work

Stats

Development practices

Releases

Contributors

vose
Release 0.0.2

Release 0.0.2

0.0.1

0.0.2