A set of probabilstic distinct value estimators (ie: calculate the cardinality of big sets)


Licenses
LGPL-3.0-only/GPL-3.0-or-later
Install
pip install countmemaybe==0.5

Documentation

Count me Maybe

Build Status

Hey I just met you, This is crazy, But here's a number, So count me, maybe?

countmemaybe is a set of probabilistic data structures that can estimate the cardinality of a set and also support unions. The base set of operations are:

  • Cardinality
  • Union
  • Cardinality of the Union
  • Cardinality of the Intersection
  • Jaccard Index

We also implement a probabilistic quantile data-structure which creates a synopsis of the data in order to satisfy queries regarding the quantiles of the stream.

Insertions are quite quick! For both hyperloglog and kminvalues there is an average insertion time of 4us on a 1.8GHz i7 Macbook Air.

Below are some benchmarks of error rates. The set used for estimation are 160000 random 16bit integers and the benchmark was generated using the test_dve.py script. For some as of yet unknown reason, KMV with a 64bit hash function performs worse than a 32bit hash function. This is counterintuitive and requires more investigation. Furthermore, exact mean relative error rates vary from run to run by ~1%

Benchmarks