An Elixir implementation of the Count-Min Sketch probabilistic data structure used for approximate counting of events in a stream.
The algorithm is based on the paper: An Improved Data Stream Summary: The Count-Min Sketch and its Applications by Graham Cormode and S. Muthukrishnan, though a simpler follow-up paper by the same authors can be found in Approximating Data with the Count-Min Data Structure
The package can be installed
by adding count_min_sketch
to your list of dependencies in mix.exs
:
def deps do
[
{:count_min_sketch, "~> 0.1.0"}
]
end
CountMinSketch is simple to use, though an important factor to take into account is the calculation of the expected error with a certain probability. An excerpt from [2] states:
As a result, for a sketch of size w × d with total count N , it follows that any estimate has error at most 2N/w, with probability at least 1 − (1/2)^d. So setting the parameters w and d large enough allows us to achieve very high accuracy while using relatively little space.
For example: Suppose we want an error of at most 0.1% (of the sum of all frequencies), with 99.9% certainty. Then we want 2/w = 1/1000, we set w = 2000, and (1/2)^d = 0.001, i.e. d = log 0.001/ log 0.5 ≤ 10.
# Creates a new CountMinSketch with 100 rows and 1000 columns
count_min_sketch = CountMinSketch.new(100, 1000)
# Adds a new element to the sketch
count_min_sketch = CountMinSketch.add(count_min_sketch, "foo")
>CountMinSketch.add("foo")
>CountMinSketch.add("foo")
>CountMinSketch.add("foo")
>CountMinSketch.add("foo")
>CountMinSketch.add("foo")
>CountMinSketch.add("bar")
CountMinSketch.get_count(count_min_sketch, "foo")
> 5