count_min_sketch

Minimal Count-Min Sketch implementation in Elixir


License
MIT

Documentation

CountMinSketch

An Elixir implementation of the Count-Min Sketch probabilistic data structure used for approximate counting of events in a stream.

The algorithm is based on the paper: An Improved Data Stream Summary: The Count-Min Sketch and its Applications by Graham Cormode and S. Muthukrishnan, though a simpler follow-up paper by the same authors can be found in Approximating Data with the Count-Min Data Structure

Installation

The package can be installed by adding count_min_sketch to your list of dependencies in mix.exs:

def deps do
  [
    {:count_min_sketch, "~> 0.1.0"}
  ]
end

Error rate calculation

CountMinSketch is simple to use, though an important factor to take into account is the calculation of the expected error with a certain probability. An excerpt from [2] states:

As a result, for a sketch of size w × d with total count N , it follows that any estimate has error at most 2N/w, with probability at least 1 − (1/2)^d. So setting the parameters w and d large enough allows us to achieve very high accuracy while using relatively little space.

For example: Suppose we want an error of at most 0.1% (of the sum of all frequencies), with 99.9% certainty. Then we want 2/w = 1/1000, we set w = 2000, and (1/2)^d = 0.001, i.e. d = log 0.001/ log 0.5 ≤ 10.

Usage

  # Creates a new CountMinSketch with 100 rows and 1000 columns 
  count_min_sketch = CountMinSketch.new(100, 1000)

  # Adds a new element to the sketch
  count_min_sketch = CountMinSketch.add(count_min_sketch, "foo")
    >CountMinSketch.add("foo")
    >CountMinSketch.add("foo")
    >CountMinSketch.add("foo")
    >CountMinSketch.add("foo")
    >CountMinSketch.add("foo")
    >CountMinSketch.add("bar")

  CountMinSketch.get_count(count_min_sketch, "foo")
  > 5