# hyperloglog

A redis-backed hyperloglog implementation in Clojure

Hyperlolog is a highly-accurate probabilistic cardinality estimation algorithm that uses constant storage, first proposed by Flajolet *et al* [1]. This clojure library is designed for the use case where multiple frontends update shared hyperloglog counters stored in Redis. Hyperloglog time-series are also supported, enabling cardinality estimates over arbitrary periods of time (e.g., number of distinct users over the past week).

## Obsolescence

According to this blog post, upcoming versions of Redis will have built-in support for hyperloglog counters. Once that feature is released, this library should be considered deprecated.

## Setup

Hyperloglog leverages the fantastic carmine library for connecting to Redis. Add the following dependencies to your `project.clj`

:

```
[com.taoensso/carmine "2.4.0"]
[hyperloglog "0.2.1"]
```

First we need to set up a carmine connection pool. For example, for a simple redis instance running on `redis.company.com`

, use:

```
(ns my-app (:require [taoensso.carmine :as car]))
(defmacro wcar* [& body] `(car/wcar {:pool {} :spec {:host "redis.company.com"} ~@body))
```

## Usage

### Simple shared counter

Require the core namespace:

`(ns my-app (:require [hyperloglog.core :as hll]))`

Populate the counter:

```
(wcar* (hll/add "5e0b4287-e88d-4d95-99c9-3bac4dded572"))
(wcar* (hll/add "2561d259-6924-4e9b-b9cf-cea7c9eb4cc2"))
...
(wcar* (hll/add "697d99d1-2cd9-4423-a4f8-6577dfd3f923"))
```

Get the estimated number of distinct items:

`(wcar* (hll/count-distinct))`

The `add`

and `count-distinct`

accept a number of options. See their respective docstrings for the full list.

### Time series

Require the time-series namespace:

`(ns my-app (:require [hyperloglog.time-series :as hll-time]))`

Populate the counter:

```
(wcar* (hll-time/add-now "5e0b4287-e88d-4d95-99c9-3bac4dded572"))
(wcar* (hll-time/add-now "2561d259-6924-4e9b-b9cf-cea7c9eb4cc2"))
...
(wcar* (hll-time/add-now "697d99d1-2cd9-4423-a4f8-6577dfd3f923"))
```

Get the estimated number of distinct items over recent time periods:

```
(wcar* (hll-time/count-lastest-distinct hll-time/day)) ; last day
(wcar* (hll-time/count-lastest-distinct (* 6 hll-time/hour))) ; last 6 hours
```

Furthermore an `hll/add-at`

function is available for adding items at a specific time. By default, the time-series are bucketed per hour but this is configurable (see the `hll/add-at`

docstring for details).

## Implementation details

### Accuracy

This library make the following implementation choices

- Uses a 64 bit hashing function for increased accuracy on higher cardinalities (> 1 billion) as suggested by Heule
*et al*[2], instead of the 32-bit hashing function with large range corrections in the original paper [1]. - No small range corrections. If
*m*is the number of observables, the hyperloglog aymptotic expected relative error is not typically attained for cardinalities less than*n = 5/2 m*.

The table below shows the various hyperloglog operating points depending on the choice of the number of observables.

Number of observables (m) | Minimum number of samples (5/2m) | Standard error (1.04/sqrt(m)) |

16 | 40 | 26% |

32 | 80 | 18.38% |

64 | 160 | 13.00% |

128 | 320 | 9.19% |

256 | 640 | 6.50% |

512 | 1280 | 4.60% |

1024 | 2560 | 3.25% |

2048 | 5120 | 2.30% |

4092 | 10230 | 1.63% |

- The number of observables scales linearly with the memory requirements. Note that the observables for a given hyperloglog is backed by a redis hash (storing
*m*fields and*m*value). - The minimum number of samples indicates the the number of samples beyond which the estimate is usually within its expected relative error (
*5/2m*). - The standard error is the square root of the variance divided by the number of samples. Its asymptotic value is
*1.04/sqrt(m)*. See [1] for full details.

This library defaults to *m=1024*, but this is of course configurable (see for example the `hyperloglog.core/add`

docstring`).

### Hashing

The hyperloglog algorithm relies on a good 64 bit hashing function. By default, this library uses the MurmurHash3 hashing function in its 128-bit variant and then keeps only the first 64 bits. Heule *et al* tested MurmurHash3 and all the common hash functions (i.e., MD5, SHA1, SHA256) and found no significant performance difference [2]. Under the hood we use the MurmurHash3 implementation from the byte-streams library. Therefore, the `hyperloglog.core/add`

function should handle anything that byte-streams does (e.g., `String`

, `byte[]`

, `CharSequence`

, `ByteBuffer`

). If you need to specify a custom 64-bit hashing function, then you can pass your own as follows.

```
(def hll-opts {:hashing-fn my-hashing-function})
(wcar* (hll/add my-custom-object hll-opts))
```

## References

- Philippe Flajolet, Eric Fusy, and Olivier Gandouet. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In
*Proc. AOFA*, 2007 - Stefan Heule, Marc Nunkesser, and Alex Hall. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm. In
*Proc EDBT*, 2013

## License

Copyright © 2013 John Whitbeck

Distributed under the Eclipse Public License, the same as Clojure.