clj-similar

Fast similar set lookup using MinHash and K-d trees


License
EPL-1.0

Documentation

clj-similar

Clojars Project

Experimental library for (fast) similar set lookup. Under the hood it uses a MinHash to compute locality sensitive hashes of a collection of sets, and loads them into a k-d tree. The constructed similar data structure can be used to retrieve nearest neighbors of a given target set. While the construction of the data structure can be expensive, lookups should be fast.

Note that it will always return some set that is considered nearest. Thresholds for a value that is "too dissimilar" is currently at your own discretion, the distance can be accessed with the meta data on the returned result(s).

Caveats

  • Collections of sets that have more than the maximum integer value of distinct values are currently unsupported.
  • The data structure is read-only, support for modifications is not currently planned (but pull requests welcome).

Usage

(require '[clj-similar.core :refer [similar nearest]])
(def coll [#{"a" "b" "c"} #{"d" "e" "c"} #{"f" "e" "a" "b"}])
;; Creates the data structure, optionally an error rate can be defined (default 0.05)
(def s (similar coll))
(def s (similar coll 0.01)) ;; different error rate

;; A single nearest neighbor
(nearest s #{"f" "e" "a" "b"})
;=> #{"f" "e" "a" "b"}

(nearest s #{"f" "e" "a" "b" "x"})
;=> #{"f" "e" "a" "b"}

(nearest s #{"f" "e" "a"})
;=> #{"f" "e" "a" "b"}

(nearest s #{"a"})
;=> #{"a" "b" "c"}

;; Two nearest neighbors
(nearest s #{"a" "b"} 2)
;=> (#{"a" "b" "c"} #{"f" "e" "a" "b"})

;; To access the distance metrics and computed point use the associated metadata
(meta (nearest s #{"a" "b"}))

;; The values of the sets can be any Clojure data structure, even other collections
(def coll [#{["a"] ["a" "b"]} #{["c" "d"] ["a" "c"]}])
(def s (similar coll))
(nearest s #{["a" "b"]})
;=> #{["a" "b"] ["a"]}

Dependencies

License

Copyright © 2016 Joël Kuiper

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.