SimSearch

Implementation of Bayesian Sets for fast similarity searches


License
GPL-2.0+
Install
pip install SimSearch

Documentation

SimSearch is an item based retrieval engine which implements Bayesian Sets. Bayesian Sets is a new framework for information retrieval in which a query consists of a set of items which are examples of some concept. The result is a set of items which attempts to capture the example concept given by the query.

For example, for the query with the two animated movies, "Lilo & Stitch" and "Up", Bayesian Sets would return other similar animated movies like "Toy Story". There is a nice blog post about item based search with Bayesian Sets. Feel free to read through it.

This module also adds the novel ability to combine full text queries with items. For example a query can be a combination of items and full text search keywords. In this case the results match the keywords and are re-ranked by similary to the queried items.

It is important to note that Bayesian Sets does not care about the actual feature engineering. In fact SimSearch only implements a simple bag of words model. However other feature types are possible as long as they can be binarized. The index is a set of files in a .xco and .yco format (more in the tutorial) that represents the presence of a feature value in a given item. So as long as you can create these files, SimSearch can read them and perform the matching.

SimSearch has been tested on datasets with millions of documents and hundreds of thousands of features. Future plans include distributed search and real time indexing. For more information, feel free please to follow the tutorial.