pytokenjoin

pyTokenJoin is a library containing efficient algorithms that solve the set similarity join problem with maximum weighted bipartite matching.


License
Apache-2.0
Install
pip install pytokenjoin==0.1.8

Documentation

pyTokenJoin

alt text

Overview

TokenJoin is an efficient method for solving the Fuzzy Set Similarity Join problem. It relies only on tokens and their defined utilities, avoiding pairwise comparisons between elements. It is submitted to the International Conference on Very Large Databases (VLDB). This is the repository for the python source code. More information about the original method can be found here.

Installation

You can easily install pytokenjoin from PyPI using pip:

pip install pytokenjoin

More on PyPI.

Usage

There are two ways to use TokenJoin:

  • When using a threshold δ, e.g. δ=0.7
  • When requesting top-k results, e.g. k=100.

There are also two similarity functions supported: Jaccard and Edit Similarity.

More information on how to use the functions can be found on this jupyter notebook.