pandas-linker

pandas-linker runs comparison windows over different sortings of a pandas DataFrame and links the rows via assigned UUIDs. This library does not actually do any duplicate detection. Instead it provides a harness to run your own comparison functions on your data.

This library is meant for datasets of a size where comparing every row with every other is undesirable. Instead you can decide on a sorting order of the DataFrame and only compare every row with every other inside a sliding window.

Install

pip install pandas-linker

Example

Let's say you have a DataFrame like this:

[ix]	name	country
0	Pete	Spain
1	Mary	USA
2	Bart	US
3	Mary	US

and you want to detect similar rows and mark them as such. Here's how to do that:

from pandas_linker import get_linker


def compare_rows(a, b):
    ''' Example function that decides if two rows represent same entity.'''
    return a['name'] in b['name'] or b['name'] in a['name']

# df is a pandas.DataFrame with a unique index

with get_linker(df, field='uid') as linker:

    print('Comparing in 10 row window sorted by name')
    linker(sort_cols=['name'], window_size=10, cmp=compare_rows)

    print('Comparing in 15 row window sorted by country')
    linker(sort_cols=['country'], window_size=15, cmp=compare_rows)

After running the linker the process is complete

[ix]	name	country	uid
0	Pete	Spain	7509781940fc471cad5dc32944652d70
1	Mary	USA	8f8dccd91568472daf740e9160349d6c
2	Bart	US	12b55fbe80f64d378193acd727b0e051
3	Mary	US	8f8dccd91568472daf740e9160349d6c

Note that both "Mary" rows in the DataFrame have been identified as representing the same entity and were assigned the same UUID.

pandas-linker
Release 0.0.2

Release 0.0.2

0.0.2

0.0.1

Documentation

pandas-linker

Install

Example

Stats

Development practices

Releases

Contributors

pandas-linker Release 0.0.2

Release 0.0.2 Toggle Dropdown 0.0.2 0.0.1

Documentation

pandas-linker

Install

Example

Stats

Development practices

Releases

Contributors

pandas-linker
Release 0.0.2

Release 0.0.2

0.0.2

0.0.1