matchtools

A set of tools for data matching and manipulation


Keywords
match, matching, comparison, data, attributes, alignment, integration
License
MIT
Install
pip install matchtools==0.1.2

Documentation

Build Status Documentation Status

Matchtools

Streamline data matching and integration processes. Compare single values (or entire rows) containing different kinds of data and specify match tolerance for each.

Requirements

Installation

>>> pip install matchtools

Documentation

Please refer to the documentation for the API description and more real life examples given in the cookbook.

Usage

The matching process is handled by MatchBlock class.

When instantiated, MatchBlock object tries to automatically guess and extract following data types from a supplied string, int or float and saves it as an instance attribute:

  • number – only if int or float were supplied (for number extracted from a string see below)
  • date – as datetime object, extracted from a string
  • coordinates – latitude and longitude, separated by comma, extracted from a string
  • string – any amount of chars extracted from a string
  • str_number – any number or char with number attached, extracted from a string
  • str_custom – gets filled with value from a dictionary during string substitution operation

When MatchBlock objects compared, each data type from both objects gets compared to the corresponding type of another object. To determine if the pair of data types is a match, each type uses tolerance parameter, which should be set beforehand.

  • number_tolerance – absolute difference between numbers compared to tolerance
  • date_tolerance – difference between two dates compared to tolerance set in days
  • coordinates_tolerance – distance between two points in km (by default) compared to tolerance
  • string_tolerance – difference between two strings using fuzzywuzzy's UWRatio method (by default) compared to tolerance
  • str_number_tolerance – same as string_tolerance but for numbers and chars with numbers
  • str_custom_tolerance – same as string_tolerance but for strings obtained through substitution

If all of the data types are matched, two MatchBlock objects are considered as a match.

MatchBlock also has some useful class methods for string transformations which can be used alone.

Examples

>>> from matchtools import MatchBlock
Comparison
  • Two strings containing string and string number data types:
>>> MatchBlock('New York 35') == MatchBlock('35_new-york')
True
  • Two strings containing string and date data types:
MatchBlock('Madrid_2016-09-05') == MatchBlock('5_Sep_2016_madrid')
True
Data manipulation

Standardise names in pandas DataFrame using MatchBlock class methods:

from matchtools import MatchBlock, move_element_to_back
import pandas as pd

input_data = [('nord IV N1'), ('west 3 W001'), ('e 02 E01'), ('sud 1 S1')]

df = pd.DataFrame(input_data, columns = ['Name'])

def standardize(element):
    element = move_element_to_back(element, 1)
    element = MatchBlock.dict_sub(element)
    element = MatchBlock.roman_to_integers(element)
    element = MatchBlock.strip_zeros(element)
    return element

df['Standardized'] = df.apply(lambda row: standardize(row['Name']), axis=1)

print(df)
          Name Standardized
0   nord IV N1   north N1 4
1  west 3 W001    west W1 3
2     e 02 E01    east E1 2
3     sud 1 S1   south S1 1
Data matching

Find record from a dataset that matches input record:

from matchtools import MatchBlock, match_find

MatchBlock.number_tolerance = 10
MatchBlock.date_tolerance = 5
MatchBlock.coordinates_tolerance = 0
MatchBlock.string_tolerance = 0
MatchBlock.str_number_tolerance = 0

record_1 = ['Flight 3', 5, '1 May 2015', '52.3740300, 4.8896900']

records = [['Flight 1', 0, '3 May 2015', '52.3740300, 4.8896900'],
           ['Flight 2', 5, '4 May 2016', '52.3740300, 4.8896900'],
           ['Flight 3', 10, '5 May 2015', '52.3740300, 4.8896900'],
           ['Flight 3', 15, '6 May 2015', '52.3740300, 4.8896900']]

print(match_find(record_1, records))
['Flight 3', 10, '5 May 2015', '52.3740300, 4.8896900']