Matchtools
Streamline data matching and integration processes. Compare single values (or entire rows) containing different kinds of data and specify match tolerance for each.
Requirements
- Python 3.5 or higher
- datefinder
- fuzzywuzzy
- geopy
- roman
Installation
>>> pip install matchtools
Documentation
Please refer to the documentation for the API description and more real life examples given in the cookbook.
Usage
The matching process is handled by MatchBlock class.
When instantiated, MatchBlock object tries to automatically guess and extract following data types from a supplied string, int or float and saves it as an instance attribute:
- number – only if int or float were supplied (for number extracted from a string see below)
- date – as datetime object, extracted from a string
- coordinates – latitude and longitude, separated by comma, extracted from a string
- string – any amount of chars extracted from a string
- str_number – any number or char with number attached, extracted from a string
- str_custom – gets filled with value from a dictionary during string substitution operation
When MatchBlock objects compared, each data type from both objects gets compared to the corresponding type of another object. To determine if the pair of data types is a match, each type uses tolerance parameter, which should be set beforehand.
- number_tolerance – absolute difference between numbers compared to tolerance
- date_tolerance – difference between two dates compared to tolerance set in days
- coordinates_tolerance – distance between two points in km (by default) compared to tolerance
- string_tolerance – difference between two strings using fuzzywuzzy's UWRatio method (by default) compared to tolerance
- str_number_tolerance – same as string_tolerance but for numbers and chars with numbers
- str_custom_tolerance – same as string_tolerance but for strings obtained through substitution
If all of the data types are matched, two MatchBlock objects are considered as a match.
MatchBlock also has some useful class methods for string transformations which can be used alone.
Examples
>>> from matchtools import MatchBlock
Comparison
- Two strings containing string and string number data types:
>>> MatchBlock('New York 35') == MatchBlock('35_new-york')
True
- Two strings containing string and date data types:
MatchBlock('Madrid_2016-09-05') == MatchBlock('5_Sep_2016_madrid')
True
Data manipulation
Standardise names in pandas DataFrame using MatchBlock class methods:
from matchtools import MatchBlock, move_element_to_back
import pandas as pd
input_data = [('nord IV N1'), ('west 3 W001'), ('e 02 E01'), ('sud 1 S1')]
df = pd.DataFrame(input_data, columns = ['Name'])
def standardize(element):
element = move_element_to_back(element, 1)
element = MatchBlock.dict_sub(element)
element = MatchBlock.roman_to_integers(element)
element = MatchBlock.strip_zeros(element)
return element
df['Standardized'] = df.apply(lambda row: standardize(row['Name']), axis=1)
print(df)
Name Standardized
0 nord IV N1 north N1 4
1 west 3 W001 west W1 3
2 e 02 E01 east E1 2
3 sud 1 S1 south S1 1
Data matching
Find record from a dataset that matches input record:
from matchtools import MatchBlock, match_find
MatchBlock.number_tolerance = 10
MatchBlock.date_tolerance = 5
MatchBlock.coordinates_tolerance = 0
MatchBlock.string_tolerance = 0
MatchBlock.str_number_tolerance = 0
record_1 = ['Flight 3', 5, '1 May 2015', '52.3740300, 4.8896900']
records = [['Flight 1', 0, '3 May 2015', '52.3740300, 4.8896900'],
['Flight 2', 5, '4 May 2016', '52.3740300, 4.8896900'],
['Flight 3', 10, '5 May 2015', '52.3740300, 4.8896900'],
['Flight 3', 15, '6 May 2015', '52.3740300, 4.8896900']]
print(match_find(record_1, records))
['Flight 3', 10, '5 May 2015', '52.3740300, 4.8896900']