DataGlitch: A Python Toolkit for Messy Data

DataGlitch is a Python package designed to address common data challenges in a pandas DataFrame, including handling mixed data types, non-ASCII values, and facilitating dataset exploration.

pip install DataGlitch

Usage

DataGlitch currently offers three functionalities:

dtype_detector: Find mixed data types in columns.
nonascii_handler: Detect and handle non-ASCII characters.
data_search: Search for the existence of specific columns or values.

dtype_detector

The dtype_detector uses regular expressions to detect different data types in a column through the find_numeric() function. This function takes a pandas Series with data type Object as an argument and returns three new variables: numeric, ambiguous and non_numeric. Each of these contain a subset of the original column.

numeric can contain any numeric value or any string value interpreted as numeric by the regular expression:

Integers: whole numbers, both positive and negative. Examples: -42, 123.
Floating-point numbers: decimal numbers, both positive and negative. Examples: -3.14, 0.5.
Numbers in scientific notation. Examples: 1.23e-4, -2E+10.
Fractions: fractions in the form of numerator/denominator. Examples: 1/2, -3/4.
Numbers with comma or dot as decimal separator. Examples: 1,000, 3.5.
Numbers with multiple decimal separators (comma or dot). Examples: 1.0.0, 2,000.5.6.

ambiguous can contain any value that is interpreted as numeric by the regular expression if it also has alphabetic or special characters at its front or end. Examples: 50cents, $10.

non_numeric can contain any value that is not classified as numeric or ambiguous. Examples: 2000-01-01, DataGlitch6000!.

# Import and apply
from DataGlitch.dtype_detector import find_numeric
numeric, ambiguous, non_numeric = find_numeric(df["col"])

Example operations:

# Drop non_numeric subset from dataframe
df = df[~df["col"].isin(non_numeric)]

# Replace values in subset with NAs
import numpy as np
df.loc[df["col"].isin(non_numeric), "col"] = np.nan

Other operations can occur directly on the dataframe. For instance, if wanting to correct comma-separated floats that have been identified in the column:

df["col"] = df["col"].replace(",", ".", regex=True)

nonascii_handler

The nonascii_handler uses the find_nonascii() function to locate rows/values with non-ascii characters in a DataFrame or Series with data type Object. For handling non-ascii, the user is offerred three options:

Drop all rows that contain non-ascii values by setting the drop parameter to True.
Replace values with non-ascii with np.nan to indicate missigness by setting the remove parameter to True.
Translate values with non-ascii with the unidecode library by setting the translate parameter to True.

If none of the above options are selected, the data is returned as is (default).

from DataGlitch.nonascii_handler import find_nonascii
df_ascii = find_nonascii(df, drop=False, remove=False, translate=True)

data_search

data_search performs fuzzy string matching through the rapidfuzz library. It looks for the existance of columns in a dataset or particular values within a column.

from data_search import column_search, value_search

Columns are identified through the column_search() function which takes a pandas DataFrame, the name of the column as a string, and a cut-off score which defaults to 80. The output contains any matches and their similarity score. For a less strict search, the cut-off score can be lowered.

column_search(df, "column_name", score_cutoff=80)

The value_search() function looks for the existance of a value in a pandas Series. Even if the value under investigation is integer/float, it should still be passed as string to the function. The output includes any matches and their similarity score.

value_search(df["col"], "value", score_cutoff=80)

DataGlitch
Release 0.0.2

Release 0.0.2

0.0.1

0.0.2

0.1

Documentation

DataGlitch: A Python Toolkit for Messy Data

Usage

dtype_detector

nonascii_handler

data_search

Stats

Development practices

Releases

Contributors

DataGlitch Release 0.0.2

Release 0.0.2 Toggle Dropdown 0.0.1 0.0.2 0.1

Documentation

DataGlitch: A Python Toolkit for Messy Data

Usage

dtype_detector

nonascii_handler

data_search

Stats

Development practices

Releases

Contributors

DataGlitch
Release 0.0.2

Release 0.0.2

0.0.1

0.0.2

0.1