dsbox-datacleaning 0.3.1.post1 on PyPI

Missing value imputer

This component is for missing value imputation. This module is designed to support:

multiple ways to impute data, including our self-defined methods.
missing pattern related analysis (to be exposed)

Now the functionality is limited to:

one label problem

Dependencies

To install:

pip install --process-dependency-links git+https://github.com/usc-isi-i2/dsbox-profiling.git@v1.0

Run Tests

python -m unittest discover

Usage:

see imputation pipelines for reference

methods

see methods for details.

One-hot encoder

The encoder takes pandas DataFrame as input, then one-hot encode columns which are considered categorical. For categorical_features = '95in10', it takes a column as category if:

its dtype is not float and
95% of its data fall in 10 values.
For the rest values (not top 10) with low frequency, put into one column [colname]_other_

Note:

Maximum number of values encoded: n_limit, Whether to convert other text columns to integers: text2int.
Apply set_params() function to change the two parameters' values.
For one-hot encoded columns, in the output there would always be a [colname]_other_ column for values not appear in fitted data and values with fewer occurrence (when there are more than n_limit distinct values).

Usage:

from dsbox.datapreprocessing.cleaner import Encoder, EncHyperparameter

train_x = pd.read_csv(train_dataset)
test_x = pd.read_csv(test_dataset)

hp = EncHyperparameter(text2int=True,n_limit=12,categorical_features='95in10')
enc = Encoder(hyperparams=hp)
enc.set_training_data(inputs=train_x)
enc.fit()

result = enc.produce(inputs=train_x)

p = enc.get_params()
enc2 = Encoder(hyperparams=hp)
enc2.set_params(params=p)
result2 = enc2.produce(inputs=test_x)

TODO:

~~Find better way to distinguish categorical columns - by statistics?~~ by Profiler
~~More functionality and more flexible implementation for user to config prefered setting.~~
~~Deal with ID-like columns: identify (also let user decide?) and delete ?~~ Will have encoders which users can specify columns to encode.

Unary encoder

The encoder takes pandas DataFrame and specified column name(s) as input, then unary encode the column(s).

Usage:

see encoder pipelines for reference

Discretizer

Take a column (pandas Series) as input, output a column with discretized values. For the discretize() function:

by: "width": discretize by equal width; "frequency": discretize by equal frequency; "kmeans": discretize by kmeans clustering; "gmm": discretize by Gaussian mixure models clustering. default by="width".
num_bins: number of bins. default num_bins=10.
labels: list of values for the discretized bins, currently only for binning methods where orders of values are kept (by width and by frequency). default labels= [0,1,2...].

Note, currently:

Missing cells remain missing in the output column.

Usage:

from dsbox.datapreprocessing.cleaner import discretizer

data = pd.read_csv('yourDataset.csv')
col = data["column_name"]
# 10 bins, discretize by equal width
result = discretizer.discretize(col)
# 5 bins, discretize by gmm
result = discretizer.discretize(col,num_bins=5,by='gmm')
# or you can replace original column in the dataset with discretized values
data["column_name"] = result