automated data cleaning tool


License
MIT
Install
pip install datacleanbot==0.91

Documentation

License: MIT

datacleanbot

Automated Data Cleaning Tool. The main goal is to develop a Python tool datacleanbot such that: Given a random parsed raw dataset representing a supervised learning problem, the Python tool is capable of automatically identifying the potential issues and reporting the results and recommendations to the end-user in an effective way.

Install

$ pip install datacleanbot

QuickStart

Install OpenML (version 0.9.0):

OpenML is used to easily import datasets and share models and experiments.

$ pip install openml

For Windows, you need to have C++ Compiler installed.

Acquire data from OpenML:

>>> import openml as oml
>>> data = oml.datasets.get_dataset(id) # id: openml dataset id
>>> X, y, categorical_indicator, features = data.get_data(target=data.default_target_attribute, dataset_format='array')
>>> Xy = np.concatenate((X,y.reshape((y.shape[0],1))), axis=1)

Autoclean data with datacleanbot:

>>> import datacleanbot.dataclean as dc
>>> Xy = dc.autoclean(Xy, data.name, features)

Description

datacleanbot is equipped with the following capabilities:

  • Present an overview report of the given dataset
    • The most important features
    • Statistical information (e.g., mean, max, min)
    • Data types of features
  • Clean common data problems in the raw dataset
    • Duplicated records
    • Inconsistent column names
    • Missing values
    • Outliers

The two aspects datacleanbot meaningfully automates are marked in bold.

User's Guide

The user's guide can be found at datacleanbot.