Automated Data Cleaning Tool.
The main goal is to develop a Python tool
datacleanbot such that:
Given a random parsed raw dataset representing a supervised learning problem, the Python tool is capable of automatically identifying the potential issues and reporting the results and recommendations to the end-user in an effective way.
$ pip install datacleanbot
Install OpenML (version 0.9.0):
OpenML is used to easily import datasets and share models and experiments.
$ pip install openml
For Windows, you need to have C++ Compiler installed.
Acquire data from OpenML:
>>> import openml as oml >>> data = oml.datasets.get_dataset(id) # id: openml dataset id >>> X, y, categorical_indicator, features = data.get_data(target=data.default_target_attribute, dataset_format='array') >>> Xy = np.concatenate((X,y.reshape((y.shape,1))), axis=1)
Autoclean data with datacleanbot:
>>> import datacleanbot.dataclean as dc >>> Xy = dc.autoclean(Xy, data.name, features)
datacleanbot is equipped with the following capabilities:
- Present an overview report of the given dataset
- The most important features
- Statistical information (e.g., mean, max, min)
- Data types of features
- Clean common data problems in the raw dataset
- Duplicated records
- Inconsistent column names
- Missing values
The two aspects
datacleanbot meaningfully automates are marked in bold.
The user's guide can be found at datacleanbot.