discovery-transitioning-utils

Advanced data cleaning, data wrangling and feature extraction tools for ML engineers


Keywords
Wrangling ML Visualisation Dictionary Discovery Productize Classification Feature Engineering Cleansing
License
BSD-3-Clause
Install
pip install discovery-transitioning-utils==1.8.25

Documentation

Discovery transitioning Tools

This project looks to help improve the time to market of data classification, data cleaning, data wrangling and feature extraction for ML engineers. It looks to provide visual and observational output within a few minutes of data delivery and the foundation data preparation for ML feature engineering and modelling. The package is fully thread safe and comes with a singleton configuration make clean module for persisting discovery settings and allow repeatability.

PyPI - Python Version Documentation Status PyPI - License PyPI - Wheel

1   Main features

  • Configuration module
  • Data Discovery module
  • Data Cleaning module
  • Feature building module
  • Visualisation module

2   Installation

2.1   package install

The best way to install this package is directly from the Python Package Index repository using pip

$ pip install discovery-transitioning-utils

if you want to upgrade your current version then using pip

$ pip install --upgrade discovery-transitioning-utils

2.2   env setup

Other than the dependant python packages indicated in the requirements.txt there are no special environment setup needs to use the package. The package should sit as an extension to your current data science and discovery packages.

In saying this, the configuration class uses a default yaml file for persisted configuration values and a root working directory. On initialisation of any configuration instance the package looks for two optional environment variables $DSTU_WORK_PATH and $DSTU_CONFIG_FILE. By default the data path is assumed under the working path but can also be set directly using $DSTU_WORK_PATH

$DSTU_CONFIG_FILE
is the path and name of the yaml configuration file. If the environment variable is not found then the default name ./config/config.yaml from the process working directory.
$DSTU_WORK_PATH
is the root working path where the configuration and data tree structure can be found. If the environment variable is not found, then the default path creates ./ from the process working directory.
$DSTU_DATA_PATH
is the root data path where the data tree structure can be found. If the environment variable is not found, then the default path is used.

Alternatively you can set both the working directory and the data directory at runtime with the initialisation of the `Transition()` class.

To find the the value of the environment variables in python use os.environ['name'] or cut and paste this code into a Jupyter Notebook cell:

import os

for _key in ['DSTU_WORK_PATH', 'DSTU_CONFIG_FILE', 'DSTU_DATA_PATH']:
    _value = 'Not Found' if not _key in os.environ.keys() else os.environ[_key]
    print('{} = {}'.format(_key, _value))

3   Overview

Born out of the frustration of time constraints and the inability to show business value within a business expectation, this project aims to provide a set of tools to quickly produce visual and observational results. It also aims to improve the communication outputs needed by ML delivery to talk to Pre-Sales, Stakholders, Business SME's, Data SME's product coders and tooling engineers while still remaining within familiar code paragigms.

The package looks to build a set of outputs as part of standard data wrangling and ML exploration that, by their nature, are familiar tools to the various reliant people and processes. For example Data dictionaries for SME's, Visual representations for clients and stakeholders and configuration contracts for architects, tool builders and data ingestion.

3.1   ML Discovery

ML Discovery is first and key part of an end to end process of discovery, productization and tooling. It defines the ‘intelligence’ and business differentiators of everything downstream.

To become effective in the ML discovery phase, the ability to be able to micro-iterate within distinct layers enables the needed adaptive delivery and quicker returns on ML use case.

The building and discovery of an ML model can be broken down into three Separation of Concerns (SoC) or Scope of Responsibility (SoR) for the ML engineer and ML model builder.

  • Data Preparation
  • Feature Engineering
  • Model selection and optimisation

with a forth discipline of insight, interpretation and profiling as an outcome. these three SoC's can be perceived as eight distinct disciplines

3.2   Eight stages of Machine Learning

  1. Data Loading (fit-for-purpose, quality, quantity, veracity, connectivity)
  2. Data Preparation (predictor selection, typing, cleaning, valuing, validating)
  3. Data Dictionary (observation, visualisation, knowledge, value scale)
  4. Feature Attribution (attribute mapping, quantitative attribute characterisation. predictor selection)
  5. Feature Engineering (feature modelling, dirty clustering, time series, qualitative feature characterisation)
  6. Feature Framing (hypothesis function, specialisation, custom model framing, model/feature selection)
  7. Modelling (selection, optimisation, testing, training)
  8. Training (learning, feedback loops, opacity testing, insight, profiling, stabilization)

Though conceptual they do represent a set of needed disciplines and the complexity of the journey to quality output.

3.3   Layered approach to ML

The idea behind the conceptual eight stages of Machine Learning is to layer the preparation and reuse of the activities undertaken by the ML Data Engineer and ML Modeller. To provide a platform for micro iterations rather than a constant repetition of repeatable tasks through the stack. It also facilitates contractual definitions between the different disciplines that allows loose coupling and automated regeneration of the different stages of model build. Finally it reduces the cross discipline commitments by creating a 'by-design' set of contracts targeted at, and written in, the language of the consumer.

The concept of being able to quickly run over a single aspect of the ML discovery and then present a stable base for the next layer to iterate against. this micro-iteration approach allows for quick to market adaptive delivery.

4   Getting Started

The example we will go through uses the following resources. Once you have created the directory structure as instructed, place the data file in the 0_raw folder and the master notebook in the root of your scripts. In addition is a link to a clean template that I use as a starting point for all my data cleaning.

Example Data File: https://github.com/Gigas64/discovery-transitioning-utils/blob/master/jupyter/resource/example01.csv.
Master Jupyter Notebook: https://github.com/Gigas64/discovery-transitioning-utils/blob/master/jupyter/resource/master_scriptlets.ipynb.
Clean Template Jupyter Notebook: https://github.com/Gigas64/discovery-transitioning-utils/blob/master/jupyter/resource/clean_template.ipynb.

4.1   Example Usage

The following example demonstrates the least possible steps to engineer a solution the the point of repeatable cleaned data and the start of feature extraction. The library is rich with methods that allow more fine grain customisation of setup, properties and parametrisation which are not covered here. The library is fully documented so for futher details please reference the module inline documentation.

The example is based on using Jupyter Notebook but the library works as well within an IDE such as PyCharm or command line Python.

As good practice, and to avoid repetition of common tasks I would suggest creating a master notebook sitting in the scripts root containing all your common imports and setups calling in the sub notebooks using %run.

4.1.1   First Time Env Setup

When you create a new project, or set up your default master notebook you import both the File

...
from ds_discovery import FilePropertyManager
from ds_discovery import Transition

Within my master notebook, just as a fail-safe, as it costs nothing, I also set up the environment variables

# set environment variables
os.environ['DSTU_DATA_ROOT'] = <your_data_path>
os.environ['DSTU_CONFIG_FILE'] = <your_path_and_config_file>

We now have all the appropriate imports and environment variables.

Next step is to create all the default directories to put things into and record them inside the configuration contract config.yaml. Again this is a one of step and once recorded doesn't need to be run again.

# create the instance of the File Properties Manager
fpm = FilePropertyManager()

# Set the default properties values
fpm.set_folder_defaults(create_dir=True)
fpm.set_pattern_defaults()

The set_folder_defaults() can take other paramters to tailor where things might be found but for this example we will use the default paths. This will take the DSTU_DATA_ROOT path as the default path and set all the attributes in the configuration contract. The method parameter create_dir=True tells the method to additionally create the directory structure if it doesn't exist.

The set_pattern_defaults sets the default patterns for the various filenames created during the transitioning process, thus creating a common naming convention. As an example the cleaned set of predictors are stored in a file with the pattern clean_{}_v{}.p where the first {} is the contract name and the second {} the version number

4.1.2   Creating a Data Contract

First we create the instance of the transitioning class, passing it the reference 'Contract' name.

tr = Transition('Example01')

the Contract name is just a reference name to the discovery activity and sets up an unlerlying configuration structure with that reference. Any activities performed using this transition instance will be recored under this reference. as a rule of thumb, a contract would normally have a direct relationship with a single input source or file.

Now we have created the instance we need to set its input source. In this example we are referencing a csv file called 'Example01.csv', that should be placed in the 0_raw/ directory.

df = tr.set_contract_source('example01.csv', sep=',')

the set_source_contract() method take a file name and then a list of kwargs relating to that file load. in this instance with csv files, and with json files, call uses Pandas read_csv(), and read_json(), methods passing the kwargs directly to that pandas method. For more information on the kwagrs see padas.read_csv and pandas.read_json

by default set_source_contract() has a parameter get_source=True this indicates to the method to save the configuration and then attempt to load the file and return a pandas.DataFrame. Thus this method also loads the file.

Once the source configuration is set, from this point you can then use tr.get_source() to load the DataFrame

df = tr.get_source()

Note that this loads, or reloads, your raw DataFrame ready for cleaning and is your first micro iteration platform

4.1.3   Data Discovery

Now we have the DataFrame we want to start identifying our predictors and tidy their observations, or data-points.

One of the 'go-to' methods in the transitioning is the data_dictionary() method as it gives you an immediate and useful overview of quality and quantity, veracity and validity of the datasets and provides a set of observational references and samples of the underlying data.

within the Transitioning class we use the discover property and it's methods

tr.discover.data_dictionary()

this retruns a data dictionary in the form of a pandas DataFrame. this quick reference view presents

Attribute: the current header name
Type: the current type of that attribute
% Nulls: the percentage of nulls data points observed
Count: the total count of data points not null
Unique: the number of unique data points
Value: (if set) a scalar value of the attribute
Observations: observational information about the datapoints dependant on their type
knowledge: (if set) additional SME knowledge of the attribute

in most cases there is enough information or inference in the data dictionary to be able to start the cleaning but sometimes it is easier to do some auto-cleaning to be able to see the 'wood for the trees'

Another handy method is clean_filter_columns() that applies the common attribute filter found in all the cleaner methods to filter or include only certain headers. When i am creating the data dictionary code snippet I tend to do:

df_filtered = tr.clean.filter_columns(df, dtype=[])
tr.discover.data_dictionary(df_filtered)

this still allows me to see all the fields but the addition of dtypes lets me only see part selections for example

df_filtered = tr.clean.filter_columns(df, dtype=['int', 'float64'])
tr.discover.data_dictionary(df_filtered)

in this example the data dictionary will only include attributes that are any type of int and only 64 bit floats. it is worth familiarising yourself with the filter options as this can be a very powerful micro-iteration tool to work through large datasets.

the method tr.clean.filter_columns(...) also can perform a regular expression search by passing a regular expression string to the parameter regex= allowing you to customise your filter. As most cleaners are based on this method call, the same applies to those cleaner methods.

4.1.4   Auto Cleaning

To do any kind of cleaning we use the methods in the transitioning property tr.clean. have three auto-clean methods shown here.

df = tr.clean.clean_header(df)
df = tr.clean.auto_remove_columns(df)
df = tr.clean.auto_to_category(df)

With all the cleaning methods the option exists to do inplace or not. In this example we set inplace to be False.

this is all quick and dirty but from here you can then start to fine tune the predictors with a number of cleaner methods.

remove_columns:
to_date_type:
to_category_type:
to_bool_type:
to_float_type:
to_int_type:
to_str_type:

an example of use might be:

# turn all predictors that are of type 'object' into category except for the headers listed.
df = tr.clean.to_category_type(df, dtype=['object'], headers=['active', 'agent_id', 'postcode'], drop=True)
# turn the 'age' predictor into an int.
df = tr.clean.to_int_type(df, headers=['age'])

so as to make these repeatable we can save the settings to the configuration file by using the inplace=True parameter. When inplace, not only is the passed df changed but the method now returns the configuration structure, instead of the df. You can then use the transitioning method tr.set_cleaner() which takes a configuration structure.

So from the above code snippet, we might write:

tr.set_cleaner(tr.clean.to_int_type(df, headers=['age'], inplace=True))

this has now been recorded in the config.yaml as:

...
to_int:
  drop: false
  exclude: false
  fillna: -1
  headers:
  - age
...

From the configuration, a documented contract has been created that can provide a software engineer or architect with the blueprint of the ML activity.

4.1.5   Saving the Cleaned DataFrame

Finally we need to now save the cleaned df to file so it becomes our micro-iteration base for feature extraction.

tr.save_clean_file(df)

this places a pickle file in the 2_clean directory and can be recovered using

tr = Transition('Example01')
df = tr.load_clean_file()

4.1.6   Creating Shared Output

Now we have cleaned data we can share this with business and Data SME's so as to get feedback, knowledge and value scaling. Though we have been able to create a data dictionary, now we have cleaned and categorised the data, we are able to provide a richer view by creating an excel spreadsheet and visual helpers.

There are two generalised methods, create_data_dictionary() and create_visuals()

Simply pass the DataFrame to the transitioning methods

tr.create_data_dictionary(df)
tr.create_visuals(df)

The output files can be found in the 1_dictionary and 8_visual folders, assuming you are using the default folder names, in your data path folder structure. Within the Excel file are additional workbooks covering statistics and also category value breakdown. The visuals cover category and numeric views.

4.1.7   What Next

There are a few useful methods that are worth exploring in the discovery class

to_sample_num: creates a new sample subset from a larger dataset. This allows you to discover very large datasets by sampling subsets where all the data does not need to be loaded
massive_data_sampler: for files that are massive and can't be loaded into memory, this loads from file taking sample subsets out of chunked data to a limit value
train_test_sampler: splits a dataset into sample and training datasets. The split is defined by the parameters set.

There are also some already created feature extraction methods around some tricky data types in the FeatureBuilders

flatten_categorical: flattens a categorical as a sum of one-hot
date_matrix: returns a matrix of data time elements broken down into columns including decade and ordinal

but the library is being built out all the time so keep it updated.

4.2   Python version

Python 2.6 and 2.7 are not supported. Although Python 3.x is supported, it is recommended to install discovery-transitioning-utils against the latest Python 3.6.x whenever possible. Python 3 is the default for Homebrew installations starting with version 0.9.4.

4.3   GitHub Project

Discovery-Transitioning-Utils: https://github.com/Gigas64/discovery-transitioning-utils.

4.4   Change log

See CHANGELOG.

4.5   Licence

BSD-3-Clause: LICENSE.

4.6   Authors

Gigas64 (@gigas64) created discovery-transitioning-utils.