Discovery transitioning Tools
This project looks to help improve the time to market of data classification, data cleaning, data wrangling and feature extraction for ML engineers. It looks to provide visual and observational output within a few minutes of data delivery and the foundation data preparation for ML feature engineering and modelling. The package is fully thread safe and comes with a singleton configuration make clean module for persisting discovery settings and allow repeatability.
1 Main features
- Configuration module
- Data Discovery module
- Data Cleaning module
- Feature building module
- Visualisation module
2 Installation
2.1 package install
The best way to install this package is directly from the Python Package Index repository using pip
$ pip install discovery-transitioning-utils
if you want to upgrade your current version then using pip
$ pip install --upgrade discovery-transitioning-utils
2.2 env setup
Other than the dependant python packages indicated in the requirements.txt
there are
no special environment setup needs to use the package. The package should sit as an extension to
your current data science and discovery packages.
In saying this, the configuration class uses a default yaml file for persisted configuration values
and a root working directory. On initialisation of any configuration instance the package looks for two optional
environment variables $DSTU_WORK_PATH
and $DSTU_CONFIG_FILE
. By default the data path is assumed under
the working path but can also be set directly using $DSTU_WORK_PATH
$DSTU_CONFIG_FILE
- is the path and name of the yaml configuration file. If the environment variable is not found then the
default name
./config/config.yaml
from the process working directory. $DSTU_WORK_PATH
- is the root working path where the configuration and data tree structure can be found. If the environment
variable is not found, then the default path creates
./
from the process working directory. $DSTU_DATA_PATH
- is the root data path where the data tree structure can be found. If the environment variable is not found, then the default path is used.
Alternatively you can set both the working directory and the data directory at runtime with the initialisation
of the `Transition()`
class.
To find the the value of the environment variables in python use os.environ['name']
or cut and paste this code into a Jupyter Notebook cell:
import os
for _key in ['DSTU_WORK_PATH', 'DSTU_CONFIG_FILE', 'DSTU_DATA_PATH']:
_value = 'Not Found' if not _key in os.environ.keys() else os.environ[_key]
print('{} = {}'.format(_key, _value))
3 Overview
Born out of the frustration of time constraints and the inability to show business value within a business expectation, this project aims to provide a set of tools to quickly produce visual and observational results. It also aims to improve the communication outputs needed by ML delivery to talk to Pre-Sales, Stakholders, Business SME's, Data SME's product coders and tooling engineers while still remaining within familiar code paragigms.
The package looks to build a set of outputs as part of standard data wrangling and ML exploration that, by their nature, are familiar tools to the various reliant people and processes. For example Data dictionaries for SME's, Visual representations for clients and stakeholders and configuration contracts for architects, tool builders and data ingestion.
3.1 ML Discovery
ML Discovery is first and key part of an end to end process of discovery, productization and tooling. It defines the ‘intelligence’ and business differentiators of everything downstream.
To become effective in the ML discovery phase, the ability to be able to micro-iterate within distinct layers enables the needed adaptive delivery and quicker returns on ML use case.
The building and discovery of an ML model can be broken down into three Separation of Concerns (SoC) or Scope of Responsibility (SoR) for the ML engineer and ML model builder.
- Data Preparation
- Feature Engineering
- Model selection and optimisation
with a forth discipline of insight, interpretation and profiling as an outcome. these three SoC's can be perceived as eight distinct disciplines
3.2 Eight stages of Machine Learning
- Data Loading (fit-for-purpose, quality, quantity, veracity, connectivity)
- Data Preparation (predictor selection, typing, cleaning, valuing, validating)
- Data Dictionary (observation, visualisation, knowledge, value scale)
- Feature Attribution (attribute mapping, quantitative attribute characterisation. predictor selection)
- Feature Engineering (feature modelling, dirty clustering, time series, qualitative feature characterisation)
- Feature Framing (hypothesis function, specialisation, custom model framing, model/feature selection)
- Modelling (selection, optimisation, testing, training)
- Training (learning, feedback loops, opacity testing, insight, profiling, stabilization)
Though conceptual they do represent a set of needed disciplines and the complexity of the journey to quality output.
3.3 Layered approach to ML
The idea behind the conceptual eight stages of Machine Learning is to layer the preparation and reuse of the activities undertaken by the ML Data Engineer and ML Modeller. To provide a platform for micro iterations rather than a constant repetition of repeatable tasks through the stack. It also facilitates contractual definitions between the different disciplines that allows loose coupling and automated regeneration of the different stages of model build. Finally it reduces the cross discipline commitments by creating a 'by-design' set of contracts targeted at, and written in, the language of the consumer.
The concept of being able to quickly run over a single aspect of the ML discovery and then present a stable base for the next layer to iterate against. this micro-iteration approach allows for quick to market adaptive delivery.
4 Getting Started
The example we will go through uses the following resources. Once you have created the directory structure as
instructed, place the data file in the 0_raw
folder and the master notebook in the root of your scripts.
In addition is a link to a clean template that I use as a starting point for all my data cleaning.
Example Data File: | https://github.com/Gigas64/discovery-transitioning-utils/blob/master/jupyter/resource/example01.csv. |
---|---|
Master Jupyter Notebook: | https://github.com/Gigas64/discovery-transitioning-utils/blob/master/jupyter/resource/master_scriptlets.ipynb. |
Clean Template Jupyter Notebook: | https://github.com/Gigas64/discovery-transitioning-utils/blob/master/jupyter/resource/clean_template.ipynb. |
4.1 Example Usage
The following example demonstrates the least possible steps to engineer a solution the the point of repeatable cleaned data and the start of feature extraction. The library is rich with methods that allow more fine grain customisation of setup, properties and parametrisation which are not covered here. The library is fully documented so for futher details please reference the module inline documentation.
The example is based on using Jupyter Notebook but the library works as well within an IDE such as PyCharm or command line Python.
As good practice, and to avoid repetition of common tasks I would suggest creating a master notebook sitting in the
scripts root containing all your common imports and setups calling in the sub notebooks using %run
.
4.1.1 First Time Env Setup
When you create a new project, or set up your default master notebook you import both the File
...
from ds_discovery import FilePropertyManager
from ds_discovery import Transition
Within my master notebook, just as a fail-safe, as it costs nothing, I also set up the environment variables
# set environment variables
os.environ['DSTU_DATA_ROOT'] = <your_data_path>
os.environ['DSTU_CONFIG_FILE'] = <your_path_and_config_file>
We now have all the appropriate imports and environment variables.
Next step is to create all the default directories to put things into and record them inside the configuration
contract config.yaml
. Again this is a one of step and once recorded doesn't need to be run again.
# create the instance of the File Properties Manager
fpm = FilePropertyManager()
# Set the default properties values
fpm.set_folder_defaults(create_dir=True)
fpm.set_pattern_defaults()
The set_folder_defaults()
can take other paramters to tailor where things might be found but for this example
we will use the default paths. This will take the DSTU_DATA_ROOT
path as the default path and set all the
attributes in the configuration contract. The method parameter create_dir=True
tells the method to additionally
create the directory structure if it doesn't exist.
The set_pattern_defaults
sets the default patterns for the various filenames created during the transitioning
process, thus creating a common naming convention. As an example the cleaned set of predictors are stored in a
file with the pattern clean_{}_v{}.p
where the first {} is the contract name and the second {} the version number
4.1.2 Creating a Data Contract
First we create the instance of the transitioning class, passing it the reference 'Contract' name.
tr = Transition('Example01')
the Contract name is just a reference name to the discovery activity and sets up an unlerlying configuration structure with that reference. Any activities performed using this transition instance will be recored under this reference. as a rule of thumb, a contract would normally have a direct relationship with a single input source or file.
Now we have created the instance we need to set its input source. In this example we are referencing a csv file called
'Example01.csv', that should be placed in the 0_raw/
directory.
df = tr.set_contract_source('example01.csv', sep=',')
the set_source_contract()
method take a file name and then a list of kwargs relating to that file load. in this
instance with csv files, and with json files, call uses Pandas read_csv(), and read_json(), methods passing the
kwargs directly to that pandas method. For more information on the kwagrs see padas.read_csv
and
pandas.read_json
by default set_source_contract()
has a parameter get_source=True
this indicates to the method to save the
configuration and then attempt to load the file and return a pandas.DataFrame. Thus this method also loads the file.
Once the source configuration is set, from this point you can then use tr.get_source()
to load the DataFrame
df = tr.get_source()
Note that this loads, or reloads, your raw DataFrame ready for cleaning and is your first micro iteration platform
4.1.3 Data Discovery
Now we have the DataFrame we want to start identifying our predictors and tidy their observations, or data-points.
One of the 'go-to' methods in the transitioning is the data_dictionary()
method as it gives you an immediate and
useful overview of quality and quantity, veracity and validity of the datasets and provides a set of observational
references and samples of the underlying data.
within the Transitioning class we use the discover
property and it's methods
tr.discover.data_dictionary()
this retruns a data dictionary in the form of a pandas DataFrame. this quick reference view presents
Attribute: the current header name Type: the current type of that attribute % Nulls: the percentage of nulls data points observed Count: the total count of data points not null Unique: the number of unique data points Value: (if set) a scalar value of the attribute Observations: observational information about the datapoints dependant on their type knowledge: (if set) additional SME knowledge of the attribute
in most cases there is enough information or inference in the data dictionary to be able to start the cleaning but sometimes it is easier to do some auto-cleaning to be able to see the 'wood for the trees'
Another handy method is clean_filter_columns()
that applies the common attribute filter found in all the cleaner
methods to filter or include only certain headers. When i am creating the data dictionary code snippet I tend to do:
df_filtered = tr.clean.filter_columns(df, dtype=[])
tr.discover.data_dictionary(df_filtered)
this still allows me to see all the fields but the addition of dtypes lets me only see part selections for example
df_filtered = tr.clean.filter_columns(df, dtype=['int', 'float64'])
tr.discover.data_dictionary(df_filtered)
in this example the data dictionary will only include attributes that are any type of int and only 64 bit floats. it is worth familiarising yourself with the filter options as this can be a very powerful micro-iteration tool to work through large datasets.
the method tr.clean.filter_columns(...)
also can perform a regular expression search by passing a regular
expression string to the parameter regex=
allowing you to customise your filter. As most cleaners are based on
this method call, the same applies to those cleaner methods.
4.1.4 Auto Cleaning
To do any kind of cleaning we use the methods in the transitioning property tr.clean
. have three auto-clean
methods shown here.
df = tr.clean.clean_header(df)
df = tr.clean.auto_remove_columns(df)
df = tr.clean.auto_to_category(df)
With all the cleaning methods the option exists to do inplace or not. In this example we set inplace to be False.
this is all quick and dirty but from here you can then start to fine tune the predictors with a number of cleaner methods.
remove_columns: to_date_type: to_category_type: to_bool_type: to_float_type: to_int_type: to_str_type:
an example of use might be:
# turn all predictors that are of type 'object' into category except for the headers listed.
df = tr.clean.to_category_type(df, dtype=['object'], headers=['active', 'agent_id', 'postcode'], drop=True)
# turn the 'age' predictor into an int.
df = tr.clean.to_int_type(df, headers=['age'])
so as to make these repeatable we can save the settings to the configuration file by using the inplace=True
parameter. When inplace, not only is the passed df changed but the method now returns the configuration structure,
instead of the df. You can then use the transitioning method tr.set_cleaner()
which takes a configuration
structure.
So from the above code snippet, we might write:
tr.set_cleaner(tr.clean.to_int_type(df, headers=['age'], inplace=True))
this has now been recorded in the config.yaml as:
...
to_int:
drop: false
exclude: false
fillna: -1
headers:
- age
...
From the configuration, a documented contract has been created that can provide a software engineer or architect with the blueprint of the ML activity.
4.1.5 Saving the Cleaned DataFrame
Finally we need to now save the cleaned df to file so it becomes our micro-iteration base for feature extraction.
tr.save_clean_file(df)
this places a pickle file in the 2_clean
directory and can be recovered using
tr = Transition('Example01')
df = tr.load_clean_file()
4.1.6 Creating Shared Output
Now we have cleaned data we can share this with business and Data SME's so as to get feedback, knowledge and value scaling. Though we have been able to create a data dictionary, now we have cleaned and categorised the data, we are able to provide a richer view by creating an excel spreadsheet and visual helpers.
There are two generalised methods, create_data_dictionary()
and create_visuals()
Simply pass the DataFrame to the transitioning methods
tr.create_data_dictionary(df)
tr.create_visuals(df)
The output files can be found in the 1_dictionary
and 8_visual
folders, assuming you are using the default
folder names, in your data path folder structure. Within the Excel file are additional workbooks covering statistics
and also category value breakdown. The visuals cover category and numeric views.
4.1.7 What Next
There are a few useful methods that are worth exploring in the discovery class
to_sample_num: creates a new sample subset from a larger dataset. This allows you to discover very large datasets by sampling subsets where all the data does not need to be loaded massive_data_sampler: for files that are massive and can't be loaded into memory, this loads from file taking sample subsets out of chunked data to a limit value train_test_sampler: splits a dataset into sample and training datasets. The split is defined by the parameters set.
There are also some already created feature extraction methods around some tricky data types in the FeatureBuilders
flatten_categorical: flattens a categorical as a sum of one-hot date_matrix: returns a matrix of data time elements broken down into columns including decade and ordinal
but the library is being built out all the time so keep it updated.
4.2 Python version
Python 2.6 and 2.7 are not supported. Although Python 3.x is supported, it is recommended to install
discovery-transitioning-utils
against the latest Python 3.6.x whenever possible.
Python 3 is the default for Homebrew installations starting with version 0.9.4.
4.3 GitHub Project
Discovery-Transitioning-Utils: https://github.com/Gigas64/discovery-transitioning-utils.
4.4 Change log
See CHANGELOG.
4.5 Licence
BSD-3-Clause: LICENSE.