discovery-behavioral-utils

Advanced behavioral components data generation and simulation tools for ML and Data engineers


Keywords
Synthetic, data, simulator, generator_test
License
BSD-3-Clause
Install
pip install discovery-behavioral-utils==2.6.37

Documentation

Discovery Behavioral Tools

This project looks to help in the building of tools that require data that has behavioral characteristics.

PyPI - Python Version Documentation Status PyPI - License PyPI - Wheel

1   Main features

  • Probability Waiting
  • Correlation and Association
  • Behavioral Analytics

2   Installation

2.1   package install

The best way to install this package is directly from the Python Package Index repository using pip

$ pip install discovery-behavioral-utils

if you want to upgrade your current version then using pip

$ pip install --upgrade discovery-behavioral-utils

2.2   env setup

Other than the dependant python packages indicated in the requirements.txt there are no special environment setup needs to use the package. The package should sit as an extension to your current data science and discovery packages.

3   Overview

3.1   Techniques and Methods

The Behavioral Syntenic Data Generator was developed as a solution to the current challenges of data accessibility and the early mobilization of machine learning discovery and model build. This product tool takes on, what is, a sceptically viewed and challenging problem area of the generation of data that is synthetic but is still representative of its intended real-life counterpart. In short, The project needed to develop rich data sets to demonstrate the capabilities of its machine learning offerings so users could see and test what the synthetic data could do.

To achieve this, the project identified in three constructs;

1. Probability Waiting - Is an algorithm based on breadth and depth weighting patterns fulfilled through multivariate continuous distributions using monotonic splines and copulas. Working with Aryan Pedawi, a Ph.D research scientist specializing in Bayesian probability theory, this Probability Waiting algorithm is one of the key differentiators from other synthetic data models, allowing fine grain and complex behavioral characteristics to be added to the distribution of data points within a data set.

2. Correlation and Association – Through advanced programming techniques and a deep knowledge of component modelling and code reuse, the project developed a finite set of data point generation tooling that implements method chaining and rules-based association against action techniques. This approach and its techniques provide the ability to capture machine learning and business intent and generate specialized output against those requirements.

3. Behavioral Analytics – In addition to the data point generators, the tooling provides data analytics and behavioral extraction, against existing data sets, that can be replayed to quickly create behavioral patterns within existing data sets, without compromising or disclosing sensitive, or protected information. This is particularly valuable with today’s concerns of data protection and disclosure mitigation strategies.

3.2   Value Proposition

Within the Machine learning discipline, and as a broader challenge, the accessibility of data and its relevance to the success of early engagement and customer success is an industry problem with many variants available on the market. Though competent in their delivery, their ability to flex and enrich across multiple examples of need and particularly the high demands of pattern and associative recognition, pertaining to machine learning, is limited and cynically considered within the machine learning community. The Behavioral Synthetic Data Generator improves representation of data appropriate to ML modelling, test train data sets and the disclosure mitigation through targeted and customized modelling of data that removes the personal DNA and leaves one with representative data that retains its behavioural DNA allowing true representation of the problem scope.

The ability to engage with the customer before the availability of or access to organisational data sets is a vital part of an organisations ability to prove value add early and build customer success. The Behavioural Synthetic Data Generator is currently being used for stress, volume and boundary testing and presentation enrichment modelling within the Accelerated Machine learning initiative. In addition, it is being used to generate highly sophisticated machine learning focused behavioural data that allows for early validation of customer success while data access remains restrictive and inaccessible.

4   Using the Behavioral Synthetic Data Generator

4.1   Package Structure

Within the Discovery Transitioning Utils are a set ofsimulator package that contains the DataBuilder, DataBuilderPropertyManager and the DataBuilderTools class

4.1.1   DataBuilder

  • is a Data Builder management instance that allows the building of datasets to be repeatable by saving a configuration of the build definition

4.1.2   DataBuilderPropertyManager

  • manages the configuration property values and saves the build templates to regenerate the synthetic data

4.1.3   DataBuilderTools:

  • is a set of static methods that generate the different data types int, float, string, category and date. and define the randomness and patterns of the values.

Firstly we need to import the DataBuilder class and create a named instance to identify this instance from other instances we might create. Normally the name would be representative of the dataset you are trying to create such as customer, accounts or transactions as an example

from ds_behavioral import DataBuilder
builder = DataBuilder('SimpleExample')

4.2   Building a basic dataset

with this example we will firstly look at the tools that are avaialbe and produce a Pandas DataFrame on the fly

builder.tool_dir
['associate_analysis',
 'associate_custom',
 'associate_dataset',
 'correlate_categories',
 'correlate_dates',
 'correlate_numbers',
 'get_category',
 'get_custom',
 'get_datetime',
 'get_distribution',
 'get_file_column',
 'get_intervals',
 'get_number',
 'get_profiles',
 'get_reference',
 'get_string_pattern',
 'unique_date_seq',
 'unique_identifiers',
 'unique_numbers',
 'unique_str_tokens']

Here we can see the methods are broken down into four categories: get, unique, correlate, associate.

We can also look at the contextual help for each of the methods calling the tools property and using the help build-in

help(builder.tools.get_number)
Help on function get_number in module ds_discovery.simulators.data_builder:

get_number(to_value: , from_value: = None, weight_pattern: list = None, precision: int = None, size: int = None,
           quantity: float = None, seed: int = None)
    returns a number in the range from_value to to_value. if only to_value given from_value is zero

    :param to_value: highest integer value, if from_value provided must be one above this value
    :param from_value: optional, (signed) integer to start from. Default is zero (0)
    :param weight_pattern: a weighting pattern or probability that does not have to add to 1
    :param precision: the precision of the returned number. if None then assumes int value else float
    :param size: the size of the sample
    :param quantity: a number between 0 and 1 representing data that isn't null
    :param seed: a seed value for the random function: default to None
    :return: a random number

From here we can now play with some of the get methods

# get an integer between 0 and 9
builder.tools.get_number(10, size=5)
$> [6, 5, 3, 2, 3]
# get a float between -1 and 1, notice by passing an float it assumes the output to be a float
builder.tools.get_number(from_value=-1.0, to_value=1.0, precision=3, size=5)
$> [0.283, 0.296, -0.958, 0.185, 0.831]
# get a currency by setting the 'currency' parameter to a currency symbol.
# Note this returns a list of strings
builder.tools.get_number(from_value=1000.0, to_value=2000.0, size=5, currency='$', precision=2)
$> ['$1,286.00', '$1,858.00', '$1,038.00', '$1,944.00', '$1,250.00']
# get a timestamp between two dates
builder.tools.get_datetime(start='01/01/2017', until='31/12/2018')
$> [Timestamp('2018-02-11 02:23:32.733296768')]
# get a formated date string between two numbers
builder.tools.get_datetime(start='01/01/2017', until='31/12/2018', size=4, date_format='%d-%m-%Y')
$> ['06-06-2017', '05-11-2017', '28-09-2018', '04-11-2017']
# get categories from a selection
builder.tools.get_category(['Red', 'Blue', 'Green', 'Black', 'White'], size=4)
$> ['Green', 'Blue', 'Blue', 'White']
# get unique categories from a selection
builder.tools.get_category(['Red', 'Blue', 'Green', 'Black', 'White'], size=4, replace=False)
$> ['Blue', 'White', 'Green', 'Black']

4.3   Building a DataFrame

With these lets build a quick Synthetic DataFrame. For ease of code we will redefine the 'builder.tools' call

tools = builder.tools
# the dataframe has a unique id, a float value between 0.0 and 1.0and a date formtted as a text string
df = pd.DataFrame()
df['id'] = tools.unique_numbers(start=10, until=100, size=10)
df['values'] = tools.get_number(to_value=1.0, size=10)
df['date'] = tools.get_datetime(start='12/05/2018', until='30/11/2018', date_format='%d-%m-%Y %H:%M:%S', size=10)

4.3.1   Data quantity

to show representative data we can adjust the quality of the data we produce. Here we only get about 50% of the telephone numbers

# using the get string pattern we can create part random and part static data elements. see the inline docs for help on customising choices
df['mobile'] = tools.get_string_pattern("(07ddd) ddd ddd", choice_only=False, size=10, quantity=0.5)
df

https://raw.githubusercontent.com/Gigas64/discovery-behavioral-utils/master/docs/img/output_26_0.png

4.4   Weighted Patterns

Now we can get a bit more controlled in how we want the random numbers to be generated by using the weighted patterns. Weighted patterns are similar to probability but don't need to add to 1 and also don't need to be the same size as the selection. Lets see how this works through an example.

lets generate an array of 100 and then see how many times each category is selected

selection = ['M', 'F', 'U']
gender = tools.get_category(selection, weight_pattern=[5,4,1], size=100)
dist = [0]*3
for g in gender:
    dist[selection.index(g)] += 1

print(dist)
$> [51, 40, 9]
fig = plt.figure(figsize=(8,3))
sns.set(style="whitegrid")
g = sns.barplot(selection, dist)

https://raw.githubusercontent.com/Gigas64/discovery-behavioral-utils/master/docs/img/output_25_0.png

It can also be used to create more complex distribution. In this example we want an age distribution that has peaks around 35-40 and 55-60 with a significant tail off after 60 but don't want a probability for every age.

# break the pattern into every 5 years
pattern = [3,5,6,10,6,5,7,15,5,2,1,0.5,0.2,0.1]
age = tools.get_number(20, 90, weight_pattern=pattern, size=1000)

fig = plt.figure(figsize=(10,4))
_ = sns.set(style="whitegrid")
_ = sns.kdeplot(age, shade=True)

https://raw.githubusercontent.com/Gigas64/discovery-behavioral-utils/master/docs/img/output_27_0.png

4.4.1   Complex Weighting patterns

Weighting patterns acn be multi dimensial representing controlling distribution over time.

In this example we don't want there to be any values below 50 in the first half then only values below 50 in the second

split_pattern = [[0,1],[1,0]]
numbers = tools.get_number(100, weight_pattern=split_pattern, size=100)

fig = plt.figure(figsize=(8,4))
plt.style.use('seaborn-whitegrid')
plt.plot(list(range(100)), numbers);
_ = plt.axhline(y=50, linewidth=0.75, color='red')
_ = plt.axvline(x=50, linewidth=0.75, color='red')

https://raw.githubusercontent.com/Gigas64/discovery-behavioral-utils/master/docs/img/output_29_1.png

we can even build more complex numbering where we always get numbers around the middle but first 3rd and last 3rd additionally high and low numbers respectively

mid_pattern = [[0,0,1],1,[1,0,0]]
numbers = tools.get_number(100, weight_pattern=mid_pattern, size=100)
fig = plt.figure(figsize=(8,4))
_ = plt.plot(list(range(100)), numbers);
_ = plt.axhline(y=33, linewidth=0.75, color='red')
_ = plt.axhline(y=67, linewidth=0.75, color='red')
_ = plt.axvline(x=33, linewidth=0.75, color='red')
_ = plt.axvline(x=67, linewidth=0.75, color='red')

https://raw.githubusercontent.com/Gigas64/discovery-behavioral-utils/master/docs/img/output_31_0.png

4.4.2   Random Seed

in this example we are using seeding to fix predictability of the randomness of both the weighted pattern and the numbers generated. We can then look for a good set of seeds to generate different spike patterns we can predict.

fig = plt.figure(figsize=(12,15))
right=False
for i in range(0,10):
    ax = plt.subplot2grid((5,2),(int(i/2), int(right)))
    result = tools.get_number(100, weight_pattern=np.sin(range(10)), size=100, seed=i+10)
    g = plt.plot(list(range(100)), result);
    t = plt.title("seed={}".format(i+10))
    right = not right
plt.tight_layout()
plt.show()

https://raw.githubusercontent.com/Gigas64/discovery-behavioral-utils/master/docs/img/output_33_0.png

4.5   Dates

Dates are an important part of most datasets and need flexibility in all theri multidimensional elements

# creating a set of randome dates and a set of unique dates
df = pd.DataFrame()
df['dates'] =  tools.get_datetime('01/01/2017', '21/01/2017', size=20, date_format='%d-%m-%Y')
df['seq'] = tools.unique_date_seq('01/01/2017', '21/01/2017', size=20, date_format='%d-%m-%Y')
print("{}/20 dates and {}/20 unique date sequence".format(df.dates.nunique(), df.seq.nunique()))
$> 11/20 dates and 20/20 unique date sequence

4.5.1   Date patterns

Get Data has a number of different weighting patterns that can be applied - accross the daterange - by year - by month - by weekday - by hour - by minutes

Or by a combination of any of them.

from ds_discovery.transition.discovery import Visualisation as visual
# Create a month pattern that has no data in every other month
pattern = [1,0]*6
selection = ['Rigs', 'Office']

df_rota = pd.DataFrame()
df_rota['rota'] = tools.get_category(selection, size=300)
df_rota['dates'] =  tools.get_datetime('01/01/2017', '01/01/2018', size=300, month_pattern=pattern)

df_rota = cleaner.to_date_type(df_rota, headers='dates')
df_rota = cleaner.to_category_type(df_rota, headers='rota')
visual.show_cat_time_index(df_rota, 'dates', 'rota')

https://raw.githubusercontent.com/Gigas64/discovery-behavioral-utils/master/docs/img/output_39_0.png

Quite often dates need to have specific pattern to represent real working times, in this example we only want dates that occur in the working week.

# create dates that are only during the working week
pattern = [1,1,1,1,1,0,0]
selection = ['Management', 'Staff']

df_seating = pd.DataFrame()
df_seating['position'] = tools.get_category(selection, weight_pattern=[7,3], size=100)
df_seating['dates'] =  tools.get_datetime('14/01/2019', '22/01/2019', size=100, weekday_pattern=pattern)

df_seating = cleaner.to_date_type(df_seating, headers='dates')
df_seating = cleaner.to_category_type(df_seating, headers='position')
visual.show_cat_time_index(df_seating, 'dates', 'position')

https://raw.githubusercontent.com/Gigas64/discovery-behavioral-utils/master/docs/img/output_36_0.png

4.5.2   What Next

These are only the starter building blocks that give the foundation to more comple rule and behaviour. Have a play with:

correlate: creates data that correlates to another set of values giving an offset value based on the original. This applies to Dates, numbers and categories
associate: allows the construction of complex rule based actions nd behavior
builder instance: explore the ability to configure and save a template so you can repeat the build

but the library is being built out all the time so keep it updated.

4.6   Python version

Python 2.6 and 2.7 are not supported. Although Python 3.x is supported, it is recommended to install discovery-behavioral-utils against the latest Python 3.6.x whenever possible. Python 3 is the default for Homebrew installations starting with version 0.9.4.

4.7   GitHub Project

Discovery-Behavioral-Utils: https://github.com/Gigas64/discovery-behavioral-utils.

4.8   Change log

See CHANGELOG.

4.9   Licence

BSD-3-Clause: LICENSE.

4.10   Authors

Gigas64 (@gigas64) created discovery-behavioral-utils.