Discovery Behavioral Tools
This project looks to help in the building of tools that require data that has behavioral characteristics.
Contents
1 Main features
- Probability Waiting
- Correlation and Association
- Behavioral Analytics
2 Installation
2.1 package install
The best way to install this package is directly from the Python Package Index repository using pip
$ pip install discovery-behavioral-utils
if you want to upgrade your current version then using pip
$ pip install --upgrade discovery-behavioral-utils
2.2 env setup
Other than the dependant python packages indicated in the requirements.txt
there are
no special environment setup needs to use the package. The package should sit as an extension to
your current data science and discovery packages.
3 Overview
3.1 Techniques and Methods
The Behavioral Syntenic Data Generator was developed as a solution to the current challenges of data accessibility and the early mobilization of machine learning discovery and model build. This product tool takes on, what is, a sceptically viewed and challenging problem area of the generation of data that is synthetic but is still representative of its intended real-life counterpart. In short, The project needed to develop rich data sets to demonstrate the capabilities of its machine learning offerings so users could see and test what the synthetic data could do.
To achieve this, the project identified in three constructs;
1. Probability Waiting - Is an algorithm based on breadth and depth weighting patterns fulfilled through multivariate continuous distributions using monotonic splines and copulas. Working with Aryan Pedawi, a Ph.D research scientist specializing in Bayesian probability theory, this Probability Waiting algorithm is one of the key differentiators from other synthetic data models, allowing fine grain and complex behavioral characteristics to be added to the distribution of data points within a data set.
2. Correlation and Association – Through advanced programming techniques and a deep knowledge of component modelling and code reuse, the project developed a finite set of data point generation tooling that implements method chaining and rules-based association against action techniques. This approach and its techniques provide the ability to capture machine learning and business intent and generate specialized output against those requirements.
3. Behavioral Analytics – In addition to the data point generators, the tooling provides data analytics and behavioral extraction, against existing data sets, that can be replayed to quickly create behavioral patterns within existing data sets, without compromising or disclosing sensitive, or protected information. This is particularly valuable with today’s concerns of data protection and disclosure mitigation strategies.
3.2 Value Proposition
Within the Machine learning discipline, and as a broader challenge, the accessibility of data and its relevance to the success of early engagement and customer success is an industry problem with many variants available on the market. Though competent in their delivery, their ability to flex and enrich across multiple examples of need and particularly the high demands of pattern and associative recognition, pertaining to machine learning, is limited and cynically considered within the machine learning community. The Behavioral Synthetic Data Generator improves representation of data appropriate to ML modelling, test train data sets and the disclosure mitigation through targeted and customized modelling of data that removes the personal DNA and leaves one with representative data that retains its behavioural DNA allowing true representation of the problem scope.
The ability to engage with the customer before the availability of or access to organisational data sets is a vital part of an organisations ability to prove value add early and build customer success. The Behavioural Synthetic Data Generator is currently being used for stress, volume and boundary testing and presentation enrichment modelling within the Accelerated Machine learning initiative. In addition, it is being used to generate highly sophisticated machine learning focused behavioural data that allows for early validation of customer success while data access remains restrictive and inaccessible.
4 Using the Behavioral Synthetic Data Generator
4.1 Package Structure
Within the Discovery Transitioning Utils are a set
ofsimulator package
that contains the DataBuilder,
DataBuilderPropertyManager and the DataBuilderTools class
4.1.1 DataBuilder
- is a Data Builder management instance that allows the building of datasets to be repeatable by saving a configuration of the build definition
4.1.2 DataBuilderPropertyManager
- manages the configuration property values and saves the build templates to regenerate the synthetic data
4.1.3 DataBuilderTools:
- is a set of static methods that generate the different data types
int
,float
,string
,category
anddate
. and define the randomness and patterns of the values.
Firstly we need to import the DataBuilder
class and create a
named instance to identify this instance from other instances we
might create. Normally the name would be representative of the dataset
you are trying to create such as customer
, accounts
or
transactions
as an example
from ds_behavioral import DataBuilder
builder = DataBuilder('SimpleExample')
4.2 Building a basic dataset
with this example we will firstly look at the tools that are avaialbe
and produce a Pandas DataFrame
on the fly
builder.tool_dir
['associate_analysis', 'associate_custom', 'associate_dataset', 'correlate_categories', 'correlate_dates', 'correlate_numbers', 'get_category', 'get_custom', 'get_datetime', 'get_distribution', 'get_file_column', 'get_intervals', 'get_number', 'get_profiles', 'get_reference', 'get_string_pattern', 'unique_date_seq', 'unique_identifiers', 'unique_numbers', 'unique_str_tokens']
Here we can see the methods are broken down into four categories:
get
, unique
, correlate
, associate
.
We can also look at the contextual help for each of the methods calling
the tools
property and using the help
build-in
help(builder.tools.get_number)
Help on function get_number in module ds_discovery.simulators.data_builder: get_number(to_value: , from_value: = None, weight_pattern: list = None, precision: int = None, size: int = None, quantity: float = None, seed: int = None) returns a number in the range from_value to to_value. if only to_value given from_value is zero :param to_value: highest integer value, if from_value provided must be one above this value :param from_value: optional, (signed) integer to start from. Default is zero (0) :param weight_pattern: a weighting pattern or probability that does not have to add to 1 :param precision: the precision of the returned number. if None then assumes int value else float :param size: the size of the sample :param quantity: a number between 0 and 1 representing data that isn't null :param seed: a seed value for the random function: default to None :return: a random number
From here we can now play with some of the get
methods
# get an integer between 0 and 9
builder.tools.get_number(10, size=5)
$> [6, 5, 3, 2, 3]
# get a float between -1 and 1, notice by passing an float it assumes the output to be a float
builder.tools.get_number(from_value=-1.0, to_value=1.0, precision=3, size=5)
$> [0.283, 0.296, -0.958, 0.185, 0.831]
# get a currency by setting the 'currency' parameter to a currency symbol.
# Note this returns a list of strings
builder.tools.get_number(from_value=1000.0, to_value=2000.0, size=5, currency='$', precision=2)
$> ['$1,286.00', '$1,858.00', '$1,038.00', '$1,944.00', '$1,250.00']
# get a timestamp between two dates
builder.tools.get_datetime(start='01/01/2017', until='31/12/2018')
$> [Timestamp('2018-02-11 02:23:32.733296768')]
# get a formated date string between two numbers
builder.tools.get_datetime(start='01/01/2017', until='31/12/2018', size=4, date_format='%d-%m-%Y')
$> ['06-06-2017', '05-11-2017', '28-09-2018', '04-11-2017']
# get categories from a selection
builder.tools.get_category(['Red', 'Blue', 'Green', 'Black', 'White'], size=4)
$> ['Green', 'Blue', 'Blue', 'White']
# get unique categories from a selection
builder.tools.get_category(['Red', 'Blue', 'Green', 'Black', 'White'], size=4, replace=False)
$> ['Blue', 'White', 'Green', 'Black']
4.3 Building a DataFrame
With these lets build a quick Synthetic DataFrame. For ease of code we will redefine the 'builder.tools' call
tools = builder.tools
# the dataframe has a unique id, a float value between 0.0 and 1.0and a date formtted as a text string
df = pd.DataFrame()
df['id'] = tools.unique_numbers(start=10, until=100, size=10)
df['values'] = tools.get_number(to_value=1.0, size=10)
df['date'] = tools.get_datetime(start='12/05/2018', until='30/11/2018', date_format='%d-%m-%Y %H:%M:%S', size=10)
4.3.1 Data quantity
to show representative data we can adjust the quality of the data we produce. Here we only get about 50% of the telephone numbers
# using the get string pattern we can create part random and part static data elements. see the inline docs for help on customising choices
df['mobile'] = tools.get_string_pattern("(07ddd) ddd ddd", choice_only=False, size=10, quantity=0.5)
df
4.4 Weighted Patterns
Now we can get a bit more controlled in how we want the random numbers to be generated by using the weighted patterns. Weighted patterns are similar to probability but don't need to add to 1 and also don't need to be the same size as the selection. Lets see how this works through an example.
lets generate an array of 100 and then see how many times each category is selected
selection = ['M', 'F', 'U']
gender = tools.get_category(selection, weight_pattern=[5,4,1], size=100)
dist = [0]*3
for g in gender:
dist[selection.index(g)] += 1
print(dist)
$> [51, 40, 9]
fig = plt.figure(figsize=(8,3))
sns.set(style="whitegrid")
g = sns.barplot(selection, dist)
It can also be used to create more complex distribution. In this example we want an age distribution that has peaks around 35-40 and 55-60 with a significant tail off after 60 but don't want a probability for every age.
# break the pattern into every 5 years
pattern = [3,5,6,10,6,5,7,15,5,2,1,0.5,0.2,0.1]
age = tools.get_number(20, 90, weight_pattern=pattern, size=1000)
fig = plt.figure(figsize=(10,4))
_ = sns.set(style="whitegrid")
_ = sns.kdeplot(age, shade=True)
4.4.1 Complex Weighting patterns
Weighting patterns acn be multi dimensial representing controlling distribution over time.
In this example we don't want there to be any values below 50 in the first half then only values below 50 in the second
split_pattern = [[0,1],[1,0]]
numbers = tools.get_number(100, weight_pattern=split_pattern, size=100)
fig = plt.figure(figsize=(8,4))
plt.style.use('seaborn-whitegrid')
plt.plot(list(range(100)), numbers);
_ = plt.axhline(y=50, linewidth=0.75, color='red')
_ = plt.axvline(x=50, linewidth=0.75, color='red')
we can even build more complex numbering where we always get numbers around the middle but first 3rd and last 3rd additionally high and low numbers respectively
mid_pattern = [[0,0,1],1,[1,0,0]]
numbers = tools.get_number(100, weight_pattern=mid_pattern, size=100)
fig = plt.figure(figsize=(8,4))
_ = plt.plot(list(range(100)), numbers);
_ = plt.axhline(y=33, linewidth=0.75, color='red')
_ = plt.axhline(y=67, linewidth=0.75, color='red')
_ = plt.axvline(x=33, linewidth=0.75, color='red')
_ = plt.axvline(x=67, linewidth=0.75, color='red')
4.4.2 Random Seed
in this example we are using seeding to fix predictability of the randomness of both the weighted pattern and the numbers generated. We can then look for a good set of seeds to generate different spike patterns we can predict.
fig = plt.figure(figsize=(12,15))
right=False
for i in range(0,10):
ax = plt.subplot2grid((5,2),(int(i/2), int(right)))
result = tools.get_number(100, weight_pattern=np.sin(range(10)), size=100, seed=i+10)
g = plt.plot(list(range(100)), result);
t = plt.title("seed={}".format(i+10))
right = not right
plt.tight_layout()
plt.show()
4.5 Dates
Dates are an important part of most datasets and need flexibility in all theri multidimensional elements
# creating a set of randome dates and a set of unique dates
df = pd.DataFrame()
df['dates'] = tools.get_datetime('01/01/2017', '21/01/2017', size=20, date_format='%d-%m-%Y')
df['seq'] = tools.unique_date_seq('01/01/2017', '21/01/2017', size=20, date_format='%d-%m-%Y')
print("{}/20 dates and {}/20 unique date sequence".format(df.dates.nunique(), df.seq.nunique()))
$> 11/20 dates and 20/20 unique date sequence
4.5.1 Date patterns
Get Data has a number of different weighting patterns that can be applied - accross the daterange - by year - by month - by weekday - by hour - by minutes
Or by a combination of any of them.
from ds_discovery.transition.discovery import Visualisation as visual
# Create a month pattern that has no data in every other month
pattern = [1,0]*6
selection = ['Rigs', 'Office']
df_rota = pd.DataFrame()
df_rota['rota'] = tools.get_category(selection, size=300)
df_rota['dates'] = tools.get_datetime('01/01/2017', '01/01/2018', size=300, month_pattern=pattern)
df_rota = cleaner.to_date_type(df_rota, headers='dates')
df_rota = cleaner.to_category_type(df_rota, headers='rota')
visual.show_cat_time_index(df_rota, 'dates', 'rota')
Quite often dates need to have specific pattern to represent real working times, in this example we only want dates that occur in the working week.
# create dates that are only during the working week
pattern = [1,1,1,1,1,0,0]
selection = ['Management', 'Staff']
df_seating = pd.DataFrame()
df_seating['position'] = tools.get_category(selection, weight_pattern=[7,3], size=100)
df_seating['dates'] = tools.get_datetime('14/01/2019', '22/01/2019', size=100, weekday_pattern=pattern)
df_seating = cleaner.to_date_type(df_seating, headers='dates')
df_seating = cleaner.to_category_type(df_seating, headers='position')
visual.show_cat_time_index(df_seating, 'dates', 'position')
4.5.2 What Next
These are only the starter building blocks that give the foundation to more comple rule and behaviour. Have a play with:
correlate: creates data that correlates to another set of values giving an offset value based on the original. This applies to Dates, numbers and categories associate: allows the construction of complex rule based actions nd behavior builder instance: explore the ability to configure and save a template so you can repeat the build
but the library is being built out all the time so keep it updated.
4.6 Python version
Python 2.6 and 2.7 are not supported. Although Python 3.x is supported, it is recommended to install
discovery-behavioral-utils
against the latest Python 3.6.x whenever possible.
Python 3 is the default for Homebrew installations starting with version 0.9.4.
4.7 GitHub Project
Discovery-Behavioral-Utils: https://github.com/Gigas64/discovery-behavioral-utils.
4.8 Change log
See CHANGELOG.
4.9 Licence
BSD-3-Clause: LICENSE.