drifto

Automatic featurization and ML for event analytics


Keywords
data, machine, learning, feature, engineering, analytics, autoML
License
libpng-2.0
Install
pip install drifto==0.1.4

Documentation

Drifto

Drifto: Automatic Featurization 🤖 for User Event Data 👥

User event data (clickstream, transactions, product interactions, etc.) is one of the highest volume and veracity data sources collected by organizations, but it is still notoriously hard to featurize event streams and generate data-driven insights or actionable models.

Drifto is an automated feature engineering and machine learning tool. Drifto automatically generates a large number of user-centric autofeatures over a specified time period. Drifto offers a nearly fully-automated point-and-shoot experience: just point Drifto towards your raw event tables! Drifto also provides a suite of machine learning models that automatically interoperate with your generated feature tables.

Drifto is built on DuckDB and Apache Arrow, and therefore is scalable to large datasets. Contact us at founders@driftoml.com if you are interested in scaling Drifto up to the petabyte scale with a fully-managed cloud deployment.

Drifto Can Automatically :

  • Join, merge, and wrangle disparate user event tables across all user touch points
  • Generate dozens, hundreds, or even thousands of ML-ready autofeatures
  • Train models on training features and run inference on production features
  • [soon] Schedule and manage your Drifto pipelines to keep tables and models updated
  • [soon] Track data lineage all the way from raw data to processed features to trained models.
  • [soon] Combine with self-supervised deep neural autofeatures that allow for unprecedented levels of user-behavior understanding

Drifto's Top Workflows 🏆 :

  • Customer Value Estimation
  • Churn Prediction
  • Anomaly Detection
  • [soon] Personalization
  • [soon] Demand Sensing

Quick Start

Install Drifto with pip install drifto.

Example

See the examples directory for our primary example. The sample data has two tables, one with website clickstream data (events.parquet) and one with checkout transactions (transactions.parquet). The example merges the two tables into one master event table with drifto.wrangle and then uses drifto.featurize to automatically compute a large number of features for each user for each week based on different aggregations of the 'action', 'page', and other columns. These features are used to predict whether a user will stop making purchases in the subsequent week. See the docs for a more detailed example walkthrough.

fields = ('user_id', 'timestamp',)
T = drifto.wrangle(*fields, 
    primary_table_path='events.parquet',
    cols=["action", "order_total","attributes->'$.page'"],
    table_paths=[('purchase', 'transactions.parquet')])

feature_table, inference_table, metadata = drifto.featurize('action', 
    *fields, T, 'week', 'action', target_value='purchase',
    histogram_cols=["attributes->'$.page'"],
    filter_inactive=True)

pq.write_table(feature_table, "features.parquet")

model, metadata = drifto.train(feature_table, metadata, max_epochs=80,
    model='logistic', model_export_path='test.onnx', lr=8e-3, 
    batch_size=512)

predicts = drifto.inference(model, inference_table, metadata)

events.parquet has the following schema:

user_id: int64
timestamp: timestamp[ms]
action: string
attributes: string

where the action column specifies the type of action taken and attributes is a JSON object with action-specific data like the particular page visited for a page_visited action.

transactions.parquet has the following schema:

user_id: int64
timestamp: timestamp[ms]
order_total: int64

The merged event table produced by wrangle has the following schema:

timestamp: timestamp[ms]
user_id: string
action: string
order_total: int64
attributes_page: string

where order_total comes from the transactions table and attributes_page is extracted from the attributes JSON in events.parquet and is the particular page visited for a page_visited action (and null for other action types).

Finally, the feature table has the following schema:

user_id: string
time_period: timestamp[us]

action_page_visited_count: int64
action_page_visited_count_1: int64
action_email_opened_count: int64
action_email_opened_count_1: int64
...
attributes_page_payments_count: int64
attributes_page_payments_count_1: int64
...
action_count_distinct: int64
action_count_distinct_1: int64
action_mode: string
action_mode_1: string
...
attributes_page_count_distinct: int64
attributes_page_count_distinct_1: int64
...
order_total_sum: double
order_total_sum_1: double
order_total_avg: double
order_total_avg_1: double
order_total_std: double
order_total_std_1: double

label: bool

where the first group of features counts the number of occurrences of each action type for each user for each time period. The features with _1 affixed to the end are from the previous time period for the same user. The second set of features counts the number of occurrences of each particular viewed page for page_viewed events. The third and fourth sets of features are the standard ones computed for all categorical columns. Finally, the last set of features show the standard ones computed for numerical columns.