A wrapper toolbox that provides compatibility layers between TPOT and Auto-Sklearn and OpenML


Install
pip install arbok==0.1.21

Documentation

Arbok

Arbok (Automl wrapper toolbox for openml compatibility) provides wrappers for TPOT and Auto-Sklearn, as a compatibility layer between these tools and OpenML.

The wrapper extends Sklearn's BaseSearchCV and provides all the internal parameters that OpenML needs, such as cv_results_, best_index_, best_params_, best_score_ and classes_.

Installation

pip install arbok

Simple example

import openml
from arbok import AutoSklearnWrapper, TPOTWrapper


task = openml.tasks.get_task(31)
dataset = task.get_dataset()

# Get the AutoSklearn wrapper and pass parameters like you would to AutoSklearn
clf = AutoSklearnWrapper(
    time_left_for_this_task=3600, per_run_time_limit=360
)

# Or get the TPOT wrapper and pass parameters like you would to TPOT
clf = TPOTWrapper(
    generations=100, population_size=100, verbosity=2
)

# Execute the task
run = openml.runs.run_model_on_task(task, clf)
run.publish()

print('URL for run: %s/run/%d' % (openml.config.server, run.run_id))

Preprocessing data

To make the wrapper more robust, we need to preprocess the data. We can fill the missing values, and one-hot encode categorical data.

First, we get a mask that tells us whether a feature is a categorical feature or not.

dataset = task.get_dataset()
_, categorical = dataset.get_data(return_categorical_indicator=True)
categorical = categorical[:-1]  # Remove last index (which is the class)

Next, we setup a pipeline for the preprocessing. We are using a ConditionalImputer, which is an imputer which is able to use different strategies for categorical (nominal) and numerical data.

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from arbok import ConditionalImputer

preprocessor = make_pipeline(

    ConditionalImputer(
        categorical_features=categorical,
        strategy="mean",
        strategy_nominal="most_frequent"
    ),
    
    OneHotEncoder(
        categorical_features=categorical, handle_unknown="ignore", sparse=False
    )
)

And finally, we put everything together in one of the wrappers.

clf = AutoSklearnWrapper(
    preprocessor=preprocessor, time_left_for_this_task=3600, per_run_time_limit=360
)

Limitations

  • Currently only the classifiers are implemented. Regression is therefore not possible.
  • For TPOT, the config_dict variable can not be set, because this causes problems with the API.

Benchmarking

Installing the arbok package includes the arbench cli tool. We can generate a json file like this:

from arbok.bench import Benchmark
bench = Benchmark()
config_file = bench.create_config_file(
       
    # Wrapper parameters
    wrapper={"refit": True, "verbose": False, "retry_on_error": True},
    
    # TPOT parameters
    tpot={
        "max_time_mins": 6,              # Max total time in minutes
        "max_eval_time_mins": 1          # Max time per candidate in minutes
    },
    
    # Autosklearn parameters
    autosklearn={
        "time_left_for_this_task": 360,  # Max total time in seconds
        "per_run_time_limit": 60         # Max time per candidate in seconds
    }
)

And then, we can call arbench like this:

arbench --classifier autosklearn --task-id 31 --config config.json

Or calling arbok as a python module:

python -m arbok --classifier autosklearn --task-id 31 --config config.json

Running a benchmark on batch systems

To run a large scale benchmark, we can create a configuration file like above, and generate and submit jobs to a batch system as follows.

# We create a benchmark setup where we specify the headers, the interpreter we
# want to use, the directory to where we store the jobs (.sh-files), and we give
# it the config-file we created earlier.
bench = Benchmark(
    headers="#PBS -lnodes=1:cpu3\n#PBS -lwalltime=1:30:00",
    python_interpreter="python3",  # Path to interpreter
    root="/path/to/project/",
    jobs_dir="jobs",
    config_file="config.json",
    log_file="log.json"
)

# Create the config file like we did in the section above
config_file = bench.create_config_file(
       
    # Wrapper parameters
    wrapper={"refit": True, "verbose": False, "retry_on_error": True},
    
    # TPOT parameters
    tpot={
        "max_time_mins": 6,              # Max total time in minutes
        "max_eval_time_mins": 1          # Max time per candidate in minutes
    },
    
    # Autosklearn parameters
    autosklearn={
        "time_left_for_this_task": 360,  # Max total time in seconds
        "per_run_time_limit": 60         # Max time per candidate in seconds
    }
)

# Next, we load the tasks we want to benchmark on from OpenML.
# In this case, we load a list of task id's from study 99.
tasks = openml.study.get_study(99).tasks

# Next, we create jobs for both tpot and autosklearn.
bench.create_jobs(tasks, classifiers=["tpot", "autosklearn"])

# And finally, we submit the jobs using qsub
bench.submit_jobs()

Preprocessing parameters

from arbok import ParamPreprocessor
import numpy as np
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import make_pipeline

X = np.array([
    [1, 2, True, "foo", "one"],
    [1, 3, False, "bar", "two"],
    [np.nan, "bar", None, None, "three"],
    [1, 7, 0, "zip", "four"],
    [1, 9, 1, "foo", "five"],
    [1, 10, 0.1, "zip", "six"]
], dtype=object)

# Manually specify types, or use types="detect" to automatically detect types
types = ["numeric", "mixed", "bool", "nominal", "nominal"]

pipeline = make_pipeline(ParamPreprocessor(types="detect"), VarianceThreshold())

pipeline.fit_transform(X)

Output:

[[-0.4472136  -0.4472136   1.41421356 -0.70710678 -0.4472136  -0.4472136
   2.23606798 -0.4472136  -0.4472136  -0.4472136   0.4472136  -0.4472136
  -0.85226648  1.        ]
 [-0.4472136   2.23606798 -0.70710678 -0.70710678 -0.4472136  -0.4472136
  -0.4472136  -0.4472136  -0.4472136   2.23606798  0.4472136  -0.4472136
  -0.5831297  -1.        ]
 [ 2.23606798 -0.4472136  -0.70710678 -0.70710678 -0.4472136  -0.4472136
  -0.4472136  -0.4472136   2.23606798 -0.4472136  -2.23606798  2.23606798
  -1.39054004 -1.        ]
 [-0.4472136  -0.4472136  -0.70710678  1.41421356 -0.4472136   2.23606798
  -0.4472136  -0.4472136  -0.4472136  -0.4472136   0.4472136  -0.4472136
   0.49341743 -1.        ]
 [-0.4472136  -0.4472136   1.41421356 -0.70710678  2.23606798 -0.4472136
  -0.4472136  -0.4472136  -0.4472136  -0.4472136   0.4472136  -0.4472136
   1.031691    1.        ]
 [-0.4472136  -0.4472136  -0.70710678  1.41421356 -0.4472136  -0.4472136
  -0.4472136   2.23606798 -0.4472136  -0.4472136   0.4472136  -0.4472136
   1.30082778  1.        ]]