sklearn-special-ensembles

A library of specialized ensembles for sklearn-type base models.


Keywords
sklearn, ensemble, modeling, data, analysis, machine, learning, artificial-intelligence, data-science, ensemble-learning, machine-learning
License
MIT
Install
pip install sklearn-special-ensembles==1.1.3

Documentation

sklearn-special-ensembles 🪄

A library that creates robust, special-purpose ensembles from sklearn-like base models (including lightgbm, xgboost, and catboost).

Install with PyPi ✅

pip install sklearn-special-ensembles

Models and examples 🚀

NormalizedModel

NormalizedModel normalizes the target using another feature in the input DataFrame, either by subtracting it from the target, dividing the target by it, or multiplying the target by it. It trains one model on all available data and another on only those rows that are normalizable. During inference, non-normalizable rows are predicted by the general model, while normalizable rows are predicted by the normalized model.

from sklearn_special_ensembles.models.NormalizedModel import NormalizedModel
from sklearn_special_ensembles.tests.generate_dummy_dataframe import generate_dummy_dataframe
from lightgbm import LGBMRegressor

train_df, test_df = generate_dummy_dataframe(num_categorical_predictors=2, categories_by_column=[[1, 2], [3, 4]])

base_model = LGBMRegressor(verbose=-1)
normalized_model = NormalizedModel(base_estimator=base_model)

normalized_model.fit(train_df.drop(columns=["target"]), train_df["target"], normalizing_col="numerical_0", how="divide")
preds = normalized_model.predict(test_df.drop(columns=["target"]))

SegmentEnsemble

SegmentEnsemble fits a separate estimator to each subset of data corresponding to each tuple of unique values across some specified features (i.e., segments) in the DataFrame. This ensemble can also fit a general model to all available data and blend the segment-level predictions with the general predictions during inference.

from sklearn_special_ensembles.tests.generate_dummy_dataframe import generate_dummy_dataframe
from sklearn_special_ensembles.models.SegmentEnsemble import SegmentEnsemble

train_df, test_df = generate_dummy_dataframe(num_categorical_predictors=2, categories_by_column=[[1, 2], [3, 4]])

base_model = LGBMRegressor(verbose=-1)
segment_ensemble = SegmentEnsemble(base_estimator=base_model)

segment_ensemble.fit(train_df.drop(columns=["target"]), train_df["target"], segment_cols=["categorical_0", "categorical_1"])
preds = segment_ensemble.predict(test_df.drop(columns=["target"]), percent_general_model=0.1)

OutlierEnsemble

OutlierEnsemble is designed for working with data that has a designated ID-like column and some IDs that have a fundamentally different relationship with the predictors than the rest of the IDs in the dataset. It fits a separate model to these outlier IDs to preserve the signal in the rest of the data.

from sklearn_special_ensembles.tests.generate_dummy_dataframe import generate_dummy_dataframe
from sklearn_special_ensembles.models.OutlierEnsemble import OutlierEnsemble

train_df, test_df = generate_dummy_dataframe(num_categorical_predictors=1, categories_by_column=[[1, 2, 3, 4]])

base_model = LGBMRegressor(verbose=-1)
outlier_ensemble = OutlierEnsemble(base_estimator=base_model)

outlier_ensemble.fit(train_df.drop(columns=["target"]), train_df["target"], id_col="categorical_0", outlier_ids=[1, 3])
preds = outlier_ensemble.predict(test_df.drop(columns=["target"]))

FeatureSubsetEnsemble

FeatureSubsetEnsemble trains separate base learners on distinct subsets of the available features in the data. This technique adds powerful diversity to an ensemble and should be particularly helpful when working with noisy data and a large feature space.

from sklearn_special_ensembles.tests.generate_dummy_dataframe import generate_dummy_dataframe
from sklearn_special_ensembles.models.FeatureSubsetEnsemble import FeatureSubsetEnsemble

train_df, test_df = generate_dummy_dataframe(num_categorical_predictors=2, categories_by_column=[[1, 2, 3, 4], [5, 6]])

base_model = LGBMRegressor(verbose=-1)
feature_ensemble = FeatureSubsetEnsemble(base_estimator=base_model)

feature_ensemble.fit(
    train_df.drop(columns=["target"]),
    train_df["target"],
    train_col_groups=[["numerical_0", "numerical_1"],
                      ["categorical_0", "categorical_1"]]
)
preds = feature_ensemble.predict(test_df.drop(columns=["target"]))

FoldableEnsemble

FoldableEnsemble trains separate base estimators on separate folds of the data and ensembles their predictions together during inference. The user can specify the indices of the folds used during training and the weights of the estimators used during inference.

import copy
from sklearn_special_ensembles.tests.generate_dummy_dataframe import generate_dummy_dataframe
from sklearn_special_ensembles.models.FoldableEnsemble import FoldableEnsemble

train_df, test_df = generate_dummy_dataframe(num_categorical_predictors=1, categories_by_column=[[1, 2, 3, 4]])

base_model = LGBMRegressor(verbose=-1)
n_splits = 4
foldable_ensemble = FoldableEnsemble(estimators=[copy.deepcopy(base_model) for _ in range(n_splits)])

foldable_ensemble.fit(train_df.drop(columns=["target"]), train_df["target"])
preds = foldable_ensemble.predict(test_df.drop(columns=["target"]))

Upcoming 🔜

Don't hesitate to reach out if you find any bugs in this package or want to contribute! In the meantime, I'll just be writing more special-purpose ensembles as they become useful in the competitions I participate in.