Accelerate machine learning experimentation


Keywords
machine, learning, data, science, data-analysis, data-science, data-visualization, machine-learning, python
License
MIT
Install
pip install mlmachine==0.1.5

Documentation

PyPI version

mlmachine

"mlmachine is a Python library that organizes and accelerates notebook-based machine learning experiments."

Table of Contents

Novel Functionality

Easy, Elegant EDA

mlmachine creates beautiful and informative EDA panels with ease:

# create EDA panel for all "category" features
for feature in mlmachine_titanic.data.mlm_dtypes["category"]:
    mlmachine_titanic.eda_cat_target_cat_feat(
        feature=feature,
        legend_labels=["Died","Survived"],
    )

alt text

Pandas-in / Pandas-out Pipelines

mlmachine makes Scikit-learn transformers Pandas-friendly.

Here's an example. See how simply wrapping the mlmachine utility PandasTransformer() around OneHotEncoder() maintains our DataFrame:

alt text

KFold Target Encoding

mlmachine includes a utility called KFoldEncoder, which applies target encoding on categorical features and leverages out-of-fold encoding to prevent target leakage:

# perform 5-fold target encoding with TargetEncoder from the category_encoders library
encoder = KFoldEncoder(
    target=mlmachine_titanic_train.target,
    cv=KFold(n_splits=5, shuffle=True, random_state=0),
    encoder=TargetEncoder,
)
encoder.fit_transform(mlmachine_titanic_train.data[["Pclass"]])

alt text

Crowd-sourced Feature Importance & Exhaustive Feature Selection

mlmachine employs a robust approach to estimating feature importance by using a variety of techniques:

  • Tree-based Feature Importance
  • Recursive Feature Elimination
  • Sequential Forward Selection
  • Sequential Backward Selection
  • F-value / p-value
  • Variance 
  • Target Correlation

This occurs with one simple execution, and operates on multiple estimators and/or models, and one or more scoring metrics:

# instantiate custom models
rf2 = RandomForestClassifier(max_depth=2)
rf4 = RandomForestClassifier(max_depth=4)
rf6 = RandomForestClassifier(max_depth=6)

# estimator list - default XGBClassifier, default
# RandomForestClassifier and three custom models
estimators = [
    XGBClassifier,
    RandomForestClassifier,
    rf2,
    rf4,
    rf6,
]

# instantiate FeatureSelector object
fs = mlmachine_titanic_train.FeatureSelector(
    data=mlmachine_titanic_train.data,
    target=mlmachine_titanic_train.target,
    estimators=estimators,
)

# run feature importance techniques, use ROC AUC and
# accuracy score metrics and 0 CV folds (where applicable)
feature_selector_summary = fs.feature_selector_suite(
    sequential_scoring=["roc_auc","accuracy_score"],
    sequential_n_folds=0,
    save_to_csv=True,
)

Then the features are winnowed away, from least important to most important, through an exhaustive cross-validation procedure in search of an optimum feature subset:

alt text



Hyperparameter Tuning with Bayesian Optimization

mlmachine can perform Bayesian optimization on multiple estimators in one shot, and includes functionality for visualizing model performance and parameter selections:

# generate parameter selection panels for each parameter
mlmachine_titanic_train.model_param_plot(
        bayes_optim_summary=bayes_optim_summary,
        estimator_class="KNeighborsClassifier",
        estimator_parameter_space=estimator_parameter_space,
        n_iter=100,
    )

alt text

Example Notebooks

All examples can be viewed here

Example Notebook 1 - Learn the basics of mlmachine, how to create EDA panels, and how to execute Pandas-friendly Scikit-learn transformations and pipelines.

Example Notebook 2 - Learn how use mlmachine to assess a datasets pre-processing needs. See examples of how to use novel functionality, such as GroupbyImputer(), KFoldEncoder() and DualTransformer().

Example Notebook 3 - Learn how to perform thorough feature importance estimation, followed by an exhaustive, cross-validation-driven feature selection process.

Example Notebook 4 - Learn how to execute hyperparameter tuning with Bayesian optimization for multiple model and multiple parameter spaces in one simple execution.

Articles on Medium

mlmachine - Clean ML Experiments, Elegant EDA & Pandas Pipelines - Published 4/3/2020

mlmachine - GroupbyImputer, KFoldEncoder, and Skew Correction - Published 4/13/2020

Installation

Python Requirements: 3.6, 3.7

mlmachine uses the latest, or almost latest, versions of all dependencies. Therefore, it is highly recommended that mlmachine is installed in a virtual environment.

pyenv

Create a new virtual environment:

$ pyenv virtualenv 3.7.5 mlmachine-env

Activate your new virtual environment:

$ pyenv activate mlmachine-env

Install mlmachine using pip to install mlmachine and all dependencies:

$ pip install mlmachine

anaconda

Create a new virtual environment:

$ conda create --name mlmachine-env python=3.7

Activate your new virtual environment:

$ conda activate mlmachine-env

Install mlmachine using pip to install mlmachine and all dependencies:

$ pip install mlachine

Feedback

Any and all feedback is welcome. Please send me an email at petersontylerd@gmail.com

Acknowledgments

mlmachine stands on the shoulders of many great Python packages:

catboost | category_encoders | eif | hyperopt | imbalanced-learn | jupyter | lightgbm | matplotlib | numpy | pandas | prettierplot | scikit-learn | scipy | seaborn | shap | statsmodels | xgboost |