mlmachine
"mlmachine is a Python library that organizes and accelerates notebook-based machine learning experiments."
Table of Contents
Novel Functionality
Easy, Elegant EDA
mlmachine creates beautiful and informative EDA panels with ease:
# create EDA panel for all "category" features
for feature in mlmachine_titanic.data.mlm_dtypes["category"]:
mlmachine_titanic.eda_cat_target_cat_feat(
feature=feature,
legend_labels=["Died","Survived"],
)
Pandas-in / Pandas-out Pipelines
mlmachine makes Scikit-learn transformers Pandas-friendly.
Here's an example. See how simply wrapping the mlmachine utility PandasTransformer()
around OneHotEncoder()
maintains our DataFrame
:
KFold Target Encoding
mlmachine includes a utility called KFoldEncoder
, which applies target encoding on categorical features and leverages out-of-fold encoding to prevent target leakage:
# perform 5-fold target encoding with TargetEncoder from the category_encoders library
encoder = KFoldEncoder(
target=mlmachine_titanic_train.target,
cv=KFold(n_splits=5, shuffle=True, random_state=0),
encoder=TargetEncoder,
)
encoder.fit_transform(mlmachine_titanic_train.data[["Pclass"]])
Crowd-sourced Feature Importance & Exhaustive Feature Selection
mlmachine employs a robust approach to estimating feature importance by using a variety of techniques:
- Tree-based Feature Importance
- Recursive Feature Elimination
- Sequential Forward Selection
- Sequential Backward Selection
- F-value / p-value
- Variance
- Target Correlation
This occurs with one simple execution, and operates on multiple estimators and/or models, and one or more scoring metrics:
# instantiate custom models
rf2 = RandomForestClassifier(max_depth=2)
rf4 = RandomForestClassifier(max_depth=4)
rf6 = RandomForestClassifier(max_depth=6)
# estimator list - default XGBClassifier, default
# RandomForestClassifier and three custom models
estimators = [
XGBClassifier,
RandomForestClassifier,
rf2,
rf4,
rf6,
]
# instantiate FeatureSelector object
fs = mlmachine_titanic_train.FeatureSelector(
data=mlmachine_titanic_train.data,
target=mlmachine_titanic_train.target,
estimators=estimators,
)
# run feature importance techniques, use ROC AUC and
# accuracy score metrics and 0 CV folds (where applicable)
feature_selector_summary = fs.feature_selector_suite(
sequential_scoring=["roc_auc","accuracy_score"],
sequential_n_folds=0,
save_to_csv=True,
)
Then the features are winnowed away, from least important to most important, through an exhaustive cross-validation procedure in search of an optimum feature subset:
Hyperparameter Tuning with Bayesian Optimization
mlmachine can perform Bayesian optimization on multiple estimators in one shot, and includes functionality for visualizing model performance and parameter selections:
# generate parameter selection panels for each parameter
mlmachine_titanic_train.model_param_plot(
bayes_optim_summary=bayes_optim_summary,
estimator_class="KNeighborsClassifier",
estimator_parameter_space=estimator_parameter_space,
n_iter=100,
)
Example Notebooks
All examples can be viewed here
Example Notebook 1 - Learn the basics of mlmachine, how to create EDA panels, and how to execute Pandas-friendly Scikit-learn transformations and pipelines.
Example Notebook 2 - Learn how use mlmachine to assess a datasets pre-processing needs. See examples of how to use novel functionality, such as GroupbyImputer()
, KFoldEncoder()
and DualTransformer()
.
Example Notebook 3 - Learn how to perform thorough feature importance estimation, followed by an exhaustive, cross-validation-driven feature selection process.
Example Notebook 4 - Learn how to execute hyperparameter tuning with Bayesian optimization for multiple model and multiple parameter spaces in one simple execution.
Articles on Medium
mlmachine - Clean ML Experiments, Elegant EDA & Pandas Pipelines - Published 4/3/2020
mlmachine - GroupbyImputer, KFoldEncoder, and Skew Correction - Published 4/13/2020
Installation
Python Requirements: 3.6, 3.7
mlmachine uses the latest, or almost latest, versions of all dependencies. Therefore, it is highly recommended that mlmachine is installed in a virtual environment.
pyenv
Create a new virtual environment:
$ pyenv virtualenv 3.7.5 mlmachine-env
Activate your new virtual environment:
$ pyenv activate mlmachine-env
Install mlmachine using pip to install mlmachine and all dependencies:
$ pip install mlmachine
anaconda
Create a new virtual environment:
$ conda create --name mlmachine-env python=3.7
Activate your new virtual environment:
$ conda activate mlmachine-env
Install mlmachine using pip to install mlmachine and all dependencies:
$ pip install mlachine
Feedback
Any and all feedback is welcome. Please send me an email at petersontylerd@gmail.com
Acknowledgments
mlmachine stands on the shoulders of many great Python packages:
catboost | category_encoders | eif | hyperopt | imbalanced-learn | jupyter | lightgbm | matplotlib | numpy | pandas | prettierplot | scikit-learn | scipy | seaborn | shap | statsmodels | xgboost |