SageMaker Scikit-Learn Extension
SageMaker Scikit-Learn Extension is a Python module for machine learning built on top of scikit-learn.
This project contains standalone scikit-learn estimators and additional tools to support SageMaker Autopilot. Many of the additional estimators are based on existing scikit-learn estimators.
User Installation
To install,
# install from pip pip install sagemaker-scikit-learn-extension
In order to use the I/O functionalies in the sagemaker_sklearn_extension.externals
module, you will also need to install the mlio
version 0.7 package via conda. The mlio
package is only available through conda at the moment.
To install mlio
,
# install mlio conda install -c mlio -c conda-forge mlio-py==0.7
To see more information about mlio, see https://github.com/awslabs/ml-io.
You can also install from source by cloning this repository and running a pip install
command in the root directory of the repository:
# install from source git clone https://github.com/aws/sagemaker-scikit-learn-extension.git cd sagemaker-scikit-learn-extension pip install -e .
Supported Operating Systems
SageMaker scikit-learn extension supports Unix/Linux and Mac.
Supported Python Versions
SageMaker scikit-learn extension is tested on:
- Python 3.7
License
This library is licensed under the Apache 2.0 License.
Development
We welcome contributions from developers of all experience levels.
The SageMaker scikit-learn extension is meant to be a repository for scikit-learn estimators that don't meet scikit-learn's stringent inclusion criteria.
Setup
We recommend using conda for development and testing.
To download conda, go to the conda installation guide.
Running Tests
SageMaker scikit-learn extension contains an extensive suite of unit tests.
You can install the libraries needed to run the tests by running pip install --upgrade .[test]
or, for Zsh users: pip install --upgrade .\[test\]
For unit tests, tox will use pytest to run the unit tests in a Python 3.7 interpreter. tox will also run flake8 and pylint for style checks.
conda is needed because of the dependency on mlio 0.7.
To run the tests with tox, run:
tox
Running on SageMaker
To use sagemaker-scikit-learn-extension on SageMaker, you can build the sagemaker-scikit-learn-extension-container.
Overview of Submodules
-
sagemaker_sklearn_extension.decomposition
-
-
RobustPCA
dimension reduction for dense and sparse inputs
-
-
sagemaker_sklearn_extension.externals
-
-
AutoMLTransformer
utility class encapsulating feature and target transformation functionality used in SageMaker Autopilot -
Header
utility class to manage the header and target columns in tabular data -
read_csv_data
reads comma separated data and returns a numpy array (uses mlio)
-
-
sagemaker_sklearn_extension.feature_extraction.date_time
-
-
DateTimeVectorizer
convert datetime objects or strings into numeric features
-
-
sagemaker_sklearn_extension.feature_extraction.sequences
-
-
TSFlattener
convert strings of sequences into numeric features -
TSFreshFeatureExtractor
compute row-wise time series features from a numpy array (uses tsfresh)
-
-
sagemaker_sklearn_extension.feature_extraction.text
-
-
MultiColumnTfidfVectorizer
convert collections of raw documents to a matrix of TF-IDF features
-
-
sagemaker_sklearn_extension.impute
-
-
RobustImputer
imputer for missing values with customizable mask_function and multi-column constant imputation -
RobustMissingIndicator
binary indicator for missing values with customizable mask_function
-
-
sagemaker_sklearn_extension.preprocessing
-
-
BaseExtremeValuesTransformer
customizable transformer for columns that contain "extreme" values (columns that are heavy tailed) -
LogExtremeValuesTransformer
stateful log transformer for columns that contain "extreme" values (columns that are heavy tailed) -
NALabelEncoder
encoder for transforming labels to NA values -
QuadraticFeatures
generate and add quadratic features to feature matrix -
QuantileExtremeValuesTransformer
stateful quantiles transformer for columns that contain "extreme" values (columns that are he -
ThresholdOneHotEncoder
encode categorical integer features as a one-hot numeric array, with optional restrictions on feature encoding -
RemoveConstantColumnsTransformer
removes constant columns -
RobustLabelEncoder
encode labels for seen and unseen labels -
RobustStandardScaler
standardization for dense and sparse inputs -
WOEEncoder
weight of evidence supervised encoder -
SimilarityEncoder
encode categorical values based on their descriptive string
-