Ezcluster: Evaluating the optimal number of clusters for KMeans clustering using the gap statistic

K-Means clustering provides us with interesting ways of exploring our dataset, by trying to separate the data points into K different clusters. However, determining the optimal number of clusters can be a tricky task. Borrowing from the concepts outlined in this paper, ezcluster will make it easy to find an optimal K by using the gap statistic.

Installation through the Python Package index:

pip install ezcluster

Implementation

The optimal K is the smallest for which the quantity plotted in blue bars becomes positive.

Documentation

ezcluster.KMeans Initializes the class with an input dataframe, which is preprocessed in preparation for K-Means clustering

Input parameters:
- df (DataFrame): pandas dataframe
- categorical_cols (list of strings): columns to be one hot encoded
- id_col (string): name of id column

KMeans.optimal_k Searches for the optimal K within a supplied range

Input parameters:
- min_k (int): minimum k to start search
- max_k (int): maximum k to end search
- num_iters (int): number of times to run K-Means; the more the better because law of large numbers averages out the results, but takes longer to run

KMeans.plot Generates the gap_statistic and gaps_with_error plots, and saves them by default to the ezcluster_files/ directory unless otherwise specified.

KMeans.fit Returns a KMeans object initialized with the optimal number of clusters supplied to it.

KMeans.save_model Saves the ezcluster instance as a .pkl file as ezcluster_files/ezc.pkl by default unless otherwise specified.

KMeans.load_model Loads the previous saved ezcluster instance from file.

KMeans.write_csv Saves labeled dataframe to csv file, in ezcluster_files/ unless otherwise specified.

Example

# load the iris dataset
import pandas as pd
import ezcluster

# import your packages and load the iris dataset in a pandas dataframe.
iris = pd.read_csv('https://raw.githubusercontent.com/thisisandreeeee/ezcluster/master/iris.csv')
species = iris['species']
iris.drop('species', axis = 1, inplace = True)

# initialize the kmeans class with a pandas dataframe, and indicate the categorical or id columns
ezc = ezcluster.Kmeans(iris, categorical_cols = None, id_col = None)

# find the optimal number of clusters by indicating the range of K to try
num_of_clusters = ezc.optimal_k(min_k=1, max_k=10, num_iters=100)

# plot the gap statistic plots
ezc.plot()

# return a model with optimal number of clusters
model = ezc.fit(n_clusters = num_of_clusters)

# save instance
ezc.save_instance()

# save labeled dataset to csv
ezc.write_csv()

ezcluster
Release 0.0.3

Release 0.0.3

0.0.3

0.0.2

0.0.1

Documentation

Ezcluster: Evaluating the optimal number of clusters for KMeans clustering using the gap statistic

Implementation

Documentation

Example

Stats

Development practices

Releases

Contributors

ezcluster Release 0.0.3

Release 0.0.3 Toggle Dropdown 0.0.3 0.0.2 0.0.1

Documentation

Ezcluster: Evaluating the optimal number of clusters for KMeans clustering using the gap statistic

Implementation

Documentation

Example

Stats

Development practices

Releases

Contributors

ezcluster
Release 0.0.3

Release 0.0.3

0.0.3

0.0.2

0.0.1