Kmeans++ in Pandas
An implementation of the kmeans++ clustering algorithm using Pandas.
IMPORTANT NOTE
This package should not be used in production. The implementation of kmeans++ contained therein is much slower than that of scikitlearn. Use that instead.
The only reason why I wrote any of this is to teach myself Pandas.
Prerequisites
Installation
If you have pip, then just do
pip install kmeansplusplus
Otherwise,

Clone the repository:
git clone https://github.com/jackmaney/kmeanspluspluspandas.git

Enter the newlycreated folder containing the repo
cd kmeanspluspluspandas

And run the installation manually:
python setup.py install
Usage
Here are the constructor arguments:
data_frame
: A Pandas data frame representing the data that you wish to cluster. Rows represent observations, and columns represent variables.k
: The number of clusters that you want.columns=None
: A list of column names upon which you wish to cluster your data. If this argument isn't provided, then all of the columns are selected. Note: Columns upon which you want to cluster must be numeric and have nonumpy.nan
values.max_iterations=None
: The maximum number of times that you wish to iterate kmeans. If no value is provided, then the iterations continue until stability is reached (ie the cluster assignments don't change between one iteration and the next).appended_column_name=None
: If this value is set with a string, then a column will be appended to your data with the given name that contains the cluster assignments (which are integers from 0 tok1
). If this argument is not set, then you still have access to the clusters via theclusters
attribute.
Once you've constructed a KMeansPlusPlus
object, then just call the cluster
method, and everything else should happen automagically. Take a look at the examples
folder.
TODO:
Add on features that take iterations of kmeans++ clusters and compares them via, eg, concordance matrices, Jaccard indices, etc.
Given a data frame, implement the socalled Elbow Method to take a stab at an optimal value for
k
.Make this into a proper Python module that can be installed via pip.Python 3 compatibility (probably via six).