Python interface for the Ckmeans.1d.dp package
The R package Ckmeans.1d.dp by Song, Zhong, and Wang provides a C++ implementation of a few dynamic programming algorithms related to optimal kmeans in one dimension. Here we provide a Python interface to that library.
Installation
You can install this by
pip install ckmeans-1d-dp
or by
conda install ckmeans-1d-dp
Usage
There is only one function available:
from ckmeans_1d_dp import ckmeans
The docstring describes all the options in detail.
help(ckmeans)
A major advantage of this implementation is that it can broadcast over x, saving memory, and potentially saving a lot of time. This broadcasts along the last axis, treating each row independently.
>>> x = np.sqrt(np.linspace(0, 2, 80)).reshape(2, 2, 20)
>>> ckmeans(z, k=2).cluster
array([[[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]],
dtype=int32)
Related
- llimllib/ckmeans implements the main ckmeans algorithm directly in Python. This may be more appropriate if speed is not an issue and you wish to limit dependencies.
- rocketrip/ckmeans also wraps the original C++ implementation. It is based on an older release of the package so it is missing the latest improvements. The interface it provides is not vectorized, which I expect will make it slow when doing many repeated clusterings. Also, it uses Cython, which I would prefer to avoid.
-
AldenMB/NTarp includes a function to solve the same problem in the specific case of
k=2
, using purely vectorized Numpy.
The purpose of this repository is to make it easy to use the latest version of ckmeans directly, using vectorized numpy code.
Original Readme
Below is the Readme for the original R package.
Overview
The package provides a powerful set of tools for fast, optimal, and reproducible univariate clustering by dynamic programming. It is practical to cluster millions of sample points into a few clusters in seconds using a single core on a typical desktop computer. It solves four types of problem, including univariate
The main method
The Ckmeans.1d.dp algorithms cluster (weighted) univariate data given by a numeric vector
Excluding the time for sorting
When to use the package
As an alternative to popular heuristic clustering methods, this package provides functionality for (weighted) univariate clustering, segmentation, and peak calling with guaranteed optimality and efficiency.
An adaptive histogram based on optimal clusters is also recommended if an equal-bin-width histogram is inadequate to characterize clusters that vary in width.
To download and install the package
install.packages("Ckmeans.1d.dp")
Citing the package
Song M, Zhong H (2020). "Efficient weighted univariate clustering maps outstanding dysregulated genomic zones in human cancers." Bioinformatics, 36(20), 5027–5036. https://doi.org/10.1093/bioinformatics/btaa613
Wang H, Song M (2011). "Ckmeans.1d.dp: Optimal k-means clustering in one dimension by dynamic programming." The R Journal, 3(2), 29–33. https://doi.org/10.32614/RJ-2011-015