DeBaCl: DEnsity-BAsed CLustering
DeBaCl is a Python library for density-based clustering with level set trees.
Level set trees are a statistically-principled way to represent the topology of a probability density function. This representation is particularly useful for several core tasks in statistics:
- clustering, especially for data with multi-scale clustering behavior
- describing data topology
- exploratory data analysis
- data visualization
- anomaly detection
DeBaCl is a Python implementation of the Level Set Tree method, with an emphasis on computational speed, algorithmic simplicity, and extensibility.
License
DeBaCl is available under the 3-clause BSD license.
Installation
DeBaCl is currently compatible with Python 2.7 only. Other versions may work, but caveat emptor; at this time DeBaCl is only officially tested on Python 2.7. The package can be downloaded and installed from the Python package installer. From a terminal:
$ pip install debacl
It can also be installed by cloning this GitHub repo. This requires updating the Python path to include the cloned repo. On linux, this looks something like:
$ git clone https://github.com/CoAxLab/DeBaCl/
$ export PYTHONPATH='DeBaCl'
Dependencies
All of the dependencies are Python packages that can be installed with either conda or pip. DeBaCl 1.0 no longer depends on igraph, which required tricky manual installation.
Langauges:
- Python 2.7
- (coming soon: Python 3.4)
Required packages:
- numpy
- networkx
- prettytable
Strongly recommended packages
- matplotlib
- scipy
Optional packages
- scikit-learn
Quickstart
Construct the level set tree
import debacl as dcl
from sklearn.datasets import make_moons
X = make_moons(n_samples=100, noise=0.1, random_state=19)[0]
tree = dcl.construct_tree(X, k=10, prune_threshold=10)
print tree
+----+-------------+-----------+------------+----------+------+--------+----------+
| id | start_level | end_level | start_mass | end_mass | size | parent | children |
+----+-------------+-----------+------------+----------+------+--------+----------+
| 0 | 0.000 | 0.196 | 0.000 | 0.220 | 100 | None | [1, 2] |
| 1 | 0.196 | 0.396 | 0.220 | 0.940 | 37 | 0 | [] |
| 2 | 0.196 | 0.488 | 0.220 | 1.000 | 41 | 0 | [] |
+----+-------------+-----------+------------+----------+------+--------+----------+
Plot the level set tree
Clusters are represented by the vertical line segments in the dendrogram. In this example the vertical axis is plotted on the density scale, so that the lower endpoint of a cluster's branch is at its start_level and the upper endpoint is at its end_level (see the table above), and the length of the branch is the persistence of the cluster.
fig = tree.plot(form='density')[0]
fig.show()
Query the level set tree for cluster labels
import matplotlib.pyplot as plt
labels = tree.get_clusters(method='leaf') # each leaf node is a cluster
clusters = X[labels[:, 0], :]
fig, ax = plt.subplots()
ax.scatter(X[:, 0], X[:, 1], c='black', s=40, alpha=0.4)
ax.scatter(clusters[:, 0], clusters[:, 1], c=labels[:, 1], s=80, alpha=0.9,
cmap=plt.cm.winter)
ax.set_xlabel('x0')
ax.set_ylabel('x1', rotation=0)
fig.show()
Documentation
- API documentation
- Example jupyter notebooks (in progress)
Running unit tests
From the top level of the repo:
$ nosetests -s -v debacl/test
References
Chaudhuri, K., & Dasgupta, S. (2010). Rates of Convergence for the Cluster
Tree. In Advances in Neural Information Processing Systems 23 (pp. 343–351). Vancouver, BC.Kent, B. P., Rinaldo, A., Yeh, F.-C., & Verstynen, T. (2014). Mapping Topographic Structure in White Matter Pathways with Level Set Trees. PLoS ONE.
Kent, B. P., Rinaldo, A., & Verstynen, T. (2013). DeBaCl: A Python Package for Interactive DEnsity-BAsed CLustering. arXiv preprint:1307.8136.
Kent, B.P. (2013). Level Set Trees for Applied Statistics.