tsclustering
A clustering tool for timeseries data with temporal distortions.
$ pip install tsclustering
Handling Data with Temporal Distortions
KMeans implementation using DTW and interpolated averaging. This package is able to efficiently handle arrays of varied length.
Efficient Dynamic Time Warping Implementation with Early Abandon
Early abandon condition avoids unnecessary computation when searching for best centroid fit.
Task-level Hardware Parallelism
Uses Python's multiprocessing module
Interpolated Averaging
To avoid the complexity of other barycenter averaging techniques, we use interpolated averaging to efficiently compute the barycenters of varied-length arrays. The process is as follows:
- The mean length of the group,
$\mu$ , is found. - Each timeseries is interpolated to create a vector,
$\vec{ts_{l}}$ where L is the number of timeseries being averaged and$\vec{ts_{l}} \in{\mathbb{R}^{\mu}}$ . - The average vector is found as the barycenter
$barycenter = \frac{1}{L} \sum_{l=1}^{L}\vec{ts_{l}}$
Dependencies
- Numpy
- SciPy
- Numba
IMPLEMENTATION
from tsclustering import KMeans
from sklearn.preprocessing import LabelEncoder
import pickle
# Loading Example Data
with open('./data/sample_data/X.pickle', 'rb') as f:
X = pickle.load(f)
with open('./data/sample_data/y.pickle', 'rb') as f:
y = pickle.load(f)
y = LabelEncoder().fit_transform(y)
# Plotting data
for x in X:
plt.plot(x, color = 'black');
# Instantiation
kmeans = KMeans(k_clusters=3,
max_iter=100,
n_init = 5,
window = 0.9,
centroids = []
)
KMeans
k_clusters: int
User-defined number of clusters to search for.
max_iter: int
The max number of iterations to allow local search if convergent solution is not found.
n_init: int
Number of local searches to perform. Returns the solution with the minimum inertia
window: float[0,1]
Constrains the warping window of DTW. Window is
centroids: list[np.array[np.float64]]
predefined centroids to begin search from
# Fitting to data
kmeans.fit(X,
cores = 1
)
KMeans().fit
X: np.array[np.array]
Array of time-series
cores: int
Number of cores to use in parallel search. Defaults to 1 core.
import matplotlib.pyplot as plt
# Access the labels using the kmeans.clusters attribute
colors = ['red', 'green', 'blue']
for k in range(kmeans.k_clusters):
cluster = np.array(X, dtype = object)[np.where(np.array(kmeans.clusters) == k)[0]]
for arr in cluster:
plt.plot(arr, color = colors[k])
# Computing Inertia
kmeans.get_inertia()
93.27448236700336
from sklearn.metrics import rand_score, adjusted_rand_score
print('Rand Index:', f'{rand_score(kmeans.clusters, y):.2f}')
print('Rand Index:', f'{adjusted_rand_score(kmeans.clusters, y):.2f}')
Rand Index: 1.00
Adjusted RI: 1.00
# Soft clustering returns the distance from each instance to each centroid
kmeans.soft_cluster()
array([[3.66707504, 3.43053223, 3.5902464 ],
[3.26707093, 3.60751793, 3.32326565],
[3.49872418, 3.53796656, 3.60681567],
[3.4164345 , 3.3215374 , 3.31998848],
[3.33290798, 3.69574074, 3.53531107],
[3.5292556 , 3.27362416, 3.69472868],
[3.72468091, 3.65222014, 3.70735547],
[3.73722331, 3.62481453, 3.62434249],
[3.54864015, 3.66082986, 3.31089306],
[3.75099114, 4.32067397, 3.81107028]])
# Match an incoming time series array to nearest centroid
print('Clustered Labels:', [kmeans.clusters[0], kmeans.clusters[80]])
print('Predicted Labels:', kmeans.predict([X[0], X[80]]))
Clustered Labels: [2, 0]
Predicted Labels: [2, 0]
Future Development
1. Multivariate time series clustering