We propose and validate a mixture of state space models to perform unsupervised clustering of short trajectories1. Within the state space framework, we let expensive-to-gather biomarkers correspond to hidden states and readily obtainable cognitive metrics correspond to measurements. Upon training with expectation maximization, we find that our clusters stratify persons according to clinical outcome. Furthermore, we can effectively predict on held-out trajectories using cognitive metrics alone. Our approach accommodates missing data through model marginalization and generalizes across research and clinical cohorts.
We consider a training dataset
consisting of np.nan
's to shorter trajectories.
We adopt a mixture of state space models for the data:
Each individual
In our main framework, inspired by the work of Chiappa and Barber2, we
additionally assume that the cluster-specific state initialisation is Gaussian,
i.e.
In particular, we assume that the variables we are modeling are continuous and
changing over time. When we train a model like the above, we take a dataset
- [E] Expectation step: given the current model, we assign each data
instance
$(z^i_{1:T}, x^i_{1:T})$ to the cluster to which it is mostly likely to belong under the current model - [M] Maximization step: given the current cluster assignments, we compute
the sample-level cluster assignment probabilities (the
$\pi_c$ ) and optimal cluster-specific parameters
Optimization completes after a fixed (large) number of steps or when no data instances change their cluster assignment at a given iteration.
A typical workflow is described at: https://github.com/burkh4rt/Unsupervised-Trajectory-Clustering-Starter
Some efforts have been made to automatically handle edge cases. For a given training run, if any cluster becomes too small (fewer than 3 members), training terminates. In order to learn a model, we make assumptions about our training data as described above. While our approach seems to be robust to some types of model misspecification, we have encountered training issues with the following problems:
- Extreme outliers. An extreme outlier tends to want to form its own cluster (and that's problematic). In many cases this may be due to a typo or failed data-cleaning (i.e. an upstream problem). Generating histograms of each feature is one way to recognise this problem.
- Discrete / static features. Including discrete data violates our Gaussian assumptions. If we learn a cluster where each trajectory has the same value for one of the states or observations at a given time step, then we are prone to estimating a singular covariance structure for this cluster which yields numerical instabilities. Adding a small bit of noise to discrete features may remediate numerical instability to some extent.
Another assumption that is easy-to-violate is our stationarity assumption for the measurement model.
Footnotes
-
M. Burkhart, L. Lee, D. Vaghari, A. Toh, E. Chong, C. Chen, P. Tiňo, and Z. Kourtzi, Unsupervised multimodal modeling of cognitive and brain health trajectories for early dementia prediction, Sci. Rep. 14 (2024) ↩
-
S. Chiappa and D. Barber, Dirichlet Mixtures of Bayesian Linear Gaussian State-Space Models: a Variational Approach, Tech. rep. 161, Max Planck Institute for Biological Cybernetics, 2007. ↩
-
A. Dempster, N. Laird, and D. Rubin. Maximum Likelihood from
Incomplete Data via the EM Algorithm. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 39 (1977). ↩