mixdir

Cluster High Dimensional Categorical Datasets


Keywords
categorical-data, clustering, questionnaires, r-package, variational-inference
License
GPL-3.0

Documentation

mixdir

The goal of mixdir is to cluster high dimensional categorical datasets.

It can

  • handle missing data
  • infer a reasonable number of latent class (try mixdir(select_latent=TRUE))
  • cluster datasets with more than 70,000 observations and 60 features
  • propagate uncertainty and produce a soft clustering

Installation

devtools::install_github("const-ae/mixdir")

Example

Clustering the mushroom data set.

# Loading the library and the data
library(mixdir)
set.seed(1)

data("mushroom")
# High dimensional dataset: 8124 mushroom and 23 different features
mushroom[1:10, 1:5]
#>    bruises cap-color cap-shape cap-surface    edible
#> 1  bruises     brown    convex      smooth poisonous
#> 2  bruises    yellow    convex      smooth    edible
#> 3  bruises     white      bell      smooth    edible
#> 4  bruises     white    convex       scaly poisonous
#> 5       no      gray    convex      smooth    edible
#> 6  bruises    yellow    convex       scaly    edible
#> 7  bruises     white      bell      smooth    edible
#> 8  bruises     white      bell       scaly    edible
#> 9  bruises     white    convex       scaly poisonous
#> 10 bruises    yellow      bell      smooth    edible

Calling the clustering function mixdir on a subset of the data:

# Clustering into 3 latent classes
result <- mixdir(mushroom[1:1000, 1:5], n_latent=3)

Analyzing the result

# Latent class of of first 10 mushrooms
head(result$pred_class, n=10)
#>  [1] 3 1 1 3 2 1 1 1 3 1

# Soft Clustering for first 10 mushrooms
head(result$class_prob, n=10)
#>               [,1]         [,2]         [,3]
#>  [1,] 2.781141e-14 2.915757e-09 1.000000e+00
#>  [2,] 1.000000e+00 1.286473e-09 4.022864e-08
#>  [3,] 1.000000e+00 8.483274e-10 3.020595e-08
#>  [4,] 1.734041e-07 1.173963e-11 9.999998e-01
#>  [5,] 4.078317e-14 1.000000e+00 4.873989e-14
#>  [6,] 9.999999e-01 1.450405e-11 1.053419e-07
#>  [7,] 1.000000e+00 8.483274e-10 3.020595e-08
#>  [8,] 9.999999e-01 9.564273e-12 7.909668e-08
#>  [9,] 1.734041e-07 1.173963e-11 9.999998e-01
#> [10,] 1.000000e+00 4.799899e-16 8.694462e-15
pheatmap::pheatmap(result$class_prob, cluster_cols=FALSE,
                  labels_col = paste("Class", 1:3))

# Structure of latent class 1
# (bruises, cap color either yellow or white, edible etc.)
purrr::map(result$category_prob, 1)
#> $bruises
#>      bruises           no 
#> 0.9996447887 0.0003552113 
#> 
#> $`cap-color`
#>        brown         gray          red        white       yellow 
#> 0.0003548468 0.0003592167 0.0003548906 0.4077977750 0.5911332709 
#> 
#> $`cap-shape`
#>         bell       convex         flat       sunken 
#> 0.3925723193 0.4765681327 0.1305046002 0.0003549479 
#> 
#> $`cap-surface`
#>    fibrous      scaly     smooth 
#> 0.05700435 0.48705770 0.45593795 
#> 
#> $edible
#>       edible    poisonous 
#> 0.9996447805 0.0003552195

# The most predicitive features for each class
find_representative_answers(result$lambda, result$category_prob, top_n=3)
#>       column    answer class probability
#> 19 cap-color    yellow     1   0.9988010
#> 22 cap-shape      bell     1   0.9981940
#> 1    bruises   bruises     1   0.7089829
#> 48    edible poisonous     3   0.9961025
#> 15 cap-color       red     3   0.7498975
#> 9  cap-color     brown     3   0.6468097
#> 5    bruises        no     2   0.9980757
#> 11 cap-color      gray     2   0.9957144
#> 32 cap-shape    sunken     2   0.9873503
# For example: if all I know about a mushroom is that it has a
# yellow cap, then I am 99% certain that it will be in class 1
predict_class(c(`cap-color`="yellow"), result$lambda, result$category_prob)
#> [1] 0.9988009540 0.0005996516 0.0005993944

# Convergence
plot(result$convergence, main=paste0("ELBO: ", formatC(result$ELBO, digits = 3)))

Underlying Model

The package implements a variational inference algorithm to solve a Bayesian latent class model (LCM).


Disclaimer

This package is still under development and can still change profoundly.