jackstraw

Statistical Inference for Unsupervised Learning


Keywords
clustering, k-means, machine-learning, pca, r, statistics, unsupervised
License
GPL-2.0

Documentation

jackstraw: Statistical Inference for Unsupervised Learning

This R package performs association tests between the observed data and their systematic patterns of variation. Systematic variation can be modeled by latent variables, that are likely arising from biological processes, experimental conditions, and environmental factors. We are often interested in estimating these patterns using principal component analysis (PCA), factor analysis (FA), K-means clustering, partition around medoids (PAM), and related methods. The jackstraw methods learn over-fitting characteristics inherent in unsupervised learning, where the observed data are used to estimate the systematic patterns and to be tested again.

Using a variety of unsupervised learning techniques, the jackstraw provides a resampling strategy and testing scheme to estimate statistical significance of association between the observed data and their systematic patterns of variation. For example, the cell cycle in microarray data may be estimated by principal components (PCs); then, we can use the jackstraw for PCA to identify genes that are significantly associated with these PCs. On the other hand, cell identities in single cell RNA-seq data are identified by K-means clustering; then, the jackstraw for clustering can evaluate reliability of computationally determined cell identities.

The jackstraw tests enable us to identify the variables (or observations) that are driving systematic variation, in an unsupervised manner. Using jackstraw_pca, we can find statistically significant variables with regard to the top r principal components. The package also supports augmented implicitly restarted Lanczos bidiagonalization algorithm (IRLBA) and randomized Singular Value Decomposition (RSVD) by jackstraw_irlba and jackstraw_rpca. Generally, one could directly specify an estimation method for latent variables in jackstraw_subspace. Similarly, logistic factor analysis (LFA) and ALStructure estimate population structure from genetic data (single-nucleotide polymorphisms; SNPs); jackstraw_lfa and jackstraw_alstructure provides corresponding association tests between SNPs and population structure.

Instead of continuous latent variables, one may be interested in estimating discrete clusters from a high dimensional data. jackstraw_kmeans can identify the data points that are statistically significant members of clusters, by testing association between data and cluster centers. This can help select data points that are reliable members of clusters and further improve the cluster membership. Related algorithms, such as Partitioning Around Medoids (PAM) or k-medoids and Mini Batch K-means algorithms, are explicitely supported by jackstraw_pam and jackstraw_MiniBatchKmeans. Generally, jackstraw_cluster can be adapted for other clustering algorithms.

There are few additional functions to support statistical inference for unsupervised learning, such as finding a number of PCs or clusters and estimating posterior inclusion probabilities (PIPs) from the jackstraw p-values.

References

Chung, N.C. (2020) Statistical significance of cluster membership for unsupervised evaluation of cell identities. Bioinformatics, 36(10): 3107–3114 https://academic.oup.com/bioinformatics/article/36/10/3107/5788523

Chung, N.C. and Storey, J.D. (2015) Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics, 31(4): 545-554 https://academic.oup.com/bioinformatics/article/31/4/545/2748186

Short Tutorials

Association Test with Principal Components with a Gentle Introduction to Latent Variable Models

Statistical Test of Cluster Memberships with a Toy Data Set (mtcars)

Unsupervised Evaluation of Cell Identities in Single Cell Genomics using the 10X Genomics Data

Stable Version on CRAN

The stable version jackstraw v1.3.8 is on CRAN. This lacks functionalities that requires lfa, gcatest, and alstructure. If you are interested in those functionalities, see below for installing the developmental version on this GitHub repo.

To use a stable version from CRAN:

install.packages("jackstraw")

Development Version on GitHub

This package is in active development. To install this developmental version, first install updates for lfa, gcatest, and alstructure from these GitHub repositories:

library(devtools)
install_github("StoreyLab/lfa")
install_github("StoreyLab/alstructure")
install_github("alexviiia/gcatest")

Eventually, the Bioconductor versions of lfa and gcatest will have these updates; sorry for the temporary inconvenience.

Then, install the jackstraw from GitHub:

install.packages("devtools")
library("devtools")
install_github("ncchung/jackstraw")

Troubleshooting

Bioconductor dependencies may fail to automatically install, e.g., lfa, gcatest, qvalue. This would result in a warning.

To solve this problem, please install Bioconductor dependencies manually.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c('qvalue'))

Note that to use the development version, please install corresponding packages from GitHub repositories.

Implementations and Extensions

Here are some implementations of the jackstraw in different contexts and application domains.

Implementation of the jackstraw in Python is available:

jackstraw (Python) by Iain Carmichael

Extension of Jackstraw Inference for AJIVE Data Integration:

Jackstraw significance testing for JIVE in Python

The jackstraw used in Seurat, R toolkit for single cell genomics:

Guided Clustering Tutorial

Determine statistical significance of PCA scores

Seurat Wizard (GUI Web App)