jackstraw: Statistical Inference for Unsupervised Learning
This R package performs association tests between the observed data and their systematic patterns of variation. Systematic variation can be modeled by latent variables, that are likely arising from biological processes, experimental conditions, and environmental factors. We are often interested in estimating these patterns using principal component analysis (PCA), factor analysis (FA), K-means clustering, partition around medoids (PAM), and related methods. The jackstraw methods learn over-fitting characteristics inherent in unsupervised learning, where the observed data are used to estimate the systematic patterns and to be tested again.
Using a variety of unsupervised learning techniques, the jackstraw provides a resampling strategy and testing scheme to estimate statistical significance of association between the observed data and their systematic patterns of variation. For example, the cell cycle in microarray data may be estimated by principal components (PCs); then, we can use the jackstraw for PCA to identify genes that are significantly associated with these PCs. On the other hand, cell identities in single cell RNA-seq data are identified by K-means clustering; then, the jackstraw for clustering can evaluate reliability of computationally determined cell identities.
The jackstraw tests enable us to identify the variables (or observations) that are driving systematic variation, in an unsupervised manner. Using jackstraw_pca
, we can find statistically significant variables with regard to the top r principal components. The package also supports augmented implicitly restarted Lanczos bidiagonalization algorithm (IRLBA) and randomized Singular Value Decomposition (RSVD) by jackstraw_irlba
and jackstraw_rpca
. Generally, one could directly specify an estimation method for latent variables in jackstraw_subspace
. Similarly, logistic factor analysis (LFA) and ALStructure estimate population structure from genetic data (single-nucleotide polymorphisms; SNPs); jackstraw_lfa
and jackstraw_alstructure
provides corresponding association tests between SNPs and population structure.
Instead of continuous latent variables, one may be interested in estimating discrete clusters from a high dimensional data. jackstraw_kmeans
can identify the data points that are statistically significant members of clusters, by testing association between data and cluster centers. This can help select data points that are reliable members of clusters and further improve the cluster membership. Related algorithms, such as Partitioning Around Medoids (PAM) or k-medoids and Mini Batch K-means algorithms, are explicitely supported by jackstraw_pam
and jackstraw_MiniBatchKmeans
. Generally, jackstraw_cluster
can be adapted for other clustering algorithms.
There are few additional functions to support statistical inference for unsupervised learning, such as finding a number of PCs or clusters and estimating posterior inclusion probabilities (PIPs) from the jackstraw p-values.
References
Chung, N.C. (2020) Statistical significance of cluster membership for unsupervised evaluation of cell identities. Bioinformatics, 36(10): 3107–3114 https://academic.oup.com/bioinformatics/article/36/10/3107/5788523
Chung, N.C. and Storey, J.D. (2015) Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics, 31(4): 545-554 https://academic.oup.com/bioinformatics/article/31/4/545/2748186
Short Tutorials
Association Test with Principal Components with a Gentle Introduction to Latent Variable Models
Statistical Test of Cluster Memberships with a Toy Data Set (mtcars)
Unsupervised Evaluation of Cell Identities in Single Cell Genomics using the 10X Genomics Data
Stable Version on CRAN
The stable version jackstraw v1.3.8 is on CRAN. This lacks functionalities that requires lfa
, gcatest
, and alstructure
. If you are interested in those functionalities, see below for installing the developmental version on this GitHub repo.
To use a stable version from CRAN:
install.packages("jackstraw")
Development Version on GitHub
This package is in active development. To install this developmental version, first install updates for lfa
, gcatest
, and alstructure
from these GitHub repositories:
library(devtools)
install_github("StoreyLab/lfa")
install_github("StoreyLab/alstructure")
install_github("alexviiia/gcatest")
Eventually, the Bioconductor versions of lfa
and gcatest
will have these updates; sorry for the temporary inconvenience.
Then, install the jackstraw from GitHub:
install.packages("devtools")
library("devtools")
install_github("ncchung/jackstraw")
Troubleshooting
Bioconductor dependencies may fail to automatically install, e.g., lfa, gcatest, qvalue. This would result in a warning.
To solve this problem, please install Bioconductor dependencies manually.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c('qvalue'))
Note that to use the development version, please install corresponding packages from GitHub repositories.
Implementations and Extensions
Here are some implementations of the jackstraw in different contexts and application domains.
Implementation of the jackstraw in Python is available:
jackstraw (Python) by Iain Carmichael
Jackstraw Inference for AJIVE Data Integration:
Extension ofJackstraw significance testing for JIVE in Python