galileo-k

Fast correlations


Keywords
data-science, numpy, pandas
License
MIT
Install
pip install galileo-k==0.0.1

Documentation

galileo

This package contains several functions for explorative data analysis with a focus on association mining between variable pairs. The methods used here are optimized for Pandas dataframes and are inspired by the corrcoef function provided by numpy.

Because these functions rely on native matrix-level operations provided by numpy, many are orders of magnitude faster than naive looping-based alternatives. This makes them useful for constructing large association networks or for feature extraction, which have important uses in areas such as biomarker discovery.

The current functions available are listed below by variable comparison type. Benchmarks are also provided with comparisons to an equivalent looping-based method.

Requirements: Python 3, numpy, pandas, scipy, statsmodels.

Functions

Continuous vs. continuous

mat_corrs(A, B, method="pearson")

Computes pairwise Pearson or Spearman correlations between columns of A and B, provided that there are no missing values in either matrix.

mat_corrs_nan(A, B, method="pearson")

Computes pairwise Pearson or Spearman correlations between A and the columns of B, provided that A is a series and B is a dataframe that may or may not contain some missing values.

mat_corrs_naive(A, B, method="pearson")

Same functionality as mat_corrs, but uses a double loop for direct computation of statistics.

Continuous vs. categorical

mat_mwus(A, B, use_continuity=True)

Computes pairwise Mann-Whitney U tests between columns of A (continuous samples) and B (binary samples). Assumes that A and B both do not contain any missing values.

mat_mwus_naive(A, B, use_continuity=True)

Same functionality as mat_mwus, but uses a double loop for direct computation of statistics.

Categorical vs. categorical

mat_fishers(A, B)

Computes pairwise Fisher's exact tests between columns of A and B, provided that both are boolean-castable matrices and do not contain any missing values.

mat_fishers_nan(A, B)

Computes pairwise Fisher's exact tests between columns of A and B, provided that both are boolean-castable matrices and may or may not contain missing values.

mat_fishers_naive(A, B)

Same functionality as mat_fishers, but uses a double loop for direct computation of statistics.

Utilities

generate_test(n_samples, A_n_cols, B_n_cols, A_type="continuous", B_type="continuous", nans=False)

Generates randomly-initialized matrix pairs for testing and benchmarking.

Benchmarks

These benchmarks were run with 1,000 samples per variable (i.e. setting each input matrice to have 1,000 rows). The number of variables in A was set to 100, and the number of variables in B was varied as shown below. The number of pairwise comparisons calaulated (equivalent to the product of A and B's column counts) is also indicated.

Benchmark scripts can be found in /test/benchmarks.ipynb.

Pearson correlations

Column count of B Total comparisons Runtime, mat_corrs_naive, seconds Runtime, mat_corrs, seconds Speedup factor
10 1,000 3.27 0.020 ×163
100 10,000 30.55 0.066 ×461
1,000 100,000 303.30 0.53 ×574

Spearman correlations

Column count of B Total comparisons Runtime, mat_corrs_naive, seconds Runtime, mat_corrs, seconds Speedup factor
10 1,000 4.47 0.026 ×171
100 10,000 44.55 0.081 ×553
1,000 100,000 493.73 0.70 ×704

Mann-Whitney U tests

Column count of B Total comparisons Runtime, mat_mwus_naive, seconds Runtime, mat_mwus, seconds Speedup factor
10 1,000 6.94 0.18 ×38
100 10,000 60.68 1.09 ×56
1,000 100,000 615.59 8.15 ×76

Fisher's exact tests

Column count of B Total comparisons Runtime, mat_fishers_naive, seconds Runtime, mat_fishers, seconds Speedup factor
10 1,000 2.63 0.41 ×6
100 10,000 25.19 3.78 ×7
1,000 100,000 254.19 37.57 ×7