galileo
This package contains several functions for explorative data analysis with a focus on association mining between variable pairs. The methods used here are optimized for Pandas dataframes and are inspired by the corrcoef
function provided by numpy
.
Because these functions rely on native matrix-level operations provided by numpy
, many are orders of magnitude faster than naive looping-based alternatives. This makes them useful for constructing large association networks or for feature extraction, which have important uses in areas such as biomarker discovery.
The current functions available are listed below by variable comparison type. Benchmarks are also provided with comparisons to an equivalent looping-based method.
Requirements: Python 3, numpy
, pandas
, scipy
, statsmodels
.
Functions
Continuous vs. continuous
mat_corrs(A, B, method="pearson")
Computes pairwise Pearson or Spearman correlations between columns of A and B, provided that there are no missing values in either matrix.
mat_corrs_nan(A, B, method="pearson")
Computes pairwise Pearson or Spearman correlations between A and the columns of B, provided that A is a series and B is a dataframe that may or may not contain some missing values.
mat_corrs_naive(A, B, method="pearson")
Same functionality as mat_corrs
, but uses a double loop for direct computation of statistics.
Continuous vs. categorical
mat_mwus(A, B, use_continuity=True)
Computes pairwise Mann-Whitney U tests between columns of A (continuous samples) and B (binary samples). Assumes that A and B both do not contain any missing values.
mat_mwus_naive(A, B, use_continuity=True)
Same functionality as mat_mwus
, but uses a double loop for direct computation of statistics.
Categorical vs. categorical
mat_fishers(A, B)
Computes pairwise Fisher's exact tests between columns of A and B, provided that both are boolean-castable matrices and do not contain any missing values.
mat_fishers_nan(A, B)
Computes pairwise Fisher's exact tests between columns of A and B, provided that both are boolean-castable matrices and may or may not contain missing values.
mat_fishers_naive(A, B)
Same functionality as mat_fishers
, but uses a double loop for direct computation of statistics.
Utilities
generate_test(n_samples, A_n_cols, B_n_cols, A_type="continuous", B_type="continuous", nans=False)
Generates randomly-initialized matrix pairs for testing and benchmarking.
Benchmarks
These benchmarks were run with 1,000 samples per variable (i.e. setting each input matrice to have 1,000 rows). The number of variables in A was set to 100, and the number of variables in B was varied as shown below. The number of pairwise comparisons calaulated (equivalent to the product of A and B's column counts) is also indicated.
Benchmark scripts can be found in /test/benchmarks.ipynb
.
Pearson correlations
Column count of B | Total comparisons | Runtime, mat_corrs_naive , seconds |
Runtime, mat_corrs , seconds |
Speedup factor |
---|---|---|---|---|
10 | 1,000 | 3.27 | 0.020 | ×163 |
100 | 10,000 | 30.55 | 0.066 | ×461 |
1,000 | 100,000 | 303.30 | 0.53 | ×574 |
Spearman correlations
Column count of B | Total comparisons | Runtime, mat_corrs_naive , seconds |
Runtime, mat_corrs , seconds |
Speedup factor |
---|---|---|---|---|
10 | 1,000 | 4.47 | 0.026 | ×171 |
100 | 10,000 | 44.55 | 0.081 | ×553 |
1,000 | 100,000 | 493.73 | 0.70 | ×704 |
Mann-Whitney U tests
Column count of B | Total comparisons | Runtime, mat_mwus_naive , seconds |
Runtime, mat_mwus , seconds |
Speedup factor |
---|---|---|---|---|
10 | 1,000 | 6.94 | 0.18 | ×38 |
100 | 10,000 | 60.68 | 1.09 | ×56 |
1,000 | 100,000 | 615.59 | 8.15 | ×76 |
Fisher's exact tests
Column count of B | Total comparisons | Runtime, mat_fishers_naive , seconds |
Runtime, mat_fishers , seconds |
Speedup factor |
---|---|---|---|---|
10 | 1,000 | 2.63 | 0.41 | ×6 |
100 | 10,000 | 25.19 | 3.78 | ×7 |
1,000 | 100,000 | 254.19 | 37.57 | ×7 |