galileo

This package contains several functions for explorative data analysis with a focus on association mining between variable pairs. The methods used here are optimized for Pandas dataframes and are inspired by the corrcoef function provided by numpy.

Because these functions rely on native matrix-level operations provided by numpy, many are orders of magnitude faster than naive looping-based alternatives. This makes them useful for constructing large association networks or for feature extraction, which have important uses in areas such as biomarker discovery.

The current functions available are listed below by variable comparison type. Benchmarks are also provided with comparisons to an equivalent looping-based method.

Requirements: Python 3, numpy, pandas, scipy, statsmodels.

Functions

Continuous vs. continuous

mat_corrs(A, B, method="pearson")

Computes pairwise Pearson or Spearman correlations between columns of A and B, provided that there are no missing values in either matrix.

mat_corrs_nan(A, B, method="pearson")

Computes pairwise Pearson or Spearman correlations between A and the columns of B, provided that A is a series and B is a dataframe that may or may not contain some missing values.

mat_corrs_naive(A, B, method="pearson")

Same functionality as mat_corrs, but uses a double loop for direct computation of statistics.

Continuous vs. categorical

mat_mwus(A, B, use_continuity=True)

Computes pairwise Mann-Whitney U tests between columns of A (continuous samples) and B (binary samples). Assumes that A and B both do not contain any missing values.

mat_mwus_naive(A, B, use_continuity=True)

Same functionality as mat_mwus, but uses a double loop for direct computation of statistics.

Categorical vs. categorical

mat_fishers(A, B)

Computes pairwise Fisher's exact tests between columns of A and B, provided that both are boolean-castable matrices and do not contain any missing values.

mat_fishers_nan(A, B)

Computes pairwise Fisher's exact tests between columns of A and B, provided that both are boolean-castable matrices and may or may not contain missing values.

mat_fishers_naive(A, B)

Same functionality as mat_fishers, but uses a double loop for direct computation of statistics.

Utilities

generate_test(n_samples, A_n_cols, B_n_cols, A_type="continuous", B_type="continuous", nans=False)

Generates randomly-initialized matrix pairs for testing and benchmarking.

Benchmarks

These benchmarks were run with 1,000 samples per variable (i.e. setting each input matrice to have 1,000 rows). The number of variables in A was set to 100, and the number of variables in B was varied as shown below. The number of pairwise comparisons calaulated (equivalent to the product of A and B's column counts) is also indicated.

Benchmark scripts can be found in /test/benchmarks.ipynb.

Pearson correlations

Column count of B	Total comparisons	Runtime, `mat_corrs_naive`, seconds	Runtime, `mat_corrs`, seconds	Speedup factor
10	1,000	3.27	0.020	×163
100	10,000	30.55	0.066	×461
1,000	100,000	303.30	0.53	×574

Spearman correlations

Column count of B	Total comparisons	Runtime, `mat_corrs_naive`, seconds	Runtime, `mat_corrs`, seconds	Speedup factor
10	1,000	4.47	0.026	×171
100	10,000	44.55	0.081	×553
1,000	100,000	493.73	0.70	×704

Mann-Whitney U tests

Column count of B	Total comparisons	Runtime, `mat_mwus_naive`, seconds	Runtime, `mat_mwus`, seconds	Speedup factor
10	1,000	6.94	0.18	×38
100	10,000	60.68	1.09	×56
1,000	100,000	615.59	8.15	×76

Fisher's exact tests

Column count of B	Total comparisons	Runtime, `mat_fishers_naive`, seconds	Runtime, `mat_fishers`, seconds	Speedup factor
10	1,000	2.63	0.41	×6
100	10,000	25.19	3.78	×7
1,000	100,000	254.19	37.57	×7

galileo-k
Release 0.0.1

Release 0.0.1

0.0.1

Documentation

galileo

Functions

Continuous vs. continuous

Continuous vs. categorical

Categorical vs. categorical

Utilities

Benchmarks

Pearson correlations

Spearman correlations

Mann-Whitney U tests

Fisher's exact tests

Stats

Development practices

Releases

Contributors

galileo-k Release 0.0.1

Release 0.0.1 Toggle Dropdown 0.0.1

Documentation

galileo

Functions

Continuous vs. continuous

Continuous vs. categorical

Categorical vs. categorical

Utilities

Benchmarks

Pearson correlations

Spearman correlations

Mann-Whitney U tests

Fisher's exact tests

Stats

Development practices

Releases

Contributors

galileo-k
Release 0.0.1

Release 0.0.1

0.0.1