WhyPy

A python repository for causal inference.

Currently available approaches in this repository are based on Additive Noise Models (ANMs).

Install:

pip install whypy

Content:

A short introduction into the theory of causal inference
A quick start example how to run causal inference with this repository
Additive Noise Models in WhyPy
Various Templates for:

Models:

Within the WhyPy Toolbox four possible models are distinguished

The data producing process is steady state + The model is bivariate (one independent variable)
The data producing process is steady state + The model is bivariate (n independent variable)
The data producing process is transient (t₀: offset, s: stride)+ The model is multi variate (one independent variable)
The data producing process is transient (t₀: offset, s: stride)+ The model is multi variate (n independent variable)

Causal Inference (Short Introduction)

The most elementary question of causality is the one asking whether "X causes Y or vice versa". An often discussed example is the question if smoking (X) causes cancer (Y). At this point the question about causal relationships is already getting more complex. Beside the possibility that X causes Y (X → Y), there are other possible causal relationships. One is that a third Variable Z is confounding both X and Y (X ← Z → Y). In the confounding case, only looking at X and Y, might show a correlation due to the confounder even though they are not causaly related. [1], [2]

Causal Inference is the task of learning causal relationships from purely observational data. This task is a fundamental problem in science. A variety of causal inference methods are available that were claimed to be able to solve this task under certain assumptions. These assumptions are for example no confounding, no feedback loops or no selection bias. Be aware, that results given by causal inference are only valid under the methods assumptions. ITo draw causal conclusions, these methods are exploiting the complexety of the underlying models of the observational data in genearal. [2], [3]

The family of causal inference methods to used here are Additive Noise Models (ANMs). In ANMs the influence of noise is restricted to be Additive (Y ∼ f(X) + N_Y). Methods in these class are either based on independence of residuals or maximum likelihood. The procedure in the WhyPy Toolbox is the following:

Input:

Observations: X, Y

Regression Model: M

Scaler (optional): n_γ(⋅)
Normalization (optional):

Calculate X^⋆ = n_x(X)

Calculate Y^⋆ = n_y(Y)
Boostrap (optional):

Get Bootstrap Sample of Observations: X^⋆, Y^⋆
Time Shift (if model is transient):

a) Shift X^⋆ = X^⋆[0:-i:s], Y^⋆ = Y^⋆[i::s]

b) Shift Y^⋆ = Y^⋆[0:-i:s], X^⋆ = X^⋆[i::s]
Holdout (optional):

Split X^⋆ → X^⋆_regress, X^⋆_test</sub

Split Y^⋆ → Y^⋆_regress, Y^⋆_test
Fit Regression Model:

a) Fit M_{X^⋆_regress → Y^⋆_regress}

b) Fit M_{Y^⋆_regress → X^⋆_regress}
Predict based on Regression Model:

a) Regress Ŷ^⋆_test = M_{X^⋆_regress → Y^⋆_regress(X^⋆_test)}

b) Regress X̂^⋆_test = M_{Y^⋆_regress → X^⋆_regress(Y^⋆_test)}
Get Residuals:

a) Calculate Έ_{X^⋆_test → Y^⋆_test = Ŷ^⋆_test - Y^⋆_test}

b) Calculate Έ_{Y^⋆_test → X^⋆_test = X̂^⋆_test - X^⋆_test}
Evaluation Test:

a) Test Έ_{X^⋆_test → Y^⋆_test} vs. X^⋆

b) Test Έ_{Y^⋆_test → X^⋆_test} vs. Y^⋆
Interpretation:

Please refer to the given literature

Quick Start

import whypy

1. Load predefined templates of observations, regression model and scaler:

obs = whypy.load.observations(modelclass=2, no_obs=500, seed=1)
regmod = whypy.load.model_lingam(term='spline')
scaler = whypy.load.scaler_standard()

2. Initialize a bivariate steadystate ANM-Model:

mymodel = whypy.steadystate.bivariate.Model(obs=obs, combinations='all', regmod=regmod, scaler=scaler)

3. Run Causal Inference

mymodel.run(testtype='LikelihoodVariance',
            scale=True,
            bootstrap=100,
            holdout=True,
            plot_inference=True,
            plot_results=True,
            )

[return to start]

Causal Model

Init Instance

Import Whypy Toolbox

import whypy

The data producing process is steady state + The model is bivariate (one independent variable)

whypy.steadystate.bivariate.Model(obs, combinations, regmod, obs_name, scaler)

The data producing process is steady state + The model is bivariate (n independent variable)

whypy.steadystate.mvariate.Model(obs, combinations, regmod, obs_name, scaler)

The data producing process is transient (t₀: offset, s: stride)+ The model is multi variate (one independent variable)

whypy.transient.bivariate.Model(obs, combinations, regmod, obs_name, scaler, t0, stride)

The data producing process is transient (t₀: offset, s: stride)+ The model is multi variate (n independent variable)

whypy.transient.mvariate.Model(obs, combs, regmod, obs_name, scaler, t0, stride)

[return to start]

Instance-Parameters

To run causal inference a model instance must be initialized with the following attributes:

obs:

Type: Numpy Array of shape(m, n)
- m: number of observations
- n: number of variables
Description: All variables to be tested in different combinations.

combs:

Type: 'all' default or nested list
Logic: First number is number of dependent variable, following numbers are numbers of independent variable:
- Combination 1: [[dependent_variable_1, independent_variable_2, independent_variable_3, ...],
- Combination 2: [dependent_variable_2, independent_variable_1, independent_variable_3, ...],
- Combination j: ... ,
- Combination k: [...]]
Description: Combinations of dependent and independent varialbes to be tested.

regmod:

Type: Model Object or List of Model Objects
- Condition: Models must be callable with "fit" and "predict"
- If list of models is given, list must have same length as number k of combinations
Description: Models to regress independent and dependent variables.

obs_name (optional):

Type: List with name strings of shape(n)
- n: number of variables
Description: Variable Naming, default is X1, X2, ... Xn

scaler (optional):

Type: Model Object or List of Model Objects
- Condition: Models must be callable with "fit", "transform" and "inverse_transform"
- If list of models is given, list must have same length as number k of combinations
Description: Models to scale observations before regression.

t0 (required in transient models):

Type: Integer
Description: Offset Y[t₀::] ∼ f(X[:-t₀:])

stride (required in transient models):

Type: Integer
Description: Y[::stride] ∼ f(X[::stride])

[return to start]

Instance-Methods

model.run(): Do Causal Inference

model.run(testtype='LikelihoodVariance', scale=True, bootstrap=False, holdout=False, plot_inference=True, plot_results=True, **kwargs)

testtype:

Type: 'LikelihoodVariance' (default), 'LikelihoodEntropy' (to be done), 'KolmogorovSmirnoff', 'MannWhitney', 'HSIC' (to be done)
Description: Choose a test metric to be performed.

scale:

Type: True (default) or False
Description: If True scale observations before regression.

bootstrap:

Type: True or False (default)
Description: Whether to bootstrap over the observations or not (see also bootstrap_ratio and bootstrap_seed)

holdout:

Type: True or False (default)
Description: Whether to split observations between regression and test or not (see also holdout_ratio and holdout_seed)

plot_inference:

Type: True (default) or False
Description: Plot various visualisations of the inference (Pairgrid of observations, 2D Regression, Histogramms)

plot_results:

Type: True (default) or False
Description: Plot DataFrames of Normality Tests, Goodness of Fit, Independence Test and BoxPlot of test results.

bootstrap_ratio:

Type: Float, should be between 0.0 and 1.0 (default)
Description: Ratio of the original observations number m to be used for bootstraping.

bootstrap_seed:

Type: None (default) or int
Description: Seed the generator for bootstraping.

holdout_ratio:

Type: Float, should be between 0.0 and 1.0 - 0.2 (default)
Description: Ratio of the original observations number m to be used to holdout for test.

holdout_seed:

Type: None (default) or int type
Description: Seed the generator for holdout.

modelpts:

Type: integer - 50 (default)
Description: Number of points used to visualize the regression model.

gridsearch:

Type: True or False (default)
Description: Wheter or not a gridsearch should be performed to find the regmods hyperparameters. If gridsearch is True and model is not pygam, a param_grid parameter must be passed.

param_grid:

Type: dict()
Description: Defines the hyperparameters to be tested in gridsearch. Must fit to the given regmod. Not needed if model is pygam.

model.plot_inference(): Equal to Method "run" Parameter plot_inference

model.plot_inference()

model.plot_results(): Equal to Method "run" Parameter plot_results

model.plot_results()

model.get_combs(): Returns the Nested List of Combinations used in model.run()

model.get_combs()

model.get_regmod(): Returns the List of Regression Models used in model.run()

model.get_regmod()

model.get_scaler(): Returns the List of Scalers used in model.run()

model.get_scaler()

model.get_obs_name(): Returns the List of Observation Names assigned in model.run()

model.get_obs_name()

[return to start]

Instance-Attributes

model.results: DataFrame containing all results.

model.results

model.results['Fitted Combination']:

Type: String
Description: One String listing all Observation Names tested in the given Combination

model.results['Bivariate Comparison']:

Type: String
Description: One String describing a Bivariate Case out of the above combination.

model.results['tdep']:

Type: Int
Description: Dependent Variable in the Bivariate Case.

model.results['tindeps']:

Type: List
Description: List of all independent Variables in the Combination.

model.results['tindep']:

Type: Int
Description: Independent Variable in the Bivariate Case.

model.results['Normality Indep. Variable SW_pvalue [List]'], ['... [Median]'], ['... [SD]']:

Type: [List] -> dumped json | [Median] -> mean of all results given in list (float)| [SD] -> standard deviation of all results given in list (float)
Description: Normality Test on Independent Variable based on scipy.stats.shapiro()

model.results['Normality Indep. Variable Pearson_pvalue [List]'], ['... [Median]'], ['... [SD]']:

Type: see above
Description: Normality Test on Independent Variable based on scipy.stats.normaltest()

model.results['Normality Indep. Variable Combined_pvalue [List]'], ['... [Median]'], ['... [SD]']:

Type: see above
Description: Normality Test on Independent Variable based on scipy.stats.combine_pvalues()

model.results['Normality Depen. Variable SW_pvalue [List]'], ['... [Median]'], ['... [SD]']:

Type: see above
Description: Normality Test on Dependent Variable based on scipy.stats.shapiro()

model.results['Normality Depen. Variable Pearson_pvalue [List]'], ['... [Median]'], ['... [SD]']:

Type: see above
Description: Normality Test on Dependent Variable based on scipy.stats.normaltest()

model.results['Normality Depen. Variable Combined_pvalue [List]'], ['... [Median]'], ['... [SD]']:

Type: see above
Description: Normality Test on Dependent Variable based on scipy.stats.combine_pvalues()

model.results['Normality Residuals SW_pvalue [List]'], ['... [Median]'], ['... [SD]']:

Type: see above
Description: Normality Test on Residuals Variable based on scipy.stats.shapiro()

model.results['Normality Residuals Pearson_pvalue [List]'], ['... [Median]'], ['... [SD]']:

Type: see above
Description: Normality Test on Residuals Variable based on scipy.stats.normaltest()

model.results['Normality Residuals Combined_pvalue [List]'], ['... [Median]'], ['... [SD]']:

Type: see above
Description: Normality Test on Residuals Variable based on scipy.stats.combine_pvalues()

model.results['Dependence: Indep. Variable - Residuals LikelihoodVariance [List]'], ['... [Median]'], ['... [SD]']:

Type: see above
Description: Test dependence between Independent Variable and Residuals based on selected testype

model.results['Dependence: Depen. Variable - Prediction (GoF) LikelihoodVariance [List]'], ['... [Median]'], ['... [SD]']:

Type: see above
Description: Test dependence between Dependent Variable and predicted Dependent Variable (Goodness of Fit) based on selected testype

model.obs: see above

model.obs

model.combs: see above, if 'all' is passed see also model.get_combs()

model.combs

model.regmod: see above, if single Object is passed see also model.get_regmod()

model.regmod

model.obs_name (optional): see above, if None is passed see also model.get_obs_name()

model.obs_name

model.scaler (optional): see above, if single Object is passed see also model.get_scaler()

model.scaler

model.t0 (required in transient models): see above

model.t0

model.stride (required in transient models): see above

model.stride

[return to start]

Templates

There are various Regression Models, Scalers and Observational datasets available to be loaded:

Observations

whypy.load.observations(): Load Observational Datasets

whypy.load.observations(modelclass, no_obs=100, seed=None)

modelclass:

Type: Integer, should be between 1 and 10
Description: Each modelclass is defined by No. of Variables, Class of Functions and Class of Noise Distribution. Load Observations to get short summary of description.

no_obs:

Type: Integer > 0 - 100 (default)
Description: Number of observations m assigned to each variable.

seed:

Type: None (default) or int
Description: Seed the generator for Noise Distribution.

Returns:

Displays a short summary of the loaded dataset and the underlying causal graph.

obs:

Type: Numpy Array of shape(m, n)
Description: see above

[return to start]

Regression Model

whypy.load.model_lingam(): Load a Linear GAM Regression Model.

whypy.load.model_lingam(term='spline')

term:

Type: 'linear', 'spline' (default) or 'factor'
Description: see PyGAM Documentation

Returns:

Displays a short summary of the loaded regression model.

regmod:

Type: Single Instance of Regression Model
Description: see above

whypy.load.model_svr(): Load a Support Vector Regression Model.

whypy.load.observations(term='poly4')

modelclass:

Type: 'linear', 'poly2' or 'poly4' (default)
Description: see sklearn Documentation

Returns:

Displays a short summary of the loaded regression model.

regmod:

Type: Single Instance of Regression Model
Description: see above

whypy.load.model_polynomial_lr(): Load a Linear Regression Model based on Polynomial Features.

whypy.load.model_polynomial_lr(degree=2)

degree:

Type: Integer > 0, Degree of polynomial feature space
Description: Model is a Pipeline containing a Function Transformer mapping observations to polynomial feature space of given degree (without interactions) and a RidgeCV Regression Model.

Returns:

Displays a short summary of the loaded regression model.

regmod:

Type: Single Instance of Regression Model
Description: see above

[return to start]

Scaler

whypy.load.scaler_minmax(): Load a MinMaxScaler Model, scaling to feature_range=(0, 1).

whypy.load.scaler_minmax()

Returns:

Displays a short summary of the loaded scaler model.

scaler:

Type: Single Instance of Scaler Model
Description: see above

whypy.load.scaler_standard(): Load a StandardScaler Model.

whypy.load.scaler_standard()

Returns:

Displays a short summary of the loaded scaler model.

scaler:

Type: Single Instance of Scaler Model
Description: see above

[return to start]

[1]	Pearl, J. (2009). Causality. Second Edition
[2]	Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., & Schölkopf, B. (2016). Distinguishing Cause from Effect Using Observational Data: Methods and Benchmarks. Journal of Machine Learning Research
[3]	Peters, J., Janzing, D., & Schoelkopf, B. (2017). Elements of Causal Inference - Foundations and Learning Algorithms. MIT press.

WhyPy
Release 0.1.0

Release 0.1.0

0.1.0

0.0.7

Documentation

WhyPy

Causal Inference (Short Introduction)

Quick Start

1. Load predefined templates of observations, regression model and scaler:

2. Initialize a bivariate steadystate ANM-Model:

3. Run Causal Inference

Causal Model

Init Instance

Instance-Parameters

Instance-Methods

Instance-Attributes

Templates

Observations

Regression Model

Scaler

Stats

Development practices

Releases

Contributors

WhyPy Release 0.1.0

Release 0.1.0 Toggle Dropdown 0.1.0 0.0.7

Documentation

WhyPy

Causal Inference (Short Introduction)

Quick Start

1. Load predefined templates of observations, regression model and scaler:

2. Initialize a bivariate steadystate ANM-Model:

3. Run Causal Inference

Causal Model

Init Instance

Instance-Parameters

Instance-Methods

Instance-Attributes

Templates

Observations

Regression Model

Scaler

Stats

Development practices

Releases

Contributors

WhyPy
Release 0.1.0

Release 0.1.0

0.1.0

0.0.7