WhyPy

Toolbox for Causal Inference with Additive Noise Models


License
MIT
Install
pip install WhyPy==0.1.0

Documentation

WhyPy

A python repository for causal inference.

Currently available approaches in this repository are based on Additive Noise Models (ANMs).

Install:

pip install whypy

Content:

  1. A short introduction into the theory of causal inference
  2. A quick start example how to run causal inference with this repository
  3. Additive Noise Models in WhyPy
    1. Model Instances (Bivariate-MultiVariate | SteadyState-Transient)
    2. Instance Parameters
    3. Instance Methods
    4. Instance Attributes
  4. Various Templates for:
    1. Observations
    2. Regression Models
    3. Scaler

Models:

Within the WhyPy Toolbox four possible models are distinguished

  1. The data producing process is steady state + The model is bivariate (one independent variable)

    BiVariate-SteadyState

  2. The data producing process is steady state + The model is bivariate (n independent variable)

    MultiVariate-SteadyState

  3. The data producing process is transient (t0: offset, s: stride)+ The model is multi variate (one independent variable)

    BiVariate-Transient

  4. The data producing process is transient (t0: offset, s: stride)+ The model is multi variate (n independent variable)

    MultiVariate-Transient

Causal Inference (Short Introduction)

The most elementary question of causality is the one asking whether "X causes Y or vice versa". An often discussed example is the question if smoking (X) causes cancer (Y). At this point the question about causal relationships is already getting more complex. Beside the possibility that X causes Y (X β†’ Y), there are other possible causal relationships. One is that a third Variable Z is confounding both X and Y (X ← Z β†’ Y). In the confounding case, only looking at X and Y, might show a correlation due to the confounder even though they are not causaly related. [1], [2]

Cause-Effect-Confounded

Causal Inference is the task of learning causal relationships from purely observational data. This task is a fundamental problem in science. A variety of causal inference methods are available that were claimed to be able to solve this task under certain assumptions. These assumptions are for example no confounding, no feedback loops or no selection bias. Be aware, that results given by causal inference are only valid under the methods assumptions. ITo draw causal conclusions, these methods are exploiting the complexety of the underlying models of the observational data in genearal. [2], [3]

The family of causal inference methods to used here are Additive Noise Models (ANMs). In ANMs the influence of noise is restricted to be Additive (Y ∼ f(X) + NY). Methods in these class are either based on independence of residuals or maximum likelihood. The procedure in the WhyPy Toolbox is the following:


  1. Input:

    Observations: X, Y

    Regression Model: M

    Scaler (optional): nΞ³(β‹…)

  2. Normalization (optional):

    Calculate X⋆ = nx(X)

    Calculate Y⋆ = ny(Y)

  3. Boostrap (optional):

    Get Bootstrap Sample of Observations: X⋆, Y⋆

  4. Time Shift (if model is transient):

    a) Shift X⋆ = X⋆[0:-i:s], Y⋆ = Y⋆[i::s]

    b) Shift Y⋆ = Y⋆[0:-i:s], X⋆ = X⋆[i::s]

  5. Holdout (optional):

    Split X⋆ β†’ X⋆regress, X⋆test</sub

    Split Y⋆ β†’ Y⋆regress, Y⋆test

  6. Fit Regression Model:

    a) Fit MX⋆regress β†’ Y⋆regress

    b) Fit MY⋆regress β†’ X⋆regress

  7. Predict based on Regression Model:

    a) Regress Ŷ⋆test = MX⋆regress β†’ Y⋆regress(X⋆test)

    b) Regress X̂⋆test = MY⋆regress β†’ X⋆regress(Y⋆test)

  8. Get Residuals:

    a) Calculate ΈX⋆test β†’ Y⋆test = Ŷ⋆test - Y⋆test

    b) Calculate ΈY⋆test β†’ X⋆test = X̂⋆test - X⋆test

  9. Evaluation Test:

    a) Test ΈX⋆test β†’ Y⋆test vs. X⋆

    b) Test ΈY⋆test β†’ X⋆test vs. Y⋆

  10. Interpretation:

    Please refer to the given literature


Further reading:

[1] Pearl, J. (2009). Causality. Second Edition
[2] Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., & SchΓΆlkopf, B. (2016). Distinguishing Cause from Effect Using Observational Data: Methods and Benchmarks. Journal of Machine Learning Research
[3] Peters, J., Janzing, D., & Schoelkopf, B. (2017). Elements of Causal Inference - Foundations and Learning Algorithms. MIT press.

[return to start]

Quick Start

import whypy

1. Load predefined templates of observations, regression model and scaler:

obs = whypy.load.observations(modelclass=2, no_obs=500, seed=1)
regmod = whypy.load.model_lingam(term='spline')
scaler = whypy.load.scaler_standard()

Output_loading_01 Output_loading_02

2. Initialize a bivariate steadystate ANM-Model:

mymodel = whypy.steadystate.bivariate.Model(obs=obs, combinations='all', regmod=regmod, scaler=scaler)

3. Run Causal Inference

mymodel.run(testtype='LikelihoodVariance',
            scale=True,
            bootstrap=100,
            holdout=True,
            plot_inference=True,
            plot_results=True,
            )

Output_run_01

Output_run_02

Output_run_03

Output_run_04

Output_run_05

Output_run_06

Output_run_07

Output_run_08

Output_run_09

Output_run_10

Output_run_11

Output_run_12

[return to start]

Causal Model

Init Instance

Import Whypy Toolbox

import whypy

  1. The data producing process is steady state + The model is bivariate (one independent variable)
whypy.steadystate.bivariate.Model(obs, combinations, regmod, obs_name, scaler)

  1. The data producing process is steady state + The model is bivariate (n independent variable)
whypy.steadystate.mvariate.Model(obs, combinations, regmod, obs_name, scaler)

  1. The data producing process is transient (t0: offset, s: stride)+ The model is multi variate (one independent variable)
whypy.transient.bivariate.Model(obs, combinations, regmod, obs_name, scaler, t0, stride)

  1. The data producing process is transient (t0: offset, s: stride)+ The model is multi variate (n independent variable)
whypy.transient.mvariate.Model(obs, combs, regmod, obs_name, scaler, t0, stride)



[return to start]

Instance-Parameters

To run causal inference a model instance must be initialized with the following attributes:

obs:

  • Type: Numpy Array of shape(m, n)
    • m: number of observations
    • n: number of variables
  • Description: All variables to be tested in different combinations.

combs:

  • Type: 'all' default or nested list
  • Logic: First number is number of dependent variable, following numbers are numbers of independent variable:
    • Combination 1: [[dependent_variable_1, independent_variable_2, independent_variable_3, ...],
    • Combination 2: [dependent_variable_2, independent_variable_1, independent_variable_3, ...],
    • Combination j: ... ,
    • Combination k: [...]]
  • Description: Combinations of dependent and independent varialbes to be tested.

regmod:

  • Type: Model Object or List of Model Objects
    • Condition: Models must be callable with "fit" and "predict"
    • If list of models is given, list must have same length as number k of combinations
  • Description: Models to regress independent and dependent variables.

obs_name (optional):

  • Type: List with name strings of shape(n)
    • n: number of variables
  • Description: Variable Naming, default is X1, X2, ... Xn

scaler (optional):

  • Type: Model Object or List of Model Objects
    • Condition: Models must be callable with "fit", "transform" and "inverse_transform"
    • If list of models is given, list must have same length as number k of combinations
  • Description: Models to scale observations before regression.

t0 (required in transient models):

  • Type: Integer
  • Description: Offset Y[t0::] ∼ f(X[:-t0:])

stride (required in transient models):

  • Type: Integer
  • Description: Y[::stride] ∼ f(X[::stride])



[return to start]

Instance-Methods

model.run(): Do Causal Inference

model.run(testtype='LikelihoodVariance', scale=True, bootstrap=False, holdout=False, plot_inference=True, plot_results=True, **kwargs)

testtype:

  • Type: 'LikelihoodVariance' (default), 'LikelihoodEntropy' (to be done), 'KolmogorovSmirnoff', 'MannWhitney', 'HSIC' (to be done)
  • Description: Choose a test metric to be performed.

scale:

  • Type: True (default) or False
  • Description: If True scale observations before regression.

bootstrap:

  • Type: True or False (default)
  • Description: Whether to bootstrap over the observations or not (see also bootstrap_ratio and bootstrap_seed)

holdout:

  • Type: True or False (default)
  • Description: Whether to split observations between regression and test or not (see also holdout_ratio and holdout_seed)

plot_inference:

  • Type: True (default) or False
  • Description: Plot various visualisations of the inference (Pairgrid of observations, 2D Regression, Histogramms)

plot_results:

  • Type: True (default) or False
  • Description: Plot DataFrames of Normality Tests, Goodness of Fit, Independence Test and BoxPlot of test results.

bootstrap_ratio:

  • Type: Float, should be between 0.0 and 1.0 (default)
  • Description: Ratio of the original observations number m to be used for bootstraping.

bootstrap_seed:

  • Type: None (default) or int
  • Description: Seed the generator for bootstraping.

holdout_ratio:

  • Type: Float, should be between 0.0 and 1.0 - 0.2 (default)
  • Description: Ratio of the original observations number m to be used to holdout for test.

holdout_seed:

  • Type: None (default) or int type
  • Description: Seed the generator for holdout.

modelpts:

  • Type: integer - 50 (default)
  • Description: Number of points used to visualize the regression model.

gridsearch:

  • Type: True or False (default)
  • Description: Wheter or not a gridsearch should be performed to find the regmods hyperparameters. If gridsearch is True and model is not pygam, a param_grid parameter must be passed.

param_grid:

  • Type: dict()
  • Description: Defines the hyperparameters to be tested in gridsearch. Must fit to the given regmod. Not needed if model is pygam.

model.plot_inference(): Equal to Method "run" Parameter plot_inference

model.plot_inference()

model.plot_results(): Equal to Method "run" Parameter plot_results

model.plot_results()

model.get_combs(): Returns the Nested List of Combinations used in model.run()

model.get_combs()

model.get_regmod(): Returns the List of Regression Models used in model.run()

model.get_regmod()

model.get_scaler(): Returns the List of Scalers used in model.run()

model.get_scaler()

model.get_obs_name(): Returns the List of Observation Names assigned in model.run()

model.get_obs_name()



[return to start]

Instance-Attributes

model.results: DataFrame containing all results.

model.results

model.results['Fitted Combination']:

  • Type: String
  • Description: One String listing all Observation Names tested in the given Combination

model.results['Bivariate Comparison']:

  • Type: String
  • Description: One String describing a Bivariate Case out of the above combination.

model.results['tdep']:

  • Type: Int
  • Description: Dependent Variable in the Bivariate Case.

model.results['tindeps']:

  • Type: List
  • Description: List of all independent Variables in the Combination.

model.results['tindep']:

  • Type: Int
  • Description: Independent Variable in the Bivariate Case.

model.results['Normality Indep. Variable SW_pvalue [List]'], ['... [Median]'], ['... [SD]']:

  • Type: [List] -> dumped json | [Median] -> mean of all results given in list (float)| [SD] -> standard deviation of all results given in list (float)
  • Description: Normality Test on Independent Variable based on scipy.stats.shapiro()

model.results['Normality Indep. Variable Pearson_pvalue [List]'], ['... [Median]'], ['... [SD]']:

model.results['Normality Indep. Variable Combined_pvalue [List]'], ['... [Median]'], ['... [SD]']:

model.results['Normality Depen. Variable SW_pvalue [List]'], ['... [Median]'], ['... [SD]']:

model.results['Normality Depen. Variable Pearson_pvalue [List]'], ['... [Median]'], ['... [SD]']:

model.results['Normality Depen. Variable Combined_pvalue [List]'], ['... [Median]'], ['... [SD]']:

model.results['Normality Residuals SW_pvalue [List]'], ['... [Median]'], ['... [SD]']:

model.results['Normality Residuals Pearson_pvalue [List]'], ['... [Median]'], ['... [SD]']:

model.results['Normality Residuals Combined_pvalue [List]'], ['... [Median]'], ['... [SD]']:

model.results['Dependence: Indep. Variable - Residuals LikelihoodVariance [List]'], ['... [Median]'], ['... [SD]']:

  • Type: see above
  • Description: Test dependence between Independent Variable and Residuals based on selected testype

model.results['Dependence: Depen. Variable - Prediction (GoF) LikelihoodVariance [List]'], ['... [Median]'], ['... [SD]']:

  • Type: see above
  • Description: Test dependence between Dependent Variable and predicted Dependent Variable (Goodness of Fit) based on selected testype

model.obs: see above

model.obs

model.combs: see above, if 'all' is passed see also model.get_combs()

model.combs

model.regmod: see above, if single Object is passed see also model.get_regmod()

model.regmod

model.obs_name (optional): see above, if None is passed see also model.get_obs_name()

model.obs_name

model.scaler (optional): see above, if single Object is passed see also model.get_scaler()

model.scaler

model.t0 (required in transient models): see above

model.t0

model.stride (required in transient models): see above

model.stride

[return to start]

Templates

There are various Regression Models, Scalers and Observational datasets available to be loaded:

Observations

whypy.load.observations(): Load Observational Datasets

whypy.load.observations(modelclass, no_obs=100, seed=None)

modelclass:

  • Type: Integer, should be between 1 and 10
  • Description: Each modelclass is defined by No. of Variables, Class of Functions and Class of Noise Distribution. Load Observations to get short summary of description.

no_obs:

  • Type: Integer > 0 - 100 (default)
  • Description: Number of observations m assigned to each variable.

seed:

  • Type: None (default) or int
  • Description: Seed the generator for Noise Distribution.

Returns:

Displays a short summary of the loaded dataset and the underlying causal graph.

obs:

  • Type: Numpy Array of shape(m, n)
  • Description: see above



[return to start]

Regression Model

whypy.load.model_lingam(): Load a Linear GAM Regression Model.

whypy.load.model_lingam(term='spline')

term:

Returns:

Displays a short summary of the loaded regression model.

regmod:

  • Type: Single Instance of Regression Model
  • Description: see above

whypy.load.model_svr(): Load a Support Vector Regression Model.

whypy.load.observations(term='poly4')

modelclass:

Returns:

Displays a short summary of the loaded regression model.

regmod:

  • Type: Single Instance of Regression Model
  • Description: see above

whypy.load.model_polynomial_lr(): Load a Linear Regression Model based on Polynomial Features.

whypy.load.model_polynomial_lr(degree=2)

degree:

  • Type: Integer > 0, Degree of polynomial feature space
  • Description: Model is a Pipeline containing a Function Transformer mapping observations to polynomial feature space of given degree (without interactions) and a RidgeCV Regression Model.

Returns:

Displays a short summary of the loaded regression model.

regmod:

  • Type: Single Instance of Regression Model
  • Description: see above



[return to start]

Scaler

whypy.load.scaler_minmax(): Load a MinMaxScaler Model, scaling to feature_range=(0, 1).

whypy.load.scaler_minmax()

Returns:

Displays a short summary of the loaded scaler model.

scaler:

  • Type: Single Instance of Scaler Model
  • Description: see above

whypy.load.scaler_standard(): Load a StandardScaler Model.

whypy.load.scaler_standard()

Returns:

Displays a short summary of the loaded scaler model.

scaler:

  • Type: Single Instance of Scaler Model
  • Description: see above

[return to start]