A python repository for causal inference.
Currently available approaches in this repository are based on Additive Noise Models (ANMs).
Install:
pip install whypy
Content:
- A short introduction into the theory of causal inference
- A quick start example how to run causal inference with this repository
- Additive Noise Models in WhyPy
- Various Templates for:
Models:
Within the WhyPy Toolbox four possible models are distinguished
-
The data producing process is steady state + The model is bivariate (one independent variable)
-
The data producing process is steady state + The model is bivariate (n independent variable)
-
The data producing process is transient (t0: offset, s: stride)+ The model is multi variate (one independent variable)
-
The data producing process is transient (t0: offset, s: stride)+ The model is multi variate (n independent variable)
The most elementary question of causality is the one asking whether "X causes Y or vice versa". An often discussed example is the question if smoking (X) causes cancer (Y). At this point the question about causal relationships is already getting more complex. Beside the possibility that X causes Y (X β Y), there are other possible causal relationships. One is that a third Variable Z is confounding both X and Y (X β Z β Y). In the confounding case, only looking at X and Y, might show a correlation due to the confounder even though they are not causaly related. [1], [2]
Causal Inference is the task of learning causal relationships from purely observational data. This task is a fundamental problem in science. A variety of causal inference methods are available that were claimed to be able to solve this task under certain assumptions. These assumptions are for example no confounding, no feedback loops or no selection bias. Be aware, that results given by causal inference are only valid under the methods assumptions. ITo draw causal conclusions, these methods are exploiting the complexety of the underlying models of the observational data in genearal. [2], [3]
The family of causal inference methods to used here are Additive Noise Models (ANMs). In ANMs the influence of noise is restricted to be Additive (Y βΌ f(X) + NY). Methods in these class are either based on independence of residuals or maximum likelihood. The procedure in the WhyPy Toolbox is the following:
-
Input:
Observations: X, Y
Regression Model: M
Scaler (optional): nΞ³(β )
-
Normalization (optional):
Calculate Xβ = nx(X)
Calculate Yβ = ny(Y)
-
Boostrap (optional):
Get Bootstrap Sample of Observations: Xβ, Yβ
-
Time Shift (if model is transient):
a) Shift Xβ = Xβ[0:-i:s], Yβ = Yβ[i::s]
b) Shift Yβ = Yβ[0:-i:s], Xβ = Xβ[i::s]
-
Holdout (optional):
Split Xβ β Xβregress, Xβtest</sub
Split Yβ β Yβregress, Yβtest
-
Fit Regression Model:
a) Fit MXβregress β Yβregress
b) Fit MYβregress β Xβregress
-
Predict based on Regression Model:
a) Regress YΜβtest = MXβregress β Yβregress(Xβtest)
b) Regress XΜβtest = MYβregress β Xβregress(Yβtest)
-
Get Residuals:
a) Calculate ΞXβtest β Yβtest = YΜβtest - Yβtest
b) Calculate ΞYβtest β Xβtest = XΜβtest - Xβtest
-
Evaluation Test:
a) Test ΞXβtest β Yβtest vs. Xβ
b) Test ΞYβtest β Xβtest vs. Yβ
-
Interpretation:
Please refer to the given literature
Further reading:
import whypy
1. Load predefined templates of observations, regression model and scaler:
obs = whypy.load.observations(modelclass=2, no_obs=500, seed=1)
regmod = whypy.load.model_lingam(term='spline')
scaler = whypy.load.scaler_standard()
2. Initialize a bivariate steadystate ANM-Model:
mymodel = whypy.steadystate.bivariate.Model(obs=obs, combinations='all', regmod=regmod, scaler=scaler)
3. Run Causal Inference
mymodel.run(testtype='LikelihoodVariance',
scale=True,
bootstrap=100,
holdout=True,
plot_inference=True,
plot_results=True,
)
Init Instance
Import Whypy Toolbox
import whypy
- The data producing process is steady state + The model is bivariate (one independent variable)
whypy.steadystate.bivariate.Model(obs, combinations, regmod, obs_name, scaler)
- The data producing process is steady state + The model is bivariate (n independent variable)
whypy.steadystate.mvariate.Model(obs, combinations, regmod, obs_name, scaler)
- The data producing process is transient (t0: offset, s: stride)+ The model is multi variate (one independent variable)
whypy.transient.bivariate.Model(obs, combinations, regmod, obs_name, scaler, t0, stride)
- The data producing process is transient (t0: offset, s: stride)+ The model is multi variate (n independent variable)
whypy.transient.mvariate.Model(obs, combs, regmod, obs_name, scaler, t0, stride)
Instance-Parameters
To run causal inference a model instance must be initialized with the following attributes:
- Type: Numpy Array of shape(m, n)
- m: number of observations
- n: number of variables
- Description: All variables to be tested in different combinations.
- Type: 'all' default or nested list
- Logic: First number is number of dependent variable, following numbers are numbers of independent variable:
- Combination 1: [[dependent_variable_1, independent_variable_2, independent_variable_3, ...],
- Combination 2: [dependent_variable_2, independent_variable_1, independent_variable_3, ...],
- Combination j: ... ,
- Combination k: [...]]
- Description: Combinations of dependent and independent varialbes to be tested.
- Type: Model Object or List of Model Objects
- Condition: Models must be callable with "fit" and "predict"
- If list of models is given, list must have same length as number k of combinations
- Description: Models to regress independent and dependent variables.
- Type: List with name strings of shape(n)
- n: number of variables
- Description: Variable Naming, default is X1, X2, ... Xn
- Type: Model Object or List of Model Objects
- Condition: Models must be callable with "fit", "transform" and "inverse_transform"
- If list of models is given, list must have same length as number k of combinations
- Description: Models to scale observations before regression.
- Type: Integer
- Description: Offset Y[t0::] βΌ f(X[:-t0:])
- Type: Integer
- Description: Y[::stride] βΌ f(X[::stride])
Instance-Methods
model.run(): Do Causal Inference
model.run(testtype='LikelihoodVariance', scale=True, bootstrap=False, holdout=False, plot_inference=True, plot_results=True, **kwargs)
- Type: 'LikelihoodVariance' (default), 'LikelihoodEntropy' (to be done), 'KolmogorovSmirnoff', 'MannWhitney', 'HSIC' (to be done)
- Description: Choose a test metric to be performed.
- Type: True (default) or False
- Description: If True scale observations before regression.
- Type: True or False (default)
- Description: Whether to bootstrap over the observations or not (see also bootstrap_ratio and bootstrap_seed)
- Type: True or False (default)
- Description: Whether to split observations between regression and test or not (see also holdout_ratio and holdout_seed)
- Type: True (default) or False
- Description: Plot various visualisations of the inference (Pairgrid of observations, 2D Regression, Histogramms)
- Type: True (default) or False
- Description: Plot DataFrames of Normality Tests, Goodness of Fit, Independence Test and BoxPlot of test results.
- Type: Float, should be between 0.0 and 1.0 (default)
- Description: Ratio of the original observations number m to be used for bootstraping.
- Type: None (default) or int
- Description: Seed the generator for bootstraping.
- Type: Float, should be between 0.0 and 1.0 - 0.2 (default)
- Description: Ratio of the original observations number m to be used to holdout for test.
- Type: None (default) or int type
- Description: Seed the generator for holdout.
- Type: integer - 50 (default)
- Description: Number of points used to visualize the regression model.
- Type: True or False (default)
- Description: Wheter or not a gridsearch should be performed to find the regmods hyperparameters. If gridsearch is True and model is not pygam, a param_grid parameter must be passed.
- Type: dict()
- Description: Defines the hyperparameters to be tested in gridsearch. Must fit to the given regmod. Not needed if model is pygam.
model.plot_inference(): Equal to Method "run" Parameter plot_inference
model.plot_inference()
model.plot_results(): Equal to Method "run" Parameter plot_results
model.plot_results()
model.get_combs(): Returns the Nested List of Combinations used in model.run()
model.get_combs()
model.get_regmod(): Returns the List of Regression Models used in model.run()
model.get_regmod()
model.get_scaler(): Returns the List of Scalers used in model.run()
model.get_scaler()
model.get_obs_name(): Returns the List of Observation Names assigned in model.run()
model.get_obs_name()
Instance-Attributes
model.results: DataFrame containing all results.
model.results
model.results['Fitted Combination']:
- Type: String
- Description: One String listing all Observation Names tested in the given Combination
model.results['Bivariate Comparison']:
- Type: String
- Description: One String describing a Bivariate Case out of the above combination.
model.results['tdep']:
- Type: Int
- Description: Dependent Variable in the Bivariate Case.
model.results['tindeps']:
- Type: List
- Description: List of all independent Variables in the Combination.
model.results['tindep']:
- Type: Int
- Description: Independent Variable in the Bivariate Case.
model.results['Normality Indep. Variable SW_pvalue [List]'], ['... [Median]'], ['... [SD]']:
- Type: [List] -> dumped json | [Median] -> mean of all results given in list (float)| [SD] -> standard deviation of all results given in list (float)
- Description: Normality Test on Independent Variable based on scipy.stats.shapiro()
model.results['Normality Indep. Variable Pearson_pvalue [List]'], ['... [Median]'], ['... [SD]']:
- Type: see above
- Description: Normality Test on Independent Variable based on scipy.stats.normaltest()
model.results['Normality Indep. Variable Combined_pvalue [List]'], ['... [Median]'], ['... [SD]']:
- Type: see above
- Description: Normality Test on Independent Variable based on scipy.stats.combine_pvalues()
model.results['Normality Depen. Variable SW_pvalue [List]'], ['... [Median]'], ['... [SD]']:
- Type: see above
- Description: Normality Test on Dependent Variable based on scipy.stats.shapiro()
model.results['Normality Depen. Variable Pearson_pvalue [List]'], ['... [Median]'], ['... [SD]']:
- Type: see above
- Description: Normality Test on Dependent Variable based on scipy.stats.normaltest()
model.results['Normality Depen. Variable Combined_pvalue [List]'], ['... [Median]'], ['... [SD]']:
- Type: see above
- Description: Normality Test on Dependent Variable based on scipy.stats.combine_pvalues()
model.results['Normality Residuals SW_pvalue [List]'], ['... [Median]'], ['... [SD]']:
- Type: see above
- Description: Normality Test on Residuals Variable based on scipy.stats.shapiro()
model.results['Normality Residuals Pearson_pvalue [List]'], ['... [Median]'], ['... [SD]']:
- Type: see above
- Description: Normality Test on Residuals Variable based on scipy.stats.normaltest()
model.results['Normality Residuals Combined_pvalue [List]'], ['... [Median]'], ['... [SD]']:
- Type: see above
- Description: Normality Test on Residuals Variable based on scipy.stats.combine_pvalues()
model.results['Dependence: Indep. Variable - Residuals LikelihoodVariance [List]'], ['... [Median]'], ['... [SD]']:
- Type: see above
- Description: Test dependence between Independent Variable and Residuals based on selected testype
model.results['Dependence: Depen. Variable - Prediction (GoF) LikelihoodVariance [List]'], ['... [Median]'], ['... [SD]']:
- Type: see above
- Description: Test dependence between Dependent Variable and predicted Dependent Variable (Goodness of Fit) based on selected testype
model.obs: see above
model.obs
model.combs: see above, if 'all' is passed see also model.get_combs()
model.combs
model.regmod: see above, if single Object is passed see also model.get_regmod()
model.regmod
model.obs_name (optional): see above, if None is passed see also model.get_obs_name()
model.obs_name
model.scaler (optional): see above, if single Object is passed see also model.get_scaler()
model.scaler
model.t0 (required in transient models): see above
model.t0
model.stride (required in transient models): see above
model.stride
Templates
There are various Regression Models, Scalers and Observational datasets available to be loaded:
Observations
whypy.load.observations(): Load Observational Datasets
whypy.load.observations(modelclass, no_obs=100, seed=None)
modelclass:
- Type: Integer, should be between 1 and 10
- Description: Each modelclass is defined by No. of Variables, Class of Functions and Class of Noise Distribution. Load Observations to get short summary of description.
no_obs:
- Type: Integer > 0 - 100 (default)
- Description: Number of observations m assigned to each variable.
seed:
- Type: None (default) or int
- Description: Seed the generator for Noise Distribution.
Returns:
Displays a short summary of the loaded dataset and the underlying causal graph.
obs:
- Type: Numpy Array of shape(m, n)
- Description: see above
Regression Model
whypy.load.model_lingam(): Load a Linear GAM Regression Model.
whypy.load.model_lingam(term='spline')
term:
- Type: 'linear', 'spline' (default) or 'factor'
- Description: see PyGAM Documentation
Returns:
Displays a short summary of the loaded regression model.
regmod:
- Type: Single Instance of Regression Model
- Description: see above
whypy.load.model_svr(): Load a Support Vector Regression Model.
whypy.load.observations(term='poly4')
modelclass:
- Type: 'linear', 'poly2' or 'poly4' (default)
- Description: see sklearn Documentation
Returns:
Displays a short summary of the loaded regression model.
regmod:
- Type: Single Instance of Regression Model
- Description: see above
whypy.load.model_polynomial_lr(): Load a Linear Regression Model based on Polynomial Features.
whypy.load.model_polynomial_lr(degree=2)
degree:
- Type: Integer > 0, Degree of polynomial feature space
- Description: Model is a Pipeline containing a Function Transformer mapping observations to polynomial feature space of given degree (without interactions) and a RidgeCV Regression Model.
Returns:
Displays a short summary of the loaded regression model.
regmod:
- Type: Single Instance of Regression Model
- Description: see above
Scaler
whypy.load.scaler_minmax(): Load a MinMaxScaler Model, scaling to feature_range=(0, 1).
whypy.load.scaler_minmax()
Returns:
Displays a short summary of the loaded scaler model.
scaler:
- Type: Single Instance of Scaler Model
- Description: see above
whypy.load.scaler_standard(): Load a StandardScaler Model.
whypy.load.scaler_standard()
Returns:
Displays a short summary of the loaded scaler model.
scaler:
- Type: Single Instance of Scaler Model
- Description: see above