GenericML

Generic Machine Learning Inference


License
CNRI-Python-GPL-Compatible

Documentation

GenericML: Generic Machine Learning Inference

License: GPL v3 R-CMD-check CRAN_Status_Badge metacran downloads

To cite GenericML in publications, please use:

Welz M., Alfons, A., Demirer, M. and Chernozhukov, V. (2021). GenericML: Generic Machine Learning Inference. R package version 0.1.0. URL: https://CRAN.R-project.org/package=GenericML.

Summary

R implementation of Generic Machine Learning Inference on heterogeneous treatment effects in randomized experiments as proposed in Chernozhukov, Demirer, Duflo and Fernández-Val (2020). This package's workhorse is the mlr3 framework of Lang et al. (2019), which enables the specification of a wide variety of machine learners. The main functionality, GenericML(), runs Algorithm 1 in Chernozhukov, Demirer, Duflo and Fernández-Val (2020) for a suite of user-specified machine learners. All steps in the algorithm are customizable via setup functions. Methods for printing and plotting are available for objects returned by GenericML(). Parallel computing is supported.

Installation

From CRAN

The package GenericML is on the CRAN (The Comprehensive R Archive Network), hence the latest release can be easily installed from the R command line via

install.packages("GenericML")

Building from source

To install the latest (possibly unstable) development version from GitHub, you can pull this repository and install it from the R command line via

install.packages("devtools")
devtools::install_github("mwelz/GenericML")

If you already have the package devtools installed, you can skip the first line.

Community guidelines

Report issues and request features

If you experience any bugs or issues or if you have any suggestions for additional features, please submit an issue via the Issues tab of this repository. Please have a look at existing issues first to see if your problem or feature request has already been discussed.

Contribute to the package

If you want to contribute to the package, you can fork this repository and create a pull request after implementing the desired functionality.

Ask for help

If you need help using the package, or if you are interested in collaborations related to this project, please get in touch with the package maintainer.

Example

We generate n=5000 samples that adhere to a simple linear data generating process. We emulate a randomized experiment. There is no treatment effect heterogeneity since the treatment effect is constant at value two. Hence, Generic ML should not indicate the existence of treatment effect heterogeneity.

### 1. Data Generation (linear, no treatment effect heterogeneity) ----
library(GenericML)

set.seed(1)
num.obs  <- 5000
num.vars <- 5

# ATE parameter
ATE <- 2

# random treatment assignment
D <- rbinom(num.obs, 1, 0.5)

# covariates
Z <- mvtnorm::rmvnorm(num.obs, mean = rep(0, num.vars), sigma = diag(num.vars))
colnames(Z) <- paste0("z", 1:num.vars)

# counterfactual outcomes
Y0 <- as.numeric(Z %*% c(2, -3, 0, 1, 2)) + rnorm(num.obs)
Y1 <- ATE + Y0

# observed outcome
Y  <- ifelse(D == 1, Y1, Y0)


### 2. Prepare the arguments for GenericML() ----

# quantile cutoffs for the GATES grouping of the estimated CATEs
quantile_cutoffs <- c(0.2, 0.4, 0.6, 0.8) # 20%, 40%, 60%, 80% quantiles

# specify the learner of the propensity score (non-penalized logistic regression here). Propensity scores can also directly be supplied.
learner_propensity_score <- "mlr3::lrn('glmnet', lambda = 0, alpha = 1)"

# specify the considered learners of the BCA and the CATE (here: lasso, random forest, and SVM)
learners_GenericML <- c("lasso", "mlr3::lrn('ranger', num.trees = 100)", "mlr3::lrn('svm')")

# specify the data that shall be used for the CLAN
# here, we use all variables of Z and uniformly distributed random noise
Z_CLAN <- cbind(Z, random = runif(num.obs))

# specify the number of splits
num_splits <- 100

# specify if a HT transformation shall be used when estimating BLP and GATES
HT <- FALSE

# A list controlling the variables that shall be used in the matrix X1 for the BLP and GATES regressions. 
X1_BLP   <- setup_X1()
X1_GATES <- setup_X1()

# consider differences between group K (most affected) with groups 1 and 2, respectively.
diff_GATES <- setup_diff(subtract_from = "most",
                         subtracted = c(1,2))
diff_CLAN  <- setup_diff(subtract_from = "most",
                         subtracted = c(1,2))

# specify the significance level
significance_level <- 0.05

# specify minimum variation of predictions before Gaussian noise with variance var(Y)/20 is added.
min_variation <- 1e-05

# specify which estimator of the error covariance matrix shall be used in BLP and GATES (standard OLS covariance matrix estimator here)
vcov_BLP   <- setup_vcov()
vcov_GATES <- setup_vcov()

# specify whether of not it should be assumed that the group variances of the most and least affected groups are equal in CLAN.
equal_variances_CLAN <- FALSE

# specify the proportion of samples that shall be selected in the auxiliary set
prop_aux <- 0.5

# specify whether or not the splits and auxiliary results of the learners shall be stored
store_splits   <- TRUE
store_learners <- TRUE

# parallelization options (currently only supported on Unix systems)
parallel  <- TRUE
num_cores <- 4      # 4 cores
seed      <- 123456
# Note that the number of cores influences the random number stream. Thus, different choices of `num_cores` may lead to different results.



### 3. Run the GenericML() functions with these arguments ----
# runtime: ~40 seconds with R version 4.1.0 on a Dell Latitude 5300 (i5-8265U CPU @ 1.60GHz × 8, 32GB RAM), running on Ubuntu 21.10. Returns a GenericML object.
genML <- GenericML(Z = Z, D = D, Y = Y,
                   learner_propensity_score = learner_propensity_score,
                   learners_GenericML = learners_GenericML,
                   num_splits = num_splits,
                   Z_CLAN = Z_CLAN,
                   HT = HT,
                   X1_BLP = X1_BLP,
                   X1_GATES = X1_GATES,
                   vcov_BLP = vcov_BLP,
                   vcov_GATES = vcov_GATES,
                   quantile_cutoffs = quantile_cutoffs,
                   diff_GATES = diff_GATES,
                   diff_CLAN = diff_CLAN,
                   equal_variances_CLAN = equal_variances_CLAN,
                   prop_aux = prop_aux,
                   significance_level = significance_level,
                   min_variation = min_variation,
                   parallel = parallel,
                   num_cores = num_cores,
                   seed = seed,
                   store_splits = store_splits,
                   store_learners = store_learners)

### 4. Analyze the output ----
## print
genML

## the line below returns the medians of the estimated  \Lambda and \bar{\Lambda}
genML$best$overview

# Get best learner for BLP
genML$best$BLP

# Get best learner for GATES and CLAN (this is the same learner)
genML$best$GATES
genML$best$CLAN


# VEIN of BLP
get_BLP(genML, plot = FALSE)
plot(genML, type = "BLP") # plot.GenericML() method
# No indication of treatment effect heterogeneity: beta.2 not significant

# VEIN of GATES
get_GATES(genML, plot = FALSE)
plot(genML, type = "GATES")
# No indication of heterogeneity

# VEIN of CLAN for variable 'z1'
get_CLAN(genML, variable = "z1", plot = FALSE)
plot(genML, type = "CLAN", CLAN_variable = "z1")
# No indication of heterogeneity

Authors

Max Welz (welz@ese.eur.nl), Andreas Alfons (alfons@ese.eur.nl), Mert Demirer (mdemirer@mit.edu), and Victor Chernozhukov (vchern@mit.edu).