27 Using machine learning for causal effect in observational study

Published

September 21, 2017

27.1 A simulation for an OLS model

In an observational study, we need to assume we have the functional form to get causal effect estimated correctly, in addition to the assumption of treatment being exogenous.

library(MASS)
library(ggplot2)
library(dplyr)
library(tmle)
library(glmnet)
set.seed(366)

nobs <- 2000
xw <- .8
xz <- .5
zw <- .6
nrow <- 3
ncol <- 3
covarMat = matrix( c(1^2, xz^2, xw^2, xz^2, 1^2, zw^2,  xw^2, zw^2, 1^2 ) , nrow=ncol , ncol=ncol )

mu <- rep(0,3)
rawvars <- mvrnorm(n=nobs, mu=mu, Sigma=covarMat)
df <- as_tibble(rawvars, .name_repair = "minimal")
names(df) <- c('x','z','w')
df <- df %>%
    # A small constant inside each log() keeps log(v^2) from diverging to
    # large negative values when v is close to 0 (e.g. log(w^2) alone ranges
    # down to about -15 here), which otherwise pushes the propensity score
    # P(A=1|W) toward 0/1 for those units and violates positivity.
    mutate(log.x=log(x^2 + 0.01), log.z=log(z^2 + 0.01), log.w=log(w^2 + 0.01), z.sqr=z^2, w.sqr=w^2) %>%
    mutate(g.var= log.w  + rnorm(nobs)) %>%
    mutate(A = rbinom(nobs, 1, 1/(1+exp((g.var))))) %>%
    mutate(y0=rnorm(nobs) + log.x) %>%
    mutate(tau.true = 2  + rnorm(nobs), y1=y0+tau.true, treat=A, y = treat*y1 + (1-treat)*y0)
lm1 <- lm(y ~ A + log.w + log.x , data=df)
summary(lm1)


Call:
lm(formula = y ~ A + log.w + log.x, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5207 -0.8338 -0.0089  0.8386  4.1534 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.01311    0.04774   0.275    0.784    
A            1.95487    0.06703  29.166   <2e-16 ***
log.w        0.02835    0.01945   1.458    0.145    
log.x        1.00090    0.01748  57.264   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.253 on 1996 degrees of freedom
Multiple R-squared:  0.6746,    Adjusted R-squared:  0.6741 
F-statistic:  1380 on 3 and 1996 DF,  p-value: < 2.2e-16

lm2 <- lm(y ~ A , data=df)
summary(lm2)


Call:
lm(formula = y ~ A, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.1667 -1.4416  0.1372  1.4760  6.1625 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.75733    0.07539  -10.04   <2e-16 ***
A            1.47007    0.09571   15.36   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.077 on 1998 degrees of freedom
Multiple R-squared:  0.1056,    Adjusted R-squared:  0.1052 
F-statistic: 235.9 on 1 and 1998 DF,  p-value: < 2.2e-16

lm3 <- lm(y ~ A + w, data=df)
summary(lm3)


Call:
lm(formula = y ~ A + w, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.1657 -1.4531  0.1338  1.4780  6.0957 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.75631    0.07541 -10.030   <2e-16 ***
A            1.46876    0.09573  15.343   <2e-16 ***
w           -0.03874    0.04464  -0.868    0.386    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.077 on 1997 degrees of freedom
Multiple R-squared:  0.1059,    Adjusted R-squared:  0.105 
F-statistic: 118.3 on 2 and 1997 DF,  p-value: < 2.2e-16

lm4 <- lm(y ~ A + w + x, data=df)
summary(lm4)


Call:
lm(formula = y ~ A + w + x, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-6.167 -1.445  0.147  1.473  6.205 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.75360    0.07539  -9.996   <2e-16 ***
A            1.46387    0.09573  15.292   <2e-16 ***
w           -0.09981    0.05764  -1.732   0.0835 .  
x            0.10281    0.06141   1.674   0.0943 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.076 on 1996 degrees of freedom
Multiple R-squared:  0.1072,    Adjusted R-squared:  0.1059 
F-statistic: 79.88 on 3 and 1996 DF,  p-value: < 2.2e-16

In this example, treatment assignment process is determined by logged w, and outcome is determined by logged x and treatment. However, what we observe is w and x. In observational studies, this happens all the time. In fact, this is an ideal situation, that we observe variables that are determinants of outcome, although we are not sure about the functional form that determines the outcome. However, this example shows that unless we have observed exactly the factors themselves (in this case logged x, w, which determines the DGP), we have biased estimates of the true treatment effect.

Model 1 is the only model with reasonable estimate of treatment effect (which is 2 in this case). Model 2 is a model with endogeneity: A is correlated with the missing variable logged x. Model 3 and 4 we have x and w, but not logged, therefore still biased.

The lesson here is the functional form does matter. However, we have no way of knowing the functional form. What can we do here?

# Algorithm set trimmed to 7 fast learners for render speed.
# A production analysis would also include SL.randomForest, SL.gbm, SL.gam.
Q.SL.library <- c("SL.glmnet","SL.glm","SL.glm.interaction", "SL.rpart","SL.bayesglm","SL.step","SL.mean")
g.SL.library <- c("SL.glmnet","SL.glm","SL.glm.interaction", "SL.rpart","SL.bayesglm","SL.step","SL.mean")

# tmle1 uses x and w; tmle2 additionally includes z.  Note z is not in the true
# treatment or outcome DGP -- it is only correlated with x and w -- so adding it
# should not change the estimate much.
tmle1 <- tmle(Y = df$y, A = df$treat, W = df[,c('x','w')], g.SL.library = g.SL.library , Q.SL.library = Q.SL.library)
tmle1

 Marginal mean under treatment (EY1)
   Parameter Estimate:  0.70474
   Estimated Variance:  0.0041837
              p-value:  <2e-16
    95% Conf Interval:  (0.57797, 0.83151)

 Marginal mean under comparator (EY0)
   Parameter Estimate:  -1.259
   Estimated Variance:  0.0051339
              p-value:  <2e-16
    95% Conf Interval:  (-1.3994, -1.1185)

 Additive Effect
   Parameter Estimate:  1.9637
   Estimated Variance:  0.0051747
              p-value:  <2e-16
    95% Conf Interval:  (1.8227, 2.1047)

 Additive Effect among the Treated
   Parameter Estimate:  1.936
   Estimated Variance:  0.0068662
              p-value:  <2e-16
    95% Conf Interval:  (1.7736, 2.0984)

 Additive Effect among the Controls
   Parameter Estimate:  2.0294
   Estimated Variance:  0.0056766
              p-value:  <2e-16
    95% Conf Interval:  (1.8817, 2.1771)

tmle2 <- tmle(Y = df$y, A = df$treat, W = df[,c('x','w', 'z')], g.SL.library = g.SL.library , Q.SL.library = Q.SL.library)
tmle2

 Marginal mean under treatment (EY1)
   Parameter Estimate:  0.70402
   Estimated Variance:  0.0041865
              p-value:  <2e-16
    95% Conf Interval:  (0.5772, 0.83083)

 Marginal mean under comparator (EY0)
   Parameter Estimate:  -1.2499
   Estimated Variance:  0.0051507
              p-value:  <2e-16
    95% Conf Interval:  (-1.3906, -1.1093)

 Additive Effect
   Parameter Estimate:  1.954
   Estimated Variance:  0.0051513
              p-value:  <2e-16
    95% Conf Interval:  (1.8133, 2.0946)

 Additive Effect among the Treated
   Parameter Estimate:  1.9354
   Estimated Variance:  0.0068412
              p-value:  <2e-16
    95% Conf Interval:  (1.7733, 2.0975)

 Additive Effect among the Controls
   Parameter Estimate:  2.028
   Estimated Variance:  0.0057375
              p-value:  <2e-16
    95% Conf Interval:  (1.8795, 2.1764)

We use [Mark van der Laan’s TMLE method] (http://biostats.bepress.com/ucbbiostat/paper275/). It uses [SuperLearner] (http://biostats.bepress.com/ucbbiostat/paper222/) as the initial estimator. It’s an ensemble of multiple machine learning algorithms. Therefore it does not need to assume the functional form of the DGP. Even if we don’t have the variables that determines the DGP of outcome, if we observe some functions (even nonlinear functions) of these variables, we can still get reasonable estimates of the treatment effect.

In this example, we used multiple popular machine learning algorithms in modeling both treatment assignment process and the outcome process. The first TMLE model is with x and w (note not the logged x and w which are in the true DGP), the second one with an additional variable z.

It seems that TMLE results are less biased than the linear models with x and w. It may not be better than the linear model with logged x and w, but in empirical studies, we often cannot assume we have the variables in the DGP, but only some proxy of the variables in the DGP. I’ll do more simulations to see whether TMLE does perform better in the situation that we are not sure about the functional form. We should expect that is the case.

The simple tmle package example used here is for a binary treatment. TMLE itself is not limited to binary treatments: there are TMLE estimators for continuous, multivalued, longitudinal, stochastic, and survival settings (e.g. the ltmle and lmtp packages), though which estimand and package you use differs by setting.

It’s about time we embrace machine learning techniques into studies of causal effect in observational studies.

Systematic treatment: R · Julia.