27  Using machine learning for causal effect in observational study

Published

September 21, 2017

27.1 A simulation for an OLS model

In an observational study, we need to assume we have the functional form to get causal effect estimated correctly, in addtion to the assumption of treatment being exogenous.

library(MASS)
library(ggplot2)
library(dplyr)
library(tmle)
library(glmnet)
set.seed(366)

nobs <- 2000
xw <- .8
xz <- .5
zw <- .6
nrow <- 3
ncol <- 3
covarMat = matrix( c(1^2, xz^2, xw^2, xz^2, 1^2, zw^2,  xw^2, zw^2, 1^2 ) , nrow=ncol , ncol=ncol )

mu <- rep(0,3)
rawvars <- mvrnorm(n=nobs, mu=mu, Sigma=covarMat)
df <- as_tibble(rawvars, .name_repair = "minimal")
names(df) <- c('x','z','w')
df <- df %>%
    mutate(log.x=log(x^2), log.z=log(z^2), log.w=log(w^2), z.sqr=z^2, w.sqr=w^2) %>%
    mutate(g.var= log.w  + rnorm(nobs)) %>%
    mutate(A = rbinom(nobs, 1, 1/(1+exp((g.var))))) %>%
    mutate(y0=rnorm(nobs) + log.x) %>%
    mutate(tau.true = 2  + rnorm(nobs), y1=y0+tau.true, treat=A, y = treat*y1 + (1-treat)*y0)
lm1 <- lm(y ~ A + log.w + log.x , data=df)
summary(lm1)

Call:
lm(formula = y ~ A + log.w + log.x, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5506 -0.8372 -0.0154  0.8502  4.1624 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.01267    0.04778   0.265    0.791    
A            1.93171    0.06580  29.355   <2e-16 ***
log.w        0.01105    0.01428   0.774    0.439    
log.x        1.00162    0.01353  74.030   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.255 on 1996 degrees of freedom
Multiple R-squared:  0.758, Adjusted R-squared:  0.7576 
F-statistic:  2084 on 3 and 1996 DF,  p-value: < 2.2e-16
lm2 <- lm(y ~ A , data=df)
summary(lm2)

Call:
lm(formula = y ~ A, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.9102  -1.3822   0.3241   1.6933   6.4046 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.92139    0.08992  -10.25   <2e-16 ***
A            1.38964    0.11366   12.23   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.459 on 1998 degrees of freedom
Multiple R-squared:  0.06961,   Adjusted R-squared:  0.06915 
F-statistic: 149.5 on 1 and 1998 DF,  p-value: < 2.2e-16
lm3 <- lm(y ~ A + w, data=df)
summary(lm3)

Call:
lm(formula = y ~ A + w, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.9112  -1.3795   0.3139   1.6863   6.3468 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.92056    0.08995 -10.234   <2e-16 ***
A            1.38860    0.11368  12.215   <2e-16 ***
w           -0.03351    0.05286  -0.634    0.526    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.46 on 1997 degrees of freedom
Multiple R-squared:  0.0698,    Adjusted R-squared:  0.06887 
F-statistic: 74.93 on 2 and 1997 DF,  p-value: < 2.2e-16
lm4 <- lm(y ~ A + w + x, data=df)
summary(lm4)

Call:
lm(formula = y ~ A + w + x, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.9109  -1.4047   0.3177   1.6926   6.4528 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.91759    0.08995 -10.201   <2e-16 ***
A            1.38335    0.11372  12.164   <2e-16 ***
w           -0.09260    0.06827  -1.356    0.175    
x            0.09947    0.07275   1.367    0.172    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.459 on 1996 degrees of freedom
Multiple R-squared:  0.07067,   Adjusted R-squared:  0.06927 
F-statistic:  50.6 on 3 and 1996 DF,  p-value: < 2.2e-16

In this example, treatment assignment process is determined by logged w, and outcome is dtermined by logged x and treatment. However, what we observe is w and x. In observational studies, this happens all the time. In fact, this is an ideal situation, that we observe variables that are determinants of outcome, although we are not sure about the functional form that determines the outcome. However, this example shows that unless we have observed exactly the factors themselves (in this case logged x, w, which determines the DGP), we have biased estimates of the true treatment effect.

Model 1 is the only model with reasonable estimate of treatment effect (which is 2 in this case). Model 2 is a model with endogeneity: A is correlated with the missing variabel logged x. Model 3 and 4 we have x and w, but not logged, therefore still biased.

The lesson here is the functional form does matter. However, we have no way of knowing the functional form. What can we do here?

# Algorithm set trimmed to 7 fast learners for render speed.
# A production analysis would also include SL.randomForest, SL.gbm, SL.gam.
Q.SL.library <- c("SL.glmnet","SL.glm","SL.glm.interaction", "SL.rpart","SL.bayesglm","SL.step","SL.mean")
g.SL.library <- c("SL.glmnet","SL.glm","SL.glm.interaction", "SL.rpart","SL.bayesglm","SL.step","SL.mean")

# this one is good since both Q and g are correct (including z in it)
tmle1 <- tmle(Y = df$y, A = df$treat, W = df[,c('x','w')], g.SL.library = g.SL.library , Q.SL.library = Q.SL.library)
tmle1
 Marginal mean under treatment (EY1)
   Parameter Estimate:  0.70474
   Estimated Variance:  0.0041837
              p-value:  <2e-16
    95% Conf Interval:  (0.57797, 0.83151)

 Marginal mean under comparator (EY0)
   Parameter Estimate:  -1.259
   Estimated Variance:  0.0051339
              p-value:  <2e-16
    95% Conf Interval:  (-1.3994, -1.1185)

 Additive Effect
   Parameter Estimate:  1.9637
   Estimated Variance:  0.0051747
              p-value:  <2e-16
    95% Conf Interval:  (1.8227, 2.1047)

 Additive Effect among the Treated
   Parameter Estimate:  1.936
   Estimated Variance:  0.0068662
              p-value:  <2e-16
    95% Conf Interval:  (1.7736, 2.0984)

 Additive Effect among the Controls
   Parameter Estimate:  2.0294
   Estimated Variance:  0.0056766
              p-value:  <2e-16
    95% Conf Interval:  (1.8817, 2.1771)
tmle2 <- tmle(Y = df$y, A = df$treat, W = df[,c('x','w', 'z')], g.SL.library = g.SL.library , Q.SL.library = Q.SL.library)
tmle2
 Marginal mean under treatment (EY1)
   Parameter Estimate:  0.70402
   Estimated Variance:  0.0041865
              p-value:  <2e-16
    95% Conf Interval:  (0.5772, 0.83083)

 Marginal mean under comparator (EY0)
   Parameter Estimate:  -1.2499
   Estimated Variance:  0.0051507
              p-value:  <2e-16
    95% Conf Interval:  (-1.3906, -1.1093)

 Additive Effect
   Parameter Estimate:  1.954
   Estimated Variance:  0.0051513
              p-value:  <2e-16
    95% Conf Interval:  (1.8133, 2.0946)

 Additive Effect among the Treated
   Parameter Estimate:  1.9354
   Estimated Variance:  0.0068412
              p-value:  <2e-16
    95% Conf Interval:  (1.7733, 2.0975)

 Additive Effect among the Controls
   Parameter Estimate:  2.028
   Estimated Variance:  0.0057375
              p-value:  <2e-16
    95% Conf Interval:  (1.8795, 2.1764)

We use [Mark van der Laan’s TMLE method] (http://biostats.bepress.com/ucbbiostat/paper275/). It uses [SuperLearner] (http://biostats.bepress.com/ucbbiostat/paper222/) as the initial estimator. It’s an ensemble of mulitple machine learning algorithms. Therefore it does not need to assume the functional form of the DGP. Even if we don’t have the variables that determines the DGP of outcome, if we observe some functions (even nonlinear functions) of these variables, we can still get reasonable estimates of the treatment effect.

In this example, we used multiple popular machine learning algorithms in modeling both treatment assingment process and the outcome process. The first TMLE model is with x and w (note not the logged x and w which are in the true DGP), the second one with an additional variable z.

It seems that TMLE results are less biased than the linear models with x and w. It may not be better than the linear model with logged x and w, but in empirical studies, we often cannot assume we have the variables in the DGP, but only some proxy of the variables in the DGP. I’ll do more simulations to see whether TMLE does perform better in the situation that we are not sure about the functional form. We should expect that is the case.

So far TMLE can only be used when treatment is binary variable.

It’s about time we embrace machine learning techniques into studies of caual effect in observational studies.