17 Causal Forest in panel data

Published

October 23, 2017

17.1 Introduction

In this simulation exercise, we use Causal Forest (Now is implemented in Generalized Random Forest) (https://github.com/swager/grf) to calculated conditional average treatment effect (or heterogeneous treatment effect). We assume three different data generating processes. The first one is a linear interaction between a variable of interest and the treatment dummy. The second one assumes a nonlinear function (a step function) of a variable of interest, say \(X\), and the treatment dummy \(W\). The third one is also nonlinear, assuming we have variable \(X\), but the real DGP is the interaction of log of \(X\) and \(W\).

17.2 Linear effect

Suppose we have a panel of firms over time, treatment assignment is random. We split the data into training and test. Then we generate 10 random variables, the first two are highly correlated. Then we collapse V2 into firm means, and use that as the unit effect. That is, we have a DGP as:

\[y_{it} = \alpha_i + V1_{it} + W_{it} + V1_{it}*W_{it} + \epsilon_{it} \]

Here \(\alpha_i\) is not observed, and it is correlated with \(V1\). If we don’t control for it, we are going to have a biased estimator.

We run a fixed effect model, and a random effect model and an OLS model. We expect fixed effect model will do well, but not the other two. This is because the other two models do not control for \(\alpha_i\) directly.

# This is a simulation exercise to see how causal forest performs
# First we run a linear model, with panel setting
# Then two nonlinear situations, with panel setting.

library(lfe, quietly = TRUE)
library(lme4)
require(snowfall)
library(MASS)
library(grf)
library(tidyverse)
set.seed(666)
rm(list = ls())
# t is no. of periods. p is no. of variables, n is no. of firms.
t <- 10
p <- 10
n=400
# generate p random variables
data1 = as.data.frame(matrix(rnorm(n*t*p), n*t, p))
names(data1) <- paste0("X", 1:p)

firm=seq(1,n)
time=seq(1,t)
data <- expand.grid(firm=firm,time=time)


# W is the treatment assignment
data$W <- rbinom(n*t, 1, 0.5)
data$train <- rbinom(n*t, 1, 0.5)


# generate two correlated variables
covarMat = matrix( c(1, .95^2, .95^2, 1 ) , nrow=2 , ncol=2 )
data.2  = as.data.frame(mvrnorm(n=n*t , mu=rep(0,2), Sigma=covarMat ))
names(data.2) <- c("V1", "V2")

data <- bind_cols(data,data.2) %>%
    tibble()

data <- bind_cols(data,data1)

# generate unit effect by firm means of V2, which is correlated with V1
data <- data %>%
    group_by(firm) %>%
    mutate(unit.effect=mean(V2)) %>%
    ungroup() %>%
    arrange(firm, time)

y<-c()
unit<-c()
X<-c()
k <- 0
for(i in 1:n){
    for(j in 1:t){
        k<-k+1
        unit[k]<-i
        y[k]<-1+ data$V1[k] + data$W[k] + data$V1[k]*data$W[k] + data$unit.effect[k]+rnorm(1,mean=0,sd=1)
    }
}
data$unit=unit
data$y=y



data.train <- data %>%
    filter(train==1)
data.test <- data %>%
    filter(train==0)

# run fixed effect, random effect, and ols to see whether they are
# biased.
# in general we should expect fe performs better, since we are
# omitting unit.effect in the other two models
fe <- felm(y ~ V1 * W | unit, data=data.train)
re <-lmer(y~V1*W+(1 | unit), data=data.train)
ols <- lm(y ~ V1 * W , data=data.train)
summary(fe)


Call:
   felm(formula = y ~ V1 * W | unit, data = data.train) 

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0440 -0.6252 -0.0083  0.6221  3.6174 

Coefficients:
     Estimate Std. Error t value Pr(>|t|)    
V1    0.98354    0.03748   26.24   <2e-16 ***
W     1.02186    0.05093   20.06   <2e-16 ***
V1:W  1.04302    0.05237   19.92   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.018 on 1612 degrees of freedom
Multiple R-squared(full model): 0.7952   Adjusted R-squared: 0.7443 
Multiple R-squared(proj model): 0.7269   Adjusted R-squared: 0.659 
F-statistic(full model):15.61 on 401 and 1612 DF, p-value: < 2.2e-16 
F-statistic(proj model):  1430 on 3 and 1612 DF, p-value: < 2.2e-16

summary(re)

Linear mixed model fit by REML ['lmerMod']
Formula: y ~ V1 * W + (1 | unit)
   Data: data.train

REML criterion at convergence: 5938.9

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.4332 -0.6615 -0.0049  0.6640  3.8329 

Random effects:
 Groups   Name        Variance Std.Dev.
 unit     (Intercept) 0.07166  0.2677  
 Residual             1.04602  1.0228  
Number of obs: 2014, groups:  unit, 399

Fixed effects:
            Estimate Std. Error t value
(Intercept)  1.00010    0.03539   28.26
V1           1.02961    0.03486   29.53
W            1.00333    0.04684   21.42
V1:W         1.05963    0.04814   22.01

Correlation of Fixed Effects:
     (Intr) V1     W     
V1    0.013              
W    -0.656 -0.008       
V1:W -0.007 -0.727  0.025

summary(ols)


Call:
lm(formula = y ~ V1 * W, data = data.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8384 -0.7101  0.0081  0.7334  4.0053 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.00363    0.03319   30.24   <2e-16 ***
V1           1.04612    0.03516   29.75   <2e-16 ***
W            0.99657    0.04712   21.15   <2e-16 ***
V1:W         1.06458    0.04840   22.00   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.057 on 2010 degrees of freedom
Multiple R-squared:  0.7249,    Adjusted R-squared:  0.7245 
F-statistic:  1765 on 3 and 2010 DF,  p-value: < 2.2e-16

Now we run Causal Forest on this data. Two clarifications about what is and is not at stake here. First, treatment \(W\) is randomized (rbinom(., 0.5), independent of the firm effect), so the omitted unit effect \(\alpha_i\) does not confound the treatment effect — it only enters the nuisance functions (and hence precision), it does not create treatment-effect bias. The panel issue here is therefore mainly dependence and nuisance fit, not confounding. Second, the firm structure should be handled in two ways: we include the firm dummies in \(X\) so the forest can absorb \(\alpha_i\) in the nuisance models, and we pass clusters = firm so that grf’s honest sample splitting and its variance estimates keep each firm’s repeated observations together rather than treating them as independent rows. Including the dummies alone does neither of those.

# ensure consistent factor levels across train/test
all_firms <- levels(factor(data$firm))
data.train$firm <- factor(data.train$firm, levels = all_firms)
data.test$firm <- factor(data.test$firm, levels = all_firms)

# a is a data set of firm dummies
# Caution: this one-hot-encodes all ~400 firms into individual dummy columns
# of X. Random forests split poorly on such a high-cardinality sparse block
# of binary columns (each dummy is only informative for a tiny fraction of
# observations). A fixed-effects forest (demean y, W, and the continuous
# covariates within firm first, then run causal_forest on the demeaned
# variables) or a method that handles unit effects natively would typically
# perform better than including firm as raw dummies.
a <- as.data.frame(model.matrix(y~firm, data.train))
X <- as.matrix(bind_cols(data.train[,c(5,7:16)],a))
Y <- data.train$y
W <- data.train$W
# causal forest is run on y against V1, to X10 and firm dummies.
# clusters = firm makes honest splitting and the variance estimates respect
# the panel structure (a firm's repeated rows are kept together, not treated
# as independent observations).
tau.forest = causal_forest(X, Y, W, clusters = as.integer(data.train$firm))

average_treatment_effect(tau.forest, target.sample = "all")

  estimate    std.err 
0.93560912 0.08227266

average_treatment_effect(tau.forest, target.sample = "treated")

  estimate    std.err 
0.93168066 0.08158925

a <- as.data.frame(model.matrix(y~firm, data.test))
X.test <- as.matrix(bind_cols(data.test[,c(5,7:16)],a))
Y.test <- data.test$y
W.test <- data.test$W

tau.hat = predict(tau.forest, X.test, estimate.variance = TRUE)
sigma.hat = sqrt(tau.hat$variance.estimates)

data.test$pred <- tau.hat$predictions
ggplot(data.test, aes(x=V1, y=pred)) + geom_point() + geom_abline(intercept = 1, slope = 1)

This graph is the predicted value of CATE on the test data, vs. the value of \(V1\).

Now we graph the predicted CATE against the true CATE. The other nice feature of Causal Forest is that it returns variance for CATE for each observation. We also graph the error bar here.

graph.data  <- as.data.frame(tau.hat )
graph.data  <- graph.data  %>%
    mutate(cate=1+data.test$V1) %>%
    mutate(lower=predictions - sigma.hat , upper=predictions + sigma.hat )

ggplot(graph.data , aes(x=cate, y=predictions)) + geom_point()  + geom_abline(slope = 1)

ggplot() + geom_errorbar(graph.data , mapping=aes(x=cate,  ymin=lower,ymax=upper), alpha = 0.2) + geom_point(graph.data , mapping=aes(x=cate,  y=predictions), alpha = 0.2) + geom_abline(slope = 1)

17.3 Nonlinear effect case 1

In the first situation, we assume DGP:

\[y_{it} = 1 + \alpha_i + max(V1_{it},0)*W_{it} + \epsilon_{it} \]

# if we have a nonlinear effect
# suppose we have the same X, but y is generated nonlinearly.
# in this example, y is a nonlinear function of V1 and W
set.seed(666)
rm(list = ls())
# t is no. of periods. p is no. of variables, n is no. of firms.
t <- 10
p <- 10
n=400
# generate p random variables
data1 = as.data.frame(matrix(rnorm(n*t*p), n*t, p))
names(data1) <- paste0("X", 1:p)

firm=seq(1,n)
time=seq(1,t)
data <- expand.grid(firm=firm,time=time)


# W is the treatment assignment
data$W <- rbinom(n*t, 1, 0.5)
data$train <- rbinom(n*t, 1, 0.5)


# generate two correlated variables
covarMat = matrix( c(1, .95^2, .95^2, 1 ) , nrow=2 , ncol=2 )
data.2  = as.data.frame(mvrnorm(n=n*t , mu=rep(0,2), Sigma=covarMat ))
names(data.2) <- c("V1", "V2")

data <- bind_cols(data,data.2) %>%
    tibble()

data <- bind_cols(data,data1)

# generate unit effect by firm means of V2, which is correlated with V1
data <- data %>%
    group_by(firm) %>%
    mutate(unit.effect=mean(V2)) %>%
    ungroup() %>%
    arrange(firm, time)

y<-c()
unit<-c()
X<-c()
k <- 0
for(i in 1:n){
    for(j in 1:t){
        k<-k+1
        unit[k]<-i
        y[k]<-1+ pmax(data$V1[k],0)*data$W[k]  + data$unit.effect[k]+rnorm(1,mean=0,sd=1)
    }
}
data$y=y
data$unit <- unit

data.train <- data %>%
    filter(train==1)
data.test <- data %>%
    filter(train==0)

# run fixed effect, random effect, and ols to see whether they are
# biased.
# they all perform poorly
fe <- felm(y ~ V1 * W | unit, data=data.train)
re <-lmer(y~V1*W+(1 | unit), data=data.train)
ols <- lm(y ~ V1 * W , data=data.train)
summary(fe)


Call:
   felm(formula = y ~ V1 * W | unit, data = data.train) 

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9461 -0.6225 -0.0178  0.6245  3.3537 

Coefficients:
     Estimate Std. Error t value Pr(>|t|)    
V1   -0.02076    0.03815  -0.544    0.586    
W     0.42527    0.05184   8.203 4.73e-16 ***
V1:W  0.53367    0.05331  10.012  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.036 on 1612 degrees of freedom
Multiple R-squared(full model): 0.3744   Adjusted R-squared: 0.2187 
Multiple R-squared(proj model): 0.1393   Adjusted R-squared: -0.07481 
F-statistic(full model):2.405 on 401 and 1612 DF, p-value: < 2.2e-16 
F-statistic(proj model): 86.96 on 3 and 1612 DF, p-value: < 2.2e-16

summary(re)

Linear mixed model fit by REML ['lmerMod']
Formula: y ~ V1 * W + (1 | unit)
   Data: data.train

REML criterion at convergence: 6017.2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.5905 -0.6596 -0.0031  0.6717  3.4143 

Random effects:
 Groups   Name        Variance Std.Dev.
 unit     (Intercept) 0.07977  0.2824  
 Residual             1.08359  1.0410  
Number of obs: 2014, groups:  unit, 399

Fixed effects:
            Estimate Std. Error t value
(Intercept)  0.99976    0.03624  27.587
V1           0.02797    0.03553   0.787
W            0.40652    0.04774   8.515
V1:W         0.54484    0.04907  11.104

Correlation of Fixed Effects:
     (Intr) V1     W     
V1    0.013              
W    -0.653 -0.008       
V1:W -0.007 -0.727  0.025

summary(ols)


Call:
lm(formula = y ~ V1 * W, data = data.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0965 -0.7415  0.0095  0.7512  3.8413 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.00363    0.03386  29.643   <2e-16 ***
V1           0.04612    0.03587   1.286    0.199    
W            0.39928    0.04807   8.307   <2e-16 ***
V1:W         0.54938    0.04937  11.127   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.078 on 2010 degrees of freedom
Multiple R-squared:  0.1556,    Adjusted R-squared:  0.1543 
F-statistic: 123.5 on 3 and 2010 DF,  p-value: < 2.2e-16

Apparently none of these model will return anything meaningful.

Now we use Causal Forest. The first graph is the predicted CATE against \(V1\); the second graph is the predicted CATE against true CATE. The third one adds in error bars.

all_firms <- levels(factor(data$firm))
data.train$firm <- factor(data.train$firm, levels = all_firms)
data.test$firm <- factor(data.test$firm, levels = all_firms)

# a is a data set of firm dummies
a <- as.data.frame(model.matrix(y~firm,data.train))
X <- as.matrix(bind_cols(data.train[,c(5,7:16)],a))
Y <- data.train$y
W <- data.train$W

#tau.forest = causal_forest(X, Y, W, num.trees = 4000)
a <- as.data.frame(model.matrix(y~firm,data.test))
X.test <- as.matrix(bind_cols(data.test[,c(5,7:16)],a))
Y.test <- data.test$y
W.test <- data.test$W

tau.forest3 = causal_forest(X, Y, W, clusters = as.integer(data.train$firm))

average_treatment_effect(tau.forest3, target.sample = "all")

  estimate    std.err 
0.37320336 0.05326215

average_treatment_effect(tau.forest3, target.sample = "treated")

  estimate    std.err 
0.37223789 0.05285912

#tau.forest3 = causal_forest(X, Y, W, num.trees = 4000)
tau.hat3 = predict(tau.forest3, X.test, estimate.variance = TRUE)
sigma.hat3 = sqrt(tau.hat3$variance.estimates)

data.test$pred3 <- tau.hat3$predictions

seg1 <- data.frame(x1 = -3, x2 = 0, y1 = 0, y2 = 0)
seg2 <- data.frame(x1 = 0, x2 = 2, y1 = 0, y2 = 2)
ggplot(data.test, aes(x=V1, y=pred3)) + geom_point() +
geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2, colour = "segment"), data = seg1)+
geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2, colour = "segment"), data = seg2)

graph.data3 <- as.data.frame(tau.hat3)
graph.data3 <- graph.data3 %>%
    mutate(cate=pmax(data.test$V1,0)) %>%
    mutate(lower=predictions - sigma.hat3, upper=predictions + sigma.hat3)

ggplot(graph.data3, aes(x=cate, y=predictions)) + geom_point()  + geom_abline(slope = 1)

ggplot() + geom_errorbar(graph.data3, mapping=aes(x=cate,  ymin=lower,ymax=upper), alpha = 0.2) + geom_point(graph.data3, mapping=aes(x=cate,  y=predictions), alpha = 0.2) + geom_abline(slope = 1)

17.4 Nonlinear effect case 2

In the second situation, we assume DGP:

\[y_{it} = 1 + \alpha_i + log(V1_{it}^2)*W_{it} + \epsilon_{it} \]

But we only observe \(V1\), not the log of the squared \(V1\).

# if we have a nonlinear effect
# suppose we have the same X, but y is generated nonlinearly.
# another example, suppose we observe log(V1), but the true effect is V1
set.seed(666)
rm(list = ls())
# t is no. of periods. p is no. of variables, n is no. of firms.
t <- 10
p <- 10
n=400
# generate p random variables
data1 = as.data.frame(matrix(rnorm(n*t*p), n*t, p))
names(data1) <- paste0("X", 1:p)

firm=seq(1,n)
time=seq(1,t)
data <- expand.grid(firm=firm,time=time)


# W is the treatment assignment
data$W <- rbinom(n*t, 1, 0.5)
data$train <- rbinom(n*t, 1, 0.5)


# generate two correlated variables
covarMat = matrix( c(1, .95^2, .95^2, 1 ) , nrow=2 , ncol=2 )
data.2  = as.data.frame(mvrnorm(n=n*t , mu=rep(0,2), Sigma=covarMat ))
names(data.2) <- c("V1", "V2")

data <- bind_cols(data,data.2) %>%
    tibble()

data <- bind_cols(data,data1)

# generate unit effect by firm means of V2, which is correlated with V1
data <- data %>%
    group_by(firm) %>%
    mutate(unit.effect=mean(V2)) %>%
    ungroup() %>%
    arrange(firm, time)

y<-c()
unit<-c()
X<-c()
k <- 0
for(i in 1:n){
    for(j in 1:t){
        k<-k+1
        unit[k]<-i
        y[k]<-1+ log(data$V1[k]^2)*data$W[k] + pmin(data$X3[k], 0) + data$unit.effect[k]+rnorm(1,mean=0,sd=1)
    }
}
data$y=y
data$unit <- unit

data.train <- data %>%
    filter(train==1)
data.test <- data %>%
    filter(train==0)

# run fixed effect, random effect, and ols to see whether they are
# biased.
# they all perform poorly
fe <- felm(y ~ V1 * W | unit, data=data.train)
re <-lmer(y~V1*W+(1 | unit), data=data.train)
ols <- lm(y ~ V1 * W , data=data.train)
summary(fe)


Call:
   felm(formula = y ~ V1 * W | unit, data = data.train) 

Residuals:
     Min       1Q   Median       3Q      Max 
-10.0383  -0.9478   0.0907   1.0865   5.3884 

Coefficients:
     Estimate Std. Error t value Pr(>|t|)    
V1   -0.01979    0.07088  -0.279    0.780    
W    -1.17812    0.09632 -12.231   <2e-16 ***
V1:W  0.01450    0.09904   0.146    0.884    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.925 on 1612 degrees of freedom
Multiple R-squared(full model): 0.2876   Adjusted R-squared: 0.1104 
Multiple R-squared(proj model): 0.08495   Adjusted R-squared: -0.1427 
F-statistic(full model):1.623 on 401 and 1612 DF, p-value: 6.032e-11 
F-statistic(proj model): 49.88 on 3 and 1612 DF, p-value: < 2.2e-16

summary(re)

Linear mixed model fit by REML ['lmerMod']
Formula: y ~ V1 * W + (1 | unit)
   Data: data.train

REML criterion at convergence: 8421.8

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-6.1184 -0.4977  0.0647  0.6098  2.8959 

Random effects:
 Groups   Name        Variance Std.Dev.
 unit     (Intercept) 0.09312  0.3052  
 Residual             3.72467  1.9299  
Number of obs: 2014, groups:  unit, 399

Fixed effects:
            Estimate Std. Error t value
(Intercept)  0.57512    0.06301   9.128
V1           0.06152    0.06490   0.948
W           -1.18310    0.08706 -13.590
V1:W        -0.01892    0.08945  -0.212

Correlation of Fixed Effects:
     (Intr) V1     W     
V1    0.015              
W    -0.685 -0.010       
V1:W -0.009 -0.727  0.028

summary(ols)


Call:
lm(formula = y ~ V1 * W, data = data.train)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.1215  -0.9805   0.1390   1.1999   5.7125 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.57618    0.06136   9.390   <2e-16 ***
V1           0.07086    0.06502   1.090    0.276    
W           -1.18331    0.08712 -13.582   <2e-16 ***
V1:W        -0.02005    0.08949  -0.224    0.823    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.954 on 2010 degrees of freedom
Multiple R-squared:  0.08508,   Adjusted R-squared:  0.08372 
F-statistic: 62.31 on 3 and 2010 DF,  p-value: < 2.2e-16

Again, these models do not work.

Now we run Causal Forest. Again, first graph is predicted CATE against \(V1\), the second one is the predicted CATE against true CATE, the third one adds error bars.

all_firms <- levels(factor(data$firm))
data.train$firm <- factor(data.train$firm, levels = all_firms)
data.test$firm <- factor(data.test$firm, levels = all_firms)

# a is a data set of firm dummies
a <- as.data.frame(model.matrix(y~firm,data.train))
X <- as.matrix(bind_cols(data.train[,c(5,7:16)],a))
Y <- data.train$y
W <- data.train$W

#tau.forest = causal_forest(X, Y, W, num.trees = 4000)
a <- as.data.frame(model.matrix(y~firm,data.test))
X.test <- as.matrix(bind_cols(data.test[,c(5,7:16)],a))
Y.test <- data.test$y
W.test <- data.test$W

tau.forest4 = causal_forest(X, Y, W, clusters = as.integer(data.train$firm))

average_treatment_effect(tau.forest4, target.sample = "all")

   estimate     std.err 
-1.20491164  0.08132368

average_treatment_effect(tau.forest4, target.sample = "treated")

   estimate     std.err 
-1.20241393  0.07981497

#tau.forest3 = causal_forest(X, Y, W, num.trees = 4000)
tau.hat4 = predict(tau.forest4, X.test, estimate.variance = TRUE)
sigma.hat4 = sqrt(tau.hat4$variance.estimates)

data.test$pred4 <- tau.hat4$predictions

fun.1 <- function(x) log(x^2)
ggplot(data.test, aes(x=V1, y=pred4)) + geom_point() +
stat_function(fun = fun.1, colour="red") + xlim(-3,3)

graph.data4 <- as.data.frame(tau.hat4)
graph.data4 <- graph.data4 %>%
    mutate(cate=log(data.test$V1^2)) %>%
    mutate(lower=predictions - sigma.hat4, upper=predictions + sigma.hat4)

ggplot(graph.data4, aes(x=cate, y=predictions)) + geom_point()  + geom_abline(slope = 1)

ggplot() + geom_errorbar(graph.data4, mapping=aes(x=cate,  ymin=lower,ymax=upper), alpha = 0.2) + geom_point(graph.data4, mapping=aes(x=cate,  y=predictions), alpha = 0.2) + geom_abline(slope = 1)

17.5 Summary

We see in these examples, Causal Forest performs fairly well, while other linear models will not work in the nonlinear DGP situations.

Systematic treatment: R · Julia.