6 Estimation: RA, IPW, AIPW and IPWRA

Once we have identification, which means we can identify the causal effect from the observed data, we can move to estimation. Estimation is to estimate the identified causal effect from the observed data. We have several methods for estimation, including regression adjustment (RA), inverse probability weighting (IPW), augmented inverse probability weighting (AIPW) and inverse probability weighted regression adjustment (IPWRA).

6.1 Experimental Data

If we have experimental data, then we have the assumptions satisfied. We can directly compare the treated and control groups. We can use a difference-in-means estimator (under randomisation ATE = ATT = ATU, so no subscript is needed): \[ \hat{\tau} = \bar Y_1 - \bar Y_0 \]

In experimental setting, we can do a t test for the difference in means. We can also do a regression of $Y$ on $W$ and $X$.

6.2 Adjustment Formula

From causal to estimation: \[ \small \begin{align} E[Y(1)-Y(0)] &= E[Y(1)] - E[Y(0)] \\ &=E_X[E[Y(1)|X]] - E_X[E[Y(0)|X]] \\ &=E_X[E[Y(1)|W=1, X]] - E_X[E[Y(0)|W=0, X]] \\ &=E_X[E[Y|W=1, X]] - E_X[E[Y|W=0, X]] \end{align} \] where the first step is linear expectation. Second step is law of iterated expectation. Third step is unconfoundedness and overlap. Fourth step is consistency.

6.3 Regression Adjustment

How do we get $E[Y|W=w, X=x]$? We can use regression adjustment.

We define a conditional mean function:

\[ \begin{align} \mu_0(X) &\equiv E[Y|W=0, X] \\ &= E[Y(0)|W=0, X] \\ &= E[Y(0)|W=1, X] \end{align} \]

The RA estimators are:

\[ \hat \tau_{ATT} = \frac{1}{n_1}\sum_{i:W_i=1}\{Y_i-\hat\mu_0(X_i)\} \]

\[ \hat \tau_{ATE} = \frac{1}{n}\sum_{i=1}^n\{\hat\mu_1(X_i)-\hat\mu_0(X_i)\} \]

where $\hat \mu_0(X_i)$ is the predicted value of $Y_i$ from a regression of $Y$ on $X$ for the control group.

Target: \[ \begin{align} \tau_{ATT} &= E[Y|W=1] - E[Y(0)|W=1] \\ &= E[Y|W=1] - E[\mu_0(X) | W=1] \\ & = E[Y|W=1] - (\alpha_0 + E[X|W=1] \beta_0) \end{align} \]

The last term is a linear form of $\mu_0(X)$. We can specify other forms, but the idea is to model it with some functional form. For parametric forms, we need to make sure extrapolation does not go out of control.

6.4 Linear RA

Regression Adjustment is basically an imputation estimator. While we observe $E[Y|W=1,X]$, we model $E[Y(0)|W=1]$, based on unconfoundedness and a functional form (say linear form). We estimate $\beta_0$ on the control sample, then get the expected values for the treated sample, for each value of $X$.

Implementation of linear RA is easy. We regress $Y$ on $X$ and $W$ and their interaction. It’s shown that we need to de-mean $X$ to get the effect correct.

data(pension)
#  for ATE, de-mean X by deducting the mean of X in the whole sample:
#  data <- pension %>% mutate(inc_dm=inc-mean(inc), age_dm=age-mean(age))
#  for ATT (used here), de-mean X by deducting the mean of X in the treated group:
data <- pension %>%
  # Reference the local (in-mutate) inc/age rather than pension$inc/pension$age
  # directly: hardcoding the parent object name here would silently keep using
  # the ORIGINAL full-sample means even if this code were reused on a
  # resampled/bootstrapped copy of the data assigned to `data`, producing
  # wrong point estimates and standard errors for that use case.
  mutate(inc_dm=inc-mean(inc[p401==1]), age_dm=age-mean(age[p401==1]))
lm_ra <- lm(net_tfa ~ p401*(inc_dm + age_dm),data=data)
summary(lm_ra)


Call:
lm(formula = net_tfa ~ p401 * (inc_dm + age_dm), data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-508262  -17176   -3497    9223 1444086 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.410e+04  8.312e+02  28.998  < 2e-16 ***
p401        1.416e+04  1.397e+03  10.134  < 2e-16 ***
inc_dm      7.723e-01  3.009e-02  25.670  < 2e-16 ***
age_dm      7.977e+02  6.350e+01  12.561  < 2e-16 ***
p401:inc_dm 3.300e-01  5.133e-02   6.429 1.34e-10 ***
p401:age_dm 8.153e+02  1.333e+02   6.118 9.81e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 57210 on 9909 degrees of freedom
Multiple R-squared:  0.1894,    Adjusted R-squared:  0.1889 
F-statistic: 462.9 on 5 and 9909 DF,  p-value: < 2.2e-16

#coeftest(lm_ra, vcov=vcovHC(lm_ra, type='HC2'))[2,]

After demeaning the covariates, we include all interactions of demeaned covariates with treatment. Because we de-meaned at the treated-group means, the coefficient on $W$ is the average treatment effect on the treated (ATT); de-meaning at the whole-sample means instead would give the ATE.

6.4.1 Questions on RA

Why do we have to do interaction of $W$ and $X$? What if we don’t do it?
What if we don’t de-mean $X$?

Answers:

It has been shown (Słoczyński 2022) that with heterogeneous treatment effect, and unequal divide between treated and control, we need to do the interaction term. Otherwise, we’ll get biased estimates, for $\tau_{ATT}$ or $\tau_{ATE}$. The intuition is that we need to model the difference between treated and control, and the difference is not constant across $X$.
If we don’t de-mean $X$, then the coefficient on $W$ is not the average treatment effect. We can still retrieve the ATT or ATE by calculating the marginal effect of $W$ though.

6.5 IPW

Under unconfoundedness and overlap,

\[ \small \begin{align} E[Y(w)] &= E[E[Y|W=w,X]] \\ &= \sum_x (\sum_y y P(y|w,x)) P(x) \\ &= \sum_x \sum_y y P(y|w,x) P(x) {\frac{P(w|x)}{P(w|x)}} \\ &= \sum_x \sum_y y P(y,w,x) {\frac{1}{P(w|x)}} \\ &= \sum_x E[\mathbb{1}(W=w,X=x)Y] {\frac{1}{P(w|x)}} \\ &= E[\frac{\mathbb{1}(W=w)Y} {P(w|X)}] \end{align} \]

The IPW equation means that the expected value of the potential outcome is the weighted average of the observed outcome, where the weight is the inverse of the propensity score $P(w|x)$. $\mathbb{1}(W=w)$ is an indicator function, which is 1 if $W=w$ and 0 otherwise.

\[ \begin{align} \tau_{ATE} &= E[Y(1)] - E[Y(0)] \\ &= E[\frac{W Y} {\pi(X)}] - E[\frac{(1-W) Y} {1-\pi(X)}] \\ \end{align} \]

The sample estimators are:

\[ \begin{align} \hat \tau_{ATE, IPW} &= \frac{1}{N} \sum_i \frac{W_i Y_i} {\hat\pi(X_i)} - \frac{1}{N} \sum_i \frac{(1-W_i) Y_i} {1-\hat \pi(X_i)} \end{align} \]

\[ \begin{align} \hat \tau_{ATT, IPW} &= \bar Y_1 - \frac{1}{N_1} \sum_i \frac{\hat\pi(X_i)}{1-\hat \pi(X_i)} (1-W_i) Y_i \end{align} \]

For the ATT, control units are weighted by the odds $\hat\pi(X_i)/(1-\hat\pi(X_i))$ rather than $1/(1-\hat\pi(X_i))$: this reweights the controls to the covariate distribution of the treated (one can check ${\rm E}[\frac{\pi(X)}{1-\pi(X)}(1-W)Y] = P(W=1)\, {\rm E}[Y(0)|W=1]$).

6.5.1 IPW Intuition

IPW removes confounding $X$ by creating a pseudo-population in which $X$ is independent of $W$. The intuition is that we can create a pseudo-population in which $X$ is independent of $W$, by weighting the observations by the inverse of the propensity score. Once $X$ is independent of $W$, $X$ is not a confounder anymore. We can then estimate the effect of $W$ on $Y$ by comparing the weighted average of $Y$ for $W=1$ and $W=0$.

Suppose there are two types of people, one type has high probability to be treated, the other one has low probability to be treated. Without adjusting for the propensity score, the treated group will be dominated by the first type, and the control group will be dominated by the second type. The difference between the treated and control group will be due to the difference in the composition of the two groups, not due to the treatment effect. By weighting the observations by the inverse of the propensity score, we can create a pseudo-population in which the two groups have the same composition. The difference between the treated and control group will be due to the treatment effect.

6.6 Doubly Robust Estimators

We can model both outcome and treatment assignment, and use both models to estimate the treatment effect. The estimator is called doubly robust because it is consistent if either the outcome regression or the propensity/treatment model is correctly specified – we only need one of the two to be right, not both. (Note the precise statement: it is not consistent when both are misspecified, so “robust to either being misspecified” is the wrong reading.)

6.6.1 AIPW

\[ \psi_1^{AIPW} = E[\frac{W(Y-\mu_1(X))}{\pi(X)} + \mu_1(X)] \]

\[ \psi_0^{AIPW} = E[\frac{(1-W)(Y-\mu_0(X))}{1-\pi(X)} + \mu_0(X)] \]

Here $\mu_w(X)=E[Y|W=w,X]$ is the outcome-model nuisance function, while the scalar $\psi_w^{AIPW}$ equals $E[Y(w)]$ as long as either the outcome model or the propensity score is correctly specified; the ATE is $\psi_1^{AIPW}-\psi_0^{AIPW}$.

data <- read_dta(file="https://www.stata-press.com/data/r18/cattaneo2.dta")
mu <- lm(bweight ~ mbsmoke*(prenatal1 + mmarried + mage + fbaby), data=data)
pi <- glm( mbsmoke ~ mmarried + mage + fbaby + medu, data=data, family=binomial(link="logit"))
mm1 <- predict(mu, newdata=data %>% mutate(mbsmoke=1))
mm0 <- predict(mu, newdata=data %>% mutate(mbsmoke=0))
pi1 <- predict(pi, newdata=data, type="response")
data <- data %>%
  mutate(w=mbsmoke, y=bweight, m1=mm1, m0=mm0, pi1=pi1)
data2 <- data %>%
  mutate(mu1=(w*(y-m1)/pi1 + m1), mu0=((1-w)*(y-m0)/(1-pi1) + m0)) %>%
  mutate(tau=mu1-mu0)
data2 %>% summarise(mean(tau))

# A tibble: 1 × 1
  `mean(tau)`
        <dbl>
1       -234.

6.6.2 IPWRA

The idea of IPWRA is to combine RA and IPW. Basically RA with IPW weights.

Implementation:

Estimate propensity score. Get predicted probability of treated for each observation.
Estimate outcome model with the IPW weights: two separate equations, one for treated (weights $1/\hat \pi(X)$) and one for control (weights $1/(1-\hat \pi(X))$). Equivalently, one weighted regression with full treatment–covariate interaction — the interacted design matrix is block-diagonal across the two groups, so the weighted normal equations decouple and the two routes give identical fits.
Get predicted values for setting everyone treated. Get predicted values for setting everyone control. These are the two potential outcomes.
Take difference. The average is the ATE.

data <- read_dta(file="https://www.stata-press.com/data/r18/cattaneo2.dta")
pi <- glm( mbsmoke ~ mmarried + mage + fbaby + medu, data=data, family=binomial(link="logit"))
pi1 <- predict(pi, newdata=data, type="response")
data <- data %>%
  mutate(pi1=pi1)
data1 <- data %>% filter(mbsmoke==1)
data0 <- data %>% filter(mbsmoke==0)
mu1 <- lm(bweight ~ prenatal1 + mmarried + mage + fbaby, data=data1, weights=1/data1$pi1)
mu0 <- lm(bweight ~ prenatal1 + mmarried + mage + fbaby, data=data0, weights=1/(1-data0$pi1))
mm1 <- predict(mu1, newdata=data %>% mutate(mbsmoke=1))
mm0 <- predict(mu0, newdata=data %>% mutate(mbsmoke=0))
data <- data %>%
  mutate(w=mbsmoke, y=bweight, m1=mm1, m0=mm0, pi1=pi1)
data2 <- data %>%
  mutate(tau=m1-m0)
data2 %>% summarise(mean(tau))

# A tibble: 1 × 1
  `mean(tau)`
        <dbl>
1       -233.

# Estimation: RA, IPW, AIPW and IPWRA ```{r} #| include: false suppressPackageStartupMessages({ library(tidyverse) library(haven) library(hdm) library(lmtest) library(sandwich) library(broom) library(knitr) }) ``` Once we have identification, which means we can identify the causal effect from the observed data, we can move to estimation. Estimation is to estimate the identified causal effect from the observed data. We have several methods for estimation, including regression adjustment (RA), inverse probability weighting (IPW), augmented inverse probability weighting (AIPW) and inverse probability weighted regression adjustment (IPWRA). ## Experimental Data If we have experimental data, then we have the assumptions satisfied. We can directly compare the treated and control groups. We can use a difference-in-means estimator (under randomisation ATE = ATT = ATU, so no subscript is needed): $$ \hat{\tau} = \bar Y_1 - \bar Y_0 $$ In experimental setting, we can do a t test for the difference in means. We can also do a regression of $Y$ on $W$ and $X$. ## Adjustment Formula From causal to estimation: $$ \small \begin{align} E[Y(1)-Y(0)] &= E[Y(1)] - E[Y(0)] \\ &=E_X[E[Y(1)|X]] - E_X[E[Y(0)|X]] \\ &=E_X[E[Y(1)|W=1, X]] - E_X[E[Y(0)|W=0, X]] \\ &=E_X[E[Y|W=1, X]] - E_X[E[Y|W=0, X]] \end{align} $$ where the first step is linear expectation. Second step is law of iterated expectation. Third step is unconfoundedness and overlap. Fourth step is consistency. ## Regression Adjustment How do we get $E[Y|W=w, X=x]$? We can use regression adjustment. We define a conditional mean function: $$ \begin{align} \mu_0(X) &\equiv E[Y|W=0, X] \\ &= E[Y(0)|W=0, X] \\ &= E[Y(0)|W=1, X] \end{align} $$ The RA estimators are: $$ \hat \tau_{ATT} = \frac{1}{n_1}\sum_{i:W_i=1}\{Y_i-\hat\mu_0(X_i)\} $$ $$ \hat \tau_{ATE} = \frac{1}{n}\sum_{i=1}^n\{\hat\mu_1(X_i)-\hat\mu_0(X_i)\} $$ where $\hat \mu_0(X_i)$ is the predicted value of $Y_i$ from a regression of $Y$ on $X$ for the control group. Target: $$ \begin{align} \tau_{ATT} &= E[Y|W=1] - E[Y(0)|W=1] \\ &= E[Y|W=1] - E[\mu_0(X) | W=1] \\ & = E[Y|W=1] - (\alpha_0 + E[X|W=1] \beta_0) \end{align} $$ The last term is a linear form of $\mu_0(X)$. We can specify other forms, but the idea is to model it with some functional form. For parametric forms, we need to make sure extrapolation does not go out of control. ## Linear RA Regression Adjustment is basically an imputation estimator. While we observe $E[Y|W=1,X]$, we model $E[Y(0)|W=1]$, based on unconfoundedness and a functional form (say linear form). We estimate $\beta_0$ on the control sample, then get the expected values for the treated sample, for each value of $X$. Implementation of linear RA is easy. We regress $Y$ on $X$ and $W$ and their interaction. It's shown that we need to de-mean $X$ to get the effect correct. ```{r ra1, warning=FALSE, cache=TRUE, message=FALSE, echo=TRUE} data(pension) # for ATE, de-mean X by deducting the mean of X in the whole sample: # data <- pension %>% mutate(inc_dm=inc-mean(inc), age_dm=age-mean(age)) # for ATT (used here), de-mean X by deducting the mean of X in the treated group: data <- pension %>% # Reference the local (in-mutate) inc/age rather than pension$inc/pension$age # directly: hardcoding the parent object name here would silently keep using # the ORIGINAL full-sample means even if this code were reused on a # resampled/bootstrapped copy of the data assigned to `data`, producing # wrong point estimates and standard errors for that use case. mutate(inc_dm=inc-mean(inc[p401==1]), age_dm=age-mean(age[p401==1])) lm_ra <- lm(net_tfa ~ p401*(inc_dm + age_dm),data=data) summary(lm_ra) #coeftest(lm_ra, vcov=vcovHC(lm_ra, type='HC2'))[2,] ``` After demeaning the covariates, we include all interactions of demeaned covariates with treatment. Because we de-meaned at the treated-group means, the coefficient on $W$ is the average treatment effect on the treated (ATT); de-meaning at the whole-sample means instead would give the ATE. ### Questions on RA 1. Why do we have to do interaction of $W$ and $X$? What if we don't do it? 2. What if we don't de-mean $X$? Answers: 1. It has been shown [@sloczynski-2022] that with heterogeneous treatment effect, and unequal divide between treated and control, we need to do the interaction term. Otherwise, we'll get biased estimates, for $\tau_{ATT}$ or $\tau_{ATE}$. The intuition is that we need to model the difference between treated and control, and the difference is not constant across $X$. 2. If we don't de-mean $X$, then the coefficient on $W$ is not the average treatment effect. We can still retrieve the ATT or ATE by calculating the marginal effect of $W$ though. ## IPW Under unconfoundedness and overlap, $$ \small \begin{align} E[Y(w)] &= E[E[Y|W=w,X]] \\ &= \sum_x (\sum_y y P(y|w,x)) P(x) \\ &= \sum_x \sum_y y P(y|w,x) P(x) {\frac{P(w|x)}{P(w|x)}} \\ &= \sum_x \sum_y y P(y,w,x) {\frac{1}{P(w|x)}} \\ &= \sum_x E[\mathbb{1}(W=w,X=x)Y] {\frac{1}{P(w|x)}} \\ &= E[\frac{\mathbb{1}(W=w)Y} {P(w|X)}] \end{align} $$ The IPW equation means that the expected value of the potential outcome is the weighted average of the observed outcome, where the weight is the inverse of the propensity score $P(w|x)$. $\mathbb{1}(W=w)$ is an indicator function, which is 1 if $W=w$ and 0 otherwise. $$ \begin{align} \tau_{ATE} &= E[Y(1)] - E[Y(0)] \\ &= E[\frac{W Y} {\pi(X)}] - E[\frac{(1-W) Y} {1-\pi(X)}] \\ \end{align} $$ The sample estimators are: $$ \begin{align} \hat \tau_{ATE, IPW} &= \frac{1}{N} \sum_i \frac{W_i Y_i} {\hat\pi(X_i)} - \frac{1}{N} \sum_i \frac{(1-W_i) Y_i} {1-\hat \pi(X_i)} \end{align} $$ $$ \begin{align} \hat \tau_{ATT, IPW} &= \bar Y_1 - \frac{1}{N_1} \sum_i \frac{\hat\pi(X_i)}{1-\hat \pi(X_i)} (1-W_i) Y_i \end{align} $$ For the ATT, control units are weighted by the odds $\hat\pi(X_i)/(1-\hat\pi(X_i))$ rather than $1/(1-\hat\pi(X_i))$: this reweights the controls to the covariate distribution of the treated (one can check ${\rm E}[\frac{\pi(X)}{1-\pi(X)}(1-W)Y] = P(W=1)\, {\rm E}[Y(0)|W=1]$). ### IPW Intuition IPW removes confounding $X$ by creating a pseudo-population in which $X$ is independent of $W$. The intuition is that we can create a pseudo-population in which $X$ is independent of $W$, by weighting the observations by the inverse of the propensity score. Once $X$ is independent of $W$, $X$ is not a confounder anymore. We can then estimate the effect of $W$ on $Y$ by comparing the weighted average of $Y$ for $W=1$ and $W=0$. Suppose there are two types of people, one type has high probability to be treated, the other one has low probability to be treated. Without adjusting for the propensity score, the treated group will be dominated by the first type, and the control group will be dominated by the second type. The difference between the treated and control group will be due to the difference in the composition of the two groups, not due to the treatment effect. By weighting the observations by the inverse of the propensity score, we can create a pseudo-population in which the two groups have the same composition. The difference between the treated and control group will be due to the treatment effect. ## Doubly Robust Estimators We can model both outcome and treatment assignment, and use both models to estimate the treatment effect. The estimator is called doubly robust because it is consistent if *either* the outcome regression *or* the propensity/treatment model is correctly specified -- we only need one of the two to be right, not both. (Note the precise statement: it is *not* consistent when both are misspecified, so "robust to either being misspecified" is the wrong reading.) ### AIPW $$ \psi_1^{AIPW} = E[\frac{W(Y-\mu_1(X))}{\pi(X)} + \mu_1(X)] $$ $$ \psi_0^{AIPW} = E[\frac{(1-W)(Y-\mu_0(X))}{1-\pi(X)} + \mu_0(X)] $$ Here $\mu_w(X)=E[Y|W=w,X]$ is the outcome-model nuisance function, while the scalar $\psi_w^{AIPW}$ equals $E[Y(w)]$ as long as either the outcome model or the propensity score is correctly specified; the ATE is $\psi_1^{AIPW}-\psi_0^{AIPW}$. ```{r aipw1, warning=FALSE, cache=TRUE, message=FALSE, echo=TRUE} data <- read_dta(file="https://www.stata-press.com/data/r18/cattaneo2.dta") mu <- lm(bweight ~ mbsmoke*(prenatal1 + mmarried + mage + fbaby), data=data) pi <- glm( mbsmoke ~ mmarried + mage + fbaby + medu, data=data, family=binomial(link="logit")) mm1 <- predict(mu, newdata=data %>% mutate(mbsmoke=1)) mm0 <- predict(mu, newdata=data %>% mutate(mbsmoke=0)) pi1 <- predict(pi, newdata=data, type="response") data <- data %>% mutate(w=mbsmoke, y=bweight, m1=mm1, m0=mm0, pi1=pi1) data2 <- data %>% mutate(mu1=(w*(y-m1)/pi1 + m1), mu0=((1-w)*(y-m0)/(1-pi1) + m0)) %>% mutate(tau=mu1-mu0) data2 %>% summarise(mean(tau)) ``` ### IPWRA The idea of IPWRA is to combine RA and IPW. Basically RA with IPW weights. Implementation: - Estimate propensity score. Get predicted probability of treated for each observation. - Estimate outcome model with the IPW weights: two separate equations, one for treated (weights $1/\hat \pi(X)$) and one for control (weights $1/(1-\hat \pi(X))$). Equivalently, one weighted regression with full treatment–covariate interaction — the interacted design matrix is block-diagonal across the two groups, so the weighted normal equations decouple and the two routes give identical fits. - Get predicted values for setting everyone treated. Get predicted values for setting everyone control. These are the two potential outcomes. - Take difference. The average is the ATE. ```{r ipwra1, warning=FALSE, cache=TRUE, message=FALSE, echo=TRUE} data <- read_dta(file="https://www.stata-press.com/data/r18/cattaneo2.dta") pi <- glm( mbsmoke ~ mmarried + mage + fbaby + medu, data=data, family=binomial(link="logit")) pi1 <- predict(pi, newdata=data, type="response") data <- data %>% mutate(pi1=pi1) data1 <- data %>% filter(mbsmoke==1) data0 <- data %>% filter(mbsmoke==0) mu1 <- lm(bweight ~ prenatal1 + mmarried + mage + fbaby, data=data1, weights=1/data1$pi1) mu0 <- lm(bweight ~ prenatal1 + mmarried + mage + fbaby, data=data0, weights=1/(1-data0$pi1)) mm1 <- predict(mu1, newdata=data %>% mutate(mbsmoke=1)) mm0 <- predict(mu0, newdata=data %>% mutate(mbsmoke=0)) data <- data %>% mutate(w=mbsmoke, y=bweight, m1=mm1, m0=mm0, pi1=pi1) data2 <- data %>% mutate(tau=m1-m0) data2 %>% summarise(mean(tau)) ```