using CairoMakie
using GraphMakie, Graphs
using Distributions, Random, LinearAlgebra
using RDRobust
using Panelest
using Vcov
using DataFrames
using StatsModels, DataFramesMeta
import StatsAPI: coeftable
using Printf
using RegressionTables16 IV & Regression Discontinuity
If we have an unobserved confounder, unconfoundedness fails. Even when we have random experiment, we could have noncompliance. In that case, subjects are randomly assigned to treatment or control, but some of them do not comply with the assignment. In other words, there is selection bias in the sense that the treated group is self-selected. This selection could be correlated with the outcome.
16.1 Instrumental Variables
16.1.1 Uncontrolled Confounder
16.1.2 Why Does IV Work?
16.1.3 IV Under Potential Outcomes
- W has two potential outcomes W(1) and W(0), as a function of Z.
- Y has two potential outcomes Y(1) and Y(0), as a function of W.
Intention to Treat (ITT) is the average treatment effect of the treatment assignment Z. It is the difference between the average potential outcome under treatment and the average potential outcome under control. It is the average treatment effect of the treatment assignment Z, regardless of whether the subject complies with the assignment.
There are four possible groups of subjects in the case of binary treatment and binary instrument. For example treatment assignment Z and treated status W could be:
- Z=0, W=0: never-taker
- Z=0, W=1: complier
- Z=1, W=0: defier
- Z=1, W=1: always-taker
| W(1)=0 | W(1)=1 | |
|---|---|---|
| W(0)=0 | N | C |
| W(0)=1 | D | A |
These four groups (stratum) have different causal mechanisms. For example, the compliers are the only group that is affected by the treatment assignment Z in the same direction. The defiers are the only group that is affected by the treatment assignment Z in the opposite direction. The always-takers and never-takers are not affected by the treatment assignment Z.
Unfortunately we cannot identify the compliance group by looking at the data.
| Z | W | G |
|---|---|---|
| 0 | 0 | C, N |
| 0 | 1 | A, D |
| 1 | 0 | N, D |
| 1 | 1 | C, A |
16.1.4 Assumptions
- Exclusion restriction: \(Y(w, 1) = Y(w,0)\)
- Exogeneity of instrument: \(Z \perp [W(0),W(1), Y(0), Y(1)]\)
- Monotonicity (no defiers): \(W(0) \leq W(1)\)
16.1.5 LATE Identification
Since we excluded defiers, and the other two groups (always-takers and never-takers) are not affected by the treatment assignment Z. The only effect we can identify is the treatment effect on the compliers. This is called the Local Average Treatment Effect (LATE).
\[ \small \begin{aligned} & E[Y(Z=1) - Y(Z=0) | G = C] \\ &= \frac{E[Y(Z=1) - Y(Z=0)]}{P(W(1)=1, W(0)=0)} \\ &= \frac{E[Y|Z=1] - E[Y|Z=0]}{1-P(W=0|Z=1)-P(W=1|Z=0)} \text{ (rule out groups A and N) } \\ &= \frac{E[Y|Z=1] - E[Y|Z=0]}{P(W=1|Z=1)-P(W=1|Z=0)} \\ &= \frac{E[Y|Z=1] - E[Y|Z=0]}{E[W|Z=1]-E[W|Z=0]} \end{aligned} \]
16.1.6 Parametric Models
If we assume linearity, then this becomes the ratio of two OLS coefficients, sometimes called the Wald estimator.
\[ \begin{aligned} \tau_{ATE} &= \frac{E[Y|Z=1] - E[Y|Z=0]}{E[W|Z=1]-E[W|Z=0]} \\ &= \frac{Cov(Y,Z)}{Cov(W,Z)} \end{aligned} \]
Why do covariates help? Because we hope even if we don’t have exogeneity or exclusion assumption held, we would have those assumptions held within each level of \(X\).
16.1.7 Example
Random.seed!(66)
nobs = 10000
# Here we have three variables w, m, z.
# m is the omitted variable, w and m are correlated.
# z is the instrument, which is correlated with w, but not m.
# u is independent of everything else.
covarMat = [
1.0 0.36 0.64;
0.36 1.0 0.0;
0.64 0.0 1.0
]
mvn = MvNormal(zeros(3), covarMat)
mat = rand(mvn, nobs)'
data = DataFrame(w = mat[:, 1], m = mat[:, 2], z = mat[:, 3])
data.u = rand(Normal(0, 1), nobs)
# DGP
data.y = data.w .+ data.m .+ data.u
lm_biased = feols(data, @formula(y ~ w))
lm_full = feols(data, @formula(y ~ w + m))
# Manual 2SLS explicitly (educational)
# Stage 1: Predict endogenous variable w using instrument z
stage1 = feols(data, @formula(w ~ z))
data.w_hat = data.w .- stage1.residuals # Obtain fitted values
# Stage 2: Regress outcome y on predicted w_hat
tsls_manual = feols(data, @formula(y ~ w_hat));# Biased OLS
println("── Biased OLS (w only) ──────────────────────")
show_regression_html(lm_biased)
# Full OLS is good
println("\n── Full OLS (w + m) ─────────────────────────")
show_regression_html(lm_full)
# Manual 2SLS is good
# Note: the coefficients from manual 2SLS are correct, but standard
# errors are slightly incorrect because they use residuals from w_hat,
# not the true w.
println("\n── Manual 2SLS ──────────────────────────────")
show_regression_html(tsls_manual)── Biased OLS (w only) ──────────────────────
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -0.012 | 0.010 | -1.164 | 0.244 |
| w | 1.357 | 0.010 | 135.823 | < 2e-16 *** |
| --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 |
||||
── Full OLS (w + m) ─────────────────────────
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -0.009 | 0.010 | -0.915 | 0.360 |
| w | 0.998 | 0.011 | 93.176 | < 2e-16 *** |
| m | 0.998 | 0.011 | 93.439 | < 2e-16 *** |
| --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 |
||||
── Manual 2SLS ──────────────────────────────
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -0.006 | 0.010 | -0.577 | 0.564 |
| w_hat | 1.000 | 0.016 | 64.426 | < 2e-16 *** |
| --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 |
||||
16.1.8 Control Function Approach
The manual 2SLS above implicitly implements the control function (CF) approach. Understanding why makes the extension to nonlinear models (next chapter) transparent.
Algebraic motivation. Because w is endogenous, decompose the structural error:
\[\varepsilon = \rho v + \eta, \qquad v \equiv w - E[w \mid z], \qquad E(z\eta)=0,\; E(zv)=0.\]
If we knew \(v\), controlling for it directly would eliminate the confounding channel. We don’t know \(v\), but the first-stage residual \(\hat{v} = w - \hat{w}\) is a consistent estimate. Substituting into the structural equation:
\[y = \alpha + \beta w + \rho\,\hat{v} + \text{error}.\]
OLS on this augmented second stage gives a consistent \(\hat\beta\) — identical to 2SLS by the Frisch–Waugh–Lovell (FWL) theorem. The coefficient \(\hat\rho\) on \(\hat{v}\) doubles as a Durbin–Wu–Hausman (DWH) endogeneity test: under \(H_0\) (w is exogenous), \(\rho = 0\).
import StatsBase: coef, stderror, coefnames
# Control function: add first-stage residual v̂ to the second stage OLS
data.v_hat = stage1.residuals
lm_cf = feols(data, @formula(y ~ w + v_hat))
# Compare point estimates from manual 2SLS and CF — should be identical by FWL
coef_tsls = coef(tsls_manual)[findfirst(==("w_hat"), coefnames(tsls_manual))]
coef_cf = coef(lm_cf)[findfirst(==("w"), coefnames(lm_cf))]
@printf("Manual 2SLS β̂_w = %.8f\n", coef_tsls)
@printf("Control funct β̂_w = %.8f\n", coef_cf)
# DWH test: t-statistic on v̂
rho_hat = coef(lm_cf)[findfirst(==("v_hat"), coefnames(lm_cf))]
rho_se = stderror(lm_cf)[findfirst(==("v_hat"), coefnames(lm_cf))]
@printf("\nDWH endogeneity test (ρ̂ on v̂): coef = %.4f se = %.4f t = %.3f\n",
rho_hat, rho_se, rho_hat / rho_se)Manual 2SLS β̂_w = 0.99987086
Control funct β̂_w = 0.99987086
DWH endogeneity test (ρ̂ on v̂): coef = 0.6100 se = 0.0203 t = 30.077
The two \(\hat\beta_w\) values agree to machine precision, confirming FWL. The large \(|t|\) on \(\hat{v}\) rejects \(H_0 : \rho = 0\), confirming that w is endogenous.
Why FWL fails in nonlinear models. FWL relies on the residual-maker matrix being idempotent and the second stage being linear — both fail once we replace OLS with Poisson or Probit. In a nonlinear second stage, \(\hat{v}\) must enter inside the link function (e.g. \(\exp(\mathbf{x}\boldsymbol\beta + \rho\hat{v})\)), which requires a structural assumption beyond instrument exogeneity. This is the subject of the next chapter.
16.2 Regression Discontinuity Design
RDD (regression discontinuity design) is a quasi-experimental design that is used to estimate causal effects of interventions when assignment to the intervention is determined by whether a subject’s value on an observed covariate exceeds a threshold. The idea is that the assignment is as good as random, so we can estimate the causal effect of the intervention by comparing the outcomes of subjects who are just above and just below the threshold.
16.2.2 Identification of SRDD
\[ \tau_{RD} = \lim_{x \to c^+} E[Y|X=x] - \lim_{x \to c^-} E[Y|X=x] \]
In words, the treatment effect is the difference in the outcome from the right side and from the left side. This is the same as the difference in the outcome at the threshold, as if the treatment is assigned randomly.
16.2.3 Estimation of SRDD
GPA = rand(Uniform(0, 4), 1000)
future_success = 10 .+ 1.5 .* GPA .+ 2 .* (GPA .<= 3) .+ rand(Normal(0, 1), 1000)
plot_data_gpa = rdplot(future_success, GPA, c=3.0)
fig2 = Figure()
ax2 = Axis(fig2[1, 1], xlabel="GPA", ylabel="Future Success")
scatter!(ax2, plot_data_gpa.vars_bins.rdplot_mean_x, plot_data_gpa.vars_bins.rdplot_mean_y,
color=:gray, markersize=6)
poly_gpa_l = filter(r -> r.rdplot_x < 3.0, plot_data_gpa.vars_poly)
lines!(ax2, poly_gpa_l.rdplot_x, poly_gpa_l.rdplot_y, color=:blue, linewidth=2)
poly_gpa_r = filter(r -> r.rdplot_x >= 3.0, plot_data_gpa.vars_poly)
lines!(ax2, poly_gpa_r.rdplot_x, poly_gpa_r.rdplot_y, color=:blue, linewidth=2)
vlines!(ax2, [3.0], color=:black, linestyle=:dash)
fig2# Estimate the sharp RDD model
rdd_gpa = rdrobust(future_success, GPA, c=3.0)
@printf "Tau (bias-corrected) = %.4g\n" rdd_gpa.Estimate.tau_us[1]
@printf "P-value (robust) = %.4f\n" rdd_gpa.pv.PValue[3]Tau (bias-corrected) = -1.657
P-value (robust) = 0.0000
16.2.4 Nonparametric Estimation
rdd_house = rdrobust(senate.vote, senate.margin, c=0.0)
@printf "Tau (bias-corrected) = %.4g\n" rdd_house.Estimate.tau_us[1]
@printf "P-value (robust) = %.4f\n" rdd_house.pv.PValue[3]Tau (bias-corrected) = 7.414
P-value (robust) = 0.0000
16.2.5 Fuzzy RDD
In fuzzy RDD, the treatment is assigned according to a threshold \(c\), but there exist non-compliance. This is similar to IV case.
\[ \tau_{FRD} = \frac{\lim_{x \to c^+} E[Y|X=x] - \lim_{x \to c^-} E[Y|X=x]}{\lim_{x \to c^+} E[W|X=x] - \lim_{x \to c^-} E[W|X=x]} \]
For example, if food stamp eligibility is given to all households below a certain income, but not all households receive the food stamps. In other words, income does not solely determine the assignment. Cutoff point increases the probability of treatment but doesn’t completely determine treatment.
FRDD is IV.
16.2.6 FRDD Example
Fetter (2013)’s main question of interest is how much of the increase in the home ownership rate in the midcentury US was due to mortgage subsidies given out by the government.
We’re using the running variable quarter of birth (qob), which has been centered on the quarter of birth you’d need to be to be eligible for a mortgage subsidy for fighting in the Korean War (qob_minus_kw). This determines whether you were a veteran of either the Korean War or World War II (vet_wwko).
using CSV
# Load the real mortgages dataset (Fetter 2013) from causaldata R package
vet = CSV.read("data/mortgages.csv", DataFrame)
# Create an "above-cutoff" variable as the instrument
vet.above = vet.qob_minus_kw .> 0
# Impose a bandwidth of 12 quarters on either side
vet = vet[abs.(vet.qob_minus_kw) .< 12, :]
# Manual 2SLS explicitly (educational)
# We have Fixed Effects for state (bpl) and quarter of birth (qob).
# Endogenous variable: vet_wwko and qob_minus_kw * vet_wwko
# Instruments: above and qob_minus_kw * above
vet.vet_inter = vet.qob_minus_kw .* vet.vet_wwko
vet.above_inter = vet.qob_minus_kw .* vet.above
# Stage 1a: Predict vet_wwko
stage1a = feols(vet, @formula(vet_wwko ~ nonwhite + qob_minus_kw + above + above_inter + fe(bpl) + fe(qob)))
vet.vet_wwko_hat = vet.vet_wwko .- stage1a.residuals
# Stage 1b: Predict vet_inter
stage1b = feols(vet, @formula(vet_inter ~ nonwhite + qob_minus_kw + above + above_inter + fe(bpl) + fe(qob)))
vet.vet_inter_hat = vet.vet_inter .- stage1b.residuals
# Stage 2: Regress home_ownership on fitted endogenous variables
# Note: we use RobustVcov for heteroskedasticity-robust SEs
m_iv = feols(vet, @formula(home_ownership ~ nonwhite + qob_minus_kw + vet_wwko_hat + vet_inter_hat + fe(bpl) + fe(qob)), vcov=Vcov.robust())
show_regression_html(m_iv)| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| nonwhite | -0.190 | 0.007 | -27.951 | < 2e-16 *** |
| qob_minus_kw | -0.007 | 0.002 | -4.070 | 4.707e-05 *** |
| vet_wwko_hat | 0.170 | 0.046 | 3.733 | 1.894e-04 *** |
| vet_inter_hat | -0.003 | 0.003 | -1.092 | 0.275 |
| --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 |
||||
# Fuzzy RDD using rdrobust (local polynomial, MSE-optimal bandwidth)
# Note: rdrobust uses a data-driven bandwidth — no global FE needed
m_rdd = rdrobust(vet.home_ownership,
vet.qob_minus_kw,
fuzzy = vet.vet_wwko,
c = 0.0)
@printf "Manual 2SLS tau = %.4g\n" m_iv.beta[3]
@printf "RDRobust (FRDD) tau = %.4g\n" m_rdd.Estimate.tau_us[1]Manual 2SLS tau = 0.1702
RDRobust (FRDD) tau = 1.879
These two results are very different, which is one of RDD’s problems. It can be sensitive to model selection. For example, in the case of 2SLS, it’s a linear model. In the case of rdrobust, it’s a local regression, which can be sensitive to bandwidth selection.