16  IV & Regression Discontinuity

using CairoMakie
using GraphMakie, Graphs
using Distributions, Random, LinearAlgebra
using RDRobust
using Panelest
using Vcov
using DataFrames
using StatsModels, DataFramesMeta
import StatsAPI: coeftable
using Printf
using RegressionTables

If we have an unobserved confounder, unconfoundedness fails. Even when we have random experiment, we could have noncompliance. In that case, subjects are randomly assigned to treatment or control, but some of them do not comply with the assignment. In other words, there is selection bias in the sense that the treated group is self-selected. This selection could be correlated with the outcome.

16.1 Instrumental Variables

16.1.1 Uncontrolled Confounder

16.1.2 Why Does IV Work?

16.1.3 IV Under Potential Outcomes

  • W has two potential outcomes W(1) and W(0), as a function of Z.
  • Y has two potential outcomes Y(1) and Y(0), as a function of W.

Intention to Treat (ITT) is the average treatment effect of the treatment assignment Z. It is the difference between the average potential outcome under treatment and the average potential outcome under control. It is the average treatment effect of the treatment assignment Z, regardless of whether the subject complies with the assignment.

There are four possible groups of subjects in the case of binary treatment and binary instrument. For example treatment assignment Z and treated status W could be:

  • Z=0, W=0: never-taker
  • Z=0, W=1: complier
  • Z=1, W=0: defier
  • Z=1, W=1: always-taker
compliance groups
W(1)=0 W(1)=1
W(0)=0 N C
W(0)=1 D A

These four groups (stratum) have different causal mechanisms. For example, the compliers are the only group that is affected by the treatment assignment Z in the same direction. The defiers are the only group that is affected by the treatment assignment Z in the opposite direction. The always-takers and never-takers are not affected by the treatment assignment Z.

Unfortunately we cannot identify the compliance group by looking at the data.

Z W G
0 0 C, N
0 1 A, D
1 0 N, D
1 1 C, A

16.1.4 Assumptions

  • Exclusion restriction: \(Y(w, 1) = Y(w,0)\)
  • Exogeneity of instrument: \(Z \perp [W(0),W(1), Y(0), Y(1)]\)
  • Monotonicity (no defiers): \(W(0) \leq W(1)\)

16.1.5 LATE Identification

Since we excluded defiers, and the other two groups (always-takers and never-takers) are not affected by the treatment assignment Z. The only effect we can identify is the treatment effect on the compliers. This is called the Local Average Treatment Effect (LATE).

\[ \small \begin{aligned} & E[Y(Z=1) - Y(Z=0) | G = C] \\ &= \frac{E[Y(Z=1) - Y(Z=0)]}{P(W(1)=1, W(0)=0)} \\ &= \frac{E[Y|Z=1] - E[Y|Z=0]}{1-P(W=0|Z=1)-P(W=1|Z=0)} \text{ (rule out groups A and N) } \\ &= \frac{E[Y|Z=1] - E[Y|Z=0]}{P(W=1|Z=1)-P(W=1|Z=0)} \\ &= \frac{E[Y|Z=1] - E[Y|Z=0]}{E[W|Z=1]-E[W|Z=0]} \end{aligned} \]

16.1.6 Parametric Models

If we assume linearity, then this becomes the ratio of two OLS coefficients, sometimes called the Wald estimator.

\[ \begin{aligned} \tau_{ATE} &= \frac{E[Y|Z=1] - E[Y|Z=0]}{E[W|Z=1]-E[W|Z=0]} \\ &= \frac{Cov(Y,Z)}{Cov(W,Z)} \end{aligned} \]

Why do covariates help? Because we hope even if we don’t have exogeneity or exclusion assumption held, we would have those assumptions held within each level of \(X\).

16.1.7 Example

Random.seed!(66)
nobs = 10000

# Here we have three variables w, m, z.
# m is the omitted variable, w and m are correlated.
# z is the instrument, which is correlated with w, but not m.
# u is independent of everything else.
covarMat = [
    1.0   0.36  0.64;
    0.36  1.0   0.0;
    0.64  0.0   1.0
]

mvn = MvNormal(zeros(3), covarMat)
mat = rand(mvn, nobs)'

data = DataFrame(w = mat[:, 1], m = mat[:, 2], z = mat[:, 3])
data.u = rand(Normal(0, 1), nobs)

# DGP
data.y = data.w .+ data.m .+ data.u

lm_biased = feols(data, @formula(y ~ w))
lm_full = feols(data, @formula(y ~ w + m))

# Manual 2SLS explicitly (educational)
# Stage 1: Predict endogenous variable w using instrument z
stage1 = feols(data, @formula(w ~ z))
data.w_hat = data.w .- stage1.residuals # Obtain fitted values

# Stage 2: Regress outcome y on predicted w_hat
tsls_manual = feols(data, @formula(y ~ w_hat));
# Biased OLS
println("── Biased OLS (w only) ──────────────────────")
show_regression_html(lm_biased)

# Full OLS is good
println("\n── Full OLS (w + m) ─────────────────────────")
show_regression_html(lm_full)

# Manual 2SLS is good
# Note: the coefficients from manual 2SLS are correct, but standard
# errors are slightly incorrect because they use residuals from w_hat,
# not the true w.
println("\n── Manual 2SLS ──────────────────────────────")
show_regression_html(tsls_manual)
── Biased OLS (w only) ──────────────────────
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.012 0.010 -1.164 0.244
w 1.357 0.010 135.823 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

── Full OLS (w + m) ─────────────────────────
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.009 0.010 -0.915 0.360
w 0.998 0.011 93.176 < 2e-16 ***
m 0.998 0.011 93.439 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

── Manual 2SLS ──────────────────────────────
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.006 0.010 -0.577 0.564
w_hat 1.000 0.016 64.426 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

16.1.8 Control Function Approach

The manual 2SLS above implicitly implements the control function (CF) approach. Understanding why makes the extension to nonlinear models (next chapter) transparent.

Algebraic motivation. Because w is endogenous, decompose the structural error:

\[\varepsilon = \rho v + \eta, \qquad v \equiv w - E[w \mid z], \qquad E(z\eta)=0,\; E(zv)=0.\]

If we knew \(v\), controlling for it directly would eliminate the confounding channel. We don’t know \(v\), but the first-stage residual \(\hat{v} = w - \hat{w}\) is a consistent estimate. Substituting into the structural equation:

\[y = \alpha + \beta w + \rho\,\hat{v} + \text{error}.\]

OLS on this augmented second stage gives a consistent \(\hat\beta\) — identical to 2SLS by the Frisch–Waugh–Lovell (FWL) theorem. The coefficient \(\hat\rho\) on \(\hat{v}\) doubles as a Durbin–Wu–Hausman (DWH) endogeneity test: under \(H_0\) (w is exogenous), \(\rho = 0\).

import StatsBase: coef, stderror, coefnames

# Control function: add first-stage residual v̂ to the second stage OLS
data.v_hat = stage1.residuals

lm_cf = feols(data, @formula(y ~ w + v_hat))

# Compare point estimates from manual 2SLS and CF — should be identical by FWL
coef_tsls = coef(tsls_manual)[findfirst(==("w_hat"), coefnames(tsls_manual))]
coef_cf   = coef(lm_cf)[findfirst(==("w"),     coefnames(lm_cf))]

@printf("Manual 2SLS   β̂_w = %.8f\n", coef_tsls)
@printf("Control funct β̂_w = %.8f\n", coef_cf)

# DWH test: t-statistic on v̂
rho_hat = coef(lm_cf)[findfirst(==("v_hat"), coefnames(lm_cf))]
rho_se  = stderror(lm_cf)[findfirst(==("v_hat"), coefnames(lm_cf))]
@printf("\nDWH endogeneity test (ρ̂ on v̂): coef = %.4f  se = %.4f  t = %.3f\n",
        rho_hat, rho_se, rho_hat / rho_se)
Manual 2SLS   β̂_w = 0.99987086
Control funct β̂_w = 0.99987086

DWH endogeneity test (ρ̂ on v̂): coef = 0.6100  se = 0.0203  t = 30.077

The two \(\hat\beta_w\) values agree to machine precision, confirming FWL. The large \(|t|\) on \(\hat{v}\) rejects \(H_0 : \rho = 0\), confirming that w is endogenous.

Note

Why FWL fails in nonlinear models. FWL relies on the residual-maker matrix being idempotent and the second stage being linear — both fail once we replace OLS with Poisson or Probit. In a nonlinear second stage, \(\hat{v}\) must enter inside the link function (e.g. \(\exp(\mathbf{x}\boldsymbol\beta + \rho\hat{v})\)), which requires a structural assumption beyond instrument exogeneity. This is the subject of the next chapter.

16.2 Regression Discontinuity Design

RDD (regression discontinuity design) is a quasi-experimental design that is used to estimate causal effects of interventions when assignment to the intervention is determined by whether a subject’s value on an observed covariate exceeds a threshold. The idea is that the assignment is as good as random, so we can estimate the causal effect of the intervention by comparing the outcomes of subjects who are just above and just below the threshold.

16.2.1 Sharp RDD

We have a continuous variable \(X\), called the running variable, which determines the binary treatment \(W\). The treatment is assigned according to a threshold \(c\). The outcome variable \(Y\) is a function of \(X\) and \(W\).

\[ W=\mathbb{1}(X>c)\]

Lee (2008) studies the effect of incumbency advantage in elections. His identification strategy is based on the discontinuity generated by the rule that the party with a majority vote share wins. The forcing variable \(X_i\) is the difference in vote share between the Democratic and Republican parties in one election, with the threshold \(c = 0\). The outcome variable \(Y_i\) is vote share at the second election.

using CSV
senate = CSV.read("data/rdrobust_senate.csv", DataFrame)

# Calculate RD plot data
plot_data = rdplot(senate.vote, senate.margin, c=0.0)

fig = Figure()
ax = Axis(fig[1, 1], xlabel="Margin", ylabel="Vote")
scatter!(ax, plot_data.vars_bins.rdplot_mean_x, plot_data.vars_bins.rdplot_mean_y, 
         color=:gray, markersize=6)

# Plot polynomial fit for left side
poly_l = filter(r -> r.rdplot_x < 0.0, plot_data.vars_poly)
lines!(ax, poly_l.rdplot_x, poly_l.rdplot_y, color=:blue, linewidth=2)

# Plot polynomial fit for right side
poly_r = filter(r -> r.rdplot_x >= 0.0, plot_data.vars_poly)
lines!(ax, poly_r.rdplot_x, poly_r.rdplot_y, color=:blue, linewidth=2)

vlines!(ax, [0.0], color=:black, linestyle=:dash)
fig

16.2.2 Identification of SRDD

\[ \tau_{RD} = \lim_{x \to c^+} E[Y|X=x] - \lim_{x \to c^-} E[Y|X=x] \]

In words, the treatment effect is the difference in the outcome from the right side and from the left side. This is the same as the difference in the outcome at the threshold, as if the treatment is assigned randomly.

16.2.3 Estimation of SRDD

GPA = rand(Uniform(0, 4), 1000)
future_success = 10 .+ 1.5 .* GPA .+ 2 .* (GPA .<= 3) .+ rand(Normal(0, 1), 1000)

plot_data_gpa = rdplot(future_success, GPA, c=3.0)

fig2 = Figure()
ax2 = Axis(fig2[1, 1], xlabel="GPA", ylabel="Future Success")
scatter!(ax2, plot_data_gpa.vars_bins.rdplot_mean_x, plot_data_gpa.vars_bins.rdplot_mean_y, 
         color=:gray, markersize=6)

poly_gpa_l = filter(r -> r.rdplot_x < 3.0, plot_data_gpa.vars_poly)
lines!(ax2, poly_gpa_l.rdplot_x, poly_gpa_l.rdplot_y, color=:blue, linewidth=2)

poly_gpa_r = filter(r -> r.rdplot_x >= 3.0, plot_data_gpa.vars_poly)
lines!(ax2, poly_gpa_r.rdplot_x, poly_gpa_r.rdplot_y, color=:blue, linewidth=2)

vlines!(ax2, [3.0], color=:black, linestyle=:dash)
fig2

# Estimate the sharp RDD model
rdd_gpa = rdrobust(future_success, GPA, c=3.0)
@printf "Tau (bias-corrected) = %.4g\n" rdd_gpa.Estimate.tau_us[1]
@printf "P-value (robust)     = %.4f\n" rdd_gpa.pv.PValue[3]
Tau (bias-corrected) = -1.657
P-value (robust)     = 0.0000

16.2.4 Nonparametric Estimation

rdd_house = rdrobust(senate.vote, senate.margin, c=0.0)
@printf "Tau (bias-corrected) = %.4g\n" rdd_house.Estimate.tau_us[1]
@printf "P-value (robust)     = %.4f\n" rdd_house.pv.PValue[3]
Tau (bias-corrected) = 7.414
P-value (robust)     = 0.0000

16.2.5 Fuzzy RDD

In fuzzy RDD, the treatment is assigned according to a threshold \(c\), but there exist non-compliance. This is similar to IV case.

\[ \tau_{FRD} = \frac{\lim_{x \to c^+} E[Y|X=x] - \lim_{x \to c^-} E[Y|X=x]}{\lim_{x \to c^+} E[W|X=x] - \lim_{x \to c^-} E[W|X=x]} \]

For example, if food stamp eligibility is given to all households below a certain income, but not all households receive the food stamps. In other words, income does not solely determine the assignment. Cutoff point increases the probability of treatment but doesn’t completely determine treatment.

FRDD is IV.

16.2.6 FRDD Example

Fetter (2013)’s main question of interest is how much of the increase in the home ownership rate in the midcentury US was due to mortgage subsidies given out by the government.

We’re using the running variable quarter of birth (qob), which has been centered on the quarter of birth you’d need to be to be eligible for a mortgage subsidy for fighting in the Korean War (qob_minus_kw). This determines whether you were a veteran of either the Korean War or World War II (vet_wwko).

using CSV

# Load the real mortgages dataset (Fetter 2013) from causaldata R package
vet = CSV.read("data/mortgages.csv", DataFrame)

# Create an "above-cutoff" variable as the instrument
vet.above = vet.qob_minus_kw .> 0

# Impose a bandwidth of 12 quarters on either side
vet = vet[abs.(vet.qob_minus_kw) .< 12, :]

# Manual 2SLS explicitly (educational)
# We have Fixed Effects for state (bpl) and quarter of birth (qob).
# Endogenous variable: vet_wwko and qob_minus_kw * vet_wwko
# Instruments: above and qob_minus_kw * above
vet.vet_inter = vet.qob_minus_kw .* vet.vet_wwko
vet.above_inter = vet.qob_minus_kw .* vet.above

# Stage 1a: Predict vet_wwko
stage1a = feols(vet, @formula(vet_wwko ~ nonwhite + qob_minus_kw + above + above_inter + fe(bpl) + fe(qob)))
vet.vet_wwko_hat = vet.vet_wwko .- stage1a.residuals

# Stage 1b: Predict vet_inter
stage1b = feols(vet, @formula(vet_inter ~ nonwhite + qob_minus_kw + above + above_inter + fe(bpl) + fe(qob)))
vet.vet_inter_hat = vet.vet_inter .- stage1b.residuals

# Stage 2: Regress home_ownership on fitted endogenous variables
# Note: we use RobustVcov for heteroskedasticity-robust SEs
m_iv = feols(vet, @formula(home_ownership ~ nonwhite + qob_minus_kw + vet_wwko_hat + vet_inter_hat + fe(bpl) + fe(qob)), vcov=Vcov.robust())
show_regression_html(m_iv)
Estimate Std. Error t value Pr(>|t|)
nonwhite -0.190 0.007 -27.951 < 2e-16 ***
qob_minus_kw -0.007 0.002 -4.070 4.707e-05 ***
vet_wwko_hat 0.170 0.046 3.733 1.894e-04 ***
vet_inter_hat -0.003 0.003 -1.092 0.275
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Fuzzy RDD using rdrobust (local polynomial, MSE-optimal bandwidth)
# Note: rdrobust uses a data-driven bandwidth — no global FE needed
m_rdd = rdrobust(vet.home_ownership,
                 vet.qob_minus_kw,
                 fuzzy = vet.vet_wwko,
                 c = 0.0)
@printf "Manual 2SLS     tau = %.4g\n" m_iv.beta[3]
@printf "RDRobust (FRDD) tau = %.4g\n" m_rdd.Estimate.tau_us[1]
Manual 2SLS     tau = 0.1702
RDRobust (FRDD) tau = 1.879

These two results are very different, which is one of RDD’s problems. It can be sensitive to model selection. For example, in the case of 2SLS, it’s a linear model. In the case of rdrobust, it’s a local regression, which can be sensitive to bandwidth selection.