15 IV in Poisson with Fixed Effects

Poisson regression handles count and non-negative outcomes by specifying the conditional mean as an exponential function. Because it is a quasi-MLE, it requires only the mean to be correct — no distributional assumption on the count. When an explanatory variable is endogenous, we need an IV strategy adapted to the nonlinear mean. This chapter covers two cases: a continuous endogenous variable (where the control function works exactly) and a binary endogenous variable (where three CF/IV approaches are available with different trade-offs).

15.1 The Control Function Framework

15.1.1 Linear second stage — CF equals 2SLS

In the previous chapter we saw that adding the first-stage residual \(\hat{v}_2\) to an OLS second stage recovers the same coefficient as 2SLS. The algebra is:

First stage: \(y_2 = \mathbf{z}\boldsymbol{\pi} + v_2\), estimated by OLS. Residual \(\hat{v}_2 = y_2 - \hat{y}_2\) satisfies \(E(\mathbf{z}'\hat{v}_2) = 0\).
Structural error decomposition: write \(\varepsilon_1 = \rho v_2 + \eta_1\), where \(E(\mathbf{z}'\eta_1) = 0\) and \(E(v_2\,\eta_1) = 0\) by construction.
Augmented second stage: substitute to get \(y_1 = \mathbf{z}_1\boldsymbol{\delta}_1 + \alpha_1 y_2 + \rho\,\hat{v}_2 + \eta_1.\)

OLS on this equation is the control function estimator. The Frisch–Waugh–Lovell (FWL) theorem guarantees it is numerically identical to 2SLS — no extra assumption beyond \(E(\mathbf{z}'\varepsilon_1) = 0\) is required.

15.1.2 Nonlinear second stage — CF diverges from 2SLS

When the conditional mean is \(\exp(\mathbf{x}\boldsymbol{\beta})\) (Poisson), a direct IV/GMM approach solves the moment conditions \(E[\mathbf{z}'(y_1 - \exp(\mathbf{x}\boldsymbol{\beta}))] = 0\) without modifying the exponential mean. The control function approach instead imposes the stronger conditional mean assumption (Wooldridge 2002, 2014):

\[\boxed{E[y_1 \mid \mathbf{z}_1,\, y_2,\, v_2] = \exp\!\bigl(\mathbf{z}_1\boldsymbol{\delta}_1 + \alpha_1 y_2 + \rho v_2\bigr).}\]

This says the unobservable enters the conditional mean linearly inside \(\exp(\cdot)\). Under this assumption, substituting \(\hat{v}_2\) for \(v_2\) and running Poisson QMLE gives a consistent estimate of \(\alpha_1\).

Three practical consequences:

Property	Linear (OLS/2SLS)	Nonlinear (Poisson CF)
Extra assumption beyond IV exogeneity	None — FWL applies	CF mean assumption required
CF vs. 2SLS/GMM numerically	Identical	Different — FWL fails
SE from second stage alone	Correct	Understated — bootstrap needed
Test of endogeneity	\(t\)-stat on \(\hat{v}_2\) (DWH)	Same — \(t\)-stat on \(\hat{v}_2\)

Intuition for the CF assumption. The decomposition \(\varepsilon_1 = \rho v_2 + \eta_1\) is exactly the same as in the linear case. What changes is where the decomposition gets applied: in the linear model it enters additively outside any transformation; in Poisson it must enter inside \(\exp(\cdot)\) to be absorbed by the quasi-MLE score. If the true \(\rho \neq 0\), omitting \(\hat{v}_2\) from the exponential mean leaves a multiplicative endogeneity term \(e^{\rho v_2}\) correlated with \(y_2\), biasing \(\hat\alpha_1\) upward or downward depending on the sign of \(\rho\).

15.2 Part 1 — Continuous endogenous variable

15.2.1 Data generating process

We simulate a panel where time (continuous, e.g. minutes on site) is endogenous: it is correlated with an unobserved factor e that also affects visits. The excluded instrument is phone (binary, e.g. phone-access indicator). Fixed effects are ad (20 groups) and female.

set.seed(42)
n_ad <- 20; obs_per <- 250; n <- n_ad * obs_per

fe_ad  <- rnorm(n_ad) * 0.5
fe_grp <- rep(1:n_ad, each = obs_per)
female <- as.integer(runif(n) < 0.5)
phone  <- as.integer(runif(n) < 0.4)
frfam  <- runif(n)
e      <- rnorm(n)

time <- 1.5 * phone + 0.5 * frfam + fe_ad[fe_grp] + e

true_alpha <- 0.8
lambda <- exp(0.5 + true_alpha * time + 0.4 * frfam +
              fe_ad[fe_grp] + 0.3 * female + 0.5 * e)
visits <- rpois(n, lambda)

df1 <- data.frame(
  visits = visits, time = time, phone = phone,
  frfam = frfam, female = female, ad = fe_grp
)

cat(sprintf("n = %d   groups = %d   visits mean = %.2f\n", n, n_ad, mean(visits)))

n = 5000   groups = 20   visits mean = 26.43

15.2.2 Naive Poisson — endogeneity bias

Ignoring the endogeneity of time gives a biased estimate of \(\alpha_1\):

naive1 <- fepois(visits ~ time + frfam | ad + female, data = df1)
etable(naive1, title = "Naive Poisson — time is endogenous")

                            naive1
Dependent Var.:             visits
                                  
time             1.139*** (0.0025)
frfam           0.2202*** (0.0092)
Fixed-Effects:  ------------------
ad                             Yes
female                         Yes
_______________ __________________
S.E. type                      IID
Observations                 5,000
Squared Cor.               0.95297
Pseudo R2                  0.91842
BIC                       32,823.9
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

15.2.3 Control function approach

The two-step CF procedure:

First stage — project time on phone, frfam, and fixed effects via feols. The residual \(\hat{v}_2 = \text{time} - \hat{\text{time}}\) satisfies \(E(\mathbf{z}'\hat{v}_2)=0\) by construction. A linear first stage is valid even when \(y_2\) is continuous (Part 2 shows binary).
Second stage — Poisson QMLE with \(\hat{v}_2\) added as a regressor inside the exponential mean. The CF assumption \(E[y_1 \mid \mathbf{z}_1, y_2, v_2] = \exp(\mathbf{z}_1\boldsymbol{\delta}_1 + \alpha_1 y_2 + \rho v_2)\) holds exactly here because the DGP specifies \(0.5\,e\) linearly inside \(\exp(\cdot)\).

Note that unlike 2SLS in a linear model, \(\hat{v}_2\) does not replace \(y_2\) — both enter together.

# Step 1: linear projection
fs1       <- feols(time ~ phone + frfam | ad + female, data = df1)
df1$v2hat <- residuals(fs1)

# Step 2: Poisson CF
m1_cf <- fepois(visits ~ time + v2hat + frfam | ad + female, data = df1)
etable(m1_cf, title = "CF: linear projection first stage")

                             m1_cf
Dependent Var.:             visits
                                  
time            0.7801*** (0.0041)
v2hat           0.5150*** (0.0050)
frfam           0.3775*** (0.0093)
Fixed-Effects:  ------------------
ad                             Yes
female                         Yes
_______________ __________________
S.E. type                      IID
Observations                 5,000
Squared Cor.               0.99614
Pseudo R2                  0.94394
BIC                       22,627.7
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

cat(sprintf("time coefficient: %.4f  (true %.1f)\n", coef(m1_cf)["time"], true_alpha))

time coefficient: 0.7801  (true 0.8)

The coefficient on time recovers the true \(\alpha_1 = 0.8\). The v2hat coefficient tests exogeneity — its \(t\)-statistic is the Hausman-type test.

15.2.4 Bootstrap standard errors

The second-stage standard errors are incorrect because first-stage estimation error propagates. A cluster bootstrap (by ad group) corrects this. parLapply with a PSOCK cluster parallelises the 500 replications safely inside Quarto’s rendering environment:

bootstrap_cf_continuous <- function(df, B = 500) {
  clusters <- unique(df$ad)
  K        <- length(clusters)

  boot_one <- function(b) {
    library(fixest)
    set.seed(b)
    drawn <- sample(clusters, K, replace = TRUE)
    df_b  <- do.call(rbind, lapply(seq_along(drawn), function(i) {
      sub         <- df[df$ad == drawn[i], ]
      sub$ad_boot <- i
      sub
    }))
    fs         <- feols(time ~ phone + frfam | ad_boot + female, data = df_b)
    df_b$v2hat <- residuals(fs)
    ss         <- fepois(visits ~ time + v2hat + frfam | ad_boot + female, data = df_b)
    coef(ss)["time"]
  }

  n_cores <- max(1L, detectCores() - 1L)
  cl      <- makePSOCKcluster(n_cores)
  clusterExport(cl, c("df", "clusters", "K"), envir = environment())
  results <- parLapply(cl, seq_len(B), boot_one)
  stopCluster(cl)
  unlist(results)
}

boot1    <- bootstrap_cf_continuous(df1, 500)
coef1    <- coef(m1_cf)["time"]
naive_se <- se(m1_cf)["time"]

cat(sprintf("time coefficient:   %.4f  (true %.1f)\n", coef1, true_alpha))

time coefficient:   0.7801  (true 0.8)

cat(sprintf("Naive 2nd-stage SE: %.4f\n", naive_se))

Naive 2nd-stage SE: 0.0041

cat(sprintf("Bootstrap SE:       %.4f\n", sd(boot1)))

Bootstrap SE:       0.0079

cat(sprintf("Bootstrap 95%% CI: [%.4f, %.4f]\n",
            quantile(boot1, 0.025), quantile(boot1, 0.975)))

Bootstrap 95% CI: [0.7637, 0.7930]

The bootstrapped SE is larger than the naive second-stage SE, confirming that ignoring first-stage uncertainty understates uncertainty.

15.3 Part 2 — Binary endogenous variable

When \(y_2 \in \{0,1\}\), the same linear projection first stage is still valid, but two additional approaches exploiting the binary structure become available.

15.3.1 Data generating process

We use a clean probit DGP for the binary endogenous variable with a fresh structural error e2 independent of Part 1. The endogeneity coefficient ρ = 0.2 is kept small so the CF assumption holds approximately well and estimates are close to the true value.

set.seed(123)
e2      <- rnorm(n)

# Binary endogenous variable: clean probit latent index
time_hi <- as.integer(1.5 * phone + 0.4 * frfam + e2 >= 0)

# Poisson outcome — true coefficient 0.8, endogeneity via ρ*e2 (ρ=0.2)
rho2    <- 0.2
lambda2 <- exp(0.5 + true_alpha * time_hi + 0.4 * frfam +
               fe_ad[fe_grp] + 0.3 * female + rho2 * e2)
visits2 <- rpois(n, lambda2)

df2 <- data.frame(
  visits = visits2, time_hi = time_hi, phone = phone,
  frfam = frfam, female = female, ad = fe_grp, e2 = e2
)

cat(sprintf("time_hi mean:      %.3f\n", mean(time_hi)))

time_hi mean:      0.729

cat(sprintf("Cor(time_hi, e2):  %.3f  (endogeneity)\n", cor(time_hi, e2)))

Cor(time_hi, e2):  0.622  (endogeneity)

cat(sprintf("visits mean:       %.2f\n", mean(visits2)))

visits mean:       6.18

15.3.2 Three approaches: theory recap

	First stage	What enters 2nd stage	1st-stage assumption	2nd-stage assumption
A	`feols` (OLS)	Residuals \(\hat{v}_2\)	Exogeneity of \(\mathbf{z}\) only	CF mean \(\exp(\cdots + \rho v_2)\) correct
B	`feglm` (probit)	Gen. residuals \(\hat{r}_2\)	Probit correctly specified	CF mean \(\exp(\cdots + \rho r_2)\) correct
C	Any model	Fitted values \(\hat{p}_2\)	Exogeneity of \(\mathbf{z}\) only	Structural mean \(\exp(\cdots)\) correct

All three require the second-stage Poisson conditional mean to be correctly specified. A and C share the same (weaker) first-stage assumption; B additionally requires correct probit specification.

15.3.3 Approach A — Linear projection CF

fs_a      <- feols(time_hi ~ phone + frfam | ad + female, data = df2)
df2$v2hat <- residuals(fs_a)

m_a <- fepois(visits ~ time_hi + v2hat + frfam | ad + female, data = df2)
etable(m_a, title = "Approach A: OLS residual as control function")

                               m_a
Dependent Var.:             visits
                                  
time_hi         0.8613*** (0.0321)
v2hat           0.2674*** (0.0318)
frfam           0.4168*** (0.0199)
Fixed-Effects:  ------------------
ad                             Yes
female                         Yes
_______________ __________________
S.E. type                      IID
Observations                 5,000
Squared Cor.               0.73423
Pseudo R2                  0.41908
BIC                       22,269.3
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

15.3.4 Approach B — Probit generalized residuals

fs_b <- feglm(time_hi ~ phone + frfam | ad + female,
              data = df2, family = binomial(probit))

# Generalized residual: r̂ = y·φ(η)/Φ(η) − (1−y)·φ(η)/(1−Φ(η))
eta      <- predict(fs_b, type = "link")
df2$gr2  <- ifelse(df2$time_hi == 1,
                   dnorm(eta) / pnorm(eta),
                   -dnorm(eta) / (1 - pnorm(eta)))

m_b <- fepois(visits ~ time_hi + gr2 + frfam | ad + female, data = df2)
etable(m_b, title = "Approach B: probit generalized residual")

                               m_b
Dependent Var.:             visits
                                  
time_hi         0.8544*** (0.0329)
gr2             0.1664*** (0.0200)
frfam           0.4162*** (0.0199)
Fixed-Effects:  ------------------
ad                             Yes
female                         Yes
_______________ __________________
S.E. type                      IID
Observations                 5,000
Squared Cor.               0.73424
Pseudo R2                  0.41905
BIC                       22,270.4
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

15.3.5 Approach C — Fitted values as instrument (IV/GMM Poisson)

Any function of the exogenous \(\mathbf{z}\) is a valid instrument. Probit fitted probabilities are used here as the instrument for time_hi. We implement IV/GMM Poisson via gmm::gmm(), which solves the moment conditions \(E[\mathbf{z}'(y_1 - \exp(\mathbf{x}\boldsymbol{\beta}))] = 0\) directly:

# Probit without FE for fitted probabilities (instrument)
fs_c     <- glm(time_hi ~ phone + frfam, family = binomial(probit), data = df2)
df2$phat <- fitted(fs_c)

# Moment conditions: Z'(y - exp(X*beta)) = 0
# X = [time_hi, frfam, 1],  Z = [phat, frfam, 1]
gmm_moments <- function(beta, x) {
  y      <- x[, "visits"]
  th     <- x[, "time_hi"]
  fr     <- x[, "frfam"]
  ph     <- x[, "phat"]
  eta    <- pmin(pmax(beta[1] * th + beta[2] * fr + beta[3], -30), 30)
  resid  <- y - exp(eta)
  cbind(ph * resid, fr * resid, resid)
}

gmm_data <- as.matrix(df2[, c("visits", "time_hi", "frfam", "phat")])
m_c      <- gmm(gmm_moments, x = gmm_data,
                t0 = c(0, 0, log(mean(df2$visits))))
coef_c   <- coef(m_c)
se_c     <- sqrt(diag(vcov(m_c)))

cat(sprintf("\nApproach C — IV/GMM Poisson (no FE, for illustration)\n"))


Approach C — IV/GMM Poisson (no FE, for illustration)

cat(sprintf("  time_hi:  coef = %7.4f   SE = %7.4f   z = %6.3f\n",
            coef_c[1], se_c[1], coef_c[1] / se_c[1]))

  time_hi:  coef =  0.6279   SE =  0.0832   z =  7.551

cat(sprintf("  frfam:    coef = %7.4f   SE = %7.4f   z = %6.3f\n",
            coef_c[2], se_c[2], coef_c[2] / se_c[2]))

  frfam:    coef =  0.4659   SE =  0.0383   z = 12.159

Approach C with FE requires demeaning the regressor and instrument matrices via within-group centering before solving the Poisson moment conditions. Because Poisson is nonlinear this is not straightforward. Approach A via feols + fepois is the practical choice when many fixed effects are present.

15.3.6 A vs C — not equivalent in Poisson

In a linear model, using \(\hat{y}_2\) as an instrument (2SLS) is numerically identical to adding \(\hat{v}_2 = y_2 - \hat{y}_2\) as a regressor — the Frisch-Waugh-Lovell theorem. In Poisson this equivalence breaks, because Approach A puts \(\hat{v}_2\) inside \(\exp(\cdot)\) while Approach C uses \(\hat{y}_2\) in the moment conditions without modifying the exponential mean.

coef_a <- coef(m_a)["time_hi"]
coef_b <- coef(m_b)["time_hi"]

cat(sprintf("Approach A  (CF, OLS residual, with FE):   α̂₁ = %.4f\n", coef_a))

Approach A  (CF, OLS residual, with FE):   α̂₁ = 0.8613

cat(sprintf("Approach B  (CF, probit GR, with FE):      α̂₁ = %.4f\n", coef_b))

Approach B  (CF, probit GR, with FE):      α̂₁ = 0.8544

cat(sprintf("Approach C  (IV/GMM Poisson, no FE):       α̂₁ = %.4f\n", coef_c[1]))

Approach C  (IV/GMM Poisson, no FE):       α̂₁ = 0.6279

cat(sprintf("True α₁ = %.1f\n", true_alpha))

True α₁ = 0.8

The estimates differ despite using the same exclusion restriction, confirming that 2SLS ≠ CF for Poisson.

15.3.7 Bootstrap for Approach A (binary EEV)

bootstrap_cf_binary <- function(df, B = 500) {
  clusters <- unique(df$ad)
  K        <- length(clusters)

  boot_one <- function(b) {
    library(fixest)
    set.seed(b)
    drawn <- sample(clusters, K, replace = TRUE)
    df_b  <- do.call(rbind, lapply(seq_along(drawn), function(i) {
      sub         <- df[df$ad == drawn[i], ]
      sub$ad_boot <- i
      sub
    }))
    fs         <- feols(time_hi ~ phone + frfam | ad_boot + female, data = df_b)
    df_b$v2hat <- residuals(fs)
    ss         <- fepois(visits ~ time_hi + v2hat + frfam | ad_boot + female, data = df_b)
    coef(ss)["time_hi"]
  }

  n_cores <- max(1L, detectCores() - 1L)
  cl      <- makePSOCKcluster(n_cores)
  clusterExport(cl, c("df", "clusters", "K"), envir = environment())
  results <- parLapply(cl, seq_len(B), boot_one)
  stopCluster(cl)
  unlist(results)
}

boot2     <- bootstrap_cf_binary(df2, 500)
naive_se2 <- se(m_a)["time_hi"]

cat(sprintf("Approach A bootstrap (binary EEV, 500 reps, cluster by ad)\n"))

Approach A bootstrap (binary EEV, 500 reps, cluster by ad)

cat(sprintf("  Point estimate:     %.4f  (true %.1f)\n", coef_a, true_alpha))

  Point estimate:     0.8613  (true 0.8)

cat(sprintf("  Naive 2nd-stage SE: %.4f\n", naive_se2))

  Naive 2nd-stage SE: 0.0321

cat(sprintf("  Bootstrap SE:       %.4f\n", sd(boot2)))

  Bootstrap SE:       0.0331

cat(sprintf("  Bootstrap 95%% CI:  [%.4f, %.4f]\n",
            quantile(boot2, 0.025), quantile(boot2, 0.975)))

  Bootstrap 95% CI:  [0.8025, 0.9308]

15.3.8 Comparison across all methods

naive_p2    <- fepois(visits ~ time_hi + frfam | ad + female, data = df2)
coef_naive2 <- coef(naive_p2)["time_hi"]

methods <- c(
  "True value",
  "Naive Poisson (binary y₂)",
  "A: Linear CF (with FE)",
  "B: Probit GR CF (with FE)",
  "C: IV/GMM Poisson (no FE)"
)
ests <- c(true_alpha, coef_naive2, coef_a, coef_b, coef_c[1])

cat("\n── Binary EEV: coefficient on time_hi ─────────────────────\n")


── Binary EEV: coefficient on time_hi ─────────────────────

for (i in seq_along(methods)) {
  cat(sprintf("  %-38s  α̂₁ = %6.4f\n", methods[i], ests[i]))
}

  True value                              α̂₁ = 0.8000
  Naive Poisson (binary y₂)             α̂₁ = 1.0832
  A: Linear CF (with FE)                  α̂₁ = 0.8613
  B: Probit GR CF (with FE)               α̂₁ = 0.8544
  C: IV/GMM Poisson (no FE)               α̂₁ = 0.6279

cat("\nNote: Approaches A and B (with FE) recover the true value closely;\n")


Note: Approaches A and B (with FE) recover the true value closely;

cat("small upward bias from ρ=0.2 and the binary EEV is expected.\n")

small upward bias from ρ=0.2 and the binary EEV is expected.

cat("Approach C (no FE) diverges further because omitting fixed effects\n")

Approach C (no FE) diverges further because omitting fixed effects

cat("introduces additional bias on top of endogeneity. With FE absorbed,\n")

introduces additional bias on top of endogeneity. With FE absorbed,

cat("Approach A is the practical choice for count models with many groups.\n")

Approach A is the practical choice for count models with many groups.