25 Poisson model with exposure

Published

April 20, 2024

25.1 Poisson with exposure or Poisson rate

Poisson models for count data often arise when units are observed for different lengths of time. A hospital admission count for a patient followed for 30 days is not directly comparable to a count for a patient followed for 365 days. The natural quantity of interest is the rate — admissions per unit time — not the raw count.

There are three related but distinct specifications, each making a different assumption:

Model 1 — Poisson with offset (rate model):

\[\log\!\bigl(E[Y]/t\bigr) = \beta_0 + \beta_1 X\]

Equivalently, $\log E[Y] = \beta_0 + \beta_1 X + \log t$, where $\log t$ is included as a fixed offset (coefficient constrained to 1). This is the standard exposure model: the expected count is proportional to exposure time.

Model 2 — Poisson on the ratio $Y/t$:

\[\log\!\bigl(E[Y/t]\bigr) = \beta_0 + \beta_1 X\]

This looks similar, but it is not equivalent to the offset model, and strictly speaking $Y/t$ is not a Poisson outcome at all: a Poisson likelihood is defined for non-negative integer counts, and the continuous ratio $Y/t$ is not one. Because $t$ is a fixed constant for each unit, $E[Y/t] = E[Y]/t$ exactly (linearity of expectation), so the conditional mean is still log-linear in $X$; that is why the point estimates are often similar. But the Poisson variance assumption fails — $\mathrm{Var}(Y/t) = \lambda/t$ scales with $1/t$, not with the mean — so fitting glm(rate ~ x, family = poisson) is a quasi-likelihood mean model rather than a correct Poisson likelihood. If you use it, treat it as a workaround: weight by exposure (weights = t) so the implied variance is right, and/or use robust standard errors. The offset model (Model 1) avoids the issue entirely by keeping the integer count as the outcome.

Model 3 — Unconstrained log-exposure:

\[\log E[Y] = \beta_0 + \beta_1 X + \gamma \log t\]

Here $\gamma$ is estimated from the data. If $\hat\gamma \approx 1$, the proportionality assumption is supported. If $\hat\gamma < 1$, counts grow sub-linearly with exposure — a testable, substantive claim.

25.2 Simulation

library(tidyverse)
set.seed(42)

n <- 800

# Simulate a health study: count of adverse events
# Covariate: age group (0 = younger, 1 = older)
age_old <- rbinom(n, 1, 0.45)

# Exposure: days under observation (varies 7–365)
exposure <- sample(seq(7, 365, by = 7), n, replace = TRUE)

# True model: rate = exp(-3 + 1.2 * age_old)
# Count ~ Poisson(rate * exposure)
true_rate <- exp(-3 + 1.2 * age_old)
count <- rpois(n, lambda = true_rate * exposure)

df <- data.frame(count, age_old, exposure, log_exposure = log(exposure))
summary(df[, c("count", "exposure")])

     count          exposure    
 Min.   : 0.00   Min.   :  7.0  
 1st Qu.: 7.00   1st Qu.:112.0  
 Median :13.00   Median :196.0  
 Mean   :18.95   Mean   :192.9  
 3rd Qu.:27.00   3rd Qu.:281.8  
 Max.   :71.00   Max.   :364.0

The data has a wide range of exposure times, so ignoring them would badly confound the comparison between age groups.

25.3 Fitting the three models

# Model 1: offset — coefficient on log(exposure) constrained to 1
m1 <- glm(count ~ age_old + offset(log(exposure)),
          family = poisson, data = df)

# Model 2: Poisson on the rate Y/t (different response)
df$rate <- df$count / df$exposure
m2 <- glm(rate ~ age_old,
          family = poisson, data = df)

# Model 3: log(exposure) as a free covariate
m3 <- glm(count ~ age_old + log_exposure,
          family = poisson, data = df)

# Collect coefficients
results <- data.frame(
  Model = c("1. Offset (constrained γ=1)",
            "2. Rate Y/t as response",
            "3. Unconstrained log(t)"),
  Intercept = c(coef(m1)[1], coef(m2)[1], coef(m3)[1]),
  age_old   = c(coef(m1)[2], coef(m2)[2], coef(m3)[2]),
  gamma_log_t = c(1, NA, coef(m3)[3])
)

knitr::kable(results, digits = 3,
  caption = "Coefficient estimates: true intercept = −3, true age effect = 1.2, true γ = 1")

Coefficient estimates: true intercept = −3, true age effect = 1.2, true γ = 1
Model	Intercept	age_old	gamma_log_t
1. Offset (constrained γ=1)	-3.009	1.212	1.000
2. Rate Y/t as response	-2.980	1.176	NA
3. Unconstrained log(t)	-2.801	1.212	0.962

Reading the table: Model 1 and Model 3 give nearly identical age-group coefficients because the true $\gamma = 1$ (counts are indeed proportional to exposure time). Model 2 differs slightly in the intercept because the response variable changes. The estimated $\hat\gamma$ from Model 3 should be close to 1.

25.4 When $\gamma \ne 1$

To see why the unconstrained model matters, repeat the simulation with a sub-linear count process ($\gamma = 0.6$: an initial burst of events that doesn’t grow proportionally with long follow-up):

count2 <- rpois(n, lambda = true_rate * exposure^0.6)
df2 <- data.frame(count2, age_old, exposure, log_exposure = log(exposure))

m1b <- glm(count2 ~ age_old + offset(log(exposure)),
           family = poisson, data = df2)
m3b <- glm(count2 ~ age_old + log_exposure,
           family = poisson, data = df2)

cat("Offset model age coefficient:      ", round(coef(m1b)[2], 3), "\n")

Offset model age coefficient:       1.171

cat("Unconstrained model age coefficient:", round(coef(m3b)[2], 3), "\n")

Unconstrained model age coefficient: 1.168

cat("Estimated gamma (log-exposure):    ", round(coef(m3b)[3], 3),
    " (true = 0.6)\n")

Estimated gamma (log-exposure):     0.566  (true = 0.6)

When $\gamma \ne 1$, forcing the offset constraint biases the treatment coefficient. The unconstrained model recovers both the true age effect and the true $\gamma$.

25.5 Stata equivalent

* Model 1: offset
poisson count age_old, exposure(exposure)

* Model 2: rate as response
gen rate = count / exposure
* Stata will warn that the dependent variable is non-integer (poisson expects
* integer counts by default) -- this is expected and can be ignored here,
* since Poisson QMLE is valid for continuous non-negative outcomes too.
poisson rate age_old

* Model 3: unconstrained log(t)
gen log_exposure = log(exposure)
poisson count age_old log_exposure

25.6 Which model to use?

Situation	Recommended model
Counts grow proportionally with time	Model 1 (offset)
Uncertain about proportionality	Model 3 (unconstrained), test $\hat\gamma = 1$
Rate is the natural estimand	Model 1 (offset) — it estimates a rate directly; Model 2 only as a weighted/robust quasi-model
Long follow-up with declining event rate	Model 3 or negative binomial with offset

In most health and social-science applications, Model 1 (Poisson with offset) is the default. Model 3 is the diagnostic check: fit it, look at $\hat\gamma$, and return to Model 1 if the constraint is supported by the data.

Systematic treatment: R · Julia.

--- title: "Poisson model with exposure" date: "2024-04-20" --- ## Poisson with exposure or Poisson rate Poisson models for count data often arise when units are observed for different lengths of time. A hospital admission count for a patient followed for 30 days is not directly comparable to a count for a patient followed for 365 days. The natural quantity of interest is the *rate* — admissions per unit time — not the raw count. There are three related but distinct specifications, each making a different assumption: **Model 1 — Poisson with offset (rate model):** $$\log\!\bigl(E[Y]/t\bigr) = \beta_0 + \beta_1 X$$ Equivalently, $\log E[Y] = \beta_0 + \beta_1 X + \log t$, where $\log t$ is included as a *fixed* offset (coefficient constrained to 1). This is the standard exposure model: the expected count is proportional to exposure time. **Model 2 — Poisson on the ratio $Y/t$:** $$\log\!\bigl(E[Y/t]\bigr) = \beta_0 + \beta_1 X$$ This looks similar, but it is not equivalent to the offset model, and strictly speaking $Y/t$ is **not a Poisson outcome at all**: a Poisson likelihood is defined for non-negative integer counts, and the continuous ratio $Y/t$ is not one. Because $t$ is a fixed constant for each unit, $E[Y/t] = E[Y]/t$ exactly (linearity of expectation), so the *conditional mean* is still log-linear in $X$; that is why the point estimates are often similar. But the Poisson variance assumption fails — $\mathrm{Var}(Y/t) = \lambda/t$ scales with $1/t$, not with the mean — so fitting `glm(rate ~ x, family = poisson)` is a quasi-likelihood *mean model* rather than a correct Poisson likelihood. If you use it, treat it as a workaround: weight by exposure (`weights = t`) so the implied variance is right, and/or use robust standard errors. The offset model (Model 1) avoids the issue entirely by keeping the integer count as the outcome. **Model 3 — Unconstrained log-exposure:** $$\log E[Y] = \beta_0 + \beta_1 X + \gamma \log t$$ Here $\gamma$ is estimated from the data. If $\hat\gamma \approx 1$, the proportionality assumption is supported. If $\hat\gamma < 1$, counts grow sub-linearly with exposure — a testable, substantive claim. ## Simulation ```{r} #| label: sim-data #| cache: true #| warning: false #| message: false library(tidyverse) set.seed(42) n <- 800 # Simulate a health study: count of adverse events # Covariate: age group (0 = younger, 1 = older) age_old <- rbinom(n, 1, 0.45) # Exposure: days under observation (varies 7–365) exposure <- sample(seq(7, 365, by = 7), n, replace = TRUE) # True model: rate = exp(-3 + 1.2 * age_old) # Count ~ Poisson(rate * exposure) true_rate <- exp(-3 + 1.2 * age_old) count <- rpois(n, lambda = true_rate * exposure) df <- data.frame(count, age_old, exposure, log_exposure = log(exposure)) summary(df[, c("count", "exposure")]) ``` The data has a wide range of exposure times, so ignoring them would badly confound the comparison between age groups. ## Fitting the three models ```{r} #| label: fit-models #| cache: true #| warning: false # Model 1: offset — coefficient on log(exposure) constrained to 1 m1 <- glm(count ~ age_old + offset(log(exposure)), family = poisson, data = df) # Model 2: Poisson on the rate Y/t (different response) df$rate <- df$count / df$exposure m2 <- glm(rate ~ age_old, family = poisson, data = df) # Model 3: log(exposure) as a free covariate m3 <- glm(count ~ age_old + log_exposure, family = poisson, data = df) # Collect coefficients results <- data.frame( Model = c("1. Offset (constrained γ=1)", "2. Rate Y/t as response", "3. Unconstrained log(t)"), Intercept = c(coef(m1)[1], coef(m2)[1], coef(m3)[1]), age_old = c(coef(m1)[2], coef(m2)[2], coef(m3)[2]), gamma_log_t = c(1, NA, coef(m3)[3]) ) knitr::kable(results, digits = 3, caption = "Coefficient estimates: true intercept = −3, true age effect = 1.2, true γ = 1") ``` **Reading the table:** Model 1 and Model 3 give nearly identical age-group coefficients because the true $\gamma = 1$ (counts are indeed proportional to exposure time). Model 2 differs slightly in the intercept because the response variable changes. The estimated $\hat\gamma$ from Model 3 should be close to 1. ## When $\gamma \ne 1$ To see why the unconstrained model matters, repeat the simulation with a sub-linear count process ($\gamma = 0.6$: an initial burst of events that doesn't grow proportionally with long follow-up): ```{r} #| label: nonlinear-exposure #| cache: true #| warning: false count2 <- rpois(n, lambda = true_rate * exposure^0.6) df2 <- data.frame(count2, age_old, exposure, log_exposure = log(exposure)) m1b <- glm(count2 ~ age_old + offset(log(exposure)), family = poisson, data = df2) m3b <- glm(count2 ~ age_old + log_exposure, family = poisson, data = df2) cat("Offset model age coefficient: ", round(coef(m1b)[2], 3), "\n") cat("Unconstrained model age coefficient:", round(coef(m3b)[2], 3), "\n") cat("Estimated gamma (log-exposure): ", round(coef(m3b)[3], 3), " (true = 0.6)\n") ``` When $\gamma \ne 1$, forcing the offset constraint biases the treatment coefficient. The unconstrained model recovers both the true age effect and the true $\gamma$. ## Stata equivalent ```stata * Model 1: offset poisson count age_old, exposure(exposure) * Model 2: rate as response gen rate = count / exposure * Stata will warn that the dependent variable is non-integer (poisson expects * integer counts by default) -- this is expected and can be ignored here, * since Poisson QMLE is valid for continuous non-negative outcomes too. poisson rate age_old * Model 3: unconstrained log(t) gen log_exposure = log(exposure) poisson count age_old log_exposure ``` ## Which model to use? | Situation | Recommended model | |---|---| | Counts grow proportionally with time | Model 1 (offset) | | Uncertain about proportionality | Model 3 (unconstrained), test $\hat\gamma = 1$ | | Rate is the natural estimand | Model 1 (offset) — it estimates a rate directly; Model 2 only as a weighted/robust quasi-model | | Long follow-up with declining event rate | Model 3 or negative binomial with offset | In most health and social-science applications, Model 1 (Poisson with offset) is the default. Model 3 is the diagnostic check: fit it, look at $\hat\gamma$, and return to Model 1 if the constraint is supported by the data. ---  *Systematic treatment: [R](https://xiangao.github.io/causal_econometrics_guide/poisson-iv.html) · [Julia](https://xiangao.github.io/causal_econometrics_julia/poisson-iv.html).*

25.1 Poisson with exposure or Poisson rate

25.2 Simulation

25.3 Fitting the three models

25.4 When \(\gamma \ne 1\)

25.5 Stata equivalent

25.6 Which model to use?