25  Poisson model with exposure

Published

April 20, 2024

25.1 Poisson with exposure or Poisson rate

Poisson models for count data often arise when units are observed for different lengths of time. A hospital admission count for a patient followed for 30 days is not directly comparable to a count for a patient followed for 365 days. The natural quantity of interest is the rate — admissions per unit time — not the raw count.

There are three related but distinct specifications, each making a different assumption:

Model 1 — Poisson with offset (rate model):

\[\log\!\bigl(E[Y]/t\bigr) = \beta_0 + \beta_1 X\]

Equivalently, \(\log E[Y] = \beta_0 + \beta_1 X + \log t\), where \(\log t\) is included as a fixed offset (coefficient constrained to 1). This is the standard exposure model: the expected count is proportional to exposure time.

Model 2 — Poisson on the ratio \(Y/t\):

\[\log\!\bigl(E[Y/t]\bigr) = \beta_0 + \beta_1 X\]

This looks similar, but \(E[Y/t] \ne E[Y]/t\) in a nonlinear (Poisson) model by Jensen’s inequality. In practice the two can give similar point estimates, but the standard errors differ because the variance structure changes when you divide by \(t\).

Model 3 — Unconstrained log-exposure:

\[\log E[Y] = \beta_0 + \beta_1 X + \gamma \log t\]

Here \(\gamma\) is estimated from the data. If \(\hat\gamma \approx 1\), the proportionality assumption is supported. If \(\hat\gamma < 1\), counts grow sub-linearly with exposure — a testable, substantive claim.

25.2 Simulation

library(tidyverse)
set.seed(42)

n <- 800

# Simulate a health study: count of adverse events
# Covariate: age group (0 = younger, 1 = older)
age_old <- rbinom(n, 1, 0.45)

# Exposure: days under observation (varies 7–365)
exposure <- sample(seq(7, 365, by = 7), n, replace = TRUE)

# True model: rate = exp(-3 + 1.2 * age_old)
# Count ~ Poisson(rate * exposure)
true_rate <- exp(-3 + 1.2 * age_old)
count <- rpois(n, lambda = true_rate * exposure)

df <- data.frame(count, age_old, exposure, log_exposure = log(exposure))
summary(df[, c("count", "exposure")])
     count          exposure    
 Min.   : 0.00   Min.   :  7.0  
 1st Qu.: 7.00   1st Qu.:112.0  
 Median :13.00   Median :196.0  
 Mean   :18.95   Mean   :192.9  
 3rd Qu.:27.00   3rd Qu.:281.8  
 Max.   :71.00   Max.   :364.0  

The data has a wide range of exposure times, so ignoring them would badly confound the comparison between age groups.

25.3 Fitting the three models

# Model 1: offset — coefficient on log(exposure) constrained to 1
m1 <- glm(count ~ age_old + offset(log(exposure)),
          family = poisson, data = df)

# Model 2: Poisson on the rate Y/t (different response)
df$rate <- df$count / df$exposure
m2 <- glm(rate ~ age_old,
          family = poisson, data = df)

# Model 3: log(exposure) as a free covariate
m3 <- glm(count ~ age_old + log_exposure,
          family = poisson, data = df)

# Collect coefficients
results <- data.frame(
  Model = c("1. Offset (constrained γ=1)",
            "2. Rate Y/t as response",
            "3. Unconstrained log(t)"),
  Intercept = c(coef(m1)[1], coef(m2)[1], coef(m3)[1]),
  age_old   = c(coef(m1)[2], coef(m2)[2], coef(m3)[2]),
  gamma_log_t = c(1, NA, coef(m3)[3])
)

knitr::kable(results, digits = 3,
  caption = "Coefficient estimates: true intercept = −3, true age effect = 1.2, true γ = 1")
Coefficient estimates: true intercept = −3, true age effect = 1.2, true γ = 1
Model Intercept age_old gamma_log_t
1. Offset (constrained γ=1) -3.009 1.212 1.000
2. Rate Y/t as response -2.980 1.176 NA
3. Unconstrained log(t) -2.801 1.212 0.962

Reading the table: Model 1 and Model 3 give nearly identical age-group coefficients because the true \(\gamma = 1\) (counts are indeed proportional to exposure time). Model 2 differs slightly in the intercept because the response variable changes. The estimated \(\hat\gamma\) from Model 3 should be close to 1.

25.4 When \(\gamma \ne 1\)

To see why the unconstrained model matters, repeat the simulation with a sub-linear count process (\(\gamma = 0.6\): an initial burst of events that doesn’t grow proportionally with long follow-up):

count2 <- rpois(n, lambda = true_rate * exposure^0.6)
df2 <- data.frame(count2, age_old, exposure, log_exposure = log(exposure))

m1b <- glm(count2 ~ age_old + offset(log(exposure)),
           family = poisson, data = df2)
m3b <- glm(count2 ~ age_old + log_exposure,
           family = poisson, data = df2)

cat("Offset model age coefficient:      ", round(coef(m1b)[2], 3), "\n")
Offset model age coefficient:       1.171 
cat("Unconstrained model age coefficient:", round(coef(m3b)[2], 3), "\n")
Unconstrained model age coefficient: 1.168 
cat("Estimated gamma (log-exposure):    ", round(coef(m3b)[3], 3),
    " (true = 0.6)\n")
Estimated gamma (log-exposure):     0.566  (true = 0.6)

When \(\gamma \ne 1\), forcing the offset constraint biases the treatment coefficient. The unconstrained model recovers both the true age effect and the true \(\gamma\).

25.5 Stata equivalent

* Model 1: offset
poisson count age_old, exposure(exposure)

* Model 2: rate as response
gen rate = count / exposure
poisson rate age_old

* Model 3: unconstrained log(t)
gen log_exposure = log(exposure)
poisson count age_old log_exposure

25.6 Which model to use?

Situation Recommended model
Counts grow proportionally with time Model 1 (offset)
Uncertain about proportionality Model 3 (unconstrained), test \(\hat\gamma = 1\)
Rate is the natural estimand Model 1 or Model 2; both are valid
Long follow-up with declining event rate Model 3 or negative binomial with offset

In most health and social-science applications, Model 1 (Poisson with offset) is the default. Model 3 is the diagnostic check: fit it, look at \(\hat\gamma\), and return to Model 1 if the constraint is supported by the data.