23 Which count data model to use

Published

October 10, 2017

23.1 A comparison of various count data models with extra zeros

In empirical studies, data sets with a lot of zeros are often hard to model. There are various models to deal with it: zero-inflated Poisson model, Negative Binomial (NB)model, hurdle model, etc.

Here we are following a zero-inflated model’s thinking: model the data with two processes. One is a Bernoulli process, the other one is a count data process (Poisson or NB).

We’d like to see, in this simulation exercise, how different models perform with changes of sample size and percentage of zeros (we expect the less zero, the better a plain Poisson model would perform). Therefore we vary sample size $n$ and an indicator of how much percentage of zeros in the data $\theta$.

For the count data process ($y_c$): \[ log(y_c) = 2 x + u \]

For the Bernoulli process ($y_b$):

\[ z_1 = 4 z + \theta \] \[ logit(y_b) = z_1 \] \[ p_y = \frac{e^z_1}{1+e^{z_1}} \]

Combining these two processes:

\[ y = y_c \ \text{if} \ p_y=1\] \[ y = y_b \ \text{if} \ p_y=0\]

23.1.1 Zero-inflated Poisson models

A zero-inflated Poisson needs specifying both the binary process and the count process correctly. Often than not, we don’t have a model for the binary process. Many people simply use the same explanatory variables for both processes. We simulate both situations. Case 1: suppose we observe $z$, and case 2: suppose we don’t observe $z$. In the graph below, they are labeled zip1 and zip2.

23.1.2 Poisson model

A plain Poisson model returns a consistent estimator for the coefficients, with or without Poisson-distributed data. We expect Poisson model’s performance improve with sample size. Note that the standard errors from a Poisson model needs adjustment, which we do not discuss in this post.

23.1.3 NB model

NB model is used widely to handle “overdispersion” problem. That is, the variance far exceeds the mean, therefore the Poisson model is considered inappropriate. NB model addresses that by allowing an extra parameter. However, many people also use it to model “extra zero” situation, we’ll see in our simulation it may not be better than a plain Poisson model.

23.1.4 Log-linear model

What about an OLS model with $log(y+1)$?

23.1.5 hurdle model

A hurdle model models the zero’s and other values separately; that is, the zero’s are from a binomial process only, the other positive values are from a truncated count data process. We assume here, in the simulation, we don’t observe $z$. Therefore, $x$ is determining both binary and count processes. In the graph below, it’s labeled hurdle.

library(MASS)
library(pscl)
library(parallel)
set.seed(666)

gen.sim <- function(df) {
  nobs <- df["nobs"]; th <- df["th"]
  z    <- rnorm(nobs, 0, 1)
  x    <- rnorm(nobs, 0, 1)
  u    <- rnorm(nobs, 0, 1)
  y.count  <- floor(exp(2 * x + u))
  z1       <- 4 * z + th
  prob     <- plogis(z1)
  y.logit  <- rbinom(nobs, size = 1, prob = prob)
  y        <- ifelse(y.logit == 1, y.count, y.logit)

  m1 <- zeroinfl(y ~ x | z)
  m4 <- zeroinfl(y ~ x | x)
  m2 <- glm(y ~ x, family = "poisson")
  m3 <- lm(log(y + 1) ~ x)
  m5 <- glm.nb(y ~ x)
  m6 <- hurdle(y ~ x)

  c(zip1       = coef(m1, "count")["x"] - 2,
    poisson     = coef(m2)["x"] - 2,
    log.linear  = exp(coef(m3)["x"]) - 2,
    zip2        = coef(m4, "count")["x"] - 2,
    nb          = coef(m5)["x"] - 2,
    hurdle      = coef(m6, "count")["x"] - 2)
}

data.grid <- expand.grid(
  nobs = ceiling(exp(seq(4, 9, 1)) / 100) * 100,
  nsim = seq(1, 100, 1),
  th   = seq(-4, 4, 2)
)

cl      <- makeCluster(detectCores() - 1)
clusterEvalQ(cl, { library(MASS); library(pscl) })
clusterExport(cl, "gen.sim")
results <- parApply(cl, data.grid, 1, gen.sim)
stopCluster(cl)

forshiny <- cbind(data.grid, t(results))
write.csv(forshiny, "results.csv", row.names = FALSE)

Count data models can be used even if data is not “counts”; for example, some non-negative non-integer numbers. In fact, Poisson model is consistent even if data is not Poisson-distributed, if the model specification is correct on modeling the log of expected counts. We simulate both scenarios: Case 1, data is generated from a Poisson process. Case 2, data is generated from a Normal distribution, but we use count data models to model it. The above code is for case 2.

We simulate 100 times with $\theta$ ranging from -4 to 4, lower number means higher percentage of zeros; number of observations from $e^4$ to $e^9$ (roughly from 50 to 8000).

Since there are many simulations, we used “snowfall” library to speed things up.

For raw code, please visit case1: poisson and case2: normal. The code above updates the original snowfall-based version to use base R parallel.

Summary of simulation findings: With large samples, plain Poisson outperforms NB and log-linear even with extra zeros. Zero-inflated Poisson with correct specification of the zero process is best, but that relies on knowing what drives the zeros. Hurdle and ZIP without correct zero-model specification are similar and not much worse than Poisson. Log-linear (log(y+1)) is consistently the worst.

23.2 Quick worked example

All five models on a single small dataset (n = 1000, ~40% zeros):

library(MASS)
library(pscl)
set.seed(42)

n  <- 1000
x  <- rnorm(n)
z  <- rnorm(n)                          # zero-inflation driver
mu <- exp(1.5 + 0.8 * x)               # count mean
y_c <- rpois(n, lambda = mu)
y_b <- rbinom(n, 1, plogis(1 + 2 * z)) # 1 = non-zero
y   <- y_c * y_b                        # zero-inflated count

cat("Zero fraction:", round(mean(y == 0), 2), "\n")

Zero fraction: 0.39

cat("Mean (non-zero):", round(mean(y[y > 0]), 2), "\n\n")

Mean (non-zero): 6.79

# 1. Poisson
m_pois <- glm(y ~ x, family = poisson)

# 2. Negative binomial
m_nb <- glm.nb(y ~ x)

# 3. Zero-inflated Poisson (ZIP) — correct zero model
m_zip <- zeroinfl(y ~ x | z)

# 4. ZIP — wrong zero model (only x)
m_zip2 <- zeroinfl(y ~ x | x)

# 5. Hurdle
m_hurdle <- hurdle(y ~ x)

# Compare x coefficient (true = 0.8)
coefs <- c(
  Poisson    = coef(m_pois)["x"],
  NB         = coef(m_nb)["x"],
  `ZIP (correct z)` = coef(m_zip, "count")["x"],
  `ZIP (wrong)` = coef(m_zip2, "count")["x"],
  Hurdle     = coef(m_hurdle, "count")["x"]
)

knitr::kable(
  data.frame(Model = names(coefs), `x coefficient` = round(coefs, 3),
             Bias = round(coefs - 0.8, 3)),
  caption = "True x coefficient = 0.8"
)

True x coefficient = 0.8
	Model	x.coefficient	Bias
Poisson.x	Poisson.x	0.850	0.050
NB.x	NB.x	0.824	0.024
ZIP (correct z).x	ZIP (correct z).x	0.803	0.003
ZIP (wrong).x	ZIP (wrong).x	0.802	0.002
Hurdle.x	Hurdle.x	0.803	0.003

ZIP with the correct zero-inflation model (using z) recovers the true coefficient most accurately. Without z, all models show upward bias because the zeros are incorrectly attributed to the count process.

23.3 Mixed count models with glmmTMB

When data has both overdispersion and random effects (clustered counts, panel data, repeated measures), glmmTMB handles the full range in one package:

library(glmmTMB)

# Zero-inflated negative binomial with random intercept per subject
m_glmmtmb <- glmmTMB(
  count ~ treatment + time + (1 | subject),   # count model
  ziformula = ~ treatment,                     # zero-inflation model
  family = nbinom2,
  data = your_data
)

summary(m_glmmtmb)

# Conditional vs marginal predictions
predict(m_glmmtmb, type = "conditional")  # excluding zero-inflation
predict(m_glmmtmb, type = "response")     # including zero-inflation

glmmTMB supports: Poisson, NB1, NB2, zero-inflated variants of all of the above, hurdle models, beta-binomial, tweedie, and more. The ziformula argument takes a one-sided formula for the zero-inflation component — use ~ 1 for a constant zero-inflation rate, or covariates to model what drives the structural zeros.

--- title: "Which count data model to use" date: "2017-10-10" --- ## A comparison of various count data models with extra zeros In empirical studies, data sets with a lot of zeros are often hard to model. There are various models to deal with it: zero-inflated Poisson model, Negative Binomial (NB)model, hurdle model, etc. Here we are following a zero-inflated model's thinking: model the data with two processes. One is a Bernoulli process, the other one is a count data process (Poisson or NB). We'd like to see, in this simulation exercise, how different models perform with changes of sample size and percentage of zeros (we expect the less zero, the better a plain Poisson model would perform). Therefore we vary sample size $n$ and an indicator of how much percentage of zeros in the data $\theta$. For the count data process ($y_c$): $$ log(y_c) = 2 x + u $$ For the Bernoulli process ($y_b$): $$ z_1 = 4 z + \theta $$ $$ logit(y_b) = z_1 $$ $$ p_y = \frac{e^z_1}{1+e^{z_1}} $$ Combining these two processes: $$ y = y_c \ \text{if} \ p_y=1$$ $$ y = y_b \ \text{if} \ p_y=0$$ ### Zero-inflated Poisson models A zero-inflated Poisson needs specifying both the binary process and the count process correctly. Often than not, we don't have a model for the binary process. Many people simply use the same explanatory variables for both processes. We simulate both situations. Case 1: suppose we observe $z$, and case 2: suppose we don't observe $z$. In the graph below, they are labeled zip1 and zip2. ### Poisson model A plain Poisson model returns a consistent estimator for the coefficients, with or without Poisson-distributed data. We expect Poisson model's performance improve with sample size. Note that the standard errors from a Poisson model needs adjustment, which we do not discuss in this post. ### NB model NB model is used widely to handle "overdispersion" problem. That is, the variance far exceeds the mean, therefore the Poisson model is considered inappropriate. NB model addresses that by allowing an extra parameter. However, many people also use it to model "extra zero" situation, we'll see in our simulation it may not be better than a plain Poisson model. ### Log-linear model What about an OLS model with $log(y+1)$? ### hurdle model A hurdle model models the zero's and other values separately; that is, the zero's are from a binomial process only, the other positive values are from a truncated count data process. We assume here, in the simulation, we don't observe $z$. Therefore, $x$ is determining both binary and count processes. In the graph below, it's labeled hurdle. ```{r} #| label: count-sim #| eval: false #| echo: true library(MASS) library(pscl) library(parallel) set.seed(666) gen.sim <- function(df) { nobs <- df["nobs"]; th <- df["th"] z <- rnorm(nobs, 0, 1) x <- rnorm(nobs, 0, 1) u <- rnorm(nobs, 0, 1) y.count <- floor(exp(2 * x + u)) z1 <- 4 * z + th prob <- plogis(z1) y.logit <- rbinom(nobs, size = 1, prob = prob) y <- ifelse(y.logit == 1, y.count, y.logit) m1 <- zeroinfl(y ~ x | z) m4 <- zeroinfl(y ~ x | x) m2 <- glm(y ~ x, family = "poisson") m3 <- lm(log(y + 1) ~ x) m5 <- glm.nb(y ~ x) m6 <- hurdle(y ~ x) c(zip1 = coef(m1, "count")["x"] - 2, poisson = coef(m2)["x"] - 2, log.linear = exp(coef(m3)["x"]) - 2, zip2 = coef(m4, "count")["x"] - 2, nb = coef(m5)["x"] - 2, hurdle = coef(m6, "count")["x"] - 2) } data.grid <- expand.grid( nobs = ceiling(exp(seq(4, 9, 1)) / 100) * 100, nsim = seq(1, 100, 1), th = seq(-4, 4, 2) ) cl <- makeCluster(detectCores() - 1) clusterEvalQ(cl, { library(MASS); library(pscl) }) clusterExport(cl, "gen.sim") results <- parApply(cl, data.grid, 1, gen.sim) stopCluster(cl) forshiny <- cbind(data.grid, t(results)) write.csv(forshiny, "results.csv", row.names = FALSE) ``` Count data models can be used even if data is not "counts"; for example, some non-negative non-integer numbers. In fact, Poisson model is consistent even if data is not Poisson-distributed, if the model specification is correct on modeling the log of expected counts. We simulate both scenarios: Case 1, data is generated from a Poisson process. Case 2, data is generated from a Normal distribution, but we use count data models to model it. The above code is for case 2. We simulate 100 times with $\theta$ ranging from -4 to 4, lower number means higher percentage of zeros; number of observations from $e^4$ to $e^9$ (roughly from 50 to 8000). Since there are many simulations, we used "snowfall" library to speed things up. For raw code, please visit [case1: poisson](https://github.com/xiangao/poisson) and [case2: normal](https://github.com/xiangao/poisson2). The code above updates the original `snowfall`-based version to use base R `parallel`. **Summary of simulation findings:** With large samples, plain Poisson outperforms NB and log-linear even with extra zeros. Zero-inflated Poisson with correct specification of the zero process is best, but that relies on knowing what drives the zeros. Hurdle and ZIP without correct zero-model specification are similar and not much worse than Poisson. Log-linear (`log(y+1)`) is consistently the worst. ## Quick worked example All five models on a single small dataset (n = 1000, ~40% zeros): ```{r} #| label: quick-example #| cache: true #| warning: false #| message: false library(MASS) library(pscl) set.seed(42) n <- 1000 x <- rnorm(n) z <- rnorm(n) # zero-inflation driver mu <- exp(1.5 + 0.8 * x) # count mean y_c <- rpois(n, lambda = mu) y_b <- rbinom(n, 1, plogis(1 + 2 * z)) # 1 = non-zero y <- y_c * y_b # zero-inflated count cat("Zero fraction:", round(mean(y == 0), 2), "\n") cat("Mean (non-zero):", round(mean(y[y > 0]), 2), "\n\n") # 1. Poisson m_pois <- glm(y ~ x, family = poisson) # 2. Negative binomial m_nb <- glm.nb(y ~ x) # 3. Zero-inflated Poisson (ZIP) — correct zero model m_zip <- zeroinfl(y ~ x | z) # 4. ZIP — wrong zero model (only x) m_zip2 <- zeroinfl(y ~ x | x) # 5. Hurdle m_hurdle <- hurdle(y ~ x) # Compare x coefficient (true = 0.8) coefs <- c( Poisson = coef(m_pois)["x"], NB = coef(m_nb)["x"], `ZIP (correct z)` = coef(m_zip, "count")["x"], `ZIP (wrong)` = coef(m_zip2, "count")["x"], Hurdle = coef(m_hurdle, "count")["x"] ) knitr::kable( data.frame(Model = names(coefs), `x coefficient` = round(coefs, 3), Bias = round(coefs - 0.8, 3)), caption = "True x coefficient = 0.8" ) ``` ZIP with the correct zero-inflation model (using `z`) recovers the true coefficient most accurately. Without `z`, all models show upward bias because the zeros are incorrectly attributed to the count process. ## Mixed count models with glmmTMB When data has both overdispersion *and* random effects (clustered counts, panel data, repeated measures), `glmmTMB` handles the full range in one package: ```{r} #| eval: false #| echo: true library(glmmTMB) # Zero-inflated negative binomial with random intercept per subject m_glmmtmb <- glmmTMB( count ~ treatment + time + (1 | subject), # count model ziformula = ~ treatment, # zero-inflation model family = nbinom2, data = your_data ) summary(m_glmmtmb) # Conditional vs marginal predictions predict(m_glmmtmb, type = "conditional") # excluding zero-inflation predict(m_glmmtmb, type = "response") # including zero-inflation ``` `glmmTMB` supports: Poisson, NB1, NB2, zero-inflated variants of all of the above, hurdle models, beta-binomial, tweedie, and more. The `ziformula` argument takes a one-sided formula for the zero-inflation component — use `~ 1` for a constant zero-inflation rate, or covariates to model what drives the structural zeros.