8 Count data models

8.1 Poisson Model

If the interested variable is a count data variable, it is natural to model it as a Poisson process. The number of occurrence follows a Poisson process:

\[ {\rm Pr} [Y=y]=\frac{e^{-\mu} \mu^y}{y! } \]

One of Poisson’s properties is that it has equal variance as its mean:

\[ {\rm E} [Y]={\rm Var} [Y]=\mu \]

In a count data model such as Poisson regression model, we are modeling log of the expected counts:

\[ \log ({\rm E}( Y) )=\bf X_i' \beta \]

The log-likelihood function is \[ \log L(\beta)= \sum_{i=1}^N {- \exp (X_i' \beta) + y_i X_i' \beta - \ln y_i!} \]

To maximize it, the first order condition is

\[ \sum_{i=1}^N (y_i - \exp (X_i' \beta) )X_i'=0 \]

We can use Newton-Raphson or other optimization algorithms to find the solution for the first order condition.

8.2 QMLE Poisson

8.2.0.1 Model

Note that in the first-order condition above, if $X_i'$ includes a constant, then $(y_i - \exp (X_i' \beta) )$ sums to zero. If ${\rm E} [y_i|X_i] = \exp (X_i' \beta)$, then the summation on the left-hand side has expectation of zero. Hence the only specification needed to apply it is the conditional expectation of $Y$ given $X$. Even if the data is not Poisson-distributed, the estimator is still consistent. Therefore, this estimator is called the quasi-ML (QML) Poisson estimator.

Although the QML Poisson estimator is consistent under relatively weak condition, the regular variance estimator for the coefficients are not valid anymore.

If a stronger assumption is made that the data follows a Poisson distribution, then the error term has a variance which equals to the mean of $Y$. The estimator $\hat \beta_P$ (ML estimator) follows a normal distribution asymptotically with a variance matrix

\[ {\rm Var}_{ML} [\hat \beta_P] = (\sum_{i=1}^N \mu_i X_i X_i')^{-1} \]

On the other hand, the variance matrix for the QML Poisson estimator $\hat \beta_{QML}$ is

\[ {\rm Var}_{QML} [\hat \beta_{QML}] = (\sum_{i=1}^N \mu_i X_i X_i')^{-1} (\sum_{i=1}^N \omega_i X_i X_i') (\sum_{i=1}^N \mu_i X_i X_i')^{-1} \] where $_i={} [y_i|X_i] $ is the conditional variance of $y_i$.

8.2.1 Implementation

Poisson estimation is implemented in almost every statistical package. However, some of them may not work if you have a continuous dependent variable.

Stata’s implementation of the Poisson model: “poisson” and “xtpoisson” do take a continuous dependent variable. Standard errors reported without any “vce” specification assume the Poisson variance equals the mean and are likely too small under overdispersion. For a cross-sectional Poisson treated as QMLE, the robust (sandwich) standard errors – the “robust” option – are exactly the adjustment needed for a misspecified variance, so they are valid even when the equidispersion assumption fails; the conditional mean only needs to be correctly specified. In the panel fixed-effects case, serial correlation within a unit is a separate issue that requires clustered standard errors. Older versions of “xtpoisson” did not offer a cluster-robust option, which is why the user-written program xtpqml was written to supply the cluster-robust standard errors of Wooldridge (1999) after xtpoisson.

8.3 Fixed-effect Poisson model

We specify a panel data count model as: \[ E[y_{it} | \alpha_i, X_{it}]= \alpha_i \exp(X'_{it} \beta) = \exp(\gamma_i + X'_{it} \beta). \]

It turns out that for fixed-effect Poisson model, just like in linear regression case, there is NO incidental parameter problem. Poisson MLE is the same for conditional and unconditional likelihood.(See Cameron and Trivedi, 2005, Page 805).

Therefore, the coefficient estimates are the same for conditional or unconditional fixed-effect Poisson model. That is, in Stata, “xtpoisson, fe” will return the same results as “xi: poisson i.group”, is the same as “xtpqml, fe”. The only difference is that “xtpqml, fe” returns “correct” standard errors for QMLE Poisson. You can also calculate standard errors with the other commands, using clustered standard errors, or bootstrapped standard errors.

Conditional fixed effect negative binomial model is also possible to be consistent for some specifications. But Poisson fixed effect is more popular since it is QMLE.

8.4 Negative Binomial model

The Poisson model has a restrictive property that the conditional variance equals the conditional mean. Often we’ll see “overdispersion”, that is, the variance exceeds the mean, in real data. The negative binomial model is a generalization of poisson model which allows for overdispersion by introducing an unobserved heterogeneity term. In a negative binomial model,

\[ {\rm E} [Y | X, \tau]=\mu \tau = e^{X' \beta} \cdot \tau \]

This extra $\tau$ term follows a $Gamma(\theta, \theta)$ distribution, with $E(\tau)=1$ and $Var (\tau)=1/\theta$.

Conditional on $X$ and $\tau$, the distribution of $Y$ is still poisson: \[ {\rm Pr} [Y=y | X, \tau]=\frac{e^{-\mu \tau} (\mu \tau)^y}{y! } \]

Conditional on $X$ only, the distribution of $Y$ is negative binomial: \[ {\rm Pr} [Y=y | X]=\frac{\theta^{\theta} \mu^y \Gamma(\theta + y)}{ \Gamma(y+1) \Gamma(\theta) (\mu + \theta)^{\theta+y}} \]

This distribution still has conditional mean of $\mu$, but variance is $\mu (1+(1/\theta) \mu)$.

When $\alpha=1/\theta$ approaches zero, the negative binomial model converges to poisson model.

8.5 Models for truncated counts

In some situations, all we observe are non-zero counts, because of the way data was collected. For example, we only observe people’s visits to a park, only if they visit. To model how many times they visit, based on this kind of data set, we need to use zero-truncated count models.

\[ {\rm Pr} [Y=y | y>0, X]=\frac{{\rm Pr} (y | X) }{{\rm Pr} (y>0 | X)}= \frac{{\rm Pr} (y | X)}{1- \exp (- \mu)} \]

where the last equality holds for the Poisson case, since ${\rm Pr} (y=0|X)=e^{-\mu}$; for the negative binomial, ${\rm Pr} (y=0|X)= (\theta/(\theta+\mu))^{\theta}$, so the truncation factor is $1- (\theta/(\theta+\mu))^{\theta}$ instead. Thus we are modeling the counts given that we have a positive outcome. Both zero-truncated poisson (“ztp” in Stata) and zero-truncated negative binomial (“ztnb” in Stata) are estimated by Maximum-likelihood method.

We need to pay extra attention to the “over-dispersion” problem when dealing with zero-truncated data. When data is not truncated, coefficients estimated by poisson model is consistent, even if data is “overdispersed”. However, if we have zero-truncated sample, then “over-dispersion” results in inconsistent coefficient estimation (Grogger and Carson, 1991). Therefore, “over-dispersion” needs to be tested before using truncated poisson model or truncated negative binomial model.

8.6 The hurdle regression model

Although negative binomial model relaxes the assumption of equal mean and variance, some researchers prefer modeling the process of generating zeros differently from the process of generating other values. Suppose zero counts are generated by a binary process:

\[ {\rm Pr} [y=0 | X]=\frac{\exp(X \gamma)}{1 + \exp(X \gamma)} = \pi \]

Positive counts are generated by a truncated count process.

In this model, zero is a “hurdle” to get past before reaching positive counts. The hurdle model is estimated by two separate equations. It is easy to estimate. However, the marginal effect is tricky to calculate, since an independent variable, if appearing in both equations, will have effect through both equations.

\[ {\rm E} (y | X) = [\pi \times 0 ] + {(1-\pi) \times {\rm E} (y | y>0, X) } = (1- \pi) \times {\rm E} (y | y>0 , X) \]

In stata, there are commands such as “hplogit”, which is a hurdle model with first equation logit, and second equation truncated poisson. It is the same as two separate estimations; namely “logit” first, then “ztp” on positive counts.

The difference between a hurdle model and a Heckman selection model is that in a Heckman model the two equations are jointly estimated and the outcome equation must account for selection bias. An exclusion restriction – a variable that affects selection but not the outcome – is strongly recommended for the selection equation. Strictly speaking, under joint normality the model is identified by the nonlinearity of the inverse Mills ratio even without an excluded variable, but identification then rests entirely on functional form and is fragile, so an exclusion restriction is recommended in practice. On the other hand, a hurdle model assumes the two processes are separate; there is no selection.

8.7 Zero-inflated count models

A zero-inflated count model also models two different processes. One is a binary process: generating zero or non-zeros. The second one is a count model. Note there is a different between zero-inflated and hurdle model: In hurdle model, the second process is a truncated count model (positive counts). In a zero-inflated model, a zero can come from either the binary process, or from the count process.

A second difference is that the estimation of zero-inflated model is a joint estimation of the two processes.

\[ y_i \sim \left\{ \begin{array}{c} 0 \ \mbox{with probability} \ \psi \\ g(y_i | X_i) \mbox{with probability} \ 1- \psi \end{array} \right. \]

Then

\[ P(Y_i=y_i | X_i, Z_i) = \left\{ \begin{array}{c} \psi(\gamma' Z_i) + (1- \psi(\gamma' Z_i) ) g(0 | X_i) \ \mbox{if} \ y_i=0 \\ (1- \psi(\gamma' Z_i) ) g(y_i | X_i) \ \mbox{if} \ y_i>0 \end{array} \right. \]

This says that every non-zero observation follows a count process $g(y_i|X_i)$. Every zero observation has two possible sources: from a binary process (with probability $\psi(\gamma' Z_i)$) and from a count process (with probability $1- \psi(\gamma' Z_i)$ into the count process and then with probability of $g(0|X_i)$ to be a zero). The count process can be a Poisson process (“zip” in Stata) or a Negative Binomial process (“zinb” in Stata).

8.8 An example of extra-zero count data model comparison

A companion simulation study (Ao, 2014; see also the more detailed write-up in this guide’s companion blog_book, chapter “Which count data model to use”) compares plain Poisson, Negative Binomial, zero-inflated Poisson (ZIP, both with a correctly and an incorrectly specified zero-process), and hurdle models across sample sizes and zero-inflation levels. The main findings: with large samples, plain Poisson QMLE is consistent even with unmodeled excess zeros, as long as the conditional mean of the count process is correctly specified and independent of the zero-generating process – the excess zeros only inflate the standard errors (via overdispersion), not the point estimates, so robust/sandwich standard errors should be used rather than switching models. Zero-inflated Poisson with a correctly specified zero process performs best when that specification is available; hurdle and misspecified-zero-process ZIP perform similarly to plain Poisson; log-linear regression on $\log(y+1)$ is consistently the worst performer, since it targets a different (log-scale) conditional expectation than the models above.

8.9 Interpreting Coefficients and Marginal Effects in Nonlinear Models

This section is broader than count data: it covers how to read coefficients and compute marginal effects in nonlinear models generally, including the binary discrete-choice models of the Discrete and Limited Dependent Variables chapter as well as the count models above.

8.9.1 Binary Response Models

In a linear model \[ y_i= X_i' \beta + \epsilon, \] \[E(y_i) = X_i' \beta \]

When $y_i$ is binary (meaning it takes two values, 1 or 0), We can also do an OLS regression on it, but a more popular way is to use a non-linear model. The most popular choices for modeling binary response are logit model and probit model. Both of them use the same idea: use a link function to map the binary variable into a continuous variable which is a linear function of the predictors.

Suppose $p_i$ is the probability that $y_i=1$. Then \[p_i=E(y_i) = g^{-1}(X_i' \beta), \] where $g^{-1}()$ is a function that maps from the linear predictor to $y_i$, which is often called index function or inverse link function.

This says that the expectation of $y_i$ (equivalently, $p_i$) is not a linear function of $x_i$’s, but by using the link function, we are still able to model a transformed $p_i$ as a linear function of $x_i$’s: \[g(p_i) = X_i' \beta. \]

$g^{-1}()$ can be a normal CDF then we have a probit model: \[ g^{-1}(z) = \Phi (z) = \int_{- \infty}^z \frac{1}{\sqrt{2 \pi}} {\rm exp} [-(t^2/2)] dt, \]

or it can be a logit:

\[ g^{-1}(z) = \Lambda (z) = e^z/(1+e^z). \]

8.9.2 Marginal Effects

Often times we are interested in the marginal effects of the predictors. In a linear model, it’s easy: the coefficient on $x_1$ means when $x_1$ increase by one unit, how much will $y$ increase.

In a binary response model, \[\frac{\partial p}{\partial x} = \frac{\partial g^{-1}(X_i' \beta)}{\partial x}\]

In the case of probit:

\[\frac{\partial p}{\partial x} = \frac{\partial \Phi (X_i' \beta)}{\partial x} = \phi(X_i' \beta) \beta, \]

where $\phi()$ is the density function of normal distribution.

In the case of logit:

\[\frac{\partial p}{\partial x} = \frac{\partial \Lambda (X_i' \beta)}{\partial x} = \gamma(X_i' \beta) \beta, \] where $\gamma()$ is the density function for logistic distribution: \[\gamma(X_i' \beta) = \Lambda(X_i' \beta) (1- \Lambda(X_i' \beta)) = p(1-p).\]

Therefore for logit, a quick approximation of the marginal effect is to plug in the sample proportion $\bar p$ and coefficient estimate $\hat \beta$:

\[ \hat{ \frac{\partial p}{\partial x}} \approx \bar p (1- \bar p) \hat \beta \]

This evaluates the effect at the mean probability; it is a rough rule of thumb, not the average marginal effect $\frac{1}{n}\sum_i \hat p_i(1-\hat p_i)\hat\beta$, which is what software reports.

The logit model can also be formulated as \[ \log (\frac{p}{1-p})=\bf X_i' \beta \] which says the logarithm of the odds (the ratio of the two probabilities) is equal to $\bf X_i' \beta$. Therefore, \[ p =\frac{\exp( X_i' \beta)}{1+\exp(X_i' \beta)}=\Lambda( X_i' \beta) \]

Often times we have discrete valued predictors, such as gender, or other dummy variables. In that case, it makes more sense to ask what’s the effect of gender being 1 vs. 0 (female vs. male). What this question is can be formulated as:

\[ E(p | x_1=1) - E(p | x_1=0). \]

In empirical analysis, it’s often calculated by calculating the predicted probability by setting $x_1$ to 1 and by setting $x_1$ to 0. Then calculate the difference.

8.9.3 Count Data Models

In a count data model such as Poisson regression model or Negative Binomial model, we are modeling log of the expected counts:

\[ \log ({\rm E}( y) )=\bf X_i' \beta \] Therefore the marginal effect of $x_1$, for example, on ${\rm E} (y)$ is

\[ { \frac{\partial \log {\rm E} (y) }{\partial x}} = \hat \beta , \]

which means $\beta$ is the marginal effect of $x$ on $\log$ of the expected counts. Suppose the expected counts at $x$ is $\mu_{x}$, then

\[ { \frac{(\partial \mu_x) / \mu_x }{\partial x}} = \hat \beta , \]

which says that $\beta$ represents the marginal effect of percentage change in $\mu_x$ with respect to $x$.

Another way to formulate this is to use the Incidence Rate Ratio Interpretation. Suppose the expected counts at $x$ is $\mu_{x}$, and expected counts at $x+1$ is $\mu_{x+1}$.

\[ \log (\mu_x) = x' \beta. \]

Therefore,

\[ \mu_{x} = \exp (x' \beta), \]

and

\[ \frac{\mu_{x+1}}{\mu_{x}} = \exp ( \beta). \]

This says that the exponentiated $\beta$ is the incidence rate ratio of the expected counts, if $x$ increases by $1$.

To calculate the marginal effect of $x$ on $y$ (or ${\rm E} (y)$), we need to calculate

\[\frac{ \partial \mu_x} {\partial x} = \hat \beta \mu_x = \hat \beta \exp(X_i' \hat \beta) , \] if we evaluate $\mu_x$ at predicted value.

8.9.4 How to calculate marginal effects in Stata

Stata’s margins command is a powerful command to get the predicted value or marginal effects after an estimation command. Check Stata’s manual for more details.

# Count data models ## Poisson Model If the interested variable is a count data variable, it is natural to model it as a Poisson process. The number of occurrence follows a Poisson process: $$ {\rm Pr} [Y=y]=\frac{e^{-\mu} \mu^y}{y! } $$ One of Poisson's properties is that it has equal variance as its mean: $$ {\rm E} [Y]={\rm Var} [Y]=\mu $$ In a count data model such as Poisson regression model, we are modeling log of the expected counts: $$ \log ({\rm E}( Y) )=\bf X_i' \beta $$ The log-likelihood function is $$ \log L(\beta)= \sum_{i=1}^N {- \exp (X_i' \beta) + y_i X_i' \beta - \ln y_i!} $$ To maximize it, the first order condition is $$ \sum_{i=1}^N (y_i - \exp (X_i' \beta) )X_i'=0 $$ We can use Newton-Raphson or other optimization algorithms to find the solution for the first order condition. ## QMLE Poisson #### Model Note that in the first-order condition above, if $X_i'$ includes a constant, then $(y_i - \exp (X_i' \beta) )$ sums to zero. If ${\rm E} [y_i|X_i] = \exp (X_i' \beta)$, then the summation on the left-hand side has expectation of zero. Hence the only specification needed to apply it is the conditional expectation of $Y$ given $X$. Even if the data is not Poisson-distributed, the estimator is still consistent. Therefore, this estimator is called the quasi-ML (QML) Poisson estimator. Although the QML Poisson estimator is consistent under relatively weak condition, the regular variance estimator for the coefficients are not valid anymore. If a stronger assumption is made that the data follows a Poisson distribution, then the error term has a variance which equals to the mean of $Y$. The estimator $\hat \beta_P$ (ML estimator) follows a normal distribution asymptotically with a variance matrix $$ {\rm Var}_{ML} [\hat \beta_P] = (\sum_{i=1}^N \mu_i X_i X_i')^{-1} $$ On the other hand, the variance matrix for the QML Poisson estimator $\hat \beta_{QML}$ is $$ {\rm Var}_{QML} [\hat \beta_{QML}] = (\sum_{i=1}^N \mu_i X_i X_i')^{-1} (\sum_{i=1}^N \omega_i X_i X_i') (\sum_{i=1}^N \mu_i X_i X_i')^{-1} $$ where $\omega_i={\rm Var} [y_i|X_i] $ is the conditional variance of $y_i$. ### Implementation Poisson estimation is implemented in almost every statistical package. However, some of them may not work if you have a continuous dependent variable. Stata's implementation of the Poisson model: "poisson" and "xtpoisson" do take a continuous dependent variable. Standard errors reported without any "vce" specification assume the Poisson variance equals the mean and are likely too small under overdispersion. For a cross-sectional Poisson treated as QMLE, the robust (sandwich) standard errors -- the "robust" option -- *are* exactly the adjustment needed for a misspecified variance, so they are valid even when the equidispersion assumption fails; the conditional mean only needs to be correctly specified. In the panel fixed-effects case, serial correlation within a unit is a separate issue that requires *clustered* standard errors. Older versions of "xtpoisson" did not offer a cluster-robust option, which is why the user-written program xtpqml was written to supply the cluster-robust standard errors of Wooldridge (1999) after xtpoisson. ## Fixed-effect Poisson model We specify a panel data count model as: $$ E[y_{it} | \alpha_i, X_{it}]= \alpha_i \exp(X'_{it} \beta) = \exp(\gamma_i + X'_{it} \beta). $$ It turns out that for fixed-effect Poisson model, just like in linear regression case, there is NO incidental parameter problem. Poisson MLE is the same for conditional and unconditional likelihood.(See Cameron and Trivedi, 2005, Page 805). Therefore, the coefficient estimates are the same for conditional or unconditional fixed-effect Poisson model. That is, in Stata, "xtpoisson, fe" will return the same results as "xi: poisson i.group", is the same as "xtpqml, fe". The only difference is that "xtpqml, fe" returns "correct" standard errors for QMLE Poisson. You can also calculate standard errors with the other commands, using clustered standard errors, or bootstrapped standard errors. Conditional fixed effect negative binomial model is also possible to be consistent for some specifications. But Poisson fixed effect is more popular since it is QMLE. ## Negative Binomial model The Poisson model has a restrictive property that the conditional variance equals the conditional mean. Often we'll see "overdispersion", that is, the variance exceeds the mean, in real data. The negative binomial model is a generalization of poisson model which allows for overdispersion by introducing an unobserved heterogeneity term. In a negative binomial model, $$ {\rm E} [Y | X, \tau]=\mu \tau = e^{X' \beta} \cdot \tau $$ This extra $\tau$ term follows a $Gamma(\theta, \theta)$ distribution, with $E(\tau)=1$ and $Var (\tau)=1/\theta$. Conditional on $X$ and $\tau$, the distribution of $Y$ is still poisson: $$ {\rm Pr} [Y=y | X, \tau]=\frac{e^{-\mu \tau} (\mu \tau)^y}{y! } $$ Conditional on $X$ only, the distribution of $Y$ is negative binomial: $$ {\rm Pr} [Y=y | X]=\frac{\theta^{\theta} \mu^y \Gamma(\theta + y)}{ \Gamma(y+1) \Gamma(\theta) (\mu + \theta)^{\theta+y}} $$ This distribution still has conditional mean of $\mu$, but variance is $\mu (1+(1/\theta) \mu)$. When $\alpha=1/\theta$ approaches zero, the negative binomial model converges to poisson model. ## Models for truncated counts In some situations, all we observe are non-zero counts, because of the way data was collected. For example, we only observe people's visits to a park, only if they visit. To model how many times they visit, based on this kind of data set, we need to use zero-truncated count models. $$ {\rm Pr} [Y=y | y>0, X]=\frac{{\rm Pr} (y | X) }{{\rm Pr} (y>0 | X)}= \frac{{\rm Pr} (y | X)}{1- \exp (- \mu)} $$ where the last equality holds for the Poisson case, since ${\rm Pr} (y=0|X)=e^{-\mu}$; for the negative binomial, ${\rm Pr} (y=0|X)= (\theta/(\theta+\mu))^{\theta}$, so the truncation factor is $1- (\theta/(\theta+\mu))^{\theta}$ instead. Thus we are modeling the counts given that we have a positive outcome. Both zero-truncated poisson ("ztp" in Stata) and zero-truncated negative binomial ("ztnb" in Stata) are estimated by Maximum-likelihood method. We need to pay extra attention to the "over-dispersion" problem when dealing with zero-truncated data. When data is not truncated, coefficients estimated by poisson model is consistent, even if data is "overdispersed". However, if we have zero-truncated sample, then "over-dispersion" results in inconsistent coefficient estimation (Grogger and Carson, 1991). Therefore, "over-dispersion" needs to be tested before using truncated poisson model or truncated negative binomial model. ## The hurdle regression model Although negative binomial model relaxes the assumption of equal mean and variance, some researchers prefer modeling the process of generating zeros differently from the process of generating other values. Suppose zero counts are generated by a binary process: $$ {\rm Pr} [y=0 | X]=\frac{\exp(X \gamma)}{1 + \exp(X \gamma)} = \pi $$ Positive counts are generated by a truncated count process. In this model, zero is a "hurdle" to get past before reaching positive counts. The hurdle model is estimated by two separate equations. It is easy to estimate. However, the marginal effect is tricky to calculate, since an independent variable, if appearing in both equations, will have effect through both equations. $$ {\rm E} (y | X) = [\pi \times 0 ] + {(1-\pi) \times {\rm E} (y | y>0, X) } = (1- \pi) \times {\rm E} (y | y>0 , X) $$ In stata, there are commands such as "hplogit", which is a hurdle model with first equation logit, and second equation truncated poisson. It is the same as two separate estimations; namely "logit" first, then "ztp" on positive counts. The difference between a hurdle model and a Heckman selection model is that in a Heckman model the two equations are jointly estimated and the outcome equation must account for selection bias. An exclusion restriction -- a variable that affects selection but not the outcome -- is strongly recommended for the selection equation. Strictly speaking, under joint normality the model is identified by the nonlinearity of the inverse Mills ratio even without an excluded variable, but identification then rests entirely on functional form and is fragile, so an exclusion restriction is recommended in practice. On the other hand, a hurdle model assumes the two processes are separate; there is no selection. ## Zero-inflated count models A zero-inflated count model also models two different processes. One is a binary process: generating zero or non-zeros. The second one is a count model. Note there is a different between zero-inflated and hurdle model: In hurdle model, the second process is a truncated count model (positive counts). In a zero-inflated model, a zero can come from either the binary process, or from the count process. A second difference is that the estimation of zero-inflated model is a joint estimation of the two processes. $$ y_i \sim \left\{ \begin{array}{c} 0 \ \mbox{with probability} \ \psi \\ g(y_i | X_i) \mbox{with probability} \ 1- \psi \end{array} \right. $$ Then $$ P(Y_i=y_i | X_i, Z_i) = \left\{ \begin{array}{c} \psi(\gamma' Z_i) + (1- \psi(\gamma' Z_i) ) g(0 | X_i) \ \mbox{if} \ y_i=0 \\ (1- \psi(\gamma' Z_i) ) g(y_i | X_i) \ \mbox{if} \ y_i>0 \end{array} \right. $$ This says that every non-zero observation follows a count process $g(y_i|X_i)$. Every zero observation has two possible sources: from a binary process (with probability $\psi(\gamma' Z_i)$) and from a count process (with probability $1- \psi(\gamma' Z_i)$ into the count process and then with probability of $g(0|X_i)$ to be a zero). The count process can be a Poisson process ("zip" in Stata) or a Negative Binomial process ("zinb" in Stata). ## An example of extra-zero count data model comparison A companion simulation study ([Ao, 2014](http://hbs-rcs.github.io/blog/2014/09/17/poisson-models/); see also the more detailed write-up in this guide's companion `blog_book`, chapter "Which count data model to use") compares plain Poisson, Negative Binomial, zero-inflated Poisson (ZIP, both with a correctly and an incorrectly specified zero-process), and hurdle models across sample sizes and zero-inflation levels. The main findings: with large samples, plain Poisson QMLE is consistent even with unmodeled excess zeros, as long as the conditional mean of the count process is correctly specified and independent of the zero-generating process -- the excess zeros only inflate the standard errors (via overdispersion), not the point estimates, so robust/sandwich standard errors should be used rather than switching models. Zero-inflated Poisson with a correctly specified zero process performs best when that specification is available; hurdle and misspecified-zero-process ZIP perform similarly to plain Poisson; log-linear regression on $\log(y+1)$ is consistently the worst performer, since it targets a different (log-scale) conditional expectation than the models above. ## Interpreting Coefficients and Marginal Effects in Nonlinear Models This section is broader than count data: it covers how to read coefficients and compute marginal effects in nonlinear models generally, including the binary discrete-choice models of the [Discrete and Limited Dependent Variables](discrete.qmd) chapter as well as the count models above. ### Binary Response Models In a linear model $$ y_i= X_i' \beta + \epsilon, $$ $$E(y_i) = X_i' \beta $$ When $y_i$ is binary (meaning it takes two values, 1 or 0), We can also do an OLS regression on it, but a more popular way is to use a non-linear model. The most popular choices for modeling binary response are logit model and probit model. Both of them use the same idea: use a link function to map the binary variable into a continuous variable which is a linear function of the predictors. Suppose $p_i$ is the probability that $y_i=1$. Then $$p_i=E(y_i) = g^{-1}(X_i' \beta), $$ where $g^{-1}()$ is a function that maps from the linear predictor to $y_i$, which is often called index function or inverse link function. This says that the expectation of $y_i$ (equivalently, $p_i$) is not a linear function of $x_i$'s, but by using the link function, we are still able to model a transformed $p_i$ as a linear function of $x_i$'s: $$g(p_i) = X_i' \beta. $$ $g^{-1}()$ can be a normal CDF then we have a probit model: $$ g^{-1}(z) = \Phi (z) = \int_{- \infty}^z \frac{1}{\sqrt{2 \pi}} {\rm exp} [-(t^2/2)] dt, $$ or it can be a logit: $$ g^{-1}(z) = \Lambda (z) = e^z/(1+e^z). $$ ### Marginal Effects Often times we are interested in the marginal effects of the predictors. In a linear model, it's easy: the coefficient on $x_1$ means when $x_1$ increase by one unit, how much will $y$ increase. In a binary response model, $$\frac{\partial p}{\partial x} = \frac{\partial g^{-1}(X_i' \beta)}{\partial x}$$ In the case of probit: $$\frac{\partial p}{\partial x} = \frac{\partial \Phi (X_i' \beta)}{\partial x} = \phi(X_i' \beta) \beta, $$ where $\phi()$ is the density function of normal distribution. In the case of logit: $$\frac{\partial p}{\partial x} = \frac{\partial \Lambda (X_i' \beta)}{\partial x} = \gamma(X_i' \beta) \beta, $$ where $\gamma()$ is the density function for logistic distribution: $$\gamma(X_i' \beta) = \Lambda(X_i' \beta) (1- \Lambda(X_i' \beta)) = p(1-p).$$ Therefore for logit, a quick approximation of the marginal effect is to plug in the sample proportion $\bar p$ and coefficient estimate $\hat \beta$: $$ \hat{ \frac{\partial p}{\partial x}} \approx \bar p (1- \bar p) \hat \beta $$ This evaluates the effect at the mean probability; it is a rough rule of thumb, not the average marginal effect $\frac{1}{n}\sum_i \hat p_i(1-\hat p_i)\hat\beta$, which is what software reports. The logit model can also be formulated as $$ \log (\frac{p}{1-p})=\bf X_i' \beta $$ which says the logarithm of the odds (the ratio of the two probabilities) is equal to $\bf X_i' \beta$. Therefore, $$ p =\frac{\exp( X_i' \beta)}{1+\exp(X_i' \beta)}=\Lambda( X_i' \beta) $$ Often times we have discrete valued predictors, such as gender, or other dummy variables. In that case, it makes more sense to ask what's the effect of gender being 1 vs. 0 (female vs. male). What this question is can be formulated as: $$ E(p | x_1=1) - E(p | x_1=0). $$ In empirical analysis, it's often calculated by calculating the predicted probability by setting $x_1$ to 1 and by setting $x_1$ to 0. Then calculate the difference. ### Count Data Models In a count data model such as Poisson regression model or Negative Binomial model, we are modeling log of the expected counts: $$ \log ({\rm E}( y) )=\bf X_i' \beta $$ Therefore the marginal effect of $x_1$, for example, on ${\rm E} (y)$ is $$ { \frac{\partial \log {\rm E} (y) }{\partial x}} = \hat \beta , $$ which means $\beta$ is the marginal effect of $x$ on $\log$ of the expected counts. Suppose the expected counts at $x$ is $\mu_{x}$, then $$ { \frac{(\partial \mu_x) / \mu_x }{\partial x}} = \hat \beta , $$ which says that $\beta$ represents the marginal effect of percentage change in $\mu_x$ with respect to $x$. Another way to formulate this is to use the Incidence Rate Ratio Interpretation. Suppose the expected counts at $x$ is $\mu_{x}$, and expected counts at $x+1$ is $\mu_{x+1}$. $$ \log (\mu_x) = x' \beta. $$ Therefore, $$ \mu_{x} = \exp (x' \beta), $$ and $$ \frac{\mu_{x+1}}{\mu_{x}} = \exp ( \beta). $$ This says that the exponentiated $\beta$ is the incidence rate ratio of the expected counts, if $x$ increases by $1$. To calculate the marginal effect of $x$ on $y$ (or ${\rm E} (y)$), we need to calculate $$\frac{ \partial \mu_x} {\partial x} = \hat \beta \mu_x = \hat \beta \exp(X_i' \hat \beta) , $$ if we evaluate $\mu_x$ at predicted value. ### How to calculate marginal effects in Stata Stata's margins command is a powerful command to get the predicted value or marginal effects after an estimation command. Check Stata's manual for more details.