8 Count data models
8.1 Poisson Model
If the interested variable is a count data variable, it is natural to model it as a Poisson process. The number of occurrence follows a Poisson process:
\[ {\rm Pr} [Y=y]=\frac{e^{-\mu} \mu^y}{y! } \]
One of Poisson’s properties is that it has equal variance as its mean:
\[ {\rm E} [Y]={\rm Var} [Y]=\mu \]
In a count data model such as Poisson regression model, we are modeling log of the expected counts:
\[ \log ({\rm E}( Y) )=\bf X_i' \beta \]
The log-likelihood function is \[ \log L(\beta)= \sum_{i=1}^N {- \exp (X_i' \beta) + y_i X_i' \beta - \ln y_i!} \]
To maximize it, the first order condition is
\[ \sum_{i=1}^N (y_i - \exp (X_i' \beta) )X_i'=0 \]
We can use Newton-Raphson or other optimization algorithms to find the solution for the first order condition.
8.2 QMLE Poisson
8.2.0.1 Model
Note that in equation the equation above, if \(X_i'\) includes a constant, then \((y_i - \exp (X_i' \beta) )\) sums to zero. If \({\rm E} [y_i|X_i] = \exp (X_i' \beta)\), then the summation on the left-hand side has expectation of zero. Hence the only specification needed to apply equation the equation above is the conditional expectation of \(Y\) given \(X\). Even the data is not Poisson-distributed, the estimator by equation the equation above is still consistent. Therefore, this estimator is called quasi_ML (QML) Poisson estimator.
Although the QML Poisson estimator is consistent under relatively weak condition, the regular variance estimator for the coefficients are not valid anymore.
If a stronger assumption is made that the data follows a Poisson distribution, then the error term has a variance which equals to the mean of \(Y\). The estimator \(\hat \beta_P\) (ML estimator) follows a normal distribution asymptotically with a variance matrix
\[ {\rm Var}_{ML} [\hat \beta_P] = (\sum_{i=1}^N \mu_i X_i X_i')^{-1} \]
On the other hand, the variance matrix for the QML Poisson estimator \(\hat \beta_{QML}\) is
\[ {\rm Var}_{QML} [\hat \beta_P] = (\sum_{i=1}^N \mu_i X_i X_i')^{-1} (\sum_{i=1}^N \omega_i X_i X_i')^{-1} (\sum_{i=1}^N \mu_i X_i X_i')^{-1} \] where $_i={} [y_i|X_i] $ is the conditional variance of \(y_i\).
8.2.1 Implementation
Poisson estimation is implemented in almost every statistical package. However, some of them may not work if you have a continuous dependent variable.
Stata’s implementation of Poisson model: “poisson” and “xtpoisson” do take continuous dependent variable. Note that standard errors reported without any specification in “vce” will be likely downward biased. Therefore, always use “robust” option at least. Bootstrapped standard errors are also preferred. However, if you intend to use it as QMLE-Poisson, standard errors need to be adjusted. Those two procedures do not adjust for standard errors. A user-written program called xtpqml calls for xtpoisson and it calculates robust standard error which is suggested by Wooldridge (1999).
8.3 Fixed-effect Poisson model
We specify a panel data count model as: \[ E[y_{it} | \alpha_i, X_{it}]= \alpha_i \exp(X'_{it} \beta) = \exp(\gamma_i + X'_{it} \beta). \]
It turns out that for fixed-effect Poisson model, just like in linear regression case, there is NO incidental parameter problem. Poisson MLE is the same for conditional and unconditional likelihood.(See Cameron and Trivedi, 2005, Page 805).
Therefore, the coefficient estimates are the same for conditional or unconditional fixed-effect Poisson model. That is, in Stata, “xtpoisson, fe” will return the same results as “xi: poisson i.group”, is the same as “xtpqml, fe”. The only difference is that “xtpqml, fe” returns “correct” standard errors for QMLE Poisson. You can also calculate standard errors with the other commands, using clustered standard errors, or bootstrapped standard errors.
Conditional fixed effect negative binomial model is also possible to be consistent for some specifications. But Poisson fixed effect is more popular since it is QMLE.
8.4 Negative Binomial model
The Poisson model has a restrictive property that the conditional variance equals the conditional mean. Often we’ll see “overdispersion”, that is, the variance exceeds the mean, in real data. The negative binomial model is a generalization of poisson model which allows for overdispersion by introducing an unobserved heterogeneity term. In a negative binomial model,
\[ {\rm E} [Y]=\mu \tau = e^{X' \beta} \cdot \tau \]
This extra \(\tau\) term follows a \(Gamma(\theta, \theta)\) distribution, with \(E(\tau)=1\) and \(Var (\tau)=1/\theta\).
Conditional on \(X\) and \(\tau\), the distribution of \(Y\) is still poisson: \[ {\rm Pr} [Y=y | X, \tau]=\frac{e^{-\mu \tau} (\mu \tau)^y}{y! } \]
Conditional on \(X\) only, the distribution of \(Y\) is negative binomial: \[ {\rm Pr} [Y=y | X]=\frac{\theta^{\theta} \mu^y \Gamma(\theta + y)}{ \Gamma(y+1) \Gamma(\theta) (\mu + \theta)^{\theta+y}} \]
This distribution still has conditional mean of \(\mu\), but variance is \(\mu (1+(1/\theta) \mu)\).
When \(\alpha=1/\theta\) approaches zero, the negative binomial model converges to poisson model.
8.5 Models for truncated counts
In some situations, all we observe are non-zero counts, because of the way data was collected. For example, we only observe people’s visits to a park, only if they visit. To model how many times they visit, based on this kind of data set, we need to use zero-truncated count models.
\[ {\rm Pr} [Y=y | y>0, X]=\frac{{\rm Pr} (y | X) }{{\rm Pr} (y>0 | X)}= \frac{{\rm Pr} (y | X)}{1- \exp (- \mu)} \]
Thus we are modeling the counts given that we have a positive outcome. Both zero-truncated poisson (“ztp” in Stata) and zero-truncated negative binomial (“ztnb” in Stata) are estimated by Maximum-likelihood method.
We need to pay extra attention to the “over-dispersion” problem when dealing with zero-truncated data. When data is not truncated, coefficients estimated by poisson model is consistent, even if data is “overdispersed”. However, if we have zero-truncated sample, then “over-dispersion” results in inconsistent coefficient estimation (Grogger and Carson, 1991). Therefore, “over-dispersion” needs to be tested before using truncated poisson model or truncated negative binomial model.
8.6 The hurdle regression model
Although negative binomial model relaxes the assumption of equal mean and variance, some researchers prefer modeling the process of generating zeros differently from the process of generating other values. Suppose zero counts are generated by a binary process:
\[ {\rm Pr} [y=0 | X]=\frac{\exp(X \gamma)}{1 + \exp(X \gamma)} = \pi \]
Positive counts are generated by a truncated count process.
In this model, zero is a “hurdle” to get past before reaching positive counts. The hurdle model is estimated by two separate equations. It is easy to estimate. However, the marginal effect is tricky to calculate, since and independent variable, if appearing in both equations, will have effect through both equations.
\[ {\rm E} (y | X) = [\pi \times 0 ] + {(1-\pi) \times {\rm E} (y | y>0, X) } = (1- \pi) \times {\rm E} (y | y>0 , X) \]
In stata, there are commands such as “hplogit”, which is a hurdle model with first equation logit, and second equation truncated poisson. It is the same as two separate estimations; namely “logit” first, then “ztp” on positive counts.
The difference between hurdle model and a Heckman selection model is that in a Heckman model, two equations need to be jointly estimated. There must be exclusion condition to make the selection equation identifiable. That is, the second stage equation need to take account of the selection bias; therefore, some instrument for selection is needed for the first stage equation. On the other hand, a hurdle model assumes that the two processes are separate; there is no selection.
8.7 Zero-inflated count models
A zero-inflated count model also models two different processes. One is a binary process: generating zero or non-zeros. The second one is a count model. Note there is a different between zero-inflated and hurdle model: In hurdle model, the second process is a truncated count model (positive counts). In a zero-inflated model, a zero can come from either the binary process, or from the count process.
A second difference is that the estimation of zero-inflated model is a joint estimation of the two processes.
\[ y_i \sim \left\{ \begin{array}{c} 0 \ \mbox{with probability} \ \psi \\ g(y_i | X_i) \mbox{with probability} \ 1- \psi \end{array} \right. \]
Then
\[ P(Y_i=y_i | X_i, Z_i) = \left\{ \begin{array}{c} \psi(\gamma' Z_i) + (1- \psi(\gamma' Z_i) ) g(0 | X_i) \ \mbox{if} \ y_i=0 \\ (1- \psi(\gamma' Z_i) ) g(y_i | X_i) \ \mbox{if} \ y_i>0 \end{array} \right. \]
This says that every non-zero observation follows a count process \(g(y_i|X_i)\). Every zero observation has two possible sources: from a binary process (with probability \(\psi(\gamma' Z_i)\)) and from a count process (with probability \(1- \psi(\gamma' Z_i)\) into the count process and then with probability of \(g(0|X_i)\) to be a zero). The count process can be a Poisson process (“zip” in Stata) or a Negative Binomial process (“zinb” in Stata).
8.8 An example of extra-zero count data model comparison
http://hbs-rcs.github.io/blog/2014/09/17/poisson-models/
8.9 How to interpret coefficients and calculate marginal effects in Discrete Choice Models
8.9.1 Binary Response Models
In a linear model \[ y_i= X_i' \beta + \epsilon, \] \[E(y_i) = X_i' \beta \]
When \(y_i\) is binary (meaning it takes two values, 1 or 0), We can also do an OLS regression on it, but a more popular way is to use a non-linear model. The most popular choices for modeling binary response are logit model and probit model. Both of them use the same idea: use a link function to map the binary variable into a continuous variable which is a linear function of the predictors.
Suppose \(p_t\) is the probability that \(y_i=1\). Then \[p_i=E(y_i) = g^{-1}(X_i' \beta), \] where \(g^{-1}()\) is a function that maps from the linear predictor to \(y_i\), which is often called index function or inverse link function.
This says that the expectation of \(y_i\) (equivalently, \(p_i\)) is not a linear function of \(x_i\)’s, but by using the link function, we are still able to model a transformed \(p_i\) as a linear function of \(x_i\)’s: \[g(p_i) = X_i' \beta. \]
\(g^{-1}()\) can be a normal CDF then we have a probit model: \[ g^{-1}(z) = \Phi (z) = \int_{- \infty}^z \frac{1}{\sqrt{2 \pi}} {\rm exp} [-(t^2/2)] dt, \]
or it can be a logit:
\[ g^{-1}(z) = \Lambda (z) = e^z/(1+e^z). \]
8.9.2 Marginal Effects
Often times we are interested in the marginal effects of the predictors. In a linear model, it’s easy: the coefficient on \(x_1\) means when \(x_1\) increase by one unit, how much will \(y\) increase.
In a binary response model, \[\frac{\partial p}{\partial x} = \frac{\partial g^{-1}(X_i' \beta)}{\partial x}\]
In the case of probit:
\[\frac{\partial p}{\partial x} = \frac{\partial \Phi (X_i' \beta)}{\partial x} = \phi(X_i' \beta) \beta, \]
where \(\phi()\) is the density function of normal distribution.
In the case of logit:
\[\frac{\partial p}{\partial x} = \frac{\partial \Lambda (X_i' \beta)}{\partial x} = \gamma(X_i' \beta) \beta, \] where \(\gamma()\) is the density function for logistic distribution: \[\gamma(X_i' \beta) = \Lambda(X_i' \beta) (1- \Lambda(X_i' \beta)) = p(1-p).\]
Therefore for logit, it’s easier to calculate the marginal effect from coefficient estimate \(\beta\), all you have to do is to plug in sample proportion mean \(\bar p\) and coefficient estimate \(\hat \beta\):
\[ \hat{ \frac{\partial p}{\partial x}} = \bar p (1- \bar p) \hat \beta \]
The logit model can also be formulated as \[ \log (\frac{p}{1-p})=\bf X_i' \beta \] which says the logarithm of the odds (the ratio of the two probabilities) is equal to \(\bf X_i' \beta\). Therefore, \[ p =\frac{\exp( X_i' \beta)}{1+\exp({\bf X}_t \beta)}=\Lambda( X_i' \beta) \]
Often times we have discrete valued predictors, such as gender, or other dummy variables. In that case, it makes more sense to ask what’s the effect of gender being 1 vs. 0 (female vs. male). What this question is can be formulated as:
\[ E(p | x_1=1) - E(p | x_1=0). \]
In empirical analysis, it’s often calculated by calculating the predicted probability by setting \(x_1\) to 1 and by setting \(x_1\) to 0. Then calculate the difference.
8.9.3 Count Data Models
In a count data model such as Poisson regression model or Negative Binomial model, we are modeling log of the expected counts:
\[ \log ({\rm E}( y) )=\bf X_i' \beta \] Therefore the marginal effect of \(x_1\), for example, on \({\rm E} (y)\) is
\[ { \frac{\partial \log {\rm E} (y) }{\partial x}} = \hat \beta , \]
which means \(\beta\) is the marginal effect of \(x\) on \(\log\) of the expected counts. Suppose the expected counts at \(x\) is \(\mu_{x}\), then
\[ { \frac{(\partial \mu_x) / \mu_x }{\partial x}} = \hat \beta , \]
which says that \(\beta\) represents the marginal effect of percentage change in \(\mu_x\) with respect to \(x\).
Another way to formulate this is to use the Incidence Rate Ratio Interpretation. Suppose the expected counts at \(x\) is \(\mu_{x}\), and expected counts at \(x+1\) is \(\mu_{x+1}\).
\[ \log (\mu_x) = x' \beta. \]
Therefore,
\[ \mu_{x} = \exp (x' \beta), \]
and
\[ \frac{\mu_{x+1}}{\mu_{x}} = \exp ( \beta). \]
This says that the exponentiated \(\beta\) is the incidence rate ratio of the expected counts, if \(x\) increases by \(1\).
To calculate the marginal effect of \(x\) on \(y\) (or \({\rm E} (y)\)), we need to calculate
\[\frac{ \partial \mu_x} {\partial x} = \hat \beta \mu_x = \hat \beta \exp(X_i' \hat \beta) , \] if we evaluate \(\mu_x\) at predicted value.
8.9.4 How to calculate marginal effects in Stata
Stata’s margins command is a powerful command to get the predicted value or marginal effects after an estimation command. Check Stata’s manual for more details.