7 Discrete and Limited Dependent Variables

7.1 Binary Response Models

7.1.1 Probit and Logit

Let $P_t$ denote the probability that $y_t=1$ conditional on the information set $\Omega$, which consists of exogenous and predetermined variables. A binary response model serves to model this conditional expectation. Since the values are 0 or 1, it is clear that $P_t$ is also the expectation of $y_t$ conditional on $\Omega_t$: \[ P_t \equiv \mbox{Pr}(y_t=1 | \Omega_t)=\mbox{E}(y_t | \Omega_t). \]

Here $X_t$ is a set of regressors. A linear probability model would set $P_t = X_t \beta$ directly, which requires the fitted values to stay in $[0,1]$ – an awkward constraint, since the linear index $X_t \beta$ is unbounded. Logit and probit avoid this problem by modeling $P_t = F(X_t \beta)$, mapping the unbounded index through a CDF-valued transformation into $[0,1]$.

We ensure that $0 \le P_t \le 1$ by specifying that \[ P_t \equiv \mbox{Pr}(y_t=1 | \Omega_t)=F(\bf X_t \beta). \] $F(x)$ is a transformation function, which has the same characteristics as the CDF of a probability distribution.

Two popular choices of $F(x)$ are Gaussian (probit) and Logistic (logit).

The less familiar logistic function is \[ \Lambda(x)=\frac{e^x}{1+e^x} \]

The logit model is most easily derived by assuming that \[ \log (\frac{P_t}{1-P_t})=\bf X_t \beta \] which says the logarithm of the odds (the ratio of the two probabilities) is equal to $\bf X_t \beta$. Therefore, \[ P_t =\frac{\exp({\bf X}_t \beta)}{1+\exp({\bf X}_t \beta)}=\Lambda({\bf X}_t \beta) \]

7.1.2 MLE for binary data

The likelihood for an observation $t$ is the probability that $y_t=1$ if $y_t=1$, or the probability that $y_t=0$ if $y_t=0$. The logarithm of the appropriate probability is then the contribution to the loglikelihood made by observation $t$. Therefore, if $\bf y$ is an n-vector with typical element $y_t$, the loglikelihood function for $\bf y$ can be written as \[ {\ell}({\bf y, \beta})=\sum_{t=1}^n(y_t \log F({\bf X_t \beta})+(1-y_t) \log(1- F({\bf X_t \beta}))) \]

For the logit and probit models, this function is globally concave with respect to $\beta$. This implies that the first-order conditions, or likelihood equations, uniquely define the MLE estimator $\hat \beta$. These likelihood equations can be written as \[ \sum_{t=1}^n \frac{(y_t-F({\bf X_t \beta}))f({\bf X_t \beta})x_{ti}}{F({\bf X_t \beta})(1- F({\bf X_t \beta}))}=0, \ i=1, \dots, k. \] where $f=F'$ is the density (e.g. $\phi$ for probit). For logit, $f=F(1-F)$, so the numerator cancels the denominator and the score reduces to $\sum_t (y_t-F)x_{ti}=0$.

Newton’s Method can be used to find $\hat \beta$.

7.2 Models for More than Two Discrete Responses

7.2.1 The Ordered Probit

Ordered Probit can be easily derived from a latent variable model. \[ y_t^0={\bf X}_t \beta+u_t, \ u_t \sim NID(0,1) \]

Suppose we observe $y_t$ with three values.

\[ \begin{cases} y_t = 0 & \mbox{if} \ y_t^0 < \gamma_1; \\ y_t = 1 & \mbox{if} \ \gamma_1 \leq y_t^0 < \gamma_2; \\ y_t = 2 & \mbox{if} \ y_t^0 \geq \gamma_2. \end{cases} \]

Therefore, \[ \begin{aligned} \mbox{Pr}(y_t=0) &= \mbox{Pr}(y_t^0 < \gamma_1)=\mbox{Pr}({\bf X}_t \beta + u_t < \gamma_1) \\ &= \mbox{Pr}(u_t < \gamma_1 - {\bf X}_t \beta)=\Phi ( \gamma_1 - {\bf X}_t \beta) \end{aligned} \]

Similarly, \[ \mbox{Pr}(y_t=2) = 1 - \Phi ( \gamma_2 - {\bf X}_t \beta) \] \[ \mbox{Pr}(y_t=1) = \Phi ( \gamma_2- {\bf X}_t \beta ) - \Phi ( \gamma_1- {\bf X}_t \beta ) \]

These probabilities depend solely on the value of the index function and on the two threshold parameters.

The loglikelihood function is \[ \ell (\beta, \gamma_1, \gamma_2) = \sum_{y_t=0} \log(\Phi (\gamma_1 - {\bf X}_t \beta)) + \sum_{y_t=2} \log(\Phi ( {\bf X}_t \beta - \gamma_2)) + \sum_{y_t=1} \log(\Phi ( \gamma_2 - {\bf X}_t \beta ) -\Phi ( \gamma_1 - {\bf X}_t \beta )) \]

7.2.2 The Multinomial Logit

When responses are unordered, two popular choices are multinomial logit and conditional logit. They are easy to conflate because they share the same softmax form, but they differ in what varies across the $J+1$ alternatives $l=0,\dots,J$: in multinomial logit the regressors are individual-specific and only the coefficients vary by alternative; in conditional logit it’s the reverse.

Multinomial logit. Here the regressor vector $\bf X_t$ (individual-specific characteristics, e.g. age, income) is the same for every alternative, and it is the coefficient vector that is alternative-specific:

\[ \mbox{Pr}(y_t=l)=\frac{\exp({\bf X}_t \beta^l)}{\sum_{j=0}^J\exp({\bf X}_t \beta^j)} \quad \mbox{for} \ l=0, \dots, J, \]

Here $\bf X_t$ is a row vector of length $k$ of individual-specific covariates, and $\beta^l$ is a $k$-vector of alternative-specific parameters. For identification one category is taken as the base with $\beta^0=0$ (so its numerator is $1$).

Conditional logit. Here it is the regressors that vary by alternative (e.g. the price or characteristics of choice $j$ itself), while the coefficient vector is common across alternatives:

\[ \mbox{Pr}(y_t=l)=\frac{\exp({\bf W}_{tl} \beta)}{\sum_{j=0}^J\exp({\bf W}_{tj} \beta)} \quad \mbox{for} \ l=0, \dots, J, \]

Here ${\bf W}_{tj}$ is a row vector of length $k$ of alternative-$j$-specific variables, and $\beta$ is a single $k$-vector of parameters shared across all alternatives $j=0,\dots,J$.

Having both choice-varying regressors ${\bf W}_{tl}$ and choice-specific coefficients $\beta^l$ in the same formula, as in some treatments, conflates the two models and is not separately identified: the two sources of alternative-variation (in $W$ and in $\beta$) cannot be told apart from data on $\Pr(y_t=l)$ alone.

# Discrete and Limited Dependent Variables ## Binary Response Models ### Probit and Logit Let $P_t$ denote the probability that $y_t=1$ conditional on the information set $\Omega$, which consists of exogenous and predetermined variables. A binary response model serves to model this conditional expectation. Since the values are 0 or 1, it is clear that $P_t$ is also the expectation of $y_t$ conditional on $\Omega_t$: $$ P_t \equiv \mbox{Pr}(y_t=1 | \Omega_t)=\mbox{E}(y_t | \Omega_t). $$ Here $X_t$ is a set of regressors. A linear probability model would set $P_t = X_t \beta$ directly, which requires the fitted values to stay in $[0,1]$ -- an awkward constraint, since the linear index $X_t \beta$ is unbounded. Logit and probit avoid this problem by modeling $P_t = F(X_t \beta)$, mapping the unbounded index through a CDF-valued transformation into $[0,1]$. We ensure that $0 \le P_t \le 1$ by specifying that $$ P_t \equiv \mbox{Pr}(y_t=1 | \Omega_t)=F(\bf X_t \beta). $$ $F(x)$ is a transformation function, which has the same characteristics as the CDF of a probability distribution. Two popular choices of $F(x)$ are Gaussian (probit) and Logistic (logit). The less familiar logistic function is $$ \Lambda(x)=\frac{e^x}{1+e^x} $$ The logit model is most easily derived by assuming that $$ \log (\frac{P_t}{1-P_t})=\bf X_t \beta $$ which says the logarithm of the odds (the ratio of the two probabilities) is equal to $\bf X_t \beta$. Therefore, $$ P_t =\frac{\exp({\bf X}_t \beta)}{1+\exp({\bf X}_t \beta)}=\Lambda({\bf X}_t \beta) $$ ### MLE for binary data The likelihood for an observation $t$ is the probability that $y_t=1$ if $y_t=1$, or the probability that $y_t=0$ if $y_t=0$. The logarithm of the appropriate probability is then the contribution to the loglikelihood made by observation $t$. Therefore, if $\bf y$ is an n-vector with typical element $y_t$, the loglikelihood function for $\bf y$ can be written as $$ {\ell}({\bf y, \beta})=\sum_{t=1}^n(y_t \log F({\bf X_t \beta})+(1-y_t) \log(1- F({\bf X_t \beta}))) $$ For the logit and probit models, this function is globally concave with respect to $\beta$. This implies that the first-order conditions, or likelihood equations, uniquely define the MLE estimator $\hat \beta$. These likelihood equations can be written as $$ \sum_{t=1}^n \frac{(y_t-F({\bf X_t \beta}))f({\bf X_t \beta})x_{ti}}{F({\bf X_t \beta})(1- F({\bf X_t \beta}))}=0, \ i=1, \dots, k. $$ where $f=F'$ is the density (e.g. $\phi$ for probit). For logit, $f=F(1-F)$, so the numerator cancels the denominator and the score reduces to $\sum_t (y_t-F)x_{ti}=0$. Newton's Method can be used to find $\hat \beta$. ## Models for More than Two Discrete Responses ### The Ordered Probit Ordered Probit can be easily derived from a latent variable model. $$ y_t^0={\bf X}_t \beta+u_t, \ u_t \sim NID(0,1) $$ Suppose we observe $y_t$ with three values. $$ \begin{cases} y_t = 0 & \mbox{if} \ y_t^0 < \gamma_1; \\ y_t = 1 & \mbox{if} \ \gamma_1 \leq y_t^0 < \gamma_2; \\ y_t = 2 & \mbox{if} \ y_t^0 \geq \gamma_2. \end{cases} $$ Therefore, $$ \begin{aligned} \mbox{Pr}(y_t=0) &= \mbox{Pr}(y_t^0 < \gamma_1)=\mbox{Pr}({\bf X}_t \beta + u_t < \gamma_1) \\ &= \mbox{Pr}(u_t < \gamma_1 - {\bf X}_t \beta)=\Phi ( \gamma_1 - {\bf X}_t \beta) \end{aligned} $$ Similarly, $$ \mbox{Pr}(y_t=2) = 1 - \Phi ( \gamma_2 - {\bf X}_t \beta) $$ $$ \mbox{Pr}(y_t=1) = \Phi ( \gamma_2- {\bf X}_t \beta ) - \Phi ( \gamma_1- {\bf X}_t \beta ) $$ These probabilities depend solely on the value of the index function and on the two threshold parameters. The loglikelihood function is $$ \ell (\beta, \gamma_1, \gamma_2) = \sum_{y_t=0} \log(\Phi (\gamma_1 - {\bf X}_t \beta)) + \sum_{y_t=2} \log(\Phi ( {\bf X}_t \beta - \gamma_2)) + \sum_{y_t=1} \log(\Phi ( \gamma_2 - {\bf X}_t \beta ) -\Phi ( \gamma_1 - {\bf X}_t \beta )) $$ ### The Multinomial Logit When responses are unordered, two popular choices are multinomial logit and conditional logit. They are easy to conflate because they share the same softmax form, but they differ in what varies across the $J+1$ alternatives $l=0,\dots,J$: in multinomial logit the *regressors* are individual-specific and only the *coefficients* vary by alternative; in conditional logit it's the reverse. **Multinomial logit.** Here the regressor vector $\bf X_t$ (individual-specific characteristics, e.g. age, income) is the same for every alternative, and it is the coefficient vector that is alternative-specific: $$ \mbox{Pr}(y_t=l)=\frac{\exp({\bf X}_t \beta^l)}{\sum_{j=0}^J\exp({\bf X}_t \beta^j)} \quad \mbox{for} \ l=0, \dots, J, $$ Here $\bf X_t$ is a row vector of length $k$ of individual-specific covariates, and $\beta^l$ is a $k$-vector of alternative-specific parameters. For identification one category is taken as the base with $\beta^0=0$ (so its numerator is $1$). **Conditional logit.** Here it is the regressors that vary by alternative (e.g. the price or characteristics of choice $j$ itself), while the coefficient vector is common across alternatives: $$ \mbox{Pr}(y_t=l)=\frac{\exp({\bf W}_{tl} \beta)}{\sum_{j=0}^J\exp({\bf W}_{tj} \beta)} \quad \mbox{for} \ l=0, \dots, J, $$ Here ${\bf W}_{tj}$ is a row vector of length $k$ of alternative-$j$-specific variables, and $\beta$ is a single $k$-vector of parameters shared across all alternatives $j=0,\dots,J$. Having *both* choice-varying regressors ${\bf W}_{tl}$ *and* choice-specific coefficients $\beta^l$ in the same formula, as in some treatments, conflates the two models and is not separately identified: the two sources of alternative-variation (in $W$ and in $\beta$) cannot be told apart from data on $\Pr(y_t=l)$ alone.