12 Missing data

Data set like:

y	x	z	w	q
1	4	2	?	?
4	?	1	2	?
2	?	?	?	?
?	2	1	?	?
?	?	?	?	?

12.1 Different cases under different assumptions

Missing completely at Random (MCAR): the occurrence of missing data is not related to the missing value, the values of any other variables, or the pattern of missingness in other variables. Too good to be true situation.
Missing at Random (MAR): the occurrence of missing values for a variable is random, contingent on the value or missingness of observable variables. Or, the missingness can be modeled.
Missing Not at Random (MNAR): the occurrence of missing values is systematically related to the unobserved value itself, even after conditioning on observed data. This is the hardest case: there is no model-free fix, but it is not hopeless. It can be addressed with selection models, pattern-mixture models, or sensitivity analysis, all of which rely on untestable assumptions about the missingness mechanism.

The MAR case is the one we are interested in.

For MCAR: you lose efficiency if you simply drop the data. For MAR: if you don’t model it, you suffer bias and efficiency.

If we model missingness, for MCAR, you gain efficiency. For MAR, you correct bias and gain efficiency.

12.2 Old methods

Listwise deletion
Mean imputation.

Problems: * understates variability in the imputed variable * does not recover associations between variables.

So standard errors are in general under-estimated.

Regression-based imputation

Use a model such as $x=\alpha_0+\alpha_1 z + \alpha_2 y$. But that introduce extra noise that are not accounted for (similar to generated variable problem).

Interpolation of panel data

That is, to use the observation from last period or a linear interpolation for the same unit.

12.3 Modern methods

Account for uncertainty in imputed variable
Use a model to predict missing observation.
Instead of picking one, pick many
uncertainty is represented by VCV matrix of the coefficients used to predict missing values.

12.3.1 Basic ideas

Imputation model: $x=\alpha_0+\alpha_1 z + \alpha_2 y$

Main model: $y = \beta_0 + \beta_1 x + \beta_2 z$

Pick $M$ values of $\alpha$ out of the asymptotic distribution, the multivariate normal $N(\hat \alpha, \hat \Sigma)$, using the estimate $\hat \alpha$ and its VCV $\hat \Sigma$ as the mean and VCV of the distribution.
predict $M$ values of the missing values, creating $M$ data sets.
calculate $M$ new estimates and combine them by averaging, $\tilde \beta = \frac{1}{M}\sum_{m=1}^M \tilde \beta_m$, using each of the imputed data sets. then calculate the variance of $\tilde \beta$:

$V_{\beta}=W + (1+1/M) B$

where $W=\frac{1}{M} \sum_{m=1}^M s_m^2$, $B=\frac{1}{M-1} \sum_{m=1}^M (\tilde \beta_m - \tilde \beta)^2$, in other words, the within-imputation and between-imputation variation. The standard error is $\sqrt{V_{\beta}}$. These formulas (Rubin’s rules) are written here for a single scalar coefficient $\beta$; for the full coefficient vector, $W$ is the average of the within-imputation covariance matrices, $B$ is the between-imputation covariance matrix $\frac{1}{M-1}\sum_m (\tilde{\boldsymbol\beta}_m - \tilde{\boldsymbol\beta})(\tilde{\boldsymbol\beta}_m - \tilde{\boldsymbol\beta})'$, and $V_{\boldsymbol\beta}=W+(1+1/M)B$ is the total covariance matrix.

12.3.2 MI through Chained Equations (MICE) (by Buuren)

Discard observations with all missing.
Fill in the missing data with random draws from the observed values.
Move through the columns and perform single-variable imputation using some method.
Replace the original replacements with the fitted replacements. Repeat 3 for a large number of times, or with a convergence criteria.
Do 1-4 $m$ times to create $m$ imputed data sets.

Many ways to implement MICE.

Regression (linear, logit, multinomial), get $\hat w$ or $f(\hat w)$ sample.
The default is Predictive Mean Matching (PMM)
1. create predicted value
2. pick a small set of donor cases that have the closest predicted values (the number of donors is software-specific; e.g., three or five).
3. randomly choose one of those donors’ observed values to impute.

12.3.3 Bayesian Data Augmentation

MICE is Markov Chain
Build a missing data model into a Bayesian model, treating missing values as another parameter to estimate by drawing out of its posterior distribution.

$f(\beta, y_{miss}|y_{obs}) \propto f(y_{obs} | \beta, y_{miss}) f(\beta, y_{miss})$

We integrate out the missing values by sampling from the total distribution, then averaging out the beta distributions over the space of missing data points.
As long as data is MAR, the likelihood of missingness is not related to $\beta$ (ignorability), this is fine.

12.3.4 FIML

If there are no missing data, the likelihood function is

$L(\mu, \Sigma)=\prod_{i=1} f(y_i | \mu, \Sigma)$

If there are missing value,

$L(\mu, \Sigma)=\prod_{i=1} f(y_i | \mu_i, \Sigma_i)$

12.3.5 mi in stata

log using mi_10_model2.log, replace

clear
set more off

* set matsize is obsolete since Stata 16 (matrix sizes are now dynamic)
* and can be safely omitted on modern Stata; kept here only because the
* original script targeted an older Stata version.
set matsize 4000

use "CROSSED_ContestUserActivity FEB 2014 temp.dta", clear
*sample 10

mi set flong
*local dummies "x1 x2"
local continuous "CulturalDist_KS5D_ctr tight_ctr targ_tight_ctr user_country_opennessValue_ctr targetcntropennes2010_ctr"

mi register imputed `continuous'
mi impute chained  (regress) `continuous' , add(10) rseed(1)

mi estimate, cmdok: heckprob has_won c.CulturalDist_KS5D_ctr##c.tight_ctr targ_tight_ctr gender_code expert_cat2 expert_cat3 expert_cat4 submissions_count, select (has_submitted = c.CulturalDist_KS5D_ctr##c.tight_ctr log_cash_000 gender_code expert_cat2 expert_cat3 expert_cat4 submissions_count ave_numConcurrentContest) vce (cluster user_id) difficult nonrtol

This is a sample code. In stata, you have to do “mi set”, then “mi register” to register variables you need to impute. Then “mi impute chained” if you have multiple variables to impute. Then the imputation will take all variables you registered (imputed or regular) in the prediction model unless you specify otherwise. Then do “mi estimate”.

# Missing data Data set like: | y | x | z | w | q | |---|---|---|---|---| | 1 | 4 | 2 | ? | ? | | 4 | ? | 1 | 2 | ? | | 2 | ? | ? | ? | ? | | ? | 2 | 1 | ? | ? | | ? | ? | ? | ? | ? | ## Different cases under different assumptions 1. Missing completely at Random (MCAR): the occurrence of missing data is not related to the missing value, the values of any other variables, or the pattern of missingness in other variables. Too good to be true situation. 2. Missing at Random (MAR): the occurrence of missing values for a variable is random, contingent on the value or missingness of observable variables. Or, the missingness can be modeled. 3. Missing Not at Random (MNAR): the occurrence of missing values is systematically related to the unobserved value itself, even after conditioning on observed data. This is the hardest case: there is no model-free fix, but it is not hopeless. It can be addressed with selection models, pattern-mixture models, or sensitivity analysis, all of which rely on untestable assumptions about the missingness mechanism. The MAR case is the one we are interested in. For MCAR: you lose efficiency if you simply drop the data. For MAR: if you don't model it, you suffer bias and efficiency. If we model missingness, for MCAR, you gain efficiency. For MAR, you correct bias and gain efficiency. ## Old methods 1. Listwise deletion 2. Mean imputation. Problems: * understates variability in the imputed variable * does not recover associations between variables. So standard errors are in general under-estimated. 3. Regression-based imputation Use a model such as $x=\alpha_0+\alpha_1 z + \alpha_2 y$. But that introduce extra noise that are not accounted for (similar to generated variable problem). 4. Interpolation of panel data That is, to use the observation from last period or a linear interpolation for the same unit. ## Modern methods * Account for uncertainty in imputed variable * Use a model to predict missing observation. * Instead of picking one, pick many * uncertainty is represented by VCV matrix of the coefficients used to predict missing values. ### Basic ideas Imputation model: $x=\alpha_0+\alpha_1 z + \alpha_2 y$ Main model: $y = \beta_0 + \beta_1 x + \beta_2 z$ 1. Pick $M$ values of $\alpha$ out of the asymptotic distribution, the multivariate normal $N(\hat \alpha, \hat \Sigma)$, using the estimate $\hat \alpha$ and its VCV $\hat \Sigma$ as the mean and VCV of the distribution. 2. predict $M$ values of the missing values, creating $M$ data sets. 3. calculate $M$ new estimates and combine them by averaging, $\tilde \beta = \frac{1}{M}\sum_{m=1}^M \tilde \beta_m$, using each of the imputed data sets. then calculate the variance of $\tilde \beta$: $V_{\beta}=W + (1+1/M) B$ where $W=\frac{1}{M} \sum_{m=1}^M s_m^2$, $B=\frac{1}{M-1} \sum_{m=1}^M (\tilde \beta_m - \tilde \beta)^2$, in other words, the within-imputation and between-imputation variation. The standard error is $\sqrt{V_{\beta}}$. These formulas (Rubin's rules) are written here for a single scalar coefficient $\beta$; for the full coefficient vector, $W$ is the average of the within-imputation covariance matrices, $B$ is the between-imputation covariance matrix $\frac{1}{M-1}\sum_m (\tilde{\boldsymbol\beta}_m - \tilde{\boldsymbol\beta})(\tilde{\boldsymbol\beta}_m - \tilde{\boldsymbol\beta})'$, and $V_{\boldsymbol\beta}=W+(1+1/M)B$ is the total covariance matrix. ### MI through Chained Equations (MICE) (by Buuren) 1. Discard observations with all missing. 2. Fill in the missing data with random draws from the observed values. 3. Move through the columns and perform single-variable imputation using some method. 4. Replace the original replacements with the fitted replacements. Repeat 3 for a large number of times, or with a convergence criteria. 5. Do 1-4 $m$ times to create $m$ imputed data sets. Many ways to implement MICE. * Regression (linear, logit, multinomial), get $\hat w$ or $f(\hat w)$ sample. * The default is Predictive Mean Matching (PMM) a. create predicted value b. pick a small set of donor cases that have the closest predicted values (the number of donors is software-specific; e.g., three or five). c. randomly choose one of those donors' observed values to impute. ### Bayesian Data Augmentation * MICE is Markov Chain * Build a missing data model into a Bayesian model, treating missing values as another parameter to estimate by drawing out of its posterior distribution. $f(\beta, y_{miss}|y_{obs}) \propto f(y_{obs} | \beta, y_{miss}) f(\beta, y_{miss})$ * We integrate out the missing values by sampling from the total distribution, then averaging out the beta distributions over the space of missing data points. * As long as data is MAR, the likelihood of missingness is not related to $\beta$ (ignorability), this is fine. ### FIML If there are no missing data, the likelihood function is $L(\mu, \Sigma)=\prod_{i=1} f(y_i | \mu, \Sigma)$ If there are missing value, $L(\mu, \Sigma)=\prod_{i=1} f(y_i | \mu_i, \Sigma_i)$ ### mi in stata log using mi_10_model2.log, replace clear set more off * set matsize is obsolete since Stata 16 (matrix sizes are now dynamic) * and can be safely omitted on modern Stata; kept here only because the * original script targeted an older Stata version. set matsize 4000 use "CROSSED_ContestUserActivity FEB 2014 temp.dta", clear *sample 10 mi set flong *local dummies "x1 x2" local continuous "CulturalDist_KS5D_ctr tight_ctr targ_tight_ctr user_country_opennessValue_ctr targetcntropennes2010_ctr" mi register imputed `continuous' mi impute chained (regress) `continuous' , add(10) rseed(1) mi estimate, cmdok: heckprob has_won c.CulturalDist_KS5D_ctr##c.tight_ctr targ_tight_ctr gender_code expert_cat2 expert_cat3 expert_cat4 submissions_count, select (has_submitted = c.CulturalDist_KS5D_ctr##c.tight_ctr log_cash_000 gender_code expert_cat2 expert_cat3 expert_cat4 submissions_count ave_numConcurrentContest) vce (cluster user_id) difficult nonrtol This is a sample code. In stata, you have to do "mi set", then "mi register" to register variables you need to impute. Then "mi impute chained" if you have multiple variables to impute. Then the imputation will take all variables you registered (imputed or regular) in the prediction model unless you specify otherwise. Then do "mi estimate".