12  Missing data

Data set like:

y x z w q

1 4 2 ? ?

4 ? 1 2 ?

2 ? ?

? 2 1

? ? ?

12.1 Different cases under different assumptions

  1. Missing completely at Random (MCAR): the occurrence of missing data is not related to the missing value, the values of any other variables, or the pattern of missingness in other variables. Too good to be true situation.

  2. Missing at Random (MAR): the occurrence of missing values for a variable is random, contingent on the value or missingness of observable variables. Or, the missingness can be modeled.

  3. Missing Not at Random (MNAR): the occurrence of missing values is systematically related to unknown or unmeasured covariate factors. Hopeless situation.

The MAR case is the one we are interested in.

For MCAR: you lose efficiency if you simply drop the data. For MAR: if you don’t model it, you suffer bias and efficiency.

If we model missingness, for both MCAR, you gain efficiency. For MAR, you correct bias and gain efficiency.

12.2 Old methods

  1. Listwise deletion

  2. Mean imputation.

Problems: * understates variability in the imputed variable * does not recover associations between variables.

So standard errors are in general under-estimated.

  1. Regression-based imputation

Use a model such as \(x=\alpha_0+\alpha_1 z + \alpha_2 y\). But that introduce extra noise that are not accounted for (similar to generated variable problem).

  1. Interpolation of panel data

That is, to use the observation from last period or a linear interpolation for the same unit.

12.3 Modern methods

  • Account for uncertainty in imputed variable

  • Use a model to predict missing observation.

  • Instead of picking one, pick many

  • uncertainty is represented by VCV matrix of the coefficients used to predict missing values.

12.3.1 Basic ideas

Imputation model: \(x=\alpha_0+\alpha_1 z + \alpha_2 y\)

Main model: \(y = \beta_0 + \beta_1 x + \beta_2 z\)

  1. Pick m values of \(\alpha\) out of the asymptotic distribution, the multivariate normal, using \(\alpha\) and VCV \(\hat \Sigma\) for the mean and VCV of the distribution \(\Phi (\alpha \Sigma)\).

  2. predict \(m\) values of the missing values, creating \(m\) data sets.

  3. calculate \(m\) new estimates of \(\tilde \beta = \sum_{m=1}^M \tilde \beta_m\) using each of the imputed data sets. then calculate the standard error of \(\tilde \beta\).

\(V_{\beta}=W + (1+1/m) B\)

where \(W=\frac{1}{m} \sum_{m=1}^M s_m^2\), \(B=\frac{1}{m-1} \sum_{m=1}^M (\tilde \beta_m - \tilde \beta)^2\), in other words, the within-imputation and between-imputation variation.

12.3.2 MI through Chained Equations (MICE) (by Buuren)

  1. Discard observations with all missing.

  2. Fill in the missing data with random draws from the observed values.

  3. Move through the columns and perform single-variable imputation using some method.

  4. Replace the original replacements with the fitted replacements. Repeat 3 for a large number of times, or with a convergence criteria.

  5. Do 1-4 \(m\) times to create \(m\) imputed data sets.

Many ways to implement MICE.

  • Regression (linear, logit, multinomial), get \(\hat w\) or \(f(\hat w)\) sample.

  • The default is Predictive Mean Matching (PMM)

    1. create predicted value

    2. pick three cases that have the closed predicted values in terms of Euclidean distance.

    3. randomly choose one of the three values to impute.

12.3.3 Bayesian Data Augmentation

  • MICE is Markov Chain

  • Build a missing data model into a Bayesian model, treating missing values as another parameter to estimate by drawing out of its posterior distribution.

\(f(\beta, y_{miss}|y_{obs}) \sim f(y_{obs} | \beta, y_{miss}) f(\beta, y_{miss})\)

  • We integrate out the missing values by sampling from the total distribution, then averaging out the beta distributions over the space of missing data points.

  • As long as data is MAR, the likelihood of missingness is not related to \(\beta\) (ignorability), this is fine.

12.3.4 FIML

If there are no missing data, the likelihood function is

\(L(\mu, \Sigma)=\prod_{i=1} f(y_i | \mu, \Sigma)\)

If there are missing value,

\(L(\mu, \Sigma)=\prod_{i=1} f(y_i | \mu_i, \Sigma_i)\)

12.3.5 mi in stata

log using mi_10_model2.log, replace

clear
set more off

set matsize 4000

use "CROSSED_ContestUserActivity FEB 2014 temp.dta", clear
*sample 10

mi set flong
*local dummies "x1 x2"
local continuous "CulturalDist_KS5D_ctr tight_ctr targ_tight_ctr user_country_opennessValue_ctr targetcntropennes2010_ctr"

mi register imputed `continuous'
mi impute chained  (regress) `continuous' , add(10) rseed(1)

mi estimate, cmdok: heckprob has_won c.CulturalDist_KS5D_ctr##c.tight_ctr targ_tight_ctr gender_code expert_cat2 expert_cat3 expert_cat4 submissions_count, select (has_submitted = c.CulturalDist_KS5D_ctr##c.tight_ctr log_cash_000 gender_code expert_cat2 expert_cat3 expert_cat4 submissions_count ave_numConcurrentContest) vce (cluster user_id) difficult nonrtol

This is a sample code. In stata, you have to do “mi set”, then “mi register” to register variables you need to impute. Then “mi impute chained” if you have multiple variables to impute. Then the imputation will take all variables you registered (imputed or regular) in the prediction model unless you specify otherwise. Then do “mi estimate”.