UCLA Office of Advanced Research Computing, Statistical Methods and Data Analytics
Introduction
Missing data are common in applied research. Sometimes, missing data are handled in an ad hoc way, but the choice of method can affect the validity of the analysis.
In this seminar, we focus on multiple imputation, one of the most commonly used approaches for handling missing data.
Note
The goal is not to advocate for one universal method. Depending on the data structure and analytic model, other approaches, such as direct maximum likelihood, may be more appropriate.
Multiple imputation in practice
Multiple imputation requires careful attention to:
the structure of the data
the amount and pattern of missingness
the assumptions needed for imputation
the analytic model that will be estimated after imputation
The purpose of this seminar is to understand the issues that arise when using multiple imputation in practice.
Goals of statistical analysis with missing data
When analyzing data with missing values, we generally want to:
Minimize bias
Maximize use of available information
Obtain appropriate estimates of uncertainty
Note
Multiple imputation is designed to help with all three goals, but only when the imputation model and assumptions are appropriate.
Exploring missing-data mechanisms
A missing-data mechanism describes the process believed to have generated the missing values.
Missing-data mechanisms are usually discussed in three broad categories:
Missing completely at random: MCAR
Missing at random: MAR
Missing not at random: MNAR
Note
There are precise technical definitions for these terms. The explanations here are simplified for applied data analysis.See Figure 1.
Missing completely at random: MCAR
A variable is missing completely at random if missingness is not predicted by:
other variables in the data set
the unobserved value of the variable itself
In other words, the probability that a value is missing is unrelated to the observed or unobserved data.
Note
MCAR is a strong assumption and is often unlikely in real data.
Example of MCAR
One situation where MCAR may be reasonable is planned missingness.
For example, in a health survey, a random subset of participants may be selected for a more extensive physical examination.
Only those selected participants have complete information for the additional measurements.
Because the subset was randomly selected, missingness is unrelated to participant characteristics or to the missing values themselves.
A note about MCAR
MCAR can still allow missingness on one variable to be related to missingness on another variable.
For example:
var1 is missing whenever var2 is missing
a husband and wife are both missing information on height
This can still be MCAR if the missingness is not related to observed variables or to the true unobserved values.
Missing at random: MAR
A variable is missing at random if missingness can be predicted by other observed variables in the data set, but not by the missing value itself after controlling for those observed variables.
For example, in a survey, men may be more likely than women to skip a particular question.
In that case, gender predicts missingness on another variable.
Under MAR, the probability of missingness may depend on observed variables, but not on the missing value itself after conditioning on those observed variables.
MAR is closely related to the idea of an ignorable missing-data mechanism, which is needed for likelihood-based methods and multiple imputation.
Note
Ignorability is an important assumption for multiple imputation and direct maximum likelihood approaches.
Missing not at random: MNAR
Data are missing not at random if the value of the unobserved variable itself predicts missingness.
A classic example is income.
People with very high incomes may be more likely to decline to answer questions about their income than people with more moderate incomes.
Statistical models have been developed for MNAR mechanisms, but those models are beyond the scope of this seminar.
Why the mechanism matters
The missing-data mechanism matters because different mechanisms require different treatments.
If data are MCAR, complete-case analysis will not usually bias parameter estimates, such as regression coefficients.
However, complete-case analysis can still reduce sample size and increase standard errors.
If data are MAR or MNAR, analyzing only complete cases can lead to biased parameter estimates.
Modern missing-data methods, such as:
multiple imputation
direct maximum likelihood
generally assume that the data are at least MAR.
In this seminar, we focus on multiple imputation under the MCAR/MAR framework. Methods that are valid under MAR can also be used when data are MCAR.
Missing-data mechanisms: diagram
Figure 1: Graphical representations of (a) missing completely at random (MCAR), (b) missing at random (MAR), and (c) missing not at random (MNAR) in a univariate missing-data pattern. \(X\) represents variables that are completely observed, \(Y\) represents a variable that is partly missing, \(Z\) represents the component of the causes of missingness unrelated to \(X\) and \(Y\), and \(R\) represents the missingness indicator. Source: Schafer and Graham (2002).
Further reading
For more information on missing-data mechanisms, see:
Allison, 2002
Enders, 2010
Little & Rubin, 2002
Rubin, 1976
Schafer & Graham, 2002
Seminar data
This seminar uses the complete hsb2.dta data set for demonstration.
A modified version, hsb2_mar.dta, contains missing values and will be used later for the missing-data examples.
The complete-data regression results provide a reference point for comparing complete-case analysis and multiple imputation.
Note
The complete-data results are used for demonstration only. A formal evaluation of multiple imputation would require a simulation study with repeated samples and a known data-generating process.
Full data Regression model:
Below is a regression model predicting read using the complete data set (hsb2) used to create hsb_mar, which contains test scores, as well as demographic and school information for 200 high school students.
use https://stats.idre.ucla.edu/stat/stata/notes/hsb2, clearregress read write i.female math ib3.prog
. use https://stats.idre.ucla.edu/stat/stata/notes/hsb2
(highschool and beyond (200 cases))
. regress read write i.female math ib3.prog
Source | SS df MS Number of obs = 200
-------------+---------------------------------- F(5, 194) = 41.53
Model | 10814.6553 5 2162.93105 Prob > F = 0.0000
Residual | 10104.7647 194 52.0864161 R-squared = 0.5170
-------------+---------------------------------- Adj R-squared = 0.5045
Total | 20919.42 199 105.122714 Root MSE = 7.2171
------------------------------------------------------------------------------
read | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
write | .3747415 .0746281 5.02 0.000 .2275549 .521928
|
female |
female | -2.69884 1.095408 -2.46 0.015 -4.859277 -.5384027
math | .4418632 .0749972 5.89 0.000 .2939487 .5897778
|
prog |
general | .2320562 1.512195 0.15 0.878 -2.750396 3.214509
academic | 1.879263 1.423068 1.32 0.188 -.9274069 4.685933
|
_cons | 9.623172 3.409797 2.82 0.005 2.898141 16.3482
------------------------------------------------------------------------------
Common techniques for missing data
Before focusing on multiple imputation, we will briefly review several common approaches for handling missing data:
Complete-case analysis, also called listwise deletion
Available-case analysis, also called pairwise deletion
Mean imputation
Single imputation
Stochastic imputation
Each method is easy to use, but each has important limitations.
Missingness in hsb2_mar
Below we look at some of the descriptive statistics of the data set hsb2_mar.dta.
use https://stats.idre.ucla.edu/wp-content/uploads/2017/05/hsb2_mar.dta, clearsum
Although the data set contains 200 cases, several variables have fewer than 200 observed values.
The amount of missing information varies by variable:
read: 9 missing observations, or 4.5%
female and prog: 18 missing observations, or 9%
This doesn’t seem like a lot of missing data, so we might be inclined to try to analyze the observed data as they are, a strategy sometimes referred to as complete case analysis.
Complete-case analysis in Stata
One simple strategy is to analyze only the observations with complete data on all variables in the model.
In Stata, the default behavior of regress is complete-case analysis, also called listwise deletion.
Comparison of regression estimates using the complete data and complete-case analysis.
Parameter
Full data β
Full data SE
Full data p-value
Complete-case β
Complete-case SE
Complete-case p-value
Intercept
9.62
3.410
0.0053
13.03
4.124
0.002
Write
0.37
0.075
<.0001
0.44
0.093
<.0001
Female
-2.70
1.095
0.0146
-2.71
1.365
0.0496
Math
0.44
0.075
<.0001
0.32
0.095
0.001
PROG academic
1.88
1.423
0.1882
1.81
1.655
0.2759
PROG general
0.23
1.512
0.8782
0.52
1.881
0.7836
Only 130 cases were used in the complete-case regression.
This means that 70 of 200 cases were excluded because they had missing values on at least one variable in the model.
The smaller sample size can reduce statistical power and increase standard errors.
In this example, the standard errors are larger in the complete-case analysis, and the estimate for female becomes borderline non-significant.
Complete-case analysis can also produce biased estimates unless the missing-data mechanism is MCAR.
Again, The complete-data results are used for demonstration only. They help illustrate how estimates can change after missing data are introduced, but they are not a formal benchmark for evaluating missing data analysis.
Available-case analysis
Available-case analysis, also called pairwise deletion, uses all available nonmissing data for each statistic being estimated.
For example, each variance or covariance is estimated using all cases that have nonmissing values for the variables involved in that calculation.
Available-case analysis may use more data than complete-case analysis, so the loss of power can be less severe.
However, it has important drawbacks:
the sample size is not consistent across estimates
different correlations or covariances may be based on different subsets of cases
parameter estimates can differ from both complete-case and full-data analyses
unless the data are MCAR, estimates may be biased
Available-case analysis in Stata
The command below creates dummy variables for the categories of prog.
tab prog, gen(progcat)
We then examine summaries and correlations among the variables used in the regression model.
sum female write read math progcat1 progcat2, sep(6)corr female write read math progcat1 progcat2pwcorr female write read math progcat1 progcat2, obs
The key comparison is between corr and pwcorr, obs. The pwcorr command uses all available nonmissing observations for each pair of variables, so the sample size can differ across correlations.
Unconditional mean imputation
Unconditional mean imputation replaces missing values for a variable with the mean of the observed values for that same variable.
This method is simple and easy to implement, but it has serious limitations.
The main problem is that it places imputed values at the center of the distribution, which artificially reduces variability.
reduces the variance of the imputed variable
changes correlations between the imputed variable and other variables
treats imputed values as if they were observed with certainty
can also produce biased estimates unless the missing-data mechanism is MCAR.
Mean imputation in Stata
The code below replaces missing values with the observed mean for each variable.
Single imputation replaces each missing value with one estimated value. A common form is regression imputation, where missing values are replaced by predicted scores from a regression model.
This uses more information than mean imputation because the imputed value is conditional on other variables. However, the predicted values fall directly on the regression line, so the method does not preserve the natural variability in the data. See Figure 2.
As a result, deterministic imputation can:
underestimate variability
inflate associations among variables
bias correlations and \(R^2\) upward
treat imputed values as if they were observed with certainty
Warning
Because uncertainty in the imputed values is ignored, standard errors and statistical tests can be misleading.
Figure 2: Deterministic regression imputation places imputed values directly on the regression line. Source: adapted from Enders (2010), p. 46.
Stochastic imputation
Stochastic imputation modifies regression imputation by adding random error to the predicted values.
Instead of imputing only the predicted score,
\[
\hat{Y}
\]
we impute:
\[
\hat{Y} + e
\]
where \(e\) is randomly drawn from a distribution with mean 0 and variance equal to the residual variance from the regression model.
This restores some of the lost variability, but it still uses only one imputed value per missing observation.
Stochastic imputation can improve on deterministic regression imputation, but standard errors may still be too small because uncertainty about the imputation process is not fully incorporated.
Traditional methods such as complete-case analysis, mean imputation, deterministic imputation, and stochastic imputation are easy to apply, but they can distort estimates or uncertainty.
A key limitation is that single-imputation methods fill in one value and then treat it as if it were observed.
Multiple imputation addresses this by creating several plausible completed data sets and combining results across them.
Figure 3: Stochastic regression imputation adds random residual variation to the predicted values, so imputed values no longer fall exactly on the regression line. Source: adapted from Enders (2010), p. 48.
Multiple imputation
Multiple imputation is an extension of stochastic imputation.
Instead of filling in one value for each missing observation, multiple imputation creates several plausible completed data sets.
Each imputed value includes a random component, reflecting uncertainty about the missing value.
Multiple imputation has three basic phases:
Imputation phase
Missing values are filled in with estimated values. This process is repeated \(m\) times, creating \(m\) completed data sets.
Analysis phase
Each completed data set is analyzed using the statistical model of interest.
Pooling phase
The estimates and standard errors from the \(m\) analyses are combined for inference.
Note
The goal is not to recover the one “true” missing value. The goal is to preserve the relationships among variables and account for uncertainty due to missing data.
Imputation and analytic models
The imputation model should be compatible, or congenial, with the planned analytic model.
At a minimum, the imputation model should include all variables used in the final analysis model.
It should also include important features of the analytic model, such as:
transformations
interactions
nonlinear terms
recoded or categorized variables
The reason is that multiple imputation aims to preserve the relationships among variables.
Note
A compatible, or congenial, imputation model helps preserve the variance-covariance structure needed for valid inference.
Further reading
For more on imputation-model compatibility, see:
von Hippel, 2009
von Hippel, 2013
White et al., 2010
Multiple imputation workflow
Figure 4: Multiple imputation workflow: impute missing values, analyze each completed data set, and pool the results for inference.
Preparing to conduct MI
The first step is to examine the number and proportion of missing values among the variables of interest.
The user-written Stata command mdesc can be used to summarize missingness.
use https://stats.idre.ucla.edu/wp-content/uploads/2017/05/hsb2_mar.dta, clearmdesc female write read math prog
If mdesc is not installed, search for it and install it:
search mdescssc install mdesc
The mdesc output reports the number and percentage of missing observations for each variable.
In this example, the variables with the highest proportion of missing information are:
prog: 9.0% missing
female: 9.0% missing
Variables with higher missingness can have a greater impact on the stability and convergence of the imputation model.
MI data storage styles
Stata stores multiply imputed data using an MI style. The style determines how the original data and imputed data sets are arranged in memory.
This figure illustrates three common MI styles:
mlong: stores the original data and imputed data sets in long form, including only needed observations for imputations
flong: stores the original data and all imputed data sets in long form
wide: stores imputations as additional variables in wide form
Figure 5: Illustration of Stata MI storage styles: mlong, flong, and wide. Red cells represent missing values in the original data, and green cells represent imputed values.
Declaring the data as MI data
For this seminar, we use the mlong style.
This is done with mi set.
use https://stats.idre.ucla.edu/wp-content/uploads/2017/05/hsb2_mar.dta, clearmiset mlong
The mlong style tells Stata how the multiply imputed data will be stored after imputation.
The MI storage style can be changed later using:
mi convert
For more information about available MI storage styles, use:
helpmi styles
Data structure after mi set mlong
After running mi set mlong, Stata adds variables used to track the imputed data structure:
Figure 6: Stata adds MI system variables after mi set mlong
_mi_miss: marks observations in the original data that have missing values
_mi_m: indicates the imputation number; _mi_m = 0 is the original data
_mi_id: identifies observations and links them across imputed data sets
Note
These variables are created by Stata to manage the multiple-imputation data structure. You usually do not edit them directly.
Examining missing-data patterns
After declaring the data as MI data, Stata’s mi misstable commands can be used to examine missingness.
mi misstable summarize female write read math progmi misstable patterns female write read math prog
If an auxiliary variable differs between missing and observed cases, it may help explain missingness and support the MAR assumption.
The only significant difference was found when examining missingness on math with socst. Above you can see that the mean socst score is significantly lower among the respondents who are missing on math. This suggests that socst is a potential correlate of missingness (Enders, 2010) and may help us satisfy the MAR assumption for multiple imputation by including it in our imputation model.
MI using multivariate normal distribution
One imputation option in Stata is multivariate normal imputation.
This approach assumes that the variables in the imputation model follow a joint multivariate normal distribution.
The algorithm uses data augmentation, a Markov chain Monte Carlo method, to draw missing values from their conditional distribution given the observed data.
Because MVN imputation draws values from a multivariate normal distribution, imputed values can be decimal or negative.
This is not necessarily a problem for estimation, but categorical variables need special handling.
For nominal categorical variables, we create indicator variables before imputation.
For example, instead of imputing the original categorical variable prog, we use program-category indicators such as progcat1, progcat2, and progcat3.
Note
MVN imputation can often perform reasonably well even when normality is not perfect, especially with sufficient sample size. Problems are more likely when the sample size is small and the fraction of missing information is high.
Imputation phase
After the data are declared as MI data, Stata needs to know which variables will be imputed.
This is done with the mi register imputed command.
mi register imputed female write read math progcat1 progcat2 science
Then we specify the imputation model and the number of imputed data sets using mi impute mvn.
mi impute mvn requests multivariate normal imputation.
Variables before = have missing values and are imputed.
Variables after = are predictors only.
add(10) creates 10 imputed data sets.
rseed(53421) makes the random imputation process reproducible.
. mi impute mvn female write read math progcat1 progcat2 science = socst, add(10) rseed (53421)
Performing EM optimization:
observed log likelihood = -1601.2096 at iteration 12
Performing MCMC data augmentation ...
Multivariate imputation Imputations = 10
Multivariate normal regression added = 10
Imputed: m=1 through m=10 updated = 0
Prior: uniform Iterations = 1000
burn-in = 100
between = 100
------------------------------------------------------------------
| Observations per m
|----------------------------------------------
Variable | Complete Incomplete Imputed | Total
-------------------+-----------------------------------+----------
female | 182 18 18 | 200
write | 183 17 17 | 200
read | 191 9 9 | 200
math | 185 15 15 | 200
progcat1 | 182 18 18 | 200
progcat2 | 182 18 18 | 200
science | 184 16 16 | 200
------------------------------------------------------------------
(Complete + Incomplete = Total; Imputed is the minimum across m
of the number of filled-in observations.)
Note
Even though science is an auxiliary variable, it has missing values, so it must be included on the left side of the equals sign and imputed.
Analysis and pooling phase
After creating the imputed data sets, we use mi estimate to run the analysis model.
In this example, the analytic model is a linear regression predicting read.
mi estimate: regress read write female math progcat1 progcat2
The mi estimate: prefix tells Stata to:
run the regression model in each imputed data set
combine the estimates across imputations
report one set of coefficients, standard errors, confidence intervals, and p-values
. mi estimate: regress read write female math progcat1 progcat2
Multiple-imputation estimates Imputations = 10
Linear regression Number of obs = 200
Average RVI = 0.1503
Largest FMI = 0.2468
Complete DF = 194
DF adjustment: Small sample DF: min = 77.11
avg = 114.71
max = 173.44
Model F test: Equal FMI F( 5, 174.4) = 35.62
Within VCE type: OLS Prob > F = 0.0000
------------------------------------------------------------------------------
read | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
write | .3893681 .0817014 4.77 0.000 .2278278 .5509084
female | -2.747417 1.143906 -2.40 0.017 -5.005185 -.4896491
math | .4019581 .086767 4.63 0.000 .229499 .5744173
progcat1 | .5163472 1.68493 0.31 0.760 -2.827101 3.859795
progcat2 | 2.812402 1.602017 1.76 0.083 -.3775466 6.002351
_cons | 10.35629 3.68667 2.81 0.006 3.052563 17.66001
------------------------------------------------------------------------------
Comparing regression results
Parameter
Full data β
Full data SE
Full data p-value
Complete-case β
Complete-case SE
Complete-case p-value
MVN imputation β
MVN imputation SE
MVN imputation p-value
Intercept
9.62
3.410
0.0053
13.03
4.124
0.002
10.36
3.687
0.006
Write
0.37
0.075
<.0001
0.44
0.093
<.0001
0.39
0.082
<.0001
Female
-2.70
1.095
0.0146
-2.71
1.365
0.0496
-2.75
1.144
0.017
Math
0.44
0.075
<.0001
0.32
0.095
0.001
0.40
0.087
<.0001
PROG academic
1.88
1.423
0.1882
1.81
1.655
0.2759
2.81
1.602
0.083
PROG general
0.23
1.512
0.8782
0.52
1.881
0.7836
0.52
1.685
0.760
The MVN imputation estimates are generally closer to the complete-data demonstration results than the complete-case estimates.
The standard errors from multiple imputation are often slightly larger than the full-data analysis because MI incorporates additional uncertainty from the missing values.
Imputation diagnostics
After running mi estimate, Stata reports several diagnostic quantities in the output.
These help describe how much uncertainty is due to missing data and how the pooled standard errors are calculated.
To request more detailed diagnostic tables, use:
mi estimate, vartable dftable
The vartable and dftable options report variance components, relative increase in variance, fraction of missing information, relative efficiency, and degrees of freedom.
. mi estimate, vartable dftable
Multiple-imputation estimates Imputations = 10
Linear regression
Variance information
------------------------------------------------------------------------------
| Imputation variance Relative
| Within Between Total RVI FMI efficiency
-------------+----------------------------------------------------------------
write | .005939 .000669 .006675 .123977 .113855 .988743
female | 1.24261 .059921 1.30852 .053044 .051507 .994876
math | .005947 .001438 .007529 .265958 .219719 .9785
progcat1 | 2.31652 .474974 2.83899 .225541 .191897 .981172
progcat2 | 1.9623 .549235 2.56646 .307883 .246847 .97591
_cons | 11.4877 1.91258 13.5915 .183139 .160802 .984174
------------------------------------------------------------------------------
Multiple-imputation estimates Imputations = 10
Linear regression Number of obs = 200
Average RVI = 0.1503
Largest FMI = 0.2468
Complete DF = 194
DF adjustment: Small sample DF: min = 77.11
avg = 114.71
max = 173.44
Model F test: Equal FMI F( 5, 174.4) = 35.62
Within VCE type: OLS Prob > F = 0.0000
------------------------------------------------------------------------------
| % increase
read | Coefficient Std. err. t P>|t| df std. err.
-------------+----------------------------------------------------------------
write | .3893681 .0817014 4.77 0.000 138.8 6.02
female | -2.747417 1.143906 -2.40 0.017 173.4 2.62
math | .4019581 .086767 4.63 0.000 87.0 12.51
progcat1 | .5163472 1.68493 0.31 0.760 98.6 10.70
progcat2 | 2.812402 1.602017 1.76 0.083 77.1 14.36
_cons | 10.35629 3.68667 2.81 0.006 113.3 8.77
------------------------------------------------------------------------------
Variance components in MI
Multiple imputation combines two main sources of variation:
Within-imputation variance
The average sampling variance across the imputed data sets.
For example, if you sum squared the standard errors for write for all 10 imputations and then divided by 10, this would equal \(V_W = 0.0059\).
This estimates the sampling variability that we would have expected had there been no missing data.
Between-imputation variance
The variability in parameter estimates across the imputed data sets.
For example, if you took all 10 of the parameter estimates for write and calculated the variance this would equal \(V_B = 0.00067\).
This variability estimates the additional variation (uncertainty) that results from missing data.
Total variance
The total variance combines both sources, plus a correction for using a finite number of imputations.
\[
V_T = V_W + V_B + \frac{V_B}{m}
\]
For example, the total variance for the variable write would be calculated like this: \(V_W + V_B + V_B/m = 0.0059 + 0.00067 + 0.00067/10 = 0.00667\)
Relative increase in variance
The relative increase in variance, often called RVI or RIV, measures how much larger the sampling variance is because of missing data.
\[
RVI = \frac{V_B + V_B/m}{V_W}
\]
A higher RVI means more uncertainty is being added by the missing data.
Variables with more missing information, or variables weakly predicted by the imputation model, tend to have higher RVI values.
Fraction of missing information
The fraction of missing information, or FMI, is the proportion of total sampling variance attributable to missing data.
\[
FMI = \frac{V_B + V_B/m}{V_T}
\]
An FMI of 0.20 means that about 20% of the total sampling variance is due to missing data.
Note
A practical rule of thumb is to use at least as many imputations as the highest FMI percentage. For example, if the largest FMI is 25%, consider using at least 25 imputations.
Relative efficiency
Relative efficiency compares using \(m\) imputations with using an infinite number of imputations.
\[
RE = \frac{1}{1 + FMI/m}
\]
When the amount of missing information is low, a small number of imputations may give high relative efficiency.
However, more imputations may still be needed to estimate standard errors well.
Note
Good relative efficiency does not necessarily mean the variance estimates are stable.
Degrees of freedom
In MI analysis, degrees of freedom are not determined only by sample size.
They also depend on:
the number of imputations
the amount of missing information
the variability between imputations
Stata uses a small-sample correction (Barnard and Rubin, 1999) by default to avoid inflated degrees of freedom.
Note
Fractional degrees of freedom are normal in multiple-imputation output.
Additional diagnostic checks
After imputation, it is useful to compare observed and imputed values.
Possible checks include:
means
frequencies
box plots
distributions of observed versus imputed values
residuals and outliers within imputed data sets
If unusual values appear in only a few imputations, this may indicate a problem with the imputation model.
Checking convergence
For MVN imputation, convergence means that the data augmentation algorithm has reached a stable posterior distribution.
This should be done for different imputed variables, but specifically for those variables with a high proportion of missing (e.g. high FMI).
Convergence is often assessed visually using trace plots.
Trace plots show estimated parameters across iterations. These plots can be requested using the saveptrace and mcmconly option.
This mcmconly option will simply run the MCMC algorithm for the same number of iterations it takes to obtain 10 imputations without actually producing 10 imputed datasets. Is it typically used in combination with saveptrace or savewlf to examine the convergence of the MCMC prior to imputation. No imputation is performed with mcmconly is specified, so the options add or replace are not required with mi impute mvn.
Performing EM optimization:
observed log likelihood = -1601.2096 at iteration 12
Performing MCMC data augmentation ...
Note: No imputation performed.
Long-term trends in trace plots or high autocorrelation may indicate slow convergence.
Trace files for convergence diagnostics
The trace file saved by saveptrace() is not a regular Stata data set, but Stata can read it using mi ptrace.
You can describe the trace file without opening it:
mi ptrace describetrace
. mi ptrace describe trace
file trace.stptrace created on 1 Jun 2026 04:29 contains 1,000 records (obs) on
m 1 variable
iter 1 variable
b[y, x] 14 variables (7 x 2)
v[y, y] 28 variables (7 x 7, symmetric)
where y and x are
y: (1) write (2) read (3) female (4) math (5) science (6) progcat1 (7) progcat2
x: (1) socst (2) _cons
You can also load it into memory:
mi ptrace usetrace, clear
The trace file contains information such as:
imputation number
iteration number
regression coefficients
variances and covariances
If you have a lot of parameters in your model it may not be feasible to examine the convergence of each individual parameter. In that case you can use savewlf. WLF stands for worst linear function. This will output to you the parameter(s) with the highest FMI value.
Trace plots for imputed variables
As an example, we examine diagnostics for female, one of the variables with fewer complete observations.
After loading the trace file, we declare the iteration number as the time variable.
tsset iter
Then we graph the coefficient and variance series for female.
Trace plots are used to check whether the MCMC algorithm appears stable across iterations.
Figure 7: Trace plots for female. The coefficient and variance series fluctuate around stable levels across iterations, suggesting that the imputation algorithm has reached a stationary distribution.
Autocorrelation diagnostics
Another useful convergence diagnostic is the autocorrelation plot.
Autocorrelation measures the correlation between values from different MCMC iterations.
Because the imputation process is intended to produce sufficiently independent draws, we do not want strong correlation between values across iterations.
After loading the trace file, we can use Stata’s ac command to examine autocorrelation.
ac b_y3x1, ///ytitle("Coefficient on Socst""used to predict Female") ///xtitle("") ciopts(astyle(none)) note("") ///name(ac1, replace) lags(100)ac v_y3y3, ///ytitle("Female Variance") ///xtitle("") ciopts(astyle(none)) note("") ///name(ac2, replace) lags(100)graphcombine ac1 ac2, ///xcommoncols(1) title(Autocorrelations) b1title(Lag)
Figure 8: Autocorrelation plots for female. The autocorrelations drop close to zero after the first few lags, suggesting that the MCMC draws are not strongly dependent across iterations.
Interpreting autocorrelation plots
In an autocorrelation plot:
the x-axis shows the lag
the y-axis shows the correlation between values separated by that lag
Ideally, autocorrelation should drop close to zero after a small number of lags.
If autocorrelation remains high for many lags, the imputed data sets may be too similar to each other.
The time it takes for autocorrelation to approach zero gives information about convergence.
If autocorrelation drops quickly, this suggests that the chain is mixing well and that successive draws are not strongly dependent.
If autocorrelation declines slowly, the algorithm may need more iterations before drawing imputations.
miimpute mvn ..., burnbetween(200)
The burnbetween() option increases the number of iterations between imputed data sets.
Warning
If autocorrelation remains high, consider increasing the number of iterations between imputations using the burnbetween() option.
MI using chained equations
A second method available in Stata is multiple imputation by chained equations, or MICE.
MICE is also known as:
fully conditional specification
sequential generalized regression
Unlike MVN imputation, MICE does not require assuming that all variables jointly follow a multivariate normal distribution.
Instead, MICE uses a separate conditional model for each variable being imputed.
Why use MICE?
MICE is useful when variables have different measurement scales or distributions.
For example, we may need different imputation models for:
binary variables
ordinal variables
nominal categorical variables
continuous variables
count variables
This flexibility is useful when imputed variables must take on specific types of values, such as 0/1 values for binary variables.
Stata’s chained-equations approach supports different models for different variable types.
Examples include:
logistic regression for binary variables
ordered logistic regression for ordinal variables
multinomial logistic regression for nominal variables
linear regression or predictive mean matching for continuous variables
Poisson or negative binomial regression for count variables
If no model is specified, Stata uses linear regression by default.
Predictive mean matching
Predictive mean matching, or PMM, is often used for continuous variables.
PMM imputes values by selecting observed values from cases with similar predicted means.
This helps keep imputed values within the range of observed data.
Warning
When using PMM in Stata, pay attention to the number of nearest-neighbor matches. Using too few matches can lead to underestimated standard errors.
MICE imputation phase
Before using chained equations, reload the data and declare it as MI data.
use https://stats.idre.ucla.edu/wp-content/uploads/2017/05/hsb2_mar.dta, clearmiset mlong
The mlong style stores the original data and imputed data sets in long form.
The basic setup is similar to MVN imputation, but with two key differences:
We use mi impute chained instead of mi impute mvn.
We specify an imputation model for each type of variable.
For example:
logit for binary variables
mlogit for unordered categorical variables
regress for continuous variables
Register variables for imputation
First, register the variables that will be imputed.
mi register imputed female write read math prog science
Unlike the MVN example, we no longer need dummy variables for prog.
Because MICE can impute unordered categorical variables directly, prog can be imputed using a multinomial logistic model.
Chained-equations imputation
Next, use mi impute chained to specify the imputation model.
The model in parentheses applies to the variable or variables listed after it.
For example, (logit) female tells Stata to impute female using logistic regression.
add(10) creates 10 imputed data sets.
rseed(53421) makes the random process reproducible.
savetrace(trace1, replace) saves trace information for convergence diagnostics.
socst is included as a predictor in the imputation model.
By default, Stata imputes variables from most observed to least observed. Use orderasis to keep the order specified in the command.
. mi impute chained (logit) female (mlogit) prog (regress) write read math science = socst, ///
> add(10) rseed (53421) savetrace(trace1,replace)
Conditional models:
read: regress read math science write i.female i.prog socst
math: regress math read science write i.female i.prog socst
science: regress science read math write i.female i.prog socst
write: regress write read math science i.female i.prog socst
female: logit female read math science write i.prog socst
prog: mlogit prog read math science write i.female socst
Performing chained iterations ...
Multivariate imputation Imputations = 10
Chained equations added = 10
Imputed: m=1 through m=10 updated = 0
Initialization: monotone Iterations = 100
burn-in = 10
female: logistic regression
prog: multinomial logistic regression
write: linear regression
read: linear regression
math: linear regression
science: linear regression
------------------------------------------------------------------
| Observations per m
|----------------------------------------------
Variable | Complete Incomplete Imputed | Total
-------------------+-----------------------------------+----------
female | 182 18 18 | 200
prog | 182 18 18 | 200
write | 183 17 17 | 200
read | 191 9 9 | 200
math | 185 15 15 | 200
science | 184 16 16 | 200
------------------------------------------------------------------
(Complete + Incomplete = Total; Imputed is the minimum across m
of the number of filled-in observations.)
MICE analysis phase
Once the 10 imputed data sets have been created, we can run the analysis model.
Because female and prog were imputed using models appropriate for categorical variables, they can be used as factor variables in the regression model.
mi estimate: regress read write i.female math ib3.prog
The mi estimate: prefix runs the regression model in each imputed data set and then pools the results.
Stata fits the regression model separately in each of the 10 imputed data sets.
This produces 10 sets of:
regression coefficients
standard errors
test statistics
Stata then combines these results into one set of inferential statistics.
. mi estimate: regress read write i.female math ib3.prog
Multiple-imputation estimates Imputations = 10
Linear regression Number of obs = 200
Average RVI = 0.1649
Largest FMI = 0.2121
Complete DF = 194
DF adjustment: Small sample DF: min = 90.00
avg = 117.29
max = 146.03
Model F test: Equal FMI F( 5, 170.7) = 35.22
Within VCE type: OLS Prob > F = 0.0000
------------------------------------------------------------------------------
read | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
write | .4028188 .0827066 4.87 0.000 .2391084 .5665291
|
female |
female | -2.650018 1.201493 -2.21 0.029 -5.026381 -.273656
math | .4089138 .0844608 4.84 0.000 .2414949 .5763326
|
prog |
general | .0134051 1.710516 0.01 0.994 -3.384835 3.411645
academic | 2.341625 1.558824 1.50 0.136 -.75001 5.433259
|
_cons | 9.647476 3.617 2.67 0.009 2.499048 16.7959
------------------------------------------------------------------------------
Note
As with MVN imputation, the pooled standard errors include additional uncertainty due to the missing values.
Interpreting the MICE results
The MICE estimates are generally comparable to the complete-data demonstration results.
In this example:
write, female, and math are statistically significant
the standard errors are larger than in the complete-data analysis
the standard errors are still smaller than those from complete-case analysis
the estimates for prog should be examined carefully
Note
The larger standard errors are expected because multiple imputation incorporates uncertainty about the missing values.
Comparing regression estimates
Parameter
Full data β
Full data SE
Complete-case β
Complete-case SE
MICE β
MICE SE
MICE p-value
Intercept
9.62
3.410
13.03
4.124
9.65
3.620
0.009
Write
0.37
0.075
0.44
0.093
0.40
0.083
<.0001
Female
-2.70
1.095
-2.71
1.365
-2.65
1.201
0.029
Math
0.44
0.075
0.32
0.095
0.41
0.084
<.0001
PROG academic
1.88
1.423
1.81
1.655
2.34
1.559
0.136
PROG general
0.23
1.512
0.52
1.881
0.01
1.711
0.994
The complete-data results are used for demonstration only, not as a formal benchmark for evaluating multiple imputation.
Interpreting the MICE results
Compared with complete-case analysis, the MICE standard errors are generally smaller, because all 200 observations contribute information through the imputation model.
Compared with the complete-data demonstration results, the MICE standard errors are somewhat larger, which is expected because MI incorporates uncertainty from missing values.
The estimates for write, female, and math are fairly consistent across approaches.
Warning
The estimates for prog differ more noticeably, so we should examine imputation diagnostics and consider whether the imputation model for prog can be improved.
MICE imputation diagnostics
After fitting the MICE model, we can examine diagnostic measures such as:
RVI: relative increase in variance
FMI: fraction of missing information
DF: degrees of freedom
RE: relative efficiency
between-imputation variance
within-imputation variance
mi estimate, vartable dftable
. mi estimate, vartable dftable
Multiple-imputation estimates Imputations = 10
Linear regression
Variance information
------------------------------------------------------------------------------
| Imputation variance Relative
| Within Between Total RVI FMI efficiency
-------------+----------------------------------------------------------------
write | .005903 .000852 .00684 .158707 .141937 .986005
|
female |
female | 1.27246 .155565 1.44359 .134481 .12251 .987897
math | .005955 .001072 .007134 .197998 .17193 .983098
|
prog |
general | 2.33199 .539888 2.92587 .254665 .212116 .979229
academic | 2.004 .387216 2.42993 .212544 .18258 .982069
|
_cons | 11.7972 1.16864 13.0827 .108968 .101237 .989978
------------------------------------------------------------------------------
Multiple-imputation estimates Imputations = 10
Linear regression Number of obs = 200
Average RVI = 0.1649
Largest FMI = 0.2121
Complete DF = 194
DF adjustment: Small sample DF: min = 90.00
avg = 117.29
max = 146.03
Model F test: Equal FMI F( 5, 170.7) = 35.22
Within VCE type: OLS Prob > F = 0.0000
------------------------------------------------------------------------------
| % increase
read | Coefficient Std. err. t P>|t| df std. err.
-------------+----------------------------------------------------------------
write | .4028188 .0827066 4.87 0.000 123.2 7.64
|
female |
female | -2.650018 1.201493 -2.21 0.029 133.9 6.51
math | .4089138 .0844608 4.84 0.000 107.8 9.45
|
prog |
general | .0134051 1.710516 0.01 0.994 90.0 12.01
academic | 2.341625 1.558824 1.50 0.136 102.8 10.12
|
_cons | 9.647476 3.617 2.67 0.009 146.0 5.31
------------------------------------------------------------------------------
In this example, the largest RVI and FMI are associated with prog.
What do the diagnostics suggest?
The highest estimated values are associated with prog:
RVI is about 25%
FMI is about 21%
This suggests that the imputation for prog may contribute more uncertainty than the other variables.
Possible next steps include:
increasing the number of imputations to about 20 or 25
adding auxiliary variables associated with prog
checking trace plots or other diagnostics for convergence
MICE trace file
With mi impute chained, the savetrace() option saves the means and standard deviations of imputed values at each iteration.
The saved file, trace1, can be opened like a regular Stata data set.
use trace1, cleardescribe
. use trace1,clear
(Summaries of imputed values from -mi impute chained-)
. describe
Contains data from trace1.dta
Observations: 110 Summaries of imputed values from -mi impute chained-
Variables: 14 1 Jun 2026 06:04
-----------------------------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-----------------------------------------------------------------------------------------------------------------
iter byte %12.0g Iteration numbers
m byte %12.0g Imputation numbers
read_mean float %9.0g Mean of read
read_sd float %9.0g Std. dev. of read
math_mean float %9.0g Mean of math
math_sd float %9.0g Std. dev. of math
science_mean float %9.0g Mean of science
science_sd float %9.0g Std. dev. of science
write_mean float %9.0g Mean of write
write_sd float %9.0g Std. dev. of write
female_mean float %9.0g Mean of female
female_sd float %9.0g Std. dev. of female
prog_mean float %9.0g Mean of prog
prog_sd float %9.0g Std. dev. of prog
-----------------------------------------------------------------------------------------------------------------
Preparing the trace file
The trace file is stored in long form, with a row for each chain at each iteration.
Because there are multiple chains, the iteration number is repeated. Before using tsset, we reshape the data to wide form.
reshapewide *mean *sd, i(iter) j(m)tsset iter
Now the mean and standard deviation for each variable are stored separately by chain.
After reshaping the trace file, we can graph the predicted mean for an imputed variable.
tsline read_mean1, ///name(mice1, replace) ///legend(off) ///ytitle("Mean of Read")
MICE trace plot for read
Figure 9: Trace plot of the mean of read across iterations for one MICE imputation chain.
This graph shows the predicted mean of read across iterations for the first imputation chain.
Note
As with MVN imputation, we expect the trace plot to fluctuate randomly around a stable level, with no obvious long-term trend.
Trace plots across all imputation chains
All 10 imputation chains can also be graphed at the same time to check whether any single chain behaves unexpectedly.
Each colored line represents a different imputation chain, initialized with different starting values.
tsline read_mean*, name(mice1, replace) legend(off) ///ytitle("Mean of Read")tsline read_sd*, name(mice2, replace) legend(off) ///ytitle("SD of Read")graphcombine mice1 mice2, xcommoncols(1) ///title("Trace plots of summaries of imputed values")
Figure 10: Trace plots of summaries of imputed values for read across 10 MICE imputation chains. The chains fluctuate around similar levels, suggesting no obvious convergence problem.
Each colored line represents a different imputation chain. We want the chains to fluctuate around similar stable levels, with no single chain showing unusual drift or a very different pattern.
Note
When the chains are overlaid, we want to see that they all fluctuate around a similar stable level, with no single chain showing unusual drift or a different pattern.
MICE: strengths and cautions
Autocorrelation plots are mainly useful for MVN imputation. For MICE, they are less informative because the algorithm is iterative by design: each iteration uses the observed data and imputed values from the previous iteration, so some autocorrelation is expected.
MICE is attractive because each variable can be imputed using an appropriate conditional model. This is especially useful for variables that must take specific types of values, such as binary, categorical, ordinal, or count variables.
However, this flexibility can also create problems, including slow convergence, non-convergence, incompatible conditional models, or complete and quasi-complete separation when imputing categorical variables.
Warning
When using MICE, allow enough time to build, diagnose, and revise the imputation model before moving to the final analysis.
Passive imputation
Passive variables are variables created as functions of imputed variables.
For example, if the analytic model includes math^2, we may want math_sq to update automatically whenever math is imputed.
mi register imputed math read write female progmi passive: generate math_sq = math^2miimpute chained /// (regress) math read write /// (logit) female /// (mlogit) prog = socst, /// add(20) rseed(53421)mi estimate: regress read math math_sq write i.female i.prog
Passive imputation keeps math_sq mathematically consistent with math.
Warning
Passive imputation keeps derived variables mathematically consistent with their components. However, for nonlinear terms and interactions, the imputation model still needs to be compatible with the analytic model; otherwise, the relationship of interest may be attenuated.
Practical MI decisions
Before running multiple imputation, make several modeling decisions carefully:
Include variables that help predict missingness or incomplete variables.
Include the dependent variable in the imputation model.
Include transformations, interactions, and nonlinear terms needed in the final analysis.
Choose an imputation method appropriate for the variable types, such as MVN or MICE.
Use enough imputations, especially when the fraction of missing information is high.
Note
Multiple imputation should be treated as part of the analysis plan, not as a mechanical preprocessing step.
Main takeaways
Multiple imputation is generally preferable to single-imputation methods because it does not rely on one filled-in value and it incorporates uncertainty due to missing data.
The goal is not to recover the exact missing values, but to preserve the relationships among variables and obtain valid estimates and standard errors.
MI can improve estimation and power, but it is not magic: it depends on a reasonable imputation model, plausible assumptions, and careful diagnostics.
Warning
Multiple imputation helps address missing data, but it does not fix poor measurement, weak study design, or unsupported modeling assumptions.
References
Allison, P. D. (2002). Missing Data. Sage.
Allison, P. D. (2012). Handling missing data by maximum likelihood. SAS Global Forum: Statistics and Data Analysis.
Barnard, J., & Rubin, D. B. (1999). Small-sample degrees of freedom with multiple imputation. Biometrika, 86(4), 948–955.
Bartlett, J. W., et al. (2014). Multiple imputation of covariates by fully conditional specification. Statistical Methods in Medical Research.
Bodner, T. E. (2008). What improves with increased missing data imputations? Structural Equation Modeling, 15(4), 651–675.
Demirtas, H., et al. (2008). Plausibility of multivariate normality when imputing non-Gaussian outcomes. Journal of Statistical Computation and Simulation, 78(1).
Enders, C. K. (2010). Applied Missing Data Analysis. Guilford Press.
Graham, J. W., et al. (2007). How many imputations are really needed? Prevention Science, 8, 206–213.
Lee, K. J., & Carlin, J. B. (2010). Fully conditional specification versus multivariate normal imputation. American Journal of Epidemiology, 171(5), 624–632.
Little, R. J. A., & Rubin, D. B. (2002). Statistical Analysis with Missing Data (2nd ed.). Wiley.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley.
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177.
van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16, 219–242.
von Hippel, P. T. (2009). How to impute interactions, squares, and other transformed variables. Sociological Methodology, 39, 265–291.
White, I. R., et al. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.
Thank you
Questions?
Multiple Imputation in Stata
Statistical Methods and Data Analytics
UCLA Office of Advanced Research Computing