Multiple Imputation in Stata

UCLA Office of Advanced Research Computing, Statistical Methods and Data Analytics

Introduction

Missing data are common in applied research. Sometimes, missing data are handled in an ad hoc way, but the choice of method can affect the validity of the analysis.

In this seminar, we focus on multiple imputation, one of the most commonly used approaches for handling missing data.

Note

The goal is not to advocate for one universal method. Depending on the data structure and analytic model, other approaches, such as direct maximum likelihood, may be more appropriate.

Multiple imputation in practice

Multiple imputation requires careful attention to:

  • the structure of the data
  • the amount and pattern of missingness
  • the assumptions needed for imputation
  • the analytic model that will be estimated after imputation

The purpose of this seminar is to understand the issues that arise when using multiple imputation in practice.

Goals of statistical analysis with missing data

When analyzing data with missing values, we generally want to:

  1. Minimize bias
  2. Maximize use of available information
  3. Obtain appropriate estimates of uncertainty

Note

Multiple imputation is designed to help with all three goals, but only when the imputation model and assumptions are appropriate.

Exploring missing-data mechanisms

A missing-data mechanism describes the process believed to have generated the missing values.

Missing-data mechanisms are usually discussed in three broad categories:

  • Missing completely at random: MCAR
  • Missing at random: MAR
  • Missing not at random: MNAR

Note

There are precise technical definitions for these terms. The explanations here are simplified for applied data analysis.See Figure 1.

Missing completely at random: MCAR

A variable is missing completely at random if missingness is not predicted by:

  • other variables in the data set
  • the unobserved value of the variable itself

In other words, the probability that a value is missing is unrelated to the observed or unobserved data.

Note

MCAR is a strong assumption and is often unlikely in real data.

Example of MCAR

One situation where MCAR may be reasonable is planned missingness.

For example, in a health survey, a random subset of participants may be selected for a more extensive physical examination.

Only those selected participants have complete information for the additional measurements.

Because the subset was randomly selected, missingness is unrelated to participant characteristics or to the missing values themselves.

A note about MCAR

MCAR can still allow missingness on one variable to be related to missingness on another variable.

For example:

  • var1 is missing whenever var2 is missing
  • a husband and wife are both missing information on height

This can still be MCAR if the missingness is not related to observed variables or to the true unobserved values.

Missing at random: MAR

A variable is missing at random if missingness can be predicted by other observed variables in the data set, but not by the missing value itself after controlling for those observed variables.

For example, in a survey, men may be more likely than women to skip a particular question.

In that case, gender predicts missingness on another variable.

Under MAR, the probability of missingness may depend on observed variables, but not on the missing value itself after conditioning on those observed variables.

MAR is closely related to the idea of an ignorable missing-data mechanism, which is needed for likelihood-based methods and multiple imputation.

Note

Ignorability is an important assumption for multiple imputation and direct maximum likelihood approaches.

Missing not at random: MNAR

Data are missing not at random if the value of the unobserved variable itself predicts missingness.

A classic example is income.

People with very high incomes may be more likely to decline to answer questions about their income than people with more moderate incomes.

Statistical models have been developed for MNAR mechanisms, but those models are beyond the scope of this seminar.

Why the mechanism matters

The missing-data mechanism matters because different mechanisms require different treatments.

If data are MCAR, complete-case analysis will not usually bias parameter estimates, such as regression coefficients.

However, complete-case analysis can still reduce sample size and increase standard errors.

If data are MAR or MNAR, analyzing only complete cases can lead to biased parameter estimates.

Modern missing-data methods, such as:

  • multiple imputation
  • direct maximum likelihood

generally assume that the data are at least MAR.

In this seminar, we focus on multiple imputation under the MCAR/MAR framework. Methods that are valid under MAR can also be used when data are MCAR.

Missing-data mechanisms: diagram

Diagram comparing three missing-data mechanisms. For MCAR, missingness R is unrelated to both the fully observed variable X and the partially missing variable Y. For MAR, missingness R is related to the fully observed variable X but not directly to Y. For MNAR, missingness R is related to Y, meaning the probability of missingness depends on the value that may be missing.
Figure 1: Graphical representations of (a) missing completely at random (MCAR), (b) missing at random (MAR), and (c) missing not at random (MNAR) in a univariate missing-data pattern. \(X\) represents variables that are completely observed, \(Y\) represents a variable that is partly missing, \(Z\) represents the component of the causes of missingness unrelated to \(X\) and \(Y\), and \(R\) represents the missingness indicator. Source: Schafer and Graham (2002).

Further reading

For more information on missing-data mechanisms, see:

  • Allison, 2002
  • Enders, 2010
  • Little & Rubin, 2002
  • Rubin, 1976
  • Schafer & Graham, 2002

Seminar data

This seminar uses the complete hsb2.dta data set for demonstration.

A modified version, hsb2_mar.dta, contains missing values and will be used later for the missing-data examples.

The complete-data regression results provide a reference point for comparing complete-case analysis and multiple imputation.

Note

The complete-data results are used for demonstration only. A formal evaluation of multiple imputation would require a simulation study with repeated samples and a known data-generating process.

Full data Regression model:

Below is a regression model predicting read using the complete data set (hsb2) used to create hsb_mar, which contains test scores, as well as demographic and school information for 200 high school students.

use https://stats.idre.ucla.edu/stat/stata/notes/hsb2, clear
regress read write i.female math ib3.prog
. use https://stats.idre.ucla.edu/stat/stata/notes/hsb2
(highschool and beyond (200 cases))


. regress read write i.female math ib3.prog

      Source |       SS           df       MS      Number of obs   =       200
-------------+----------------------------------   F(5, 194)       =     41.53
       Model |  10814.6553         5  2162.93105   Prob > F        =    0.0000
    Residual |  10104.7647       194  52.0864161   R-squared       =    0.5170
-------------+----------------------------------   Adj R-squared   =    0.5045
       Total |    20919.42       199  105.122714   Root MSE        =    7.2171

------------------------------------------------------------------------------
        read | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       write |   .3747415   .0746281     5.02   0.000     .2275549     .521928
             |
      female |
     female  |   -2.69884   1.095408    -2.46   0.015    -4.859277   -.5384027
        math |   .4418632   .0749972     5.89   0.000     .2939487    .5897778
             |
        prog |
    general  |   .2320562   1.512195     0.15   0.878    -2.750396    3.214509
   academic  |   1.879263   1.423068     1.32   0.188    -.9274069    4.685933
             |
       _cons |   9.623172   3.409797     2.82   0.005     2.898141     16.3482
------------------------------------------------------------------------------

Common techniques for missing data

Before focusing on multiple imputation, we will briefly review several common approaches for handling missing data:

  • Complete-case analysis, also called listwise deletion
  • Available-case analysis, also called pairwise deletion
  • Mean imputation
  • Single imputation
  • Stochastic imputation

Each method is easy to use, but each has important limitations.

Missingness in hsb2_mar

Below we look at some of the descriptive statistics of the data set hsb2_mar.dta.

use https://stats.idre.ucla.edu/wp-content/uploads/2017/05/hsb2_mar.dta, clear
sum
. use https://stats.idre.ucla.edu/wp-content/uploads/2017/05/hsb2_mar.dta, clear
(highschool and beyond (200 cases))

. sum

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
          id |        200       100.5    57.87918          1        200
      female |        182    .5549451    .4983428          0          1
        race |        200        3.43    1.039472          1          4
         ses |        200       2.055    .7242914          1          3
      schtyp |        200        1.16     .367526          1          2
-------------+---------------------------------------------------------
        prog |        182    2.027473    .6927511          1          3
        read |        191    52.28796    10.21072         28         76
       write |        183    52.95082    9.257773         31         67
        math |        185     52.8973    9.360837         33         75
     science |        184    51.30978    9.817833         26         74
-------------+---------------------------------------------------------
       socst |        200      52.405    10.73579         26         71

Although the data set contains 200 cases, several variables have fewer than 200 observed values.

The amount of missing information varies by variable:

  • read: 9 missing observations, or 4.5%
  • female and prog: 18 missing observations, or 9%

This doesn’t seem like a lot of missing data, so we might be inclined to try to analyze the observed data as they are, a strategy sometimes referred to as complete case analysis.

Complete-case analysis in Stata

One simple strategy is to analyze only the observations with complete data on all variables in the model.

In Stata, the default behavior of regress is complete-case analysis, also called listwise deletion.

regress read write i.female math ib3.prog
. regress read write i.female math ib3.prog

      Source |       SS           df       MS      Number of obs   =       130
-------------+----------------------------------   F(5, 124)       =     23.69
       Model |  5895.48143         5  1179.09629   Prob > F        =    0.0000
    Residual |  6172.12627       124  49.7752118   R-squared       =    0.4885
-------------+----------------------------------   Adj R-squared   =    0.4679
       Total |  12067.6077       129  93.5473465   Root MSE        =    7.0552

------------------------------------------------------------------------------
        read | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       write |   .4410834   .0926477     4.76   0.000     .2577076    .6244592
             |
      female |
     female  |  -2.706338   1.365195    -1.98   0.050     -5.40844   -.0042351
        math |   .3210525   .0951436     3.37   0.001     .1327367    .5093682
             |
        prog |
    general  |   .5177428   1.880833     0.28   0.784    -3.204953    4.240438
   academic  |   1.811155   1.654859     1.09   0.276    -1.464274    5.086585
             |
       _cons |    13.0265   4.123545     3.16   0.002     4.864848    21.18815
------------------------------------------------------------------------------

Complete-data vs. complete-case results

Comparison of regression estimates using the complete data and complete-case analysis.
Parameter Full data β Full data SE Full data p-value Complete-case β Complete-case SE Complete-case p-value
Intercept 9.62 3.410 0.0053 13.03 4.124 0.002
Write 0.37 0.075 <.0001 0.44 0.093 <.0001
Female -2.70 1.095 0.0146 -2.71 1.365 0.0496
Math 0.44 0.075 <.0001 0.32 0.095 0.001
PROG academic 1.88 1.423 0.1882 1.81 1.655 0.2759
PROG general 0.23 1.512 0.8782 0.52 1.881 0.7836

Only 130 cases were used in the complete-case regression.

This means that 70 of 200 cases were excluded because they had missing values on at least one variable in the model.

The smaller sample size can reduce statistical power and increase standard errors.

In this example, the standard errors are larger in the complete-case analysis, and the estimate for female becomes borderline non-significant.

Complete-case analysis can also produce biased estimates unless the missing-data mechanism is MCAR.

Again, The complete-data results are used for demonstration only. They help illustrate how estimates can change after missing data are introduced, but they are not a formal benchmark for evaluating missing data analysis.

Available-case analysis

Available-case analysis, also called pairwise deletion, uses all available nonmissing data for each statistic being estimated.

For example, each variance or covariance is estimated using all cases that have nonmissing values for the variables involved in that calculation.

Available-case analysis may use more data than complete-case analysis, so the loss of power can be less severe.

However, it has important drawbacks:

  • the sample size is not consistent across estimates
  • different correlations or covariances may be based on different subsets of cases
  • parameter estimates can differ from both complete-case and full-data analyses
  • unless the data are MCAR, estimates may be biased

Available-case analysis in Stata

The command below creates dummy variables for the categories of prog.

tab prog, gen(progcat)

We then examine summaries and correlations among the variables used in the regression model.

sum female write read math progcat1 progcat2, sep(6)
corr female write read math progcat1 progcat2
pwcorr female write read math progcat1 progcat2, obs
. tab prog, gen(progcat)

    type of |
    program |      Freq.     Percent        Cum.
------------+-----------------------------------
    general |         41       22.53       22.53
   academic |         95       52.20       74.73
   vocation |         46       25.27      100.00
------------+-----------------------------------
      Total |        182      100.00

. sum female write read math progcat1 progcat2, sep(6)

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      female |        182    .5549451    .4983428          0          1
       write |        183    52.95082    9.257773         31         67
        read |        191    52.28796    10.21072         28         76
        math |        185     52.8973    9.360837         33         75
    progcat1 |        182    .2252747    .4189156          0          1
    progcat2 |        182     .521978    .5008947          0          1

. corr female write read math progcat1 progcat2
(obs=130)

             |   female    write     read     math progcat1 progcat2
-------------+------------------------------------------------------
      female |   1.0000
       write |   0.2369   1.0000
        read |  -0.0512   0.6174   1.0000
        math |  -0.0636   0.6322   0.6175   1.0000
    progcat1 |  -0.0852  -0.0190  -0.0544  -0.0923   1.0000
    progcat2 |   0.0630   0.3194   0.3288   0.3924  -0.5530   1.0000


. pwcorr female write read math progcat1 progcat2, obs

             |   female    write     read     math progcat1 progcat2
-------------+------------------------------------------------------
      female |   1.0000 
             |      182
             |
       write |   0.2508   1.0000 
             |      166      183
             |
        read |  -0.0174   0.5872   1.0000 
             |      173      174      191
             |
        math |  -0.0241   0.6182   0.6589   1.0000 
             |      168      170      176      185
             |
    progcat1 |  -0.0317  -0.0604  -0.1058  -0.1651   1.0000 
             |      165      166      173      168      182
             |
    progcat2 |   0.0500   0.3439   0.3902   0.4457  -0.5635   1.0000 
             |      165      166      173      168      182      182
             |

The key comparison is between corr and pwcorr, obs. The pwcorr command uses all available nonmissing observations for each pair of variables, so the sample size can differ across correlations.

Unconditional mean imputation

Unconditional mean imputation replaces missing values for a variable with the mean of the observed values for that same variable.

This method is simple and easy to implement, but it has serious limitations.

The main problem is that it places imputed values at the center of the distribution, which artificially reduces variability.

  • reduces the variance of the imputed variable
  • changes correlations between the imputed variable and other variables
  • treats imputed values as if they were observed with certainty
  • can also produce biased estimates unless the missing-data mechanism is MCAR.

Mean imputation in Stata

The code below replaces missing values with the observed mean for each variable.

foreach var of varlist female write read math progcat1 progcat2 progcat3 {
    egen mean`var' = mean(`var')
    replace `var' = mean`var' if missing(`var')
    drop mean`var'
}

After imputing the missing values, we examine the descriptive statistics and correlations.

sum female write read math progcat1 progcat2, sep(6)
corr female write read math progcat1 progcat2
. sum female write read math progcat1 progcat2, sep(6)

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      female |        200    .5549451    .4752706          0          1
       write |        200    52.95082    8.853514         31         67
        read |        200    52.28796     9.97715         28         76
        math |        200     52.8973     9.00113         33         75
    progcat1 |        200    .2252747    .3995207          0          1
    progcat2 |        200     .521978    .4777044          0          1

. corr female write read math progcat1 progcat2
(obs=200)

             |   female    write     read     math progcat1 progcat2
-------------+------------------------------------------------------
      female |   1.0000
       write |   0.2286   1.0000
        read |  -0.0151   0.5480   1.0000
        math |  -0.0208   0.5491   0.6159   1.0000
    progcat1 |  -0.0285  -0.0538  -0.0988  -0.1478   1.0000
    progcat2 |   0.0453   0.3128   0.3611   0.4066  -0.5635   1.0000
    

Single or deterministic imputation

Single imputation replaces each missing value with one estimated value. A common form is regression imputation, where missing values are replaced by predicted scores from a regression model.

This uses more information than mean imputation because the imputed value is conditional on other variables. However, the predicted values fall directly on the regression line, so the method does not preserve the natural variability in the data. See Figure 2.

As a result, deterministic imputation can:

  • underestimate variability
  • inflate associations among variables
  • bias correlations and \(R^2\) upward
  • treat imputed values as if they were observed with certainty

Warning

Because uncertainty in the imputed values is ignored, standard errors and statistical tests can be misleading.

Scatterplot illustrating deterministic regression imputation. Observed cases show a positive relationship between IQ and job performance. Several imputed values fall exactly on the fitted regression line, showing that deterministic imputation adds predicted values without residual variation.
Figure 2: Deterministic regression imputation places imputed values directly on the regression line. Source: adapted from Enders (2010), p. 46.

Stochastic imputation

Stochastic imputation modifies regression imputation by adding random error to the predicted values.

Instead of imputing only the predicted score,

\[ \hat{Y} \]

we impute:

\[ \hat{Y} + e \]

where \(e\) is randomly drawn from a distribution with mean 0 and variance equal to the residual variance from the regression model.

This restores some of the lost variability, but it still uses only one imputed value per missing observation.

Stochastic imputation can improve on deterministic regression imputation, but standard errors may still be too small because uncertainty about the imputation process is not fully incorporated.

Traditional methods such as complete-case analysis, mean imputation, deterministic imputation, and stochastic imputation are easy to apply, but they can distort estimates or uncertainty.

A key limitation is that single-imputation methods fill in one value and then treat it as if it were observed.

Multiple imputation addresses this by creating several plausible completed data sets and combining results across them.

Scatterplot illustrating stochastic regression imputation. Observed cases show a positive relationship between IQ and job performance. Imputed values are placed around the regression line rather than exactly on it, showing that stochastic imputation adds random residual variation to predicted values.
Figure 3: Stochastic regression imputation adds random residual variation to the predicted values, so imputed values no longer fall exactly on the regression line. Source: adapted from Enders (2010), p. 48.

Multiple imputation

Multiple imputation is an extension of stochastic imputation.

Instead of filling in one value for each missing observation, multiple imputation creates several plausible completed data sets.

Each imputed value includes a random component, reflecting uncertainty about the missing value.

Multiple imputation has three basic phases:

  1. Imputation phase
    Missing values are filled in with estimated values. This process is repeated \(m\) times, creating \(m\) completed data sets.

  2. Analysis phase
    Each completed data set is analyzed using the statistical model of interest.

  3. Pooling phase
    The estimates and standard errors from the \(m\) analyses are combined for inference.

Note

The goal is not to recover the one “true” missing value. The goal is to preserve the relationships among variables and account for uncertainty due to missing data.

Imputation and analytic models

The imputation model should be compatible, or congenial, with the planned analytic model.

At a minimum, the imputation model should include all variables used in the final analysis model.

It should also include important features of the analytic model, such as:

  • transformations
  • interactions
  • nonlinear terms
  • recoded or categorized variables

The reason is that multiple imputation aims to preserve the relationships among variables.

Note

A compatible, or congenial, imputation model helps preserve the variance-covariance structure needed for valid inference.

Further reading

For more on imputation-model compatibility, see:

  • von Hippel, 2009
  • von Hippel, 2013
  • White et al., 2010

Multiple imputation workflow

Workflow diagram of multiple imputation. An incomplete data set is used to create multiple completed data sets. Each completed data set is analyzed separately using the model of interest. The separate results are then pooled to obtain combined estimates, standard errors, confidence intervals, and p-values.
Figure 4: Multiple imputation workflow: impute missing values, analyze each completed data set, and pool the results for inference.

Preparing to conduct MI

The first step is to examine the number and proportion of missing values among the variables of interest.

The user-written Stata command mdesc can be used to summarize missingness.

use https://stats.idre.ucla.edu/wp-content/uploads/2017/05/hsb2_mar.dta, clear

mdesc female write read math prog
. mdesc female write read math prog 

    Variable    |     Missing          Total     Percent Missing
----------------+-----------------------------------------------
         female |          18            200           9.00
          write |          17            200           8.50
           read |           9            200           4.50
           math |          15            200           7.50
           prog |          18            200           9.00
----------------+-----------------------------------------------

If mdesc is not installed, search for it and install it:

search mdesc
ssc install mdesc

The mdesc output reports the number and percentage of missing observations for each variable.

In this example, the variables with the highest proportion of missing information are:

  • prog: 9.0% missing
  • female: 9.0% missing

Variables with higher missingness can have a greater impact on the stability and convergence of the imputation model.

MI data storage styles

Stata stores multiply imputed data using an MI style. The style determines how the original data and imputed data sets are arranged in memory.

This figure illustrates three common MI styles:

  • mlong: stores the original data and imputed data sets in long form, including only needed observations for imputations
  • flong: stores the original data and all imputed data sets in long form
  • wide: stores imputations as additional variables in wide form
Diagram comparing Stata multiple imputation storage styles. The mlong and flong styles stack the original data and imputed data sets vertically, with rows labeled by imputation number m equals 0, 1, and 2. The wide style stores the original data and each imputation side by side in separate blocks labeled m equals 0, m equals 1, and m equals 2. Red cells indicate missing values in the original data, and green cells indicate imputed values.
Figure 5: Illustration of Stata MI storage styles: mlong, flong, and wide. Red cells represent missing values in the original data, and green cells represent imputed values.

Declaring the data as MI data

For this seminar, we use the mlong style.

This is done with mi set.

use https://stats.idre.ucla.edu/wp-content/uploads/2017/05/hsb2_mar.dta, clear

mi set mlong

The mlong style tells Stata how the multiply imputed data will be stored after imputation.

The MI storage style can be changed later using:

mi convert

For more information about available MI storage styles, use:

help mi styles

Data structure after mi set mlong

After running mi set mlong, Stata adds variables used to track the imputed data structure:

Screenshot of the Stata Data Editor after running mi set mlong. The data set includes the original variables, such as write, math, science, socst, and program-category indicators, followed by Stata-created MI system variables named _mi_miss, _mi_m, and _mi_id. The _mi_m variable is 0 for the original data, and _mi_id identifies observations.
Figure 6: Stata adds MI system variables after mi set mlong
  • _mi_miss: marks observations in the original data that have missing values
  • _mi_m: indicates the imputation number; _mi_m = 0 is the original data
  • _mi_id: identifies observations and links them across imputed data sets

Note

These variables are created by Stata to manage the multiple-imputation data structure. You usually do not edit them directly.

Examining missing-data patterns

After declaring the data as MI data, Stata’s mi misstable commands can be used to examine missingness.

mi misstable summarize female write read math prog
mi misstable patterns female write read math prog
. mi misstable summarize female write read math prog
                                                               Obs<.
                                                +------------------------------
               |                                | Unique
      Variable |     Obs=.     Obs>.     Obs<.  | values        Min         Max
  -------------+--------------------------------+------------------------------
        female |        18                 182  |      2          0           1
         write |        17                 183  |     29         31          67
          read |         9                 191  |     30         28          76
          math |        15                 185  |     39         33          75
          prog |        18                 182  |      3          1           3
  -----------------------------------------------------------------------------

. mi misstable patterns female write read math prog

      Missing-value patterns
        (1 means complete)

              |   Pattern
    Percent   |  1  2  3  4    5
  ------------+------------------
       65%    |  1  1  1  1    1
              |
        8     |  1  1  1  0    1
        8     |  1  1  1  1    0
        6     |  1  1  0  1    1
        6     |  1  0  1  1    1
        4     |  0  1  1  1    1
        1     |  1  0  0  1    1
       <1     |  1  0  1  0    1
       <1     |  1  0  1  1    0
       <1     |  1  1  0  0    1
       <1     |  1  1  0  1    0
       <1     |  1  1  1  0    0
  ------------+------------------
      100%    |

  Variables are  (1) read  (2) math  (3) write  (4) female  (5) prog

  • summarize reports the amount of missing information for each variable.
  • patterns shows groups of observations with the same missing-data pattern.

Stata treats system and extended missing values as larger than any nonmissing value:

\[ \infty < . < .a < .b < \cdots < .z \]

Identifying auxiliary variables

Auxiliary variables are variables that are not part of the main analytic model but may improve the imputation model.

Good auxiliary variables are often:

  • correlated with variables that have missing values
  • associated with the probability that values are missing
  • theoretically important based on subject-matter knowledge

Auxiliary variables can help make the MAR assumption more plausible and may improve the quality of the imputed values.

Note

A common guideline is to consider variables correlated at about \(r > .40\) with variables being imputed, but this is not a strict rule.

Finding auxiliary variables in Stata

One way to identify potential auxiliary variables is to examine correlations between the analysis variables and other variables in the data set.

tab prog, gen(progcat)

pwcorr female write read math progcat1 progcat2 socst science, obs
             |   female    write     read     math progcat1 progcat2    socst  science
-------------+------------------------------------------------------------------------
      female |   1.0000 
             |      182
             |
       write |   0.2508   1.0000 
             |      166      183
             |
        read |  -0.0174   0.5872   1.0000 
             |      173      174      191
             |
        math |  -0.0241   0.6182   0.6589   1.0000 
             |      168      170      176      185
             |
    progcat1 |  -0.0317  -0.0604  -0.1058  -0.1651   1.0000 
             |      165      166      173      168      182
             |
    progcat2 |   0.0500   0.3439   0.3902   0.4457  -0.5635   1.0000 
             |      165      166      173      168      182      182
             |
       socst |   0.0889   0.5975   0.6160   0.5451  -0.0768   0.4096   1.0000 
             |      182      183      191      185      182      182      200
             |
     science |  -0.0918   0.5498   0.6329   0.6296   0.0567   0.2038   0.4512   1.0000 
             |      166      168      176      169      167      167      184      184
             |

In this example, science and socst may be useful auxiliary variables because they are correlated with several test-score variables.

Good auxiliary variables do not need to be correlated with every variable, and they do not need to be fully observed.

Auxiliary variables as predictors of missingness

Auxiliary variables can also be useful if they help predict whether a variable is missing.

First, create missingness indicators.

gen female_flag = !missing(female)
gen write_flag  = !missing(write)
gen read_flag   = !missing(read)
gen math_flag   = !missing(math)
gen prog_flag   = !missing(prog)

Then compare auxiliary-variable means between observed and missing groups.

foreach var of varlist female_flag-prog_flag {
    display "`var'"
    ttest socst, by(`var')
}

foreach var of varlist female_flag-prog_flag {
    display "`var'"
    ttest science, by(`var')
}
socst by math_flag

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       0 |      15    45.33333    3.080919    11.93235    38.72542    51.94125
       1 |     185    52.97838    .7690379    10.46005    51.46111    54.49564
---------+--------------------------------------------------------------------
Combined |     200      52.405    .7591352    10.73579    50.90802    53.90198
---------+--------------------------------------------------------------------
    diff |           -7.645045    2.837886               -13.24141   -2.048684
------------------------------------------------------------------------------
    diff = mean(0) - mean(1)                                      t =  -2.6939
H0: diff = 0                                     Degrees of freedom =      198

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0038         Pr(|T| > |t|) = 0.0077          Pr(T > t) = 0.9962

If an auxiliary variable differs between missing and observed cases, it may help explain missingness and support the MAR assumption.

The only significant difference was found when examining missingness on math with socst. Above you can see that the mean socst score is significantly lower among the respondents who are missing on math. This suggests that socst is a potential correlate of missingness (Enders, 2010) and may help us satisfy the MAR assumption for multiple imputation by including it in our imputation model.

MI using multivariate normal distribution

One imputation option in Stata is multivariate normal imputation.

This approach assumes that the variables in the imputation model follow a joint multivariate normal distribution.

The algorithm uses data augmentation, a Markov chain Monte Carlo method, to draw missing values from their conditional distribution given the observed data.

Because MVN imputation draws values from a multivariate normal distribution, imputed values can be decimal or negative.

This is not necessarily a problem for estimation, but categorical variables need special handling.

For nominal categorical variables, we create indicator variables before imputation.

For example, instead of imputing the original categorical variable prog, we use program-category indicators such as progcat1, progcat2, and progcat3.

Note

MVN imputation can often perform reasonably well even when normality is not perfect, especially with sufficient sample size. Problems are more likely when the sample size is small and the fraction of missing information is high.

Imputation phase

After the data are declared as MI data, Stata needs to know which variables will be imputed.

This is done with the mi register imputed command.

mi register imputed female write read math progcat1 progcat2 science

Then we specify the imputation model and the number of imputed data sets using mi impute mvn.

mi impute mvn female write read math progcat1 progcat2 science = socst, ///
    add(10) rseed(53421)
  • mi impute mvn requests multivariate normal imputation.
  • Variables before = have missing values and are imputed.
  • Variables after = are predictors only.
  • add(10) creates 10 imputed data sets.
  • rseed(53421) makes the random imputation process reproducible.
. mi impute mvn female write read math progcat1 progcat2 science = socst, add(10) rseed (53421)

Performing EM optimization:
  observed log likelihood = -1601.2096 at iteration 12

Performing MCMC data augmentation ... 

Multivariate imputation                     Imputations =       10
Multivariate normal regression                    added =       10
Imputed: m=1 through m=10                       updated =        0

Prior: uniform                               Iterations =     1000
                                                burn-in =      100
                                                between =      100

------------------------------------------------------------------
                   |               Observations per m             
                   |----------------------------------------------
          Variable |   Complete   Incomplete   Imputed |     Total
-------------------+-----------------------------------+----------
            female |        182           18        18 |       200
             write |        183           17        17 |       200
              read |        191            9         9 |       200
              math |        185           15        15 |       200
          progcat1 |        182           18        18 |       200
          progcat2 |        182           18        18 |       200
           science |        184           16        16 |       200
------------------------------------------------------------------
(Complete + Incomplete = Total; Imputed is the minimum across m
 of the number of filled-in observations.)

Note

Even though science is an auxiliary variable, it has missing values, so it must be included on the left side of the equals sign and imputed.

Analysis and pooling phase

After creating the imputed data sets, we use mi estimate to run the analysis model.

In this example, the analytic model is a linear regression predicting read.

mi estimate: regress read write female math progcat1 progcat2

The mi estimate: prefix tells Stata to:

  1. run the regression model in each imputed data set
  2. combine the estimates across imputations
  3. report one set of coefficients, standard errors, confidence intervals, and p-values
. mi estimate: regress read write female math progcat1 progcat2

Multiple-imputation estimates                   Imputations       =         10
Linear regression                               Number of obs     =        200
                                                Average RVI       =     0.1503
                                                Largest FMI       =     0.2468
                                                Complete DF       =        194
DF adjustment:   Small sample                   DF:     min       =      77.11
                                                        avg       =     114.71
                                                        max       =     173.44
Model F test:       Equal FMI                   F(   5,  174.4)   =      35.62
Within VCE type:          OLS                   Prob > F          =     0.0000

------------------------------------------------------------------------------
        read | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       write |   .3893681   .0817014     4.77   0.000     .2278278    .5509084
      female |  -2.747417   1.143906    -2.40   0.017    -5.005185   -.4896491
        math |   .4019581    .086767     4.63   0.000      .229499    .5744173
    progcat1 |   .5163472    1.68493     0.31   0.760    -2.827101    3.859795
    progcat2 |   2.812402   1.602017     1.76   0.083    -.3775466    6.002351
       _cons |   10.35629    3.68667     2.81   0.006     3.052563    17.66001
------------------------------------------------------------------------------

Comparing regression results

Parameter Full data β Full data SE Full data p-value Complete-case β Complete-case SE Complete-case p-value MVN imputation β MVN imputation SE MVN imputation p-value
Intercept 9.62 3.410 0.0053 13.03 4.124 0.002 10.36 3.687 0.006
Write 0.37 0.075 <.0001 0.44 0.093 <.0001 0.39 0.082 <.0001
Female -2.70 1.095 0.0146 -2.71 1.365 0.0496 -2.75 1.144 0.017
Math 0.44 0.075 <.0001 0.32 0.095 0.001 0.40 0.087 <.0001
PROG academic 1.88 1.423 0.1882 1.81 1.655 0.2759 2.81 1.602 0.083
PROG general 0.23 1.512 0.8782 0.52 1.881 0.7836 0.52 1.685 0.760

The MVN imputation estimates are generally closer to the complete-data demonstration results than the complete-case estimates.

The standard errors from multiple imputation are often slightly larger than the full-data analysis because MI incorporates additional uncertainty from the missing values.

Imputation diagnostics

After running mi estimate, Stata reports several diagnostic quantities in the output.

These help describe how much uncertainty is due to missing data and how the pooled standard errors are calculated.

To request more detailed diagnostic tables, use:

mi estimate, vartable dftable

The vartable and dftable options report variance components, relative increase in variance, fraction of missing information, relative efficiency, and degrees of freedom.

. mi estimate, vartable dftable

Multiple-imputation estimates                   Imputations       =         10
Linear regression

Variance information
------------------------------------------------------------------------------
             |        Imputation variance                             Relative
             |    Within   Between     Total       RVI       FMI    efficiency
-------------+----------------------------------------------------------------
       write |   .005939   .000669   .006675   .123977   .113855       .988743
      female |   1.24261   .059921   1.30852   .053044   .051507       .994876
        math |   .005947   .001438   .007529   .265958   .219719         .9785
    progcat1 |   2.31652   .474974   2.83899   .225541   .191897       .981172
    progcat2 |    1.9623   .549235   2.56646   .307883   .246847        .97591
       _cons |   11.4877   1.91258   13.5915   .183139   .160802       .984174
------------------------------------------------------------------------------


Multiple-imputation estimates                   Imputations       =         10
Linear regression                               Number of obs     =        200
                                                Average RVI       =     0.1503
                                                Largest FMI       =     0.2468
                                                Complete DF       =        194
DF adjustment:   Small sample                   DF:     min       =      77.11
                                                        avg       =     114.71
                                                        max       =     173.44
Model F test:       Equal FMI                   F(   5,  174.4)   =      35.62
Within VCE type:          OLS                   Prob > F          =     0.0000

------------------------------------------------------------------------------
             |                                                      % increase
        read | Coefficient  Std. err.      t    P>|t|           df   std. err.
-------------+----------------------------------------------------------------
       write |   .3893681   .0817014     4.77   0.000        138.8        6.02
      female |  -2.747417   1.143906    -2.40   0.017        173.4        2.62
        math |   .4019581    .086767     4.63   0.000         87.0       12.51
    progcat1 |   .5163472    1.68493     0.31   0.760         98.6       10.70
    progcat2 |   2.812402   1.602017     1.76   0.083         77.1       14.36
       _cons |   10.35629    3.68667     2.81   0.006        113.3        8.77
------------------------------------------------------------------------------

Variance components in MI

Multiple imputation combines two main sources of variation:

  • Within-imputation variance
    The average sampling variance across the imputed data sets.

For example, if you sum squared the standard errors for write for all 10 imputations and then divided by 10, this would equal \(V_W = 0.0059\).

This estimates the sampling variability that we would have expected had there been no missing data.

  • Between-imputation variance
    The variability in parameter estimates across the imputed data sets.

For example, if you took all 10 of the parameter estimates for write and calculated the variance this would equal \(V_B = 0.00067\).

This variability estimates the additional variation (uncertainty) that results from missing data.

  • Total variance

The total variance combines both sources, plus a correction for using a finite number of imputations.

\[ V_T = V_W + V_B + \frac{V_B}{m} \]

For example, the total variance for the variable write would be calculated like this: \(V_W + V_B + V_B/m = 0.0059 + 0.00067 + 0.00067/10 = 0.00667\)

Relative increase in variance

The relative increase in variance, often called RVI or RIV, measures how much larger the sampling variance is because of missing data.

\[ RVI = \frac{V_B + V_B/m}{V_W} \]

A higher RVI means more uncertainty is being added by the missing data.

Variables with more missing information, or variables weakly predicted by the imputation model, tend to have higher RVI values.

Fraction of missing information

The fraction of missing information, or FMI, is the proportion of total sampling variance attributable to missing data.

\[ FMI = \frac{V_B + V_B/m}{V_T} \]

An FMI of 0.20 means that about 20% of the total sampling variance is due to missing data.

Note

A practical rule of thumb is to use at least as many imputations as the highest FMI percentage. For example, if the largest FMI is 25%, consider using at least 25 imputations.

Relative efficiency

Relative efficiency compares using \(m\) imputations with using an infinite number of imputations.

\[ RE = \frac{1}{1 + FMI/m} \]

When the amount of missing information is low, a small number of imputations may give high relative efficiency.

However, more imputations may still be needed to estimate standard errors well.

Note

Good relative efficiency does not necessarily mean the variance estimates are stable.

Degrees of freedom

In MI analysis, degrees of freedom are not determined only by sample size.

They also depend on:

  • the number of imputations
  • the amount of missing information
  • the variability between imputations

Stata uses a small-sample correction (Barnard and Rubin, 1999) by default to avoid inflated degrees of freedom.

Note

Fractional degrees of freedom are normal in multiple-imputation output.

Additional diagnostic checks

After imputation, it is useful to compare observed and imputed values.

Possible checks include:

  • means
  • frequencies
  • box plots
  • distributions of observed versus imputed values
  • residuals and outliers within imputed data sets

If unusual values appear in only a few imputations, this may indicate a problem with the imputation model.

Checking convergence

For MVN imputation, convergence means that the data augmentation algorithm has reached a stable posterior distribution.

This should be done for different imputed variables, but specifically for those variables with a high proportion of missing (e.g. high FMI).

Convergence is often assessed visually using trace plots.

Trace plots show estimated parameters across iterations. These plots can be requested using the saveptrace and mcmconly option.

This mcmconly option will simply run the MCMC algorithm for the same number of iterations it takes to obtain 10 imputations without actually producing 10 imputed datasets. Is it typically used in combination with saveptrace or savewlf to examine the convergence of the MCMC prior to imputation. No imputation is performed with mcmconly is specified, so the options add or replace are not required with mi impute mvn.

mi impute mvn write read female math science progcat1 progcat2 = socst, ///
    mcmconly burnin(1000) rseed(53421) saveptrace(trace, replace)
Performing EM optimization:
  observed log likelihood = -1601.2096 at iteration 12

Performing MCMC data augmentation ... 

Note: No imputation performed.

Long-term trends in trace plots or high autocorrelation may indicate slow convergence.

Trace files for convergence diagnostics

The trace file saved by saveptrace() is not a regular Stata data set, but Stata can read it using mi ptrace.

You can describe the trace file without opening it:

mi ptrace describe trace
. mi ptrace describe trace 

  file trace.stptrace created on 1 Jun 2026 04:29 contains 1,000 records (obs) on
      m                      1 variable
      iter                   1 variable
      b[y, x]               14 variables (7 x 2)
      v[y, y]               28 variables (7 x 7, symmetric)

  where y and x are
      y: (1) write  (2) read  (3) female  (4) math  (5) science  (6) progcat1  (7) progcat2 
      x: (1) socst  (2) _cons 

You can also load it into memory:

mi ptrace use trace, clear

The trace file contains information such as:

  • imputation number
  • iteration number
  • regression coefficients
  • variances and covariances

If you have a lot of parameters in your model it may not be feasible to examine the convergence of each individual parameter. In that case you can use savewlf. WLF stands for worst linear function. This will output to you the parameter(s) with the highest FMI value.

Trace plots for imputed variables

As an example, we examine diagnostics for female, one of the variables with fewer complete observations.

After loading the trace file, we declare the iteration number as the time variable.

tsset iter

Then we graph the coefficient and variance series for female.

Trace plots are used to check whether the MCMC algorithm appears stable across iterations.

tsline b_y3x1, name(gr1, replace) ///
    ytitle("Coefficient on Socst" "used to predict Female") ///
    xtitle("iter")

tsline v_y3y3, name(gr2, replace) ///
    ytitle("Female Variance") ///
    xtitle("iter")

graph combine gr1 gr2, xcommon cols(1) b1title(Iteration)
Two trace plots for the imputation model parameter related to female. The top plot shows the coefficient on socst used to predict female across iterations. The bottom plot shows the variance of female across iterations. Both series fluctuate around a stable level without a clear long-term trend, suggesting good convergence.
Figure 7: Trace plots for female. The coefficient and variance series fluctuate around stable levels across iterations, suggesting that the imputation algorithm has reached a stationary distribution.

Autocorrelation diagnostics

Another useful convergence diagnostic is the autocorrelation plot.

Autocorrelation measures the correlation between values from different MCMC iterations.

Because the imputation process is intended to produce sufficiently independent draws, we do not want strong correlation between values across iterations.

After loading the trace file, we can use Stata’s ac command to examine autocorrelation.

ac b_y3x1, ///
    ytitle("Coefficient on Socst" "used to predict Female") ///
    xtitle("") ciopts(astyle(none)) note("") ///
    name(ac1, replace) lags(100)

ac v_y3y3, ///
    ytitle("Female Variance") ///
    xtitle("") ciopts(astyle(none)) note("") ///
    name(ac2, replace) lags(100)

graph combine ac1 ac2, ///
    xcommon cols(1) title(Autocorrelations) b1title(Lag)
Two autocorrelation plots for the imputation model parameter related to female. The top plot shows autocorrelations for the coefficient on socst used to predict female across lags from 0 to 100. The bottom plot shows autocorrelations for the female variance across the same lags. In both plots, autocorrelations are highest at very small lags and then fluctuate close to zero, suggesting little serial dependence after the first few iterations.
Figure 8: Autocorrelation plots for female. The autocorrelations drop close to zero after the first few lags, suggesting that the MCMC draws are not strongly dependent across iterations.

Interpreting autocorrelation plots

In an autocorrelation plot:

  • the x-axis shows the lag
  • the y-axis shows the correlation between values separated by that lag

Ideally, autocorrelation should drop close to zero after a small number of lags.

If autocorrelation remains high for many lags, the imputed data sets may be too similar to each other.

The time it takes for autocorrelation to approach zero gives information about convergence.

If autocorrelation drops quickly, this suggests that the chain is mixing well and that successive draws are not strongly dependent.

If autocorrelation declines slowly, the algorithm may need more iterations before drawing imputations.

mi impute mvn ..., burnbetween(200)

The burnbetween() option increases the number of iterations between imputed data sets.

Warning

If autocorrelation remains high, consider increasing the number of iterations between imputations using the burnbetween() option.

MI using chained equations

A second method available in Stata is multiple imputation by chained equations, or MICE.

MICE is also known as:

  • fully conditional specification
  • sequential generalized regression

Unlike MVN imputation, MICE does not require assuming that all variables jointly follow a multivariate normal distribution.

Instead, MICE uses a separate conditional model for each variable being imputed.

Why use MICE?

MICE is useful when variables have different measurement scales or distributions.

For example, we may need different imputation models for:

  • binary variables
  • ordinal variables
  • nominal categorical variables
  • continuous variables
  • count variables

This flexibility is useful when imputed variables must take on specific types of values, such as 0/1 values for binary variables.

Stata’s chained-equations approach supports different models for different variable types.

Examples include:

  • logistic regression for binary variables
  • ordered logistic regression for ordinal variables
  • multinomial logistic regression for nominal variables
  • linear regression or predictive mean matching for continuous variables
  • Poisson or negative binomial regression for count variables

If no model is specified, Stata uses linear regression by default.

Predictive mean matching

Predictive mean matching, or PMM, is often used for continuous variables.

PMM imputes values by selecting observed values from cases with similar predicted means.

This helps keep imputed values within the range of observed data.

Warning

When using PMM in Stata, pay attention to the number of nearest-neighbor matches. Using too few matches can lead to underestimated standard errors.

MICE imputation phase

Before using chained equations, reload the data and declare it as MI data.

use https://stats.idre.ucla.edu/wp-content/uploads/2017/05/hsb2_mar.dta, clear

mi set mlong

The mlong style stores the original data and imputed data sets in long form.

The basic setup is similar to MVN imputation, but with two key differences:

  1. We use mi impute chained instead of mi impute mvn.
  2. We specify an imputation model for each type of variable.

For example:

  • logit for binary variables
  • mlogit for unordered categorical variables
  • regress for continuous variables

Register variables for imputation

First, register the variables that will be imputed.

mi register imputed female write read math prog science

Unlike the MVN example, we no longer need dummy variables for prog.

Because MICE can impute unordered categorical variables directly, prog can be imputed using a multinomial logistic model.

Chained-equations imputation

Next, use mi impute chained to specify the imputation model.

mi impute chained ///
    (logit) female ///
    (mlogit) prog ///
    (regress) write read math science = socst, ///
    add(10) rseed(53421) savetrace(trace1, replace)

The model in parentheses applies to the variable or variables listed after it.

For example, (logit) female tells Stata to impute female using logistic regression.

  • add(10) creates 10 imputed data sets.
  • rseed(53421) makes the random process reproducible.
  • savetrace(trace1, replace) saves trace information for convergence diagnostics.
  • socst is included as a predictor in the imputation model.

By default, Stata imputes variables from most observed to least observed. Use orderasis to keep the order specified in the command.

. mi impute chained (logit) female (mlogit) prog (regress) write read math science = socst, ///
> add(10) rseed (53421) savetrace(trace1,replace)

Conditional models:
              read: regress read math science write i.female i.prog socst
              math: regress math read science write i.female i.prog socst
           science: regress science read math write i.female i.prog socst
             write: regress write read math science i.female i.prog socst
            female: logit female read math science write i.prog socst
              prog: mlogit prog read math science write i.female socst

Performing chained iterations ...

Multivariate imputation                     Imputations =       10
Chained equations                                 added =       10
Imputed: m=1 through m=10                       updated =        0

Initialization: monotone                     Iterations =      100
                                                burn-in =       10

            female: logistic regression
              prog: multinomial logistic regression
             write: linear regression
              read: linear regression
              math: linear regression
           science: linear regression

------------------------------------------------------------------
                   |               Observations per m             
                   |----------------------------------------------
          Variable |   Complete   Incomplete   Imputed |     Total
-------------------+-----------------------------------+----------
            female |        182           18        18 |       200
              prog |        182           18        18 |       200
             write |        183           17        17 |       200
              read |        191            9         9 |       200
              math |        185           15        15 |       200
           science |        184           16        16 |       200
------------------------------------------------------------------
(Complete + Incomplete = Total; Imputed is the minimum across m
 of the number of filled-in observations.)

MICE analysis phase

Once the 10 imputed data sets have been created, we can run the analysis model.

Because female and prog were imputed using models appropriate for categorical variables, they can be used as factor variables in the regression model.

mi estimate: regress read write i.female math ib3.prog

The mi estimate: prefix runs the regression model in each imputed data set and then pools the results.

Stata fits the regression model separately in each of the 10 imputed data sets.

This produces 10 sets of:

  • regression coefficients
  • standard errors
  • test statistics

Stata then combines these results into one set of inferential statistics.

. mi estimate: regress read write i.female math ib3.prog

Multiple-imputation estimates                   Imputations       =         10
Linear regression                               Number of obs     =        200
                                                Average RVI       =     0.1649
                                                Largest FMI       =     0.2121
                                                Complete DF       =        194
DF adjustment:   Small sample                   DF:     min       =      90.00
                                                        avg       =     117.29
                                                        max       =     146.03
Model F test:       Equal FMI                   F(   5,  170.7)   =      35.22
Within VCE type:          OLS                   Prob > F          =     0.0000

------------------------------------------------------------------------------
        read | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       write |   .4028188   .0827066     4.87   0.000     .2391084    .5665291
             |
      female |
     female  |  -2.650018   1.201493    -2.21   0.029    -5.026381    -.273656
        math |   .4089138   .0844608     4.84   0.000     .2414949    .5763326
             |
        prog |
    general  |   .0134051   1.710516     0.01   0.994    -3.384835    3.411645
   academic  |   2.341625   1.558824     1.50   0.136      -.75001    5.433259
             |
       _cons |   9.647476      3.617     2.67   0.009     2.499048     16.7959
------------------------------------------------------------------------------

Note

As with MVN imputation, the pooled standard errors include additional uncertainty due to the missing values.

Interpreting the MICE results

The MICE estimates are generally comparable to the complete-data demonstration results.

In this example:

  • write, female, and math are statistically significant
  • the standard errors are larger than in the complete-data analysis
  • the standard errors are still smaller than those from complete-case analysis
  • the estimates for prog should be examined carefully

Note

The larger standard errors are expected because multiple imputation incorporates uncertainty about the missing values.

Comparing regression estimates

Parameter Full data β Full data SE Complete-case β Complete-case SE MICE β MICE SE MICE p-value
Intercept 9.62 3.410 13.03 4.124 9.65 3.620 0.009
Write 0.37 0.075 0.44 0.093 0.40 0.083 <.0001
Female -2.70 1.095 -2.71 1.365 -2.65 1.201 0.029
Math 0.44 0.075 0.32 0.095 0.41 0.084 <.0001
PROG academic 1.88 1.423 1.81 1.655 2.34 1.559 0.136
PROG general 0.23 1.512 0.52 1.881 0.01 1.711 0.994

The complete-data results are used for demonstration only, not as a formal benchmark for evaluating multiple imputation.

Interpreting the MICE results

Compared with complete-case analysis, the MICE standard errors are generally smaller, because all 200 observations contribute information through the imputation model.

Compared with the complete-data demonstration results, the MICE standard errors are somewhat larger, which is expected because MI incorporates uncertainty from missing values.

The estimates for write, female, and math are fairly consistent across approaches.

Warning

The estimates for prog differ more noticeably, so we should examine imputation diagnostics and consider whether the imputation model for prog can be improved.

MICE imputation diagnostics

After fitting the MICE model, we can examine diagnostic measures such as:

  • RVI: relative increase in variance
  • FMI: fraction of missing information
  • DF: degrees of freedom
  • RE: relative efficiency
  • between-imputation variance
  • within-imputation variance
mi estimate, vartable dftable
. mi estimate, vartable dftable

Multiple-imputation estimates                   Imputations       =         10
Linear regression

Variance information
------------------------------------------------------------------------------
             |        Imputation variance                             Relative
             |    Within   Between     Total       RVI       FMI    efficiency
-------------+----------------------------------------------------------------
       write |   .005903   .000852    .00684   .158707   .141937       .986005
             |
      female |
     female  |   1.27246   .155565   1.44359   .134481    .12251       .987897
        math |   .005955   .001072   .007134   .197998    .17193       .983098
             |
        prog |
    general  |   2.33199   .539888   2.92587   .254665   .212116       .979229
   academic  |     2.004   .387216   2.42993   .212544    .18258       .982069
             |
       _cons |   11.7972   1.16864   13.0827   .108968   .101237       .989978
------------------------------------------------------------------------------


Multiple-imputation estimates                   Imputations       =         10
Linear regression                               Number of obs     =        200
                                                Average RVI       =     0.1649
                                                Largest FMI       =     0.2121
                                                Complete DF       =        194
DF adjustment:   Small sample                   DF:     min       =      90.00
                                                        avg       =     117.29
                                                        max       =     146.03
Model F test:       Equal FMI                   F(   5,  170.7)   =      35.22
Within VCE type:          OLS                   Prob > F          =     0.0000

------------------------------------------------------------------------------
             |                                                      % increase
        read | Coefficient  Std. err.      t    P>|t|           df   std. err.
-------------+----------------------------------------------------------------
       write |   .4028188   .0827066     4.87   0.000        123.2        7.64
             |
      female |
     female  |  -2.650018   1.201493    -2.21   0.029        133.9        6.51
        math |   .4089138   .0844608     4.84   0.000        107.8        9.45
             |
        prog |
    general  |   .0134051   1.710516     0.01   0.994         90.0       12.01
   academic  |   2.341625   1.558824     1.50   0.136        102.8       10.12
             |
       _cons |   9.647476      3.617     2.67   0.009        146.0        5.31
------------------------------------------------------------------------------

In this example, the largest RVI and FMI are associated with prog.

What do the diagnostics suggest?

The highest estimated values are associated with prog:

  • RVI is about 25%
  • FMI is about 21%

This suggests that the imputation for prog may contribute more uncertainty than the other variables.

Possible next steps include:

  • increasing the number of imputations to about 20 or 25
  • adding auxiliary variables associated with prog
  • checking trace plots or other diagnostics for convergence

MICE trace file

With mi impute chained, the savetrace() option saves the means and standard deviations of imputed values at each iteration.

The saved file, trace1, can be opened like a regular Stata data set.

use trace1, clear

describe
. use trace1,clear
(Summaries of imputed values from -mi impute chained-)

. describe

Contains data from trace1.dta
 Observations:           110                  Summaries of imputed values from -mi impute chained-
    Variables:            14                  1 Jun 2026 06:04
-----------------------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-----------------------------------------------------------------------------------------------------------------
iter            byte    %12.0g                Iteration numbers
m               byte    %12.0g                Imputation numbers
read_mean       float   %9.0g                 Mean of read
read_sd         float   %9.0g                 Std. dev. of read
math_mean       float   %9.0g                 Mean of math
math_sd         float   %9.0g                 Std. dev. of math
science_mean    float   %9.0g                 Mean of science
science_sd      float   %9.0g                 Std. dev. of science
write_mean      float   %9.0g                 Mean of write
write_sd        float   %9.0g                 Std. dev. of write
female_mean     float   %9.0g                 Mean of female
female_sd       float   %9.0g                 Std. dev. of female
prog_mean       float   %9.0g                 Mean of prog
prog_sd         float   %9.0g                 Std. dev. of prog
-----------------------------------------------------------------------------------------------------------------

Preparing the trace file

The trace file is stored in long form, with a row for each chain at each iteration.

Because there are multiple chains, the iteration number is repeated. Before using tsset, we reshape the data to wide form.

reshape wide *mean *sd, i(iter) j(m)

tsset iter

Now the mean and standard deviation for each variable are stored separately by chain.

After reshaping the trace file, we can graph the predicted mean for an imputed variable.

tsline read_mean1, ///
    name(mice1, replace) ///
    legend(off) ///
    ytitle("Mean of Read")

MICE trace plot for read

Line plot showing the mean of read across 10 iterations for one MICE imputation chain. The values fluctuate between about 48 and 56, with no clear long-term increasing or decreasing trend.
Figure 9: Trace plot of the mean of read across iterations for one MICE imputation chain.

This graph shows the predicted mean of read across iterations for the first imputation chain.

Note

As with MVN imputation, we expect the trace plot to fluctuate randomly around a stable level, with no obvious long-term trend.

Trace plots across all imputation chains

All 10 imputation chains can also be graphed at the same time to check whether any single chain behaves unexpectedly.

Each colored line represents a different imputation chain, initialized with different starting values.

tsline read_mean*, name(mice1, replace) legend(off) ///
    ytitle("Mean of Read")

tsline read_sd*, name(mice2, replace) legend(off) ///
    ytitle("SD of Read")

graph combine mice1 mice2, xcommon cols(1) ///
    title("Trace plots of summaries of imputed values")
Two trace plots showing summaries of imputed values for read across 10 MICE imputation chains. The top plot shows the mean of read across iterations, with each colored line representing a different chain. The bottom plot shows the standard deviation of read across iterations for the same chains. The chains fluctuate around similar levels without a clear long-term trend or one chain behaving very differently from the others.
Figure 10: Trace plots of summaries of imputed values for read across 10 MICE imputation chains. The chains fluctuate around similar levels, suggesting no obvious convergence problem.

Each colored line represents a different imputation chain. We want the chains to fluctuate around similar stable levels, with no single chain showing unusual drift or a very different pattern.

Note

When the chains are overlaid, we want to see that they all fluctuate around a similar stable level, with no single chain showing unusual drift or a different pattern.

MICE: strengths and cautions

Autocorrelation plots are mainly useful for MVN imputation. For MICE, they are less informative because the algorithm is iterative by design: each iteration uses the observed data and imputed values from the previous iteration, so some autocorrelation is expected.

MICE is attractive because each variable can be imputed using an appropriate conditional model. This is especially useful for variables that must take specific types of values, such as binary, categorical, ordinal, or count variables.

However, this flexibility can also create problems, including slow convergence, non-convergence, incompatible conditional models, or complete and quasi-complete separation when imputing categorical variables.

Warning

When using MICE, allow enough time to build, diagnose, and revise the imputation model before moving to the final analysis.

Passive imputation

Passive variables are variables created as functions of imputed variables.

For example, if the analytic model includes math^2, we may want math_sq to update automatically whenever math is imputed.

mi register imputed math read write female prog

mi passive: generate math_sq = math^2

mi impute chained ///
    (regress) math read write ///
    (logit) female ///
    (mlogit) prog = socst, ///
    add(20) rseed(53421)

mi estimate: regress read math math_sq write i.female i.prog

Passive imputation keeps math_sq mathematically consistent with math.

Warning

Passive imputation keeps derived variables mathematically consistent with their components. However, for nonlinear terms and interactions, the imputation model still needs to be compatible with the analytic model; otherwise, the relationship of interest may be attenuated.

Practical MI decisions

Before running multiple imputation, make several modeling decisions carefully:

  • Include variables that help predict missingness or incomplete variables.
  • Include the dependent variable in the imputation model.
  • Include transformations, interactions, and nonlinear terms needed in the final analysis.
  • Choose an imputation method appropriate for the variable types, such as MVN or MICE.
  • Use enough imputations, especially when the fraction of missing information is high.

Note

Multiple imputation should be treated as part of the analysis plan, not as a mechanical preprocessing step.

Main takeaways

Multiple imputation is generally preferable to single-imputation methods because it does not rely on one filled-in value and it incorporates uncertainty due to missing data.

The goal is not to recover the exact missing values, but to preserve the relationships among variables and obtain valid estimates and standard errors.

MI can improve estimation and power, but it is not magic: it depends on a reasonable imputation model, plausible assumptions, and careful diagnostics.

Warning

Multiple imputation helps address missing data, but it does not fix poor measurement, weak study design, or unsupported modeling assumptions.

References

  • Allison, P. D. (2002). Missing Data. Sage.
  • Allison, P. D. (2012). Handling missing data by maximum likelihood. SAS Global Forum: Statistics and Data Analysis.
  • Barnard, J., & Rubin, D. B. (1999). Small-sample degrees of freedom with multiple imputation. Biometrika, 86(4), 948–955.
  • Bartlett, J. W., et al. (2014). Multiple imputation of covariates by fully conditional specification. Statistical Methods in Medical Research.
  • Bodner, T. E. (2008). What improves with increased missing data imputations? Structural Equation Modeling, 15(4), 651–675.
  • Demirtas, H., et al. (2008). Plausibility of multivariate normality when imputing non-Gaussian outcomes. Journal of Statistical Computation and Simulation, 78(1).
  • Enders, C. K. (2010). Applied Missing Data Analysis. Guilford Press.
  • Graham, J. W., et al. (2007). How many imputations are really needed? Prevention Science, 8, 206–213.
  • Lee, K. J., & Carlin, J. B. (2010). Fully conditional specification versus multivariate normal imputation. American Journal of Epidemiology, 171(5), 624–632.
  • Little, R. J. A., & Rubin, D. B. (2002). Statistical Analysis with Missing Data (2nd ed.). Wiley.
  • Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
  • Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley.
  • Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177.
  • van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16, 219–242.
  • von Hippel, P. T. (2009). How to impute interactions, squares, and other transformed variables. Sociological Methodology, 39, 265–291.
  • White, I. R., et al. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.

Thank you

Questions?


Multiple Imputation in Stata


Statistical Methods and Data Analytics
UCLA Office of Advanced Research Computing