How can I use multiply imputed data where original data is not included?

Note that Stata’s mi commands were implemented in version 11, the code below is not applicable to earlier versions.

One common storage method for multiply imputed (MI) datasets is to include the m (i.e. number of imputations) MI datasets in a single file. For example, if 5 imputations were created, there would be five copies of each case (i.e. five rows in the dataset for each case) in a single file. Some MI datasets also contain an additional copy of the data, the original (pre-imputation) data, so that there would be six rows for each case if there were five imputations. Either format provides the information necessary to carry out data analysis on the MI datasets, however, Stata’s MI commands expect that the original (pre-imputation) data is included in the MI dataset. If the original data is not included, the commands won’t work properly. Below we explain the problem and describe how to modify a dataset released without the original data so that the original data is included in the MI file.

Note that if you are working with a National Health and Nutrition Examination Survey (NHANES) or similarly formatted MI datasets, you may want to use Stata’s mi import nhanes1 command instead of the procedure described below. For information on using mi import nhanes1, type "help mi import nhanes1" (without the quotes) in the Stata command window.

Explanation of the problem

Below is a small example of an original (pre-imputation) dataset, id is the case id variable, and v1–v3 are variables with some missing values.

id v1 v2 v3
1  9  3  4
2  4  .  2
3  .  2  .

If we created three imputations (i.e. m=3), the dataset might look like the dataset shown below, where m is the imputation number. Since case 1 (id=1) had complete data, all three of its rows are identical, in the other two cases, the imputed values vary across the imputations.

m id v1 v2 v3
1  1  9  3  4
1  2  4  2  2
1  3  4  2  5
2  1  9  3  4
2  2  4  3  2
2  3  2  2  3
3  1  9  3  4
3  2  4  3  2
3  3  2  2  4

As we discussed above, there is nothing wrong with this format. However, Stata expects that the original (unimputed) dataset is included (denoted m=0). In this format, the example dataset from above would look like this:

m id v1 v2 v3
0  1  9  3  4
0  2  4  .  2
0  3  .  2  .
1  1  9  3  4
1  2  4  2  2
1  3  4  2  5
2  1  9  3  4
2  2  4  3  2
2  3  2  2  3
3  1  9  3  4
3  2  4  3  2
3  3  2  2  4

Below we show what happens when one tries to use the mi import command to import data without the original data (m=0). First we open a dataset and tabulate the variable m, the variable m takes on five values, one for each imputation.

use https://stats.idre.ucla.edu/stat/stata/faq/hsb2_no_m_0
(highschool and beyond (200 cases))

tab m

 imputation |
     number |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |        200       20.00       20.00
          2 |        200       20.00       40.00
          3 |        200       20.00       60.00
          4 |        200       20.00       80.00
          5 |        200       20.00      100.00
------------+-----------------------------------
      Total |      1,000      100.00

Below we use the mi import command to tell Stata that our data is multiply imputed. The m(…) option identifies the variable that contains the imputation number, id(…) gives the individual id number for each case, and imputed(…) gives the names of the variables that have been imputed.

mi import flong, m(m) id(id) imputed(female math write read science)
(36 values of imputed variable female in m>0 updated to match values in m=0)
(48 values of imputed variable math in m>0 updated to match values in m=0)
(52 values of imputed variable write in m>0 updated to match values in m=0)
(76 values of imputed variable read in m>0 updated to match values in m=0)
(92 values of imputed variable science in m>0 updated to match values in m=0)

This produces five messages from Stata, each message informs us that Stata has changed the values in the imputed datasets to match existing values in what it assumes to be the original data (i.e. m=0). But the dataset we started with didn’t contain m=0, so Stata assumed the lowest value of m (m=1) was actually m=0. Below is a cross tab of the m variable from our dataset, and the system variable _mi_m (created when we used the mi import command to index the imputations). The cross tab shows how Stata renumbered the imputations. In and of itself, this isn’t a problem, m is just an identifier, so in many ways its value is arbitrary. However, Stata makes the assumption that m=0 (i.e. _mi_m=0) is the pre-imputation dataset. Since m=1 (which became _mi_m=0) contains no missing data, Stata assumes that the values are actually complete, and replaces the imputed values in the other four MI datasets with the value in m=1. If m=1 were the original data, this would make perfect sense, after all, we don’t need imputed values for cases where we have observed values. However, because m=1 does not contain the original (unimputed) data, this creates problems.

tab m _mi_m

imputation |                         _mi_m
    number |         0          1          2          3          4 |     Total
-----------+-------------------------------------------------------+----------
         1 |       200          0          0          0          0 |       200 
         2 |         0        200          0          0          0 |       200 
         3 |         0          0        200          0          0 |       200 
         4 |         0          0          0        200          0 |       200 
         5 |         0          0          0          0        200 |       200 
-----------+-------------------------------------------------------+----------
     Total |       200        200        200        200        200 |     1,000

To further show what is going on, we use the mi describe command. It tells us that there are 200 complete observations (since all observations in m=1 are complete), and that M=4 (i.e. we have four imputed datasets), when there are actually five.

mi describe

  Style:  flong
          last mi update approximately 1 minute ago

  Obs.:   complete          200
          incomplete          0  (M = 4 imputations)
          ---------------------
          total             200

  Vars.:  imputed:  5; female(0) math(0) write(0) read(0) science(0)

          passive: 0

          regular: 0

          system:  3; _mi_m _mi_id _mi_miss

         (there are 12 unregistered variables)

The solution

In order to have mi import properly import our data, we need to create a dataset of the form Stata expects, that is, a dataset where m=0 contains the original (unimputed) data, and m>0 contains the multiply imputed datasets.

Below we start by loading the MI dataset. Next we keep only one of the imputations (keep if m==1) and set the value of m to 0 (using the replace command).

use https://stats.idre.ucla.edu/stat/stata/faq/hsb2_no_m_0, clear
(highschool and beyond (200 cases))

keep if m==1
(800 observations deleted)

replace m=0
(200 real changes made)

For the next step, we need to know which variables have imputed values, and for each imputed variable, we need a variable that indicates which observations were imputed. In our case the indicator variables are the name of the imputed variable prefixed by i_, for example, the variable i_female is equal to 1 when the value of female has been imputed, and 0 when it has not. (If your dataset does not have this type of imputation indicator, it can be created, see below.) We can use these indicator variables to recreate the missing values in the original dataset. For the variable female the command to create missing values where female has been imputed is: replace female = . if i_female==1 . However, if you had more than a few imputed variables, writing out the command for each variable would be somewhat tedious, so instead, we use a loop to do the same thing. The foreach command tells Stata that for each variable that follow the keyword varlist, it should perform the action in the brackets, filling in `var’ with the name of the variable. The output from running this command shows how many values were changed for each variable, for example, for female (the first variable in the list), 9 values were changed to missing.

foreach var of varlist female math write read science {
     replace `var' = . if i_`var'==1
}
(9 real changes made, 9 to missing)
(12 real changes made, 12 to missing)
(13 real changes made, 13 to missing)
(19 real changes made, 19 to missing)
(23 real changes made, 23 to missing)

Now we have a dataset with the same missing data structure as the original (unimputed) dataset, below we use append to add the cases from our starting dataset (hsb2_no_m_0.dta). The output tells us that value labels in the dataset we’re appending already exist in the current dataset, this is expected because they began as the same dataset. Next, we tabulate m, the variable for imputation number. Note that we now have values of m from 0 to 5. Finally, we save the new dataset under a different name.

append using hsb2_no_m_0
(label rl already defined)
(label sl already defined)
(label scl already defined)
(label sel already defined)
(label fl already defined)

tab m

 imputation |
     number |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        200       16.67       16.67
          1 |        200       16.67       33.33
          2 |        200       16.67       50.00
          3 |        200       16.67       66.67
          4 |        200       16.67       83.33
          5 |        200       16.67      100.00
------------+-----------------------------------
      Total |      1,200      100.00

save hsb2_m
file hsb2_m.dta saved

Now when we use the mi import command, instead of changing values as it did above, Stata marks the 66 observations with missing values as incomplete. The output from the command mi describe reports 66 incomplete observations, and M = 5 imputations. Further down the output also lists the variables that have been imputed, as well as what we know to be the correct number of imputed values for each.

mi import flong, m(m) id(id) imputed(female math write read science)
(66 m=0 obs. now marked as incomplete)

mi describe

  Style:  flong
          last mi update 0 seconds ago

  Obs.:   complete          134
          incomplete         66  (M = 5 imputations)
          ---------------------
          total             200

  Vars.:  imputed:  5; female(9) math(12) write(13) read(19) science(23)

          passive: 0

          regular: 0

          system:  3; _mi_m _mi_id _mi_miss

         (there are 12 unregistered variables)

Creating indicators for imputed values

The example above assumes that there are variables that mark the imputed observations. That is, for each variable that has been imputed, there is a variable marking which cases have imputed values and which have observed values. If you are working with a dataset that does not have these indicators, it is possible to create them. The first two lines of code below open the dataset and sort by case id (id). The third line of code below creates a new variable i_female that is equal to the standard deviation of female, for each value of id. If female was observed for a given case, the value of female will be the same across the imputed datasets and the standard deviation within that case will be equal to zero. If female was imputed, the value is likely to vary across the imputations (see note below), and the standard deviation will be greater than 0. In the final line, we replace values of i_female greater than zero with the value 1, so that i_female is equal to 1 if female appears to have been imputed, and 0 otherwise.

use https://stats.idre.ucla.edu/stat/stata/faq/hsb2_no_m_0_ind, clear
sort id
by id: egen i_female = sd(female)
replace i_female = 1 if i_female>0

For a single variable, this works fine, however, if the dataset contains many imputed variables, this process would be labor intensive and error prone. So instead of writing out the code for each variable, we can use a loop to do it. Below we open the dataset and sort the cases by id. Next we use the foreach command to generate the indicator variables for each of the variables listed after the keyword varlist. Stata will run the commands in the brackets once for each variable in the list, each time replacing the `var’ with the name of the appropriate variable.

use hsb2_no_m_0_ind, clear
sort id
foreach var of varlist female math write read science {
    by id: egen i_`var' = sd(`var')
    replace i_`var' = 1 if i_`var'>0 
}
(45 real changes made)
(60 real changes made)
(65 real changes made)
(95 real changes made)
(115 real changes made)

Note that this technique assumes that the imputed value for a single case will vary across imputations. This is likely to be true with continuous variables and with categorical variables imputed using the multivariate normal approach. However categorical variables imputed using the chained equation approach (implemented in the Stata package ice as well as in other packges), this may be true, but there may also be some exceptions (i.e. the same value was imputed across all imputations). That said, if the imputed value does not vary across imputations, the approach outlined above will still work to set up the data for use with Stata’s mi commands, at least mechanically.