How can I perform multiple imputation on longitudinal data using ICE?
Imputing longitudinal or panel data poses special problems. If the data are in long form, each case has multiple rows in the dataset, so this needs to be accounted for in the estimation of any analytic model. At the same time, the information from other time points can be important predictors of missing values, so we want to take advantage of this and incorporate this into our imputation model. The following example shows how to impute longitudinal data, accommodating the structure of this type of data. The example dataset contains data on student’s reading and math scores at three time points (read and math respectively), as well as data on the time invariant covariates female, private, and ses. The data are in long form, so there are 3 rows in the data for each of the 200 students for whom we have data. The data also contain an id variable, which allows us to match the cases across the three waves of data collection, and a variable time which tells us when the data were collected. There are missing data on three of the four substantive variables. This FAQ page will address the following questions: (1) How does one create multiple imputed datasets that account for the clustering in the data (multiple observations per student); (2) How does one take advantage of the fact that reading or math scores at the other two time points are likely to be good predictors of any missing values of the time-varying variables?
First we want to look at our data to confirm that there is missing data, we can do this using the summarize command (which can be abbreviated to sum). We can also use the user-written command nmissing to look at the amount of missingness per variable within our data. You can download nmissing from within Stata by typing search nmissing (see How can I use the search command to search for programs and get additional help? for more information about using search).
use "https://stats.idre.ucla.edu/stat/stata/faq/mi_longi.dta"sum female ses private read mathVariable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- female | 600 .545 .4983864 0 1 ses | 531 2.011299 .7136568 1 3 private | 570 .1526316 .3599479 0 1 read | 538 52.66543 10.29398 26 76 math | 554 52.24676 9.339812 26 75nmissingses 69 read 62 math 46 private 30
Stata has a suite of multiple imputation (mi) commands to help users not only impute their data but also explore the patterns of missingness present in the data.
In order to use these commands the dataset in memory must be declared or mi set as “mi” dataset. A dataset that is mi set is given an mi style. This tells Stata how the multiply imputed data is to be stored once the imputation has been completed. For information on these style type help mi styles into the command window. We will use the style mlong. The chosen style can be changed using mi convert.
mi set mlong
You will notice that executing the previous comand will create three new variables to your dataset. These new variables will be used by Stata to track the imputed datasets and values.
- _mi_miss: marks the observations in the original dataset that have missing values.
- _mi_m: indicates the imputation number. The value is 0 for the original dataset.
- _mi_id: indicator for the observations in the original dataset and is repeated across imputed dataset to mark the imputed observations.
The mi misstable commands helps users tabulate the amount of missing in their variables of interest (summarize) as well as examine patterns of missing(patterns).
mi misstable summarize female private ses read math Obs<. +------------------------------ | | Unique Variable | Obs=. Obs>. Obs<. | values Min Max -------------+--------------------------------+------------------------------ private | 30 570 | 2 0 1 ses | 69 531 | 3 1 3 read | 62 538 | 42 26 76 math | 46 554 | 225 26 75 ----------------------------------------------------------------------------- mi misstable patterns female private ses read math Missing-value patterns (1 means complete) | Pattern Percent | 1 2 3 4 ------------+------------- 70% | 1 1 1 1 | 9 | 1 1 1 0 8 | 1 1 0 1 6 | 1 0 1 1 4 | 0 1 1 1 2 | 1 1 0 0 <1 | 1 0 0 1 <1 | 0 0 1 1 <1 | 0 1 0 0 <1 | 1 0 1 0 <1 | 0 1 0 1 <1 | 0 1 1 0 <1 | 1 0 0 0 ------------+------------- 100% | Variables are (1) private (2) math (3) read (4) ses
Once we are familiar with our data, the first step in the imputation process is to reshape the data from long to wide. Having the data in wide form takes care of both the nesting issue (there is now only one row of data per student) and allows us to easily use variables from the other time periods as predictors of missing values, since in wide form, they are just other variables in the dataset (rather than being part of another row in the dataset). We do this using the mi reshape command, and then check the output from reshape to make sure everything went the way it should, and it has. This version of reshape maintains the structure of the multiply imputed dataset as we switch between wide and long. Note that the variable time is dropped, and that there are now three read variables and three math variables after we reshape.
mi reshape wide read math, i(id) j(time) reshaping m=0 data ... (note: j = 1 2 3) Data long -> wide ----------------------------------------------------------------------------- Number of obs. 600 -> 200 Number of variables 7 -> 10 j variable (3 values) time -> (dropped) xij variables: read -> read1 read2 read3 math -> math1 math2 math3 -----------------------------------------------------------------------------
After reshaping the data, and checking to make sure that the reshape command worked as we want it to, we can do whatever steps are necessary to impute the missing values. The important point is that since our data are in wide (rather than long) format, the fact that data are longitudinal does not create any additional complications.
After the data is mi set, Stata requires 3 additional commands to complete our analysis. The first is mi register imputed. This command identifies which variables in the imputation model have missing information.
mi register imputed private ses read1 read2 read3 math1 math2 math3
The second command is mi impute chained where the user specifies the imputation model to be used and the number of imputed data sets to be created. Within this command we can specify a particular distribution to impute each variable under. The chosen imputation method is listed with parentheses directly preceding the variable(s) to which this distribution applies. Note that we also use rseed to set the seed for the random number generator, this will enable you to reproduce the results of our imputation.
On the mi impute chained command line we can use the add option to specify the number of imputations to be performed. In this example we chose 10 imputations. Note, the chosen number of 10 imputation is just for illustrative purposes, your data may require more for valid estimation. Variables on the left side of the equal sign have missing information, while the right side is reserved for variables with no missing information and are therefore solely considered “predictors” of missing values.
mi impute chained (logit) private (ologit) ses (regress) read1 read2 read3 math1 math2 math3 = female, /// add(10) rseed (091107) math1: regress math1 read1 i.private math2 math3 i.ses read3 read2 female read1: regress read1 math1 i.private math2 math3 i.ses read3 read2 female private: logit private math1 read1 math2 math3 i.ses read3 read2 female math2: regress math2 math1 read1 i.private math3 i.ses read3 read2 female math3: regress math3 math1 read1 i.private math2 i.ses read3 read2 female ses: ologit ses math1 read1 i.private math2 math3 read3 read2 female read3: regress read3 math1 read1 i.private math2 math3 i.ses read2 female read2: regress read2 math1 read1 i.private math2 math3 i.ses read3 female Performing chained iterations ... Multivariate imputation Imputations = 10 Chained equations added = 10 Imputed: m=1 through m=10 updated = 0 Initialization: monotone Iterations = 100 burn-in = 10 private: logistic regression ses: ordered logistic regression read1: linear regression read2: linear regression read3: linear regression math1: linear regression math2: linear regression math3: linear regression ------------------------------------------------------------------ | Observations per m |---------------------------------------------- Variable | Complete Incomplete Imputed | Total -------------------+-----------------------------------+---------- private | 190 10 10 | 200 ses | 177 23 23 | 200 read1 | 194 6 6 | 200 read2 | 168 32 32 | 200 read3 | 176 24 24 | 200 math1 | 195 5 5 | 200 math2 | 180 20 20 | 200 math3 | 179 21 21 | 200 ------------------------------------------------------------------ (complete + incomplete = total; imputed is the minimum across m of the number of filled-in observations.)
So now we have our multiply imputed data, but they are still in wide format, and we will probably want them in long form to run the analyses. We can again use mi reshape, to reshape the data back to long
mi reshape long read math, i(id) j(time) reshaping m=0 data ... (note: j = 1 2 3) Data wide -> long ----------------------------------------------------------------------------- Number of obs. 200 -> 600 Number of variables 10 -> 7 j variable (3 values) -> time xij variables: read1 read2 read3 -> read math1 math2 math3 -> math ----------------------------------------------------------------------------- reshaping m=1 data ... reshaping m=2 data ... reshaping m=3 data ... reshaping m=4 data ... reshaping m=5 data ... reshaping m=6 data ... reshaping m=7 data ... reshaping m=8 data ... reshaping m=9 data ... reshaping m=10 data ... assembling results ... ---
After reshaping the data, we will want to explore our imputations. It is important to make sure that the imputed values make sense, that they are not out of range of the original values, etc. We can start by summarizing our data, we may also want to use by to look at the values generated by each imputation separately. It might also be useful to generate either boxplots or histograms of our variables, so see that the distributions look reasonable after imputation.
sum female private ses read math Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- female | 2,430 .5296296 .499224 0 1 private | 2,400 .1725 .3778932 0 1 ses | 2,361 1.991529 .7415714 1 3 read | 2,368 52.87335 10.31176 17.72421 81.7084 math | 2,384 52.43168 9.857669 26 75
Once we have carefully checked our data to make sure there were no problems in the imputation, we can run an analysis on our data. The third step is mi estimate which runs the analytic model of interest within each of the imputed datasets. It also combines all the estimates (coefficients and standard errors) across all the imputed datasets to privide us with one set of estimates. Below we have used the mi estimate prefix with the command xtreg to predict reading test scores using time, math test scores (math), and gender (female) accounting for the fact that there are multiple observations per student. The command syntax is the same as for xtreg all that needs to be added is the mi estimate prefix.
mi estimate: xtreg read math time female, i(id) Multiple-imputation estimates Imputations = 10 Random-effects GLS regression Number of obs = 600 Group variable: id Number of groups = 200 Obs per group: min = 3 avg = 3.0 max = 3 Average RVI = 0.0900 Largest FMI = 0.1563 DF adjustment: Large sample DF: min = 389.84 avg = 1,816.79 max = 3,299.97 Model F test: Equal FMI F( 3, 2669.5) = 49.94 Within VCE type: Conventional Prob > F = 0.0000 ------------------------------------------------------------------------------ read | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- math | .5416812 .0463393 11.69 0.000 .4505749 .6327874 time | .0614452 .3543575 0.17 0.862 -.6333505 .7562409 female | 2.361504 .9042511 2.61 0.009 .5885544 4.134454 _cons | 22.71853 2.663597 8.53 0.000 17.48349 27.95358 -------------+---------------------------------------------------------------- sigma_u | 4.4672309 sigma_e | 6.5185302 rho | .31956744 (fraction of variance due to u_i) ------------------------------------------------------------------------------ Note: sigma_u and sigma_e are combined in the original metric.
References
Allison, Paul (2001) Missing Data. Sage University Paperback 136. Sage Publications: Thousand Oaks, CA. pg 73-75.