### How can I perform multiple imputation on longitudinal data using ICE?

Imputing longitudinal or panel data poses special problems. If the data
are in long form, each case has multiple rows in the dataset, so this needs to
be accounted for in the estimation of any analytic model. At the same time, the information from other time
points can be important predictors of missing values, so we want to take
advantage of this and incorporate this into our imputation model. The following example shows how to impute longitudinal data,
accommodating the structure of this type of data. The example dataset contains data on student’s reading and math scores at
three time points (**read** and **math** respectively),
as well as data on the time invariant covariates **female**, **private**, and **ses**. The
data are in long form, so there are 3 rows in the data for each of the 200 students
for whom we have
data. The data also contain an **id** variable, which allows
us to match the cases across the three waves of data collection, and a variable
**time** which tells us when the data were collected.
There are missing data on three of the four substantive variables. This FAQ page will address the following questions:
(1) How does one create multiple imputed datasets that account for the clustering in the data (multiple
observations per student); (2) How does one take advantage of the fact that reading or math scores at
the other two time points are likely to be good predictors of any missing values
of the time-varying variables?

First we want to look at our data to confirm that there is missing data, we
can do this using the **summarize** command (which can be abbreviated to **
sum**). We can also use the user-written command **nmissing** to look at the amount
of missingness per variable within our data. You can download **nmissing** from
within Stata by typing **search nmissing** (see
How can I use the search command to search for programs and get additional
help? for more information about using **search**).

use "https://stats.idre.ucla.edu/stat/stata/faq/mi_longi.dta"sum female ses private read mathVariable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- female | 600 .545 .4983864 0 1 ses | 531 2.011299 .7136568 1 3 private | 570 .1526316 .3599479 0 1 read | 538 52.66543 10.29398 26 76 math | 554 52.24676 9.339812 26 75nmissingses 69 read 62 math 46 private 30

Stata has a suite of multiple imputation (mi) commands to help users not only impute their data but also explore the patterns of missingness present in the data.

In order to use these commands the dataset in memory must be declared or mi set as “mi” dataset. A dataset that is **mi set** is given an mi style. This tells Stata how the multiply imputed data is to be stored once the imputation has been completed. For information on these style type **help mi styles** into the command window. We will use the style mlong. The chosen style can be changed using **mi convert**.

mi set mlong

You will notice that executing the previous comand will create three new variables to your dataset. These new variables will be used by Stata to track the imputed datasets and values.

- _mi_miss: marks the observations in the original dataset that have missing values.
- _mi_m: indicates the imputation number. The value is 0 for the original dataset.
- _mi_id: indicator for the observations in the original dataset and is repeated across imputed dataset to mark the imputed observations.

The** mi misstable **commands helps users tabulate the amount of missing in their variables of interest (**summarize**) as well as examine patterns of missing**(patterns**)**.**

mi misstable summarize female private ses read mathObs<. +------------------------------ | | Unique Variable | Obs=. Obs>. Obs<. | values Min Max -------------+--------------------------------+------------------------------ private | 30 570 | 2 0 1 ses | 69 531 | 3 1 3 read | 62 538 | 42 26 76 math | 46 554 | 225 26 75 -----------------------------------------------------------------------------mi misstable patterns female private ses read mathMissing-value patterns (1 means complete) | Pattern Percent | 1 2 3 4 ------------+------------- 70% | 1 1 1 1 | 9 | 1 1 1 0 8 | 1 1 0 1 6 | 1 0 1 1 4 | 0 1 1 1 2 | 1 1 0 0 <1 | 1 0 0 1 <1 | 0 0 1 1 <1 | 0 1 0 0 <1 | 1 0 1 0 <1 | 0 1 0 1 <1 | 0 1 1 0 <1 | 1 0 0 0 ------------+------------- 100% | Variables are (1) private (2) math (3) read (4) ses

Once we are familiar with our data, the first step in the imputation process is to reshape
the data from long to wide. Having the
data in wide form takes care of both the nesting issue (there is now only one
row of data per student) and allows us to easily use variables from the other
time periods as predictors of missing values, since in wide form, they are just
other variables in the dataset (rather than being part of another row in the dataset).
We do this using the **mi** **reshape** command, and then check the output from reshape to
make sure everything went the way it should, and it has. This version of reshape maintains the structure of the multiply imputed dataset as we switch between wide and long. Note that the variable **time** is dropped, and that there are now three **read** variables and three **math **variables after we reshape.

mi reshape wide read math, i(id) j(time)reshaping m=0 data ... (note: j = 1 2 3) Data long -> wide ----------------------------------------------------------------------------- Number of obs. 600 -> 200 Number of variables 7 -> 10 j variable (3 values) time -> (dropped) xij variables: read -> read1 read2 read3 math -> math1 math2 math3 -----------------------------------------------------------------------------

After reshaping the data, and checking to make sure that the **reshape** command worked
as we want it to, we can do whatever steps are necessary to impute the missing values. The important point is that since our data are in wide (rather than long) format, the fact that data are longitudinal does not create any additional complications.

After the data is **mi set, **Stata requires 3 additional commands to complete our analysis. The first is** mi register imputed**. This **command **identifies which variables in the imputation model have missing information.

mi register imputed private ses read1 read2 read3 math1 math2 math3

The second command is** mi impute chained** where the user specifies the imputation model to be used and the number of imputed data sets to be created. Within this command we can specify a particular distribution to impute each variable under. The chosen imputation method is listed with parentheses directly preceding the variable(s) to which this distribution applies. Note that we also use ** rseed** to set the seed for the random number generator, this will enable you to reproduce the results of our imputation.

On the **mi impute chained** command line we can use the** add **option to specify the number of imputations to be performed. In this example we chose 10 imputations. Note, the chosen number of 10 imputation is just for illustrative purposes, your data may require more for valid estimation. Variables on the left side of the equal sign have missing information, while the right side is reserved for variables with no missing information and are therefore solely considered “predictors” of missing values.

mi impute chained (logit) private (ologit) ses (regress) read1 read2 read3 math1 math2 math3 = female, ///add(10) rseed (091107)math1: regress math1 read1 i.private math2 math3 i.ses read3 read2 female read1: regress read1 math1 i.private math2 math3 i.ses read3 read2 female private: logit private math1 read1 math2 math3 i.ses read3 read2 female math2: regress math2 math1 read1 i.private math3 i.ses read3 read2 female math3: regress math3 math1 read1 i.private math2 i.ses read3 read2 female ses: ologit ses math1 read1 i.private math2 math3 read3 read2 female read3: regress read3 math1 read1 i.private math2 math3 i.ses read2 female read2: regress read2 math1 read1 i.private math2 math3 i.ses read3 female Performing chained iterations ... Multivariate imputation Imputations = 10 Chained equations added = 10 Imputed: m=1 through m=10 updated = 0 Initialization: monotone Iterations = 100 burn-in = 10 private: logistic regression ses: ordered logistic regression read1: linear regression read2: linear regression read3: linear regression math1: linear regression math2: linear regression math3: linear regression ------------------------------------------------------------------ | Observations per m |---------------------------------------------- Variable | Complete Incomplete Imputed | Total -------------------+-----------------------------------+---------- private | 190 10 10 | 200 ses | 177 23 23 | 200 read1 | 194 6 6 | 200 read2 | 168 32 32 | 200 read3 | 176 24 24 | 200 math1 | 195 5 5 | 200 math2 | 180 20 20 | 200 math3 | 179 21 21 | 200 ------------------------------------------------------------------ (complete + incomplete = total; imputed is the minimum across m of the number of filled-in observations.)

So now we have our multiply imputed data, but they are still in wide format,
and we will probably want them in long form to run the analyses. We can again use **mi reshape**, to reshape the data back to long

mi reshape long read math, i(id) j(time)reshaping m=0 data ... (note: j = 1 2 3) Data wide -> long ----------------------------------------------------------------------------- Number of obs. 200 -> 600 Number of variables 10 -> 7 j variable (3 values) -> time xij variables: read1 read2 read3 -> read math1 math2 math3 -> math ----------------------------------------------------------------------------- reshaping m=1 data ... reshaping m=2 data ... reshaping m=3 data ... reshaping m=4 data ... reshaping m=5 data ... reshaping m=6 data ... reshaping m=7 data ... reshaping m=8 data ... reshaping m=9 data ... reshaping m=10 data ... assembling results ... ---

After reshaping the data, we will want to explore our imputations. It is important to make sure that the imputed values make sense, that they are not out of range of the original values, etc. We can start by summarizing our data, we may also want to use by to look at the values generated by each imputation separately. It might also be useful to generate either boxplots or histograms of our variables, so see that the distributions look reasonable after imputation.

sum female private ses read mathVariable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- female | 2,430 .5296296 .499224 0 1 private | 2,400 .1725 .3778932 0 1 ses | 2,361 1.991529 .7415714 1 3 read | 2,368 52.87335 10.31176 17.72421 81.7084 math | 2,384 52.43168 9.857669 26 75

Once we have carefully checked our data to make sure there were no problems
in the imputation, we can run an analysis on our data. The third step is **mi estimate **which runs the analytic model of interest within each of the imputed datasets. It also combines all the estimates (coefficients and standard errors) across all the imputed datasets to privide us with one set of estimates.
Below we have used
the **mi estimate** prefix with the command **xtreg** to predict reading test scores
using **time, **math test scores (**math**), and gender (**female**) accounting for the fact that there are multiple
observations per student. The command syntax is the same as for **xtreg** all
that needs to be added is the **mi estimate** prefix.

mi estimate: xtreg read math time female, i(id)Multiple-imputation estimates Imputations = 10 Random-effects GLS regression Number of obs = 600 Group variable: id Number of groups = 200 Obs per group: min = 3 avg = 3.0 max = 3 Average RVI = 0.0900 Largest FMI = 0.1563 DF adjustment: Large sample DF: min = 389.84 avg = 1,816.79 max = 3,299.97 Model F test: Equal FMI F( 3, 2669.5) = 49.94 Within VCE type: Conventional Prob > F = 0.0000 ------------------------------------------------------------------------------ read | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- math | .5416812 .0463393 11.69 0.000 .4505749 .6327874 time | .0614452 .3543575 0.17 0.862 -.6333505 .7562409 female | 2.361504 .9042511 2.61 0.009 .5885544 4.134454 _cons | 22.71853 2.663597 8.53 0.000 17.48349 27.95358 -------------+---------------------------------------------------------------- sigma_u | 4.4672309 sigma_e | 6.5185302 rho | .31956744 (fraction of variance due to u_i) ------------------------------------------------------------------------------ Note: sigma_u and sigma_e are combined in the original metric.

## References

Allison, Paul (2001) Missing Data. Sage University Paperback 136. Sage Publications: Thousand Oaks, CA. pg 73-75.