Different types of missing data require different types of imputation procedures (many of which can be performed with PROC MI) based upon the variables (are they categorical, continuous, binary) and the pattern of missingness in the data (discussed below). This page deals with the procedure for imputing missing data when the variables to be imputed are all continuous and have a monotone missing data pattern.
Examples
This example uses data from the 200 subject version of the highschool and beyond dataset. This dataset includes data on high school students’ scores on a tests in different academic areas. This dataset has been modified so that some of the cases have missing values. You can download the dataset here https://stats.idre.ucla.edu/wp-content/uploads/2016/02/hsb2_w_missing.sas7bdat . For this example, lets assume that a researcher wants to test the theory that science aptitude is predicted by students’ aptitude in reading, writing, and math using this data. Unfortunately, some students are missing data on the four variables in our analysis.
Patterns of Missingness
Dataset 1: Monotone Missing Data
id V1 V2 V3 V4
1 2 5 9 3
2 3 1 2 .
3 2 6 5 .
4 1 4 . .
5 3 . . .
Dataset 2: Non-Monotone Missing Data
id V1 V2 V3 V4
1 2 5 9 3
2 3 7 . .
3 2 . 5 9
4 1 4 . 2
5 3 . 5 .
proc means data=mi.hsb2_w_missing ;
var read science math write;
run;
The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------------- READ 200 52.2300000 10.2529368 28.0000000 76.0000000 SCIENCE 193 52.3367876 9.7004910 26.0000000 74.0000000 MATH 185 53.5567568 9.1001834 35.0000000 75.0000000 WRITE 175 54.1542857 8.8261739 31.0000000 67.0000000 -------------------------------------------------------------------------------
Examining distributions of missing values in SAS
We can look at the patterns of missing values. We can recode each variable into a dummy variable such that 1 is missing and 0 is nonmissing. Then we use the proc freq with statement tables with option list to compute the frequency for each pattern of missing data.
data mi.hsb2_w_missing2 (drop=i);set mi.hsb2_w_missing;array test1{*} read science math write;do i=1 to dim(test1);if test1{i} =. then test1{i}=1;else test1{i}=0;end;
run;
proc freq data=mi.hsb2_w_missing2;tables read*science*math*write /list;
run;
Cumulative Cumulative READ SCIENCE MATH WRITE Frequency Percent Frequency Percent -----------------------------------------------------------------------
0 0 0 0 175 87.50 175 87.50 0 0 0 1 10 5.00 185 92.50 0 0 1 1 8 4.00 193 96.50 0 1 1 1 7 3.50 200 100.00
This table shows us that 175 cases have no missing data, 10 cases are missing values on just the WRITE variable, 8 cases are missing data on the MATH and WRITE variables, and 7 cases are missing data on SCIENCE, MATH, and WRITE. Since all of the cases that have missing values on SCIENCE also have missing values on both MATH and WRITE, and all of the cases that have missing values on MATH also have missing values on WRITE, we say that the pattern of missingness is monotone. Any time the missing data is, or can be, arranged to form the triangle of 1’s seen in the above table, the pattern is said to be monotone.
Imputing the Missing Values
Analyzing Multiply Imputed Datasets
There are actually two steps in analyzing a multiply imputed dataset. First, we use an analysis procedure to analyze our multiply imputed datasets, and then we use PROC MIANALYZE to combine the results.
Discussion
See Also