missing_data

SAS Textbook Examples
Missing Data by Paul D. Allison

Chapter 4: Maximum Likelihood

Table 4.1 on page 21 showing the number of nonmissing cases for each variable, and the means and standard deviations for those cases with data present.

data college;
  set 'd:datamiusnews';
run;

proc means data=college n mean std ;
  var gradrat csat lenroll private stufac rmbrd act;
run;

The MEANS Procedure

Variable       N            Mean         Std Dev
------------------------------------------------
gradrat     1204      60.4053156      18.8890580
csat         779     967.9781772     123.5774927
lenroll     1297       6.1675196       0.9971460
private     1302       0.6390169       0.4804702
stufac      1300      14.8587692       5.1863985
rmbrd        783       4.1451149       1.1695883
act          714      22.1204482       2.5798990
------------------------------------------------

Table 4.2 on page 22 using listwise deletion.

proc reg data = college;
  model gradrat = csat lenroll private stufac rmbrd;
  run;
quit;

The REG Procedure
Model: MODEL1
Dependent Variable: gradrat

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     5          75529          15106      78.68    <.0001
Error                   449          86199      191.97929
Corrected Total         454         161728

Root MSE             13.85566    R-Square     0.4670
Dependent Mean       63.38462    Adj R-Sq     0.4611
Coeff Var            21.85966

                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1      -35.02840        7.68461      -4.56      <.0001
csat          1        0.06727        0.00643      10.47      <.0001
lenroll       1        2.41705        0.95904       2.52      0.0121
private       1       13.58806        1.94612       6.98      <.0001
stufac        1       -0.12306        0.13188      -0.93      0.3513
rmbrd         1        2.16169        0.71364       3.03      0.0026

Table 4.3 on page 22 using EM algorithm. The output from proc mi is shown using proc print and proc iml is used to create the mean and standard deviation for each variable.

proc mi data = college seed = 37851 nimpute = 0 noprint;
  em itprint outem = outem ;
  var gradrat csat lenroll private stufac rmbrd act;
run;
proc print data = outem;
run;

Obs  _TYPE_  _NAME_    gradrat       csat   lenroll   private    stufac     rmbrd       act

 1    MEAN               59.86     957.88    6.1694   0.63902    14.864    4.0726    22.220
 2    COV    gradrat    355.71    1352.99   -0.4998   3.60825   -31.142   10.3847    30.584
 3    COV    csat      1352.99   14745.15   23.2381   9.38160  -198.406   67.1206   298.906
 4    COV    lenroll     -0.50      23.24    0.9937  -0.29640     1.382   -0.0188     0.470
 5    COV    private      3.61       9.38   -0.2964   0.23067    -0.916    0.1885     0.291
 6    COV    stufac     -31.14    -198.41    1.3822  -0.91560    26.886   -1.6854    -4.122
 7    COV    rmbrd       10.38      67.12   -0.0188   0.18853    -1.685    1.3290     1.514
 8    COV    act         30.58     298.91    0.4695   0.29118    -4.122    1.5143     7.353

proc iml;
  use outem; 
  read all var _num_ into A;
  read all var {_name_} where (_type_ ^="MEAN") into variable ;
  r=nrow(A);
  v=ncol(A);
  mvec = t(A[1,]);
  var = sqrt(vecdiag(A[2:r,]));
  print variable "   "  mvec[colname="Mean" label=" " format=F6.2] "   "
        var[colname="Std. Dev" label=" " format=F6.3];
quit;

VARIABLE       Mean     Std. Dev
gradrat       59.86       18.860
csat         957.88       121.43
lenroll        6.17        0.997
private        0.64        0.480
stufac        14.86        5.185
rmbrd          4.07        1.153
act           22.22        2.712

Table 4.4 for the correlations.

proc iml;
  use outem ;
  read all var _num_ where (_type_ ^="MEAN") into A;
  b = sqrt(vecdiag(A));
  corrb = diag(1/b)*A*diag(1/b);
  print "Correlation From the EM Algorithm";
  print corrb[label=" "];
quit;

Correlation From the EM Algorithm

        1 0.5907701 -0.026587 0.3983338 -0.318444 0.4776146 0.5980197
0.5907701         1 0.1919785 0.1608618 -0.315116 0.4794722  0.907774
-0.026587 0.1919785         1   -0.6191 0.2674224 -0.016402 0.1737042
0.3983338 0.1608618   -0.6191         1 -0.367662 0.3405044 0.2235772
-0.318444 -0.315116 0.2674224 -0.367662         1 -0.281956  -0.29315
0.4776146 0.4794722 -0.016402 0.3405044 -0.281956         1 0.4843963
0.5980197  0.907774 0.1737042 0.2235772  -0.29315 0.4843963         1

Table 4.5 based on EM algorithm result. We first create a covariance type of data adding the number of observations to the matrix from proc mi.

data outem1;
  set outem end=last;
  array temp(*) _numeric_;
  output;
  if last then do;
  _type_ = "N";
  _name_ = " ";
  do i = 1 to dim(temp);
  temp(i) = 1302; 
  end;
  output;
  end;
drop i;
run;
proc reg data = outem1(type=cov) ;
   model gradrat = csat lenroll private stufac rmbrd;
  run;
quit;

The REG Procedure
Model: MODEL1
Dependent Variable: gradrat

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     5         216812          43362     228.47    <.0001
Error                  1296         245971      189.79261
Corrected Total        1301         462783


Root MSE             13.77652    R-Square     0.4685
Dependent Mean       59.86180    Adj R-Sq     0.4664
Coeff Var            23.01388


                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1      -32.39455        4.35463      -7.44      <.0001
csat          1        0.06688        0.00390      17.14      <.0001
lenroll       1        2.08321        0.53938       3.86      0.0001
private       1       12.91450        1.14657      11.26      <.0001
stufac        1       -0.18139        0.08412      -2.16      0.0312
rmbrd         1        2.40383        0.40010       6.01      <.0001

Table 4.6 on page 27, difficult to do in SAS.

Chapter 5: Multiple Imputation: Basics

Single Random Imputation

The program here is to illustrate the idea presented in this section. The results are not exactly the same because the randomness involved in generating the data sets. We first generate a data set of 10,000 cases, where variable x and y are drawn from a standard, bivariate normal distribution with a correlation .3.

data page28;
     do subject=1 to 10000;
     y=rannor(123456);
	 x=.3*y+sqrt(1-(.3)**2)*rannor(0);
    output;
    end;
 run;
proc corr data = page28 noprob;
  var x y;
run;

The CORR Procedure

   2  Variables:    x        y

                                    Simple Statistics

Variable           N          Mean       Std Dev           Sum       Minimum       Maximum

x              10000       0.00897       1.00804      89.68395      -4.12924       4.09470
y              10000       0.01985       0.99246     198.45285      -4.37004       4.14573

Pearson Correlation Coefficients, N = 10000

              x             y

x       1.00000       0.30019

y       0.30019       1.00000

At this point, we are going to create a data set with half of the x values assigned to be missing completely at random. The new variable is called mx in the new data set.

data page28_miss;
  set page28 nobs=n;
  retain index 10000 count 5000;
  percent = count /index;
  if uniform(_n_) < percent then do;
  a = 1; mx = .;
  count = count - 1;
  end;
  else do;
  a = 0;
  mx = x;
  end;
  index = index - 1;
run;

proc corr data = page28_miss noprob;
  var mx y;
run;

The CORR Procedure

   2  Variables:    mx       y


                                    Simple Statistics

Variable           N          Mean       Std Dev           Sum       Minimum       Maximum

mx              5000       0.00400       1.00469      19.97631      -3.59469       4.09470
y              10000       0.01985       0.99246     198.45285      -4.37004       4.14573


Pearson Correlation Coefficients
     Number of Observations

              mx             y

mx       1.00000       0.29610
            5000          5000

y        0.29610       1.00000
            5000         10000

Now we use the predicted values from the regression of mx on y to substitute for the missing values and check on the correlation based on the substitution. The correlation is .397, overestimating.


proc reg data = page28_miss;
  model mx = y;
run;
quit;

data page28_impute;
  set page28_miss;
  if mx = . then 
  imx = -.00194 + .29571*y;
  else imx = x;
run;

proc corr data = page28_impute noprob;
  var imx y;
run;
                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1       -0.00194        0.01358      -0.14      0.8862
y             1        0.29571        0.01349      21.92      <.0001

The CORR Procedure

   2  Variables:    imx      y


                                    Simple Statistics

Variable           N          Mean       Std Dev           Sum       Minimum       Maximum

imx            10000       0.00393       0.73928      39.27347      -3.59469       4.09470
y              10000       0.01985       0.99246     198.45285      -4.37004       4.14573


Pearson Correlation Coefficients, N = 10000

              imx             y

imx       1.00000       0.39698

y         0.39698       1.00000

Finally, we correct the bias by taking random draws from the residual distribution of mx and adding them back to the predicted values of mx.

data page28_impute_random;
  set page28_miss;
  if mx = . then 
  imx = -.00194 + .29571*y + .95073*rannor(_n_);
  else imx = x;
run;

proc corr data = page28_impute_random noprob;
var imx y;
run;

The CORR Procedure

   2  Variables:    imx      y


                                    Simple Statistics

Variable           N          Mean       Std Dev           Sum       Minimum       Maximum

imx            10000      -0.00443       1.00081     -44.33937      -3.67296       4.09470
y              10000       0.01985       0.99246     198.45285      -4.37004       4.14573


Pearson Correlation Coefficients, N = 10000

              imx             y

imx       1.00000       0.29643

y         0.29643       1.00000

Multiple Random Imputation

Table 5.1 on page 30 and the calculation of standard error for the mean of the correlation estimates.

Since we have to run a loop here for at least 8 times as shown in the text, we make use of some SAS macro programming here. The macro loop is based on the data set created earlier in the previous section and takes one parameter of number of times that we want to estimate the correlation coefficient.

%macro loop(num);

   %do i = 1 %to #
		data page28_impute_random;
  			set page28_miss;
  			if mx = . then 
  			imx = -.00194 + .29571*y + .95073*rannor(_n_+&i*10000);
  			else imx = x;
		run;
	
  proc corr data = page28_impute_random noprob OUTP=pcorr&i ;
    var imx y;
  run;
%end;

data pcorr;
  set %do i = 1 %to #
    pcorr&i
  %end;
  ;
  if _name_="imx";
  se = (1 - y*y)/sqrt(10000);
   keep y se;
   label y = "correlation";
   label se = "S.E.";
run;

proc print data=pcorr label noobs;
format se 7.5;
run;

proc sql;
  select ymean as mean_of_r, semean as mean_of_se, 
  sqrt((1/&num)*set + (1+1/&num)*(1/(&num-1))*(y2sum -&num*ymean**2)) as se_r_avg,
  sqrt((1/&num)*set + (1+1/&num)*(1/(&num-1))*(y2sum -&num*ymean**2))/semean*100-100 as ratio
from (select sum(se**2) as set, sum(y*y) as y2sum, mean(y) as ymean, mean(se) as semean
      from pcorr);
quit;
%mend;
%loop(8)

correlation       S.E.

  0.29445      0.00913
  0.28309      0.00920
  0.28426      0.00919
  0.28972      0.00916
  0.29658      0.00912
  0.29449      0.00913
  0.30680      0.00906
  0.30502      0.00907

mean_of_r  mean_of_se  se_r_avg     ratio
-----------------------------------------
 0.294303    0.009133  0.012945  41.74063

Allowing for Random Variation in the Parameter Estimates

Table 5.2 on page 32.

proc mi data=page28_miss seed=37851  nimpute = 8 out=outmi;;
mcmc nbiter = 500 niter = 200;
var mx y;
run;
proc corr data = outmi noprob;
by _imputation_;
  var mx y;
  ods output PearsonCorr=test;
run;

data pcorr;
  set test;
  if variable ="mx";
  se = (1 - y*y)/sqrt(10000);
   keep y se;
   label y = "correlation";
   label se = "S.E.";
run;

proc print data=pcorr label noobs;
format se 7.5;
run;
%let num = 8;
proc sql;
  select ymean as mean_of_r, semean as mean_of_se, 
  sqrt((1/&num)*set + (1+1/&num)*(1/(&num-1))*(y2sum -&num*ymean**2)) as se_r_avg,
  sqrt((1/&num)*set + (1+1/&num)*(1/(&num-1))*(y2sum -&num*ymean**2))/semean*100-100 as ratio
from (select sum(se**2) as set, sum(y*y) as y2sum, mean(y) as ymean, mean(se) as semean
      from pcorr);
quit;

correlation       S.E.

  0.27640      0.00924
  0.28710      0.00918
  0.29265      0.00914
  0.30048      0.00910
  0.29319      0.00914
  0.30283      0.00908
  0.28291      0.00920
  0.30148      0.00909

mean_of_r  mean_of_se  se_r_avg     ratio
-----------------------------------------
 0.292131    0.009146  0.013601  48.71186

MI Example 1 on page 41

Note: The SAS code in the book is for SAS 8.1. Since the publishing of Missing Data, SAS has modified the imputation procedures. The results on this webpage will not match with the results in the book precisely, even though the code is the same.

Figure 5.1 on page 44.

proc mi data = mycollege out = college nimpute=1 seed = 1401
    minimum = 0 600 . . 0 1.26 11
	maximum = 100 1400 . . 100 8.70 31
    round = 1 1 1 1 . 1 1 1;
	 var gradrat csat lenroll private stufac rmbrd act;
	 mcmc nbiter = 100 niter=1 outiter (cov) =g100;
run;
quit;
data fig5_1;
  set g100 (where =(_name_="csat") keep = _iteration_ _name_ gradrat csat);
  slope = gradrat/csat;
  iteration = 100 + _iteration_;
run;
symbol i = spline;
proc gplot data = fig5_1;
  plot slope*iteration ;
run;
quit;

Image fig5_1-1

Table 5.3 on page 45.

proc mi data = mycollege out = college seed = 1401
    minimum = 0 600 . . 0 1.26 11
	maximum = 100 1400 . . 100 8.70 31
    round = 1 1 1 1 . 1 1 1;
	 var gradrat csat lenroll private stufac rmbrd act;
	 mcmc nbiter = 500 niter = 200 chain = multiple  ;
run;

proc reg data = college outest=estimate outseb;
  model gradrat = csat lenroll private stufac rmbrd;
  by _imputation_;
run;
proc transpose data = estimate out=wide (drop=_label_) ;
  by _imputation_;
  var intercept csat lenroll private stufac rmbrd;
run;
proc sort data = wide;
 by _name_;
 run;
proc print data = wide (rename=(col1=est col2=stderr));
run;

Obs    _Imputation_    _NAME_            est     stderr

  1          1         Intercept    -32.7293    4.35540
  2          2         Intercept    -33.6320    4.33494
  3          3         Intercept    -31.0291    4.36411
  4          4         Intercept    -29.5485    4.38421
  5          5         Intercept    -30.7938    4.29643
  6          1         csat           0.0645    0.00390
  7          2         csat           0.0692    0.00387
  8          3         csat           0.0670    0.00398
  9          4         csat           0.0654    0.00400
 10          5         csat           0.0646    0.00388
 11          1         lenroll        2.3325    0.53602
 12          2         lenroll        2.0885    0.53881
 13          3         lenroll        2.0575    0.54997
 14          4         lenroll        1.9638    0.54982
 15          5         lenroll        2.0654    0.54046
 16          1         private       13.6415    1.12855
 17          2         private       12.9080    1.13730
 18          3         private       13.2238    1.14985
 19          4         private       12.8049    1.16374
 20          5         private       12.7397    1.14009
 21          1         rmbrd          2.5091    0.39269
 22          2         rmbrd          2.0074    0.39868
 23          3         rmbrd          2.0282    0.41058
 24          4         rmbrd          2.1758    0.40537
 25          5         rmbrd          2.6136    0.39714
 26          1         stufac        -0.1677    0.08419
 27          2         stufac        -0.1445    0.08445
 28          3         stufac        -0.1927    0.08535
 29          4         stufac        -0.1667    0.08487
 30          5         stufac        -0.1805    0.08368

Figure 5.2 on page 46.

We have created a data set called fig5_1 for figure 5.1 previously and we will continue to use it here.

ods output corrgraph=fig5_2 (keep = lag correlation);
proc arima data=fig5_1;
  identify var= slope  nlag=100;
run;
quit;

symbol i=needle w=5;
axis1 order=(1 to 100 by 1) ;
axis2 order = (-.2 to .3 by .05) label=(a=90);
proc gplot data = fig5_2;
  where lag~=0;
  format correlation 5.2;
  plot correlation*lag /haxis=axis1 vaxis=axis2;
run;
quit;

Image fig5_3

Figure 5.3 on page 48.

proc mi data = mycollege out = college seed = 1401
    minimum = 0 600 . . 0 1.26 11
	maximum = 100 1400 . . 100 8.70 31
    round = 1 1 1 1 . 1 1 1;
	 var gradrat csat lenroll private stufac rmbrd act;
	 mcmc nbiter = 500 niter = 200   ;
run;

proc reg data = college outest=estimate covout;
  model gradrat = csat lenroll private stufac rmbrd;
  by _imputation_;
run;

proc mianalyze data = estimate;
  modeleffects intercept csat lenroll private stufac rmbrd;
run;

                           Multiple Imputation Variance Information

                                                                       Relative       Fraction
             -----------------Variance-----------------                Increase        Missing
Parameter         Between         Within          Total       DF    in Variance    Information

intercept        5.112237      19.334782      25.469467   68.947       0.317287       0.261967
csat          0.000008409    0.000015729    0.000025820   26.191       0.641501       0.432540
lenroll          0.042972       0.293037       0.344604   178.63       0.175973       0.159004
private          0.086331       1.313014       1.416612   747.93       0.078900       0.075599
stufac           0.002583       0.007161       0.010260   43.837       0.432809       0.331872
rmbrd            0.093636       0.158752       0.271115   23.287       0.707789       0.458998

                  Multiple Imputation Parameter Estimates

Parameter        Estimate      Std Error    95% Confidence Limits        DF

intercept      -34.159501       5.046728     -44.2276     -24.0914   68.947
csat             0.068021       0.005081       0.0576       0.0785   26.191
lenroll          2.176581       0.587030       1.0182       3.3350   178.63
private         13.519104       1.190215      11.1825      15.8557   747.93
stufac          -0.132473       0.101291      -0.3366       0.0717   43.837
rmbrd            2.181653       0.520687       1.1053       3.2580   23.287

                       Multiple Imputation Parameter Estimates

                                                                 t for H0:
Parameter         Minimum        Maximum         Theta0   Parameter=Theta0   Pr > |t|

intercept      -37.231787     -32.226869              0              -6.77     <.0001
csat             0.064459       0.071636              0              13.39     <.0001
lenroll          1.891638       2.410736              0               3.71     0.0003
private         13.014425      13.719135              0              11.36     <.0001
stufac          -0.178505      -0.049020              0              -1.31     0.1977
rmbrd            1.842707       2.509079              0               4.19     0.0003

Figure 5.4 on page 49. We choose seed to be 12345 for this example.

proc mi data = mycollege out = college seed = 12345
    minimum = 0 600 . . 0 1.26 11
	maximum = 100 1400 . . 100 8.70 31
    round = 1 1 1 1 . 1 1 1;
	 var gradrat csat lenroll private stufac rmbrd act;
	 mcmc nbiter = 500 niter = 200   ;
run;

proc reg data = college outest=estimate covout;
  model gradrat = csat lenroll private stufac rmbrd;
  by _imputation_;
run;

proc mianalyze data = estimate;
  modeleffects intercept csat lenroll private stufac rmbrd;
run;

                           Multiple Imputation Variance Information

                                                                       Relative       Fraction
             -----------------Variance-----------------                Increase        Missing
Parameter         Between         Within          Total       DF    in Variance    Information

intercept        3.045733      19.316622      22.971502   158.01       0.189209       0.169550
csat          0.000004876    0.000015251    0.000021102   52.026       0.383665       0.303550
lenroll          0.128509       0.291455       0.445666   33.408       0.529109       0.381950
private          0.318826       1.315225       1.697816   78.772       0.290895       0.244290
stufac           0.005102       0.007133       0.013255   18.751       0.858295       0.511354
rmbrd            0.211943       0.159753       0.414084   10.603       1.592022       0.670922

                  Multiple Imputation Parameter Estimates

Parameter        Estimate      Std Error    95% Confidence Limits        DF

intercept      -32.615425       4.792859     -42.0818     -23.1491   158.01
csat             0.068855       0.004594       0.0596       0.0781   52.026
lenroll          2.053894       0.667582       0.6963       3.4115   33.408
private         13.054872       1.303003      10.4612      15.6486   78.772
stufac          -0.194243       0.115129      -0.4354       0.0469   18.751
rmbrd            2.036734       0.643494       0.6139       3.4595   10.603

                       Multiple Imputation Parameter Estimates

                                                                 t for H0:
Parameter         Minimum        Maximum         Theta0   Parameter=Theta0   Pr > |t|

intercept      -34.976524     -30.583509              0              -6.81     <.0001
csat             0.066110       0.072156              0              14.99     <.0001
lenroll          1.664030       2.472921              0               3.08     0.0042
private         12.541527      13.984207              0              10.02     <.0001
stufac          -0.311137      -0.122689              0              -1.69     0.1081
rmbrd            1.469074       2.718011              0               3.17     0.0094

Chapter 6: Multiple Imputation: Complications

Table 6.1 on page 51.

Method 1: Interaction term is created based on the imputed data.

proc mi data = mycollege out = college seed = 1401
    minimum = 0 600 . . 0 1.26 11
	maximum = 100 1400 . . 100 8.70 31
    round = 1 1 1 1 . 1 1 1;
	 var gradrat csat lenroll private stufac rmbrd act;
	 mcmc nbiter = 500 niter = 200   ;
run;

data method1;
  set college;
  csatXpri = csat*private;
run;
proc reg data = method1 outest=estimate covout;
  model gradrat = csat lenroll private stufac rmbrd csatXpri;
  by _imputation_;
run;

proc mianalyze data = estimate;
  modeleffects intercept csat lenroll private stufac rmbrd csatXpri;
run;

Parameter        Estimate      Std Error

intercept      -39.490344       6.718937
csat             0.073566       0.006494
lenroll          2.195024       0.589931
private         21.183207       7.870483
stufac          -0.136234       0.101501
rmbrd            2.210600       0.529051
csatXpri        -0.008134       0.008130

Parameter   Pr > |t|

intercept     <.0001
csat          <.0001
lenroll       0.0003
private       0.0090
stufac        0.1865
rmbrd         0.0004
csatXpri      0.3202

Method 2: Separate chains of data augmentation for private colleges and for public colleges.

data private public;
  set mycollege ;
  if private = 1 then output private;
  else output public;
run;

proc mi data = private out = private_imp seed = 1401
    minimum = 0 600 .  0 1.26 11
	maximum = 100 1400 .  100 8.70 31
    round = 1 1 1 1  1 1 1;
	 var gradrat csat lenroll  stufac rmbrd act;
	 mcmc nbiter = 500 niter = 200   ;
run;

proc mi data = public out = public_imp seed = 1401
    minimum = 0 600 .  0 1.26 11
	maximum = 100 1400 .  100 8.70 31
    round = 1 1 1 1  1 1 1;
	 var gradrat csat lenroll  stufac rmbrd act;
	 mcmc nbiter = 500 niter = 200   ;
run;
data method2;
  set private_imp public_imp;
  csatXpri = csat*private;
run;
proc sort data = method2;
  by _imputation_;
run;
proc reg data =  method2 outest=estimate2 covout;
  model gradrat = csat lenroll private stufac rmbrd csatXpri;
  by _imputation_;
run;
proc mianalyze data = estimate2;
  modeleffects intercept csat lenroll private stufac rmbrd csatXpri;
run;

Parameter        Estimate

intercept      -49.672789
csat             0.084559
lenroll          2.230984
private         35.992044
stufac          -0.209713
rmbrd            2.384819
csatXpri        -0.023945

Parameter   Pr > |t|

intercept     <.0001
csat          <.0001
lenroll       0.0004
private       0.0001
stufac        0.1062
rmbrd         0.0412
csatXpri      0.0076

Method 3: Creating interaction term before imputation.

data method3;
  set mycollege;
  csatXpri = csat*private;
run;

proc mi data = method3 out = m3_out seed = 1401
    minimum = 0 600 . . 0 1.26 11 .
	maximum = 100 1400 . . 100 8.70 31 1400
    round = 1 1 1 1 . 1 1 1 1;
	 var gradrat csat lenroll private stufac rmbrd act csatXpri;
	 mcmc nbiter = 500 niter = 200  ;
run;

proc reg data =  m3_out outest=estimate3 covout;
  model gradrat = csat lenroll private stufac rmbrd csatXpri;
  by _imputation_;
run;
proc mianalyze data = estimate3;
  modeleffects intercept csat lenroll private stufac rmbrd csatXpri;
run;

Parameter        Estimate

intercept      -54.317589
csat             0.088825
lenroll          2.147010
private         41.531167
stufac          -0.165995
rmbrd            2.479377
csatXpri        -0.029622

Parameter    Pr > |t|

intercept      <.0001
csat           <.0001
lenroll        0.0002
private        <.0001
stufac         0.0752
rmbrd          0.0003
csatXpri       0.0028

Example on page 61. The data set is created based on a text file. You can download the text file here.

data college;
infile "https://stats.idre.ucla.edu/wp-content/uploads/2016/02/usnews3.txt" missover ;
input  #2 prit 1-2 csat 11-15 act 16-18 nfresh 53-57 
       #3 tuition_in 13-18   tuition_out 19-24
       #3 rm 30-34 board 35-39 stufac 63-67 gradrat 77-80;
rmbrd = (rm + board)/1000;
private = prit - 1;
lenroll = log(nfresh);
run;

data page61;
  set college;
  if tuition_in ~=. ;
run;

The listwise deletion method result for the mean and standard deviation is obtained by using proc means.

proc means data = page61 mean stderr min max ;
  var board ;
run;

The MEANS Procedure

Variable              Mean       Std Error         Minimum         Maximum
--------------------------------------------------------------------------
board              2060.28      23.3527380     531.0000000         6250.00
--------------------------------------------------------------------------

The EM estimate of correlation between variable board and tuition_in obtained using proc mi. Based on the covariance matrix from proc mi, we calculated the correlation using a little bit of proc iml.

proc mi data = page61 seed = 37851 nimpute = 0 ;
  em itprint outem = outem ;
  var board tuition_in;
run;

The MI Procedure
            EM (MLE) Parameter Estimates
_TYPE_    _NAME_               board      tuition_in
MEAN                     2031.715414     7897.274371
COV       board               424466         1933499
COV       tuition_in         1933499        28580357

           EM (Posterior Mode) Estimates
_TYPE_    _NAME_               board      tuition_in
MEAN                     2031.694808     7897.274371
COV       board               423074         1929028
COV       tuition_in         1929028        28513109

proc iml;
  use outem ;
  read all var _num_ where (_type_ ^="MEAN") into A;
  b = sqrt(vecdiag(A));
  corrb = diag(1/b)*A*diag(1/b);
  print "Correlation From the EM Algorithm";
  print corrb[label=" "];
quit;

Correlation From the EM Algorithm

        1 0.5551223
0.5551223         1

The multiple imputation using data augmentation is obtained below using proc mi. The result is not exactly the same from the book. It may be because of the different choice of seed.

proc mi data = page61 out = impage61 seed = 12345;
   var  board tuition_in;
   mcmc nbiter = 500 niter = 200 outiter (cov) = outiter ;
run;

The MI Procedure

                  Multiple Imputation Parameter Estimates

Variable              Mean      Std Error    95% Confidence Limits        DF

board          2036.013053      21.328071     1993.225     2078.802   52.486

                    Multiple Imputation Parameter Estimates

                                                           t for H0:
Variable           Minimum        Maximum            Mu0    Mean=Mu0   Pr > |t|

board          2025.248823    2049.634738              0       95.46     <.0001

To see how the data augmentation has done, we can use proc univariate to see the extreme values imputed. There is a negative imputed value and the five lowest values are way below the observed minimum for variable board.

proc univariate data = impage61 ;
var board ;
run;

The UNIVARIATE Procedure
Variable:  board

          Extreme Observations

------Lowest------        ----Highest---

    Value      Obs        Value      Obs

 -23.8777     3850         6250      937
  34.4726      583         6250     2209
  34.5104     4320         6250     3481
  72.0237     5418         6250     4753
 223.5688     2429         6250     6025

Now the regression result is as follows and will be used to do the sampling.

proc reg data = page61;
  model board = tuition_in;
run;
quit;

The REG Procedure
Dependent Variable: board

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F
Model                     1      111364505      111364505     378.29    <.0001
Error                   794      233744119         294388
Corrected Total         795      345108624

Root MSE            542.57540    R-Square     0.3227
Dependent Mean     2060.27764    Adj R-Sq     0.3218
Coeff Var            26.33506

                         Parameter Estimates

                      Parameter       Standard
Variable      DF       Estimate          Error    t Value    Pr > |t|
Intercept      1     1497.41087       34.74671      43.10      <.0001
tuition_in     1        0.06765        0.00348      19.45      <.0001

MI Example 3 using hip data set. Notice that hip data set here has four waves rather than three. The last three waves are after releasing from the hospital. The output for parameter estimates on each individual SID has been omitted in the output part.

Table 6.4 on page 76.

Part 1: Listwise deletion by person

data hip2;
  set hip (where = (wave~=1));
  wdum2 = (wave=2);
  wdum3 = (wave=3);
run;
proc sql;
  create table tokeep as
  select sid, count(cesd)+ count(srh) + count(walk) + count(adl) + count(pain) as check
  from hip2
  group by sid;
quit;
proc sql;
  create table hip_list as
  select hip2.*
  from hip2, tokeep
  where tokeep.check = 15 and tokeep.sid = hip2.sid;
quit;

proc glm data = hip_list;
  class sid;
  model cesd = srh walk adl pain  wdum2 wdum3 sid/solution;
run;
quit;

Dependent Variable: CESD

                                      Sum of
Source                     DF        Squares    Mean Square   F Value   Pr > F

Model                     106    12565.89863      118.54621      7.94   <.0001

Error                     196     2926.95063       14.93342

Corrected Total           302    15492.84925

R-Square     Coeff Var      Root MSE     CESD Mean

0.811077      41.46229      3.864379      9.320224

                                        Standard
Parameter             Estimate             Error    t Value    Pr > |t|

Intercept           9.35357013 B      2.76111252       3.39      0.0009
SRH                 2.34162590        0.58561467       4.00      <.0001
WALK               -1.55214163        0.77085849      -2.01      0.0454
ADL                -0.67651277        0.52774928      -1.28      0.2014
PAIN                0.03117661        0.17860643       0.17      0.8616
wdum2               8.00426420        0.64960304      12.32      <.0001
wdum3               7.04479504        0.57893929      12.17      <.0001

Part 2: Listwise deletion by person-time

proc glm data = hip2;
  class sid;
  model cesd = srh walk adl pain  wdum2 wdum3 sid/solution;
run;
quit;

Dependent Variable: CESD

                                      Sum of
Source                     DF        Squares    Mean Square   F Value   Pr > F

Model                     200    23301.31535      116.50658      6.43   <.0001

Error                     252     4565.64090       18.11762

Corrected Total           452    27866.95625

R-Square     Coeff Var      Root MSE     CESD Mean

0.836163      44.23388      4.256480      9.622669

Parameter             Estimate             Error    t Value    Pr > |t|

Intercept           9.37277265 B      2.95992726       3.17      0.0017
SRH                 1.64132997        0.55572595       2.95      0.0034
WALK               -1.38062437        0.76131946      -1.81      0.0710
ADL                -0.33546552        0.53881181      -0.62      0.5341
PAIN                0.21506213        0.16756607       1.28      0.2005
wdum2               8.78676092        0.61275103      14.34      <.0001
wdum3               7.93044189        0.51985970      15.25      <.0001

Part 3: Data augmentation by person-time. Notice that the result here is quite different from the result in the book. This may be due to several reasons. First of all, the result in the book was done in an earlier version of SAS. Secondly, there is a natural randomness, e.g., the choice of seed. The result below is much closer to the second method above.

proc mi data = hip2 out = hip2_out1 seed = 1401;
 	 var cesd srh walk adl pain wdum2 wdum3;
	 mcmc nbiter = 500 niter = 200  ;
run;

ods output ParameterEstimates = est3;
proc glm data = hip2_out1 outstat= test ;
  by _imputation_;
  class sid;
  model cesd = srh walk adl pain  wdum2 wdum3 sid /solution;
run;
quit;
proc mianalyze parms = est3;
  modeleffects intercept srh walk adl pain wdum2 wdum3;
run;

                  Multiple Imputation Parameter Estimates

Parameter        Estimate      Std Error    95% Confidence Limits        DF

intercept        7.112057       3.260271      0.71830     13.50582   2065.1
srh              2.331491       0.422612      1.50305      3.15993   7312.9
walk            -1.364671       0.780848     -2.92266      0.19332    68.41
adl             -0.287099       0.372109     -1.05010      0.47590   27.386
pain             0.456069       0.199120      0.03956      0.87258   19.174
wdum2            8.805726       0.691685      7.37941     10.23204   24.406
wdum3            7.692605       0.725346      6.13175      9.25347   13.525

Part 4: Data augmentation by person.

%macro towide(var);
proc transpose data = hip2  out=hip_&var prefix=&var;
  by sid;
  id wave;
  var &var ;
run;
%mend;
%towide(cesd);
%towide(srh);
%towide(walk);
%towide(adl);
%towide(pain);
data hip_wide;
  merge hip_cesd hip_srh hip_walk hip_adl hip_pain;
  by sid;
run;

proc mi data = hip_wide out = hip3 seed = 1401;
 	 var cesd: srh: walk: adl: pain:;
	 mcmc nbiter = 500 niter = 200  ;
run;

data hip4;
  set hip3;	
  	array c(3) cesd:;
  	array s(3) srh:;
	array w(3) walk:;
	array a(3) adl:;
	array p(3) pain:;
do wave = 1 to 3;
   cesd = c(wave);
   srh = s(wave);
   walk = w(wave);
   adl = a(wave);
   pain = p(wave);
   output;
end;
keep sid cesd srh walk adl pain wave _imputation_;
run; 
data hip4a;
  set hip4;
   wdum2 = (wave=1);
   wdum3 = (wave=2);
run;
ods output parameterestimates = est4;
proc glm data = hip4a  ;
  by _imputation_;
  class sid;
  model cesd = srh walk adl pain  wdum2 wdum3 sid /solution;
run;
quit;
proc mianalyze parms = est4;
  modeleffects intercept srh walk adl pain wdum2 wdum3;
run;

                  Multiple Imputation Parameter Estimates

Parameter        Estimate      Std Error    95% Confidence Limits        DF

intercept        9.250156       3.604442      1.95764     16.54267   38.694
srh              1.480490       0.659000      0.04960      2.91138   12.386
walk            -0.889094       0.770854     -2.43282      0.65463   56.799
adl             -0.113741       0.665701     -1.72931      1.50183   6.2114
pain             0.214135       0.216721     -0.26201      0.69028   11.163
wdum2            8.943499       0.616950      7.68654     10.20046   31.824
wdum3            8.166227       0.568707      7.00730      9.32515   31.648

SAS Textbook Examples Missing Data by Paul D. Allison

Chapter 4: Maximum Likelihood

Chapter 5: Multiple Imputation: Basics

Chapter 6: Multiple Imputation: Complications

SAS Textbook Examples
Missing Data by Paul D. Allison