Applied Regression Analysis Chapter 11: Unusual and Influential Data

Regression analysis middle of page 269. Make measwt by female interaction and then run regression with the interaction.

use https://stats.idre.ucla.edu/stat/stata/examples/ara/davis, clear

generate measwt_f = measwt * female
regress reptwt measwt female measwt_f

  Source |       SS       df       MS                  Number of obs =     183
---------+------------------------------               F(  3,   179) =  470.41
   Model |  30654.7294     3  10218.2431               Prob > F      =  0.0000
Residual |  3888.25423   179  21.7220907               R-squared     =  0.8874
---------+------------------------------               Adj R-squared =  0.8856
   Total |  34542.9836   182  189.796613               Root MSE      =  4.6607

------------------------------------------------------------------------------
  reptwt |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
  measwt |   .9898221   .0425995     23.236   0.000       .9057602    1.073884
  female |   39.96412   3.929322     10.171   0.000       32.21037    47.71787
measwt_f |  -.7253627   .0559804    -12.957   0.000      -.8358292   -.6148963
   _cons |    1.35864   3.277192      0.415   0.679      -5.108262    7.825541
------------------------------------------------------------------------------

Save version with error (for use later) as davis_er.

save davis_er, replace
file davis_er.dta saved

Regression at bottom of 269. Fix error in case 12, and run the regression from above again. Fix error in case 12.

generate t = measwt in 12
(199 missing values generated)

replace measwt = measht in 12
(1 real change made)

replace measht = t in 12
(1 real change made)

drop t

Make measwt by female interaction again.

replace measwt_f = measwt * female

Run regression.

regress reptwt measwt female measwt_f

  Source |       SS       df       MS                  Number of obs =     183
---------+------------------------------               F(  3,   179) = 2228.78
   Model |  33642.3446     3  11214.1149               Prob > F      =  0.0000
Residual |  900.638968   179  5.03150261               R-squared     =  0.9739
---------+------------------------------               Adj R-squared =  0.9735
   Total |  34542.9836   182  189.796613               Root MSE      =  2.2431

------------------------------------------------------------------------------
  reptwt |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
  measwt |   .9898221   .0205023     48.279   0.000       .9493648    1.030279
  female |   1.982518   2.450282      0.809   0.420      -2.852638    6.817673
measwt_f |  -.0566831   .0384548     -1.474   0.142      -.1325662    .0191999
   _cons |    1.35864   1.577248      0.861   0.390      -1.753752    4.471032
------------------------------------------------------------------------------

Save corrected version (for use later) as davis_co.

save davis_co, replace
file davis_co.dta saved

Page 270, figure 11.2. Show graph similar to figure 11.2 showing outlier. Use the data file with the error.

use https://stats.idre.ucla.edu/stat/stata/examples/ara/davis_er, clear

Run the regression.

regress reptwt measwt female measwt_f

  Source |       SS       df       MS                  Number of obs =     183
---------+------------------------------               F(  3,   179) =  470.41
   Model |  30654.7294     3  10218.2431               Prob > F      =  0.0000
Residual |  3888.25423   179  21.7220907               R-squared     =  0.8874
---------+------------------------------               Adj R-squared =  0.8856
   Total |  34542.9836   182  189.796613               Root MSE      =  4.6607

------------------------------------------------------------------------------
  reptwt |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
  measwt |   .9898221   .0425995     23.236   0.000       .9057602    1.073884
  female |   39.96412   3.929322     10.171   0.000       32.21037    47.71787
measwt_f |  -.7253627   .0559804    -12.957   0.000      -.8358292   -.6148963
   _cons |    1.35864   3.277192      0.415   0.679      -5.108262    7.825541
------------------------------------------------------------------------------

predict yhat
graph twoway (scatter reptwt measwt, mlabel(female)) (line yhat measwt if female == 1, sort) ///
	(line yhat measwt if female == 0, sort), xlabel(25(25)175)

Middle of page 270, regression analysis. (Note errata: last term should be reptwt x female). Stata results match the results in Fox.

generate reptwt_f = reptwt * female
(17 missing values generated)

regress measwt reptwt female reptwt_f

  Source |       SS       df       MS                  Number of obs =     183
---------+------------------------------               F(  3,   179) =  139.07
   Model |  29786.3783     3  9928.79278               Prob > F      =  0.0000
Residual |  12779.4359   179  71.3934965               R-squared     =  0.6998
---------+------------------------------               Adj R-squared =  0.6947
   Total |  42565.8142   182    233.8781               Root MSE      =  8.4495

------------------------------------------------------------------------------
  measwt |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
  reptwt |   .9689183   .0764096     12.681   0.000       .8181387    1.119698
  female |   2.074211   9.297269      0.223   0.824      -16.27214    20.42056
reptwt_f |  -.0095251   .1468546     -0.065   0.948      -.2993141    .2802639
   _cons |    1.79428   5.923944      0.303   0.762       -9.89547    13.48403
------------------------------------------------------------------------------

Page 271, make hat value and show largest hat values.

Run the regression.

regress reptwt measwt female measwt_f

  Source |       SS       df       MS                  Number of obs =     183
---------+------------------------------               F(  3,   179) =  470.41
   Model |  30654.7294     3  10218.2431               Prob > F      =  0.0000
Residual |  3888.25423   179  21.7220907               R-squared     =  0.8874
---------+------------------------------               Adj R-squared =  0.8856
   Total |  34542.9836   182  189.796613               Root MSE      =  4.6607

------------------------------------------------------------------------------
  reptwt |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
  measwt |   .9898221   .0425995     23.236   0.000       .9057602    1.073884
  female |   39.96412   3.929322     10.171   0.000       32.21037    47.71787
measwt_f |  -.7253627   .0559804    -12.957   0.000      -.8358292   -.6148963
   _cons |    1.35864   3.277192      0.415   0.679      -5.108262    7.825541
------------------------------------------------------------------------------

Generate hat values, calling the result myhat.

predict myhat, hat

Middle of page 271, get the largest hat value, .714.

summarize myhat, detail

                          Leverage
-------------------------------------------------------------
      Percentiles      Smallest
 1%     .0099067       .0099067
 5%     .0099302       .0099067
10%     .0101496       .0099067       Obs                 200
25%     .0110274       .0099067       Sum of Wgt.         200

50%     .0131432                      Mean           .0212232
                        Largest       Std. Dev.      .0514178
75%     .0185647       .0687759
90%     .0285695       .0732077       Variance       .0026438
95%     .0456122       .1668405       Skewness       12.45924
99%     .1200241       .7141857       Kurtosis       166.9527

Lower part of page 271, subject 12 has the hat value over .7.

list subject myhat if myhat > .7

       subject      myhat 
 12.        12   .7141857

Page 274, middle of page. Make studentized residual and show largest value.

Make residual.

predict res, rstud
(17 missing values generated)

Get the largest studentized residual, -24.3.

summarize res, detail

                    Studentized residuals
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -2.349376      -24.30446
 5%    -1.466444      -2.349376
10%    -.9816217      -2.189853       Obs                 183
25%    -.5037653      -1.959426       Sum of Wgt.         183

50%    -.0284926                      Mean          -.0961781
                        Largest       Std. Dev.      2.008318
75%     .4462518       2.393195
90%     1.040694        2.90657       Variance       4.033341
95%     1.566641       3.081378       Skewness      -9.557869
99%     3.081378       3.496628       Kurtosis       116.8471

Subject 12 had the largest residual.

list subject res if res < -24.3

       subject        res 
 12.        12  -24.30446

Middle of page 276, computing DFBETA and making index plot (plot described but not shown). The dfbeta command computes the DFBETA for measwt (called DFmeaswt), for female (called DFfemale, and for measwt_f (called DF1).

dfbeta
(17 missing values generated)
DFmeaswt:  DFbeta(measwt)
(17 missing values generated)
DFfemale:  DFbeta(female)
(17 missing values generated)
DF1:       DFbeta(measwt_f)

Index plot shows an observation influencing female and influencing female*measwt.

rename DFmeaswt_f DF1

graph twoway scatter DFmeaswt DFfemale DF1 subject

Show same plot, but use subject number as symbol to identify which subject has influential data. It is subject 12.

graph twoway scatter DFmeaswt DFfemale DF1 subject, mlabel(subject subject subject)

Scatterplot of dfbetas, suggested near bottom page 276.

graph matrix DFmeaswt DFfemale DF1 subject, mlabel(subject)

Bottom part of page 277, computing and showing Cook’s D, DFFITS, DFBETAS.

Already computed DFBETA above using dfbeta command.

Compute Cook’s D.

predict d, cooksd
(17 missing values generated)

Compute DFFITS.

predict dfit, dfits
(17 missing values generated)

Use summarize to get largest values of cooks d, dffits, dfbeta like bottom of 277.

summarize d dfit DFmeaswt DFfemale DF1 , detail

                          Cook's D
-------------------------------------------------------------
      Percentiles      Smallest
 1%     2.11e-06       2.11e-06
 5%     .0000228       2.11e-06
10%      .000026       2.11e-06       Obs                 183
25%     .0001244       .0000157       Sum of Wgt.         183

50%     .0007961                      Mean           .4738773
                        Largest       Std. Dev.      6.351621
75%     .0032174       .0651359
90%     .0096058       .0701759       Variance       40.34309
95%     .0199879       .0856294       Skewness       13.41655
99%     .0856294       85.92735       Kurtosis       181.0043

                            Dfits
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -.4308479      -38.41931
 5%    -.1958499      -.4308479
10%    -.1174221       -.296132       Obs                 183
25%    -.0592021      -.2823458       Sum of Wgt.         183

50%    -.0028959                      Mean          -.2012365
                        Largest       Std. Dev.      2.843795
75%     .0560052       .5108672
90%     .1362741       .5115646       Variance       8.087169
95%     .2121069        .540725       Skewness      -13.37203
99%      .540725       .6033236       Kurtosis        180.215

                          DFmeaswt
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -.1318417      -.1449406
 5%     -.059406      -.1318417
10%    -.0427876      -.0978029       Obs                 183
25%    -.0000766      -.0921938       Sum of Wgt.         183

50%    -5.92e-16                      Mean           .0003997
                        Largest       Std. Dev.      .0540191
75%     1.22e-16       .1096412
90%     .0141305       .2565254       Variance       .0029181
95%     .0314289       .2809305       Skewness       5.222216
99%     .2809305       .4918421       Kurtosis       45.52623

                          DFfemale
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -.2217265      -.2219922
 5%    -.1140623      -.2217265
10%     -.063526      -.2097966       Obs                 183
25%    -.0294111      -.1823008       Sum of Wgt.         183

50%    -.0050063                      Mean           .0941992
                        Largest       Std. Dev.      1.482754
75%     .0039615       .1956965
90%     .0166306       .2036534       Variance       2.198559
95%     .0342111       .3870315       Skewness       13.38578
99%     .3870315       20.02775       Kurtosis       180.4558

                             DF1
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -.3742779       -24.7525
 5%    -.0256302      -.3742779
10%    -.0133058      -.2137802       Obs                 183
25%    -.0021874      -.1952085       Sum of Wgt.         183

50%     .0054611                      Mean          -.1163372
                        Largest       Std. Dev.      1.832269
75%     .0328629       .2331503
90%     .0701569       .2636165       Variance        3.35721
95%     .1102957       .2943538       Skewness      -13.39209
99%     .2943538       .3174025       Kurtosis       180.5693

We can see these values for subject 12.

list d dfit DFmeaswt DFfemale DF1 if subject==12

             d       dfit   DFmeaswt   DFfemale        DF1 
 12.  85.92735  -38.41931   8.62e-13   20.02775   -24.7525

Top of page 279, computing COVRATIO. We can compute a variable covrat containing the COVRATIO.

predict covrat, covratio
(17 missing values generated)

Look at summary stats for covrat.

summarize covrat

Variable |     Obs        Mean   Std. Dev.       Min        Max
---------+-----------------------------------------------------
  covrat |     183    1.018453   .0833018   .0102869    1.19215

You can see that subject 12 has smallest covrat.

list subject covrat if covrat < .02

       subject     covrat 
 12.        12   .0102869

Page 283 bottom, and figure 11.5 page 284, partial regression plots.

use https://stats.idre.ucla.edu/stat/stata/examples/ara/duncan, clear

Fit regression at bottom of page 283.

regress prestige income educ

  Source |       SS       df       MS                  Number of obs =      45
---------+------------------------------               F(  2,    42) =  101.22
   Model |  36180.9458     2  18090.4729               Prob > F      =  0.0000
Residual |  7506.69865    42   178.73092               R-squared     =  0.8282
---------+------------------------------               Adj R-squared =  0.8200
   Total |  43687.6444    44   992.90101               Root MSE      =  13.369

------------------------------------------------------------------------------
prestige |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
  income |   .5987328   .1196673      5.003   0.000       .3572343    .8402313
    educ |   .5458339   .0982526      5.555   0.000       .3475521    .7441158
   _cons |  -6.064663   4.271941     -1.420   0.163      -14.68579    2.556463
------------------------------------------------------------------------------

Figure 11.5 on page 284 can be obtained by using the avplots command.

avplots

Middle of page 284 influence statistics, and figure 11.6 on page 285.

Make and show hat, residual, Cook’s D.

predict hat1, hat
predict res, rstud
predict d, cooksd
summarize hat1 res d, detail

                          Leverage
-------------------------------------------------------------
      Percentiles      Smallest
 1%     .0241298       .0241298
 5%     .0262859       .0246816
10%     .0327049       .0262859       Obs                  45
25%     .0467892        .031326       Sum of Wgt.          45

50%       .05732                      Mean           .0666667
                        Largest       Std. Dev.      .0438265
75%     .0705812       .0878518
90%      .082588       .1730582       Variance       .0019208
95%     .1730582       .1945416       Skewness       3.050461
99%     .2690896       .2690896       Kurtosis       13.16723

                    Studentized residuals
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -2.397022      -2.397022
 5%    -1.760491      -1.930919
10%    -1.433249      -1.760491       Obs                  45
25%    -.4980818      -1.704032       Sum of Wgt.          45

50%     .0505098                      Mean            .006828
                        Largest       Std. Dev.      1.055861
75%     .5083882       1.602429
90%     1.068858       1.887047       Variance       1.114843
95%     1.887047       2.043805       Skewness        .296872
99%     3.134519       3.134519       Kurtosis       3.928454

                          Cook's D
-------------------------------------------------------------
      Percentiles      Smallest
 1%     1.20e-08       1.20e-08
 5%     .0000537       .0000128
10%     .0001334       .0000537       Obs                  45
25%     .0016885       .0000784       Sum of Wgt.          45

50%     .0058424                      Mean           .0317011
                        Largest       Std. Dev.      .0898235
75%     .0236292       .0809681
90%     .0585235       .0989846       Variance       .0080683
95%     .0989846       .2236412       Skewness       5.064094
99%     .5663797       .5663797       Kurtosis       29.68371

Figure 11.6on page 285. Scatterplot of rstud and hat weighted by cooksd. Observations with a hat value larger than 0.13 had their occupational title specified by the mlabel(occtitle) option.

graph twoway (scatter res hat1 [w=d],msymbol(Oh)) ///
		 (scatter res hat1 if res <= -2.1 | hat1 >= .13, mlabel(occtitle) msymbol(i)), ///
		 xlabel(0(.05).3) ylabel(-2.5(2.5)5) yline(-2.1 0 2.1) xline(.13 .20)