Regression Analysis by Example by Chatterjee, Hadi and Price Chapter 4: Regression Diagnostics: Detection of Model Violations

Inputting the Hamilton data, table 4.1, p. 9

data p095;
  input Y X1 X2;
cards;
12.37 2.23 9.66
12.66 2.57 8.94
12    3.87 4.4
11.93 3.1  6.64
11.06 3.39 4.91
13.03 2.83 8.52
13.13 3.02 8.04
11.44 2.14 9.05
12.86 3.04 7.71
10.84 3.26 5.11
11.2  3.39 5.05
11.56 2.35 8.51
10.83 2.76 6.59
12.63 3.9  4.9
12.46 3.16 6.96
;
run;

Correlations and scatter matrix, p. 94
Invoking a macro in order to create a scatter plot matrix.

proc corr data = p095;
  var y x1 x2;
run;
%scatter(data=p095, y x1 x2);

The CORR Procedure 3 Variables: Y X1 X2 Simple Statistics

Variable N Mean Std Dev Sum Minimum Maximum Y 15 12.00000 0.80217 180.00000 10.83000 13.13000 X1 15 3.00067 0.53470 45.01000 2.14000 3.90000 X2 15 6.99933 1.78145 104.99000 4.40000 9.66000 Pearson Correlation Coefficients, N = 15 Prob > |r| under H0: Rho=0

Y X1 X2 Y 1.00000 0.00250 0.43407 0.9930 0.1060

X1 0.00250 1.00000 -0.89978 0.9930 <.0001

X2 0.43407 -0.89978 1.00000 0.1060 <.0001

Fitted Equations, p. 95.

proc reg data = p095;
 model y = x1;
 model y = x2;
 model y=x1 x2;
run;
quit;

The REG Procedure Model: MODEL1 Dependent Variable: Y

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 1 0.00005621 0.00005621 0.00 0.9930 Error 13 9.00854 0.69296 Corrected Total 14 9.00860

Root MSE 0.83245 R-Square 0.0000 Dependent Mean 12.00000 Adj R-Sq -0.0769 Coeff Var 6.93704 Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 11.98876 1.26689 9.46 <.0001 X1 1 0.00375 0.41608 0.01 0.9930

The REG Procedure Model: MODEL2 Dependent Variable: Y

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 1 1.69736 1.69736 3.02 0.1060 Error 13 7.31124 0.56240 Corrected Total 14 9.00860

Root MSE 0.74994 R-Square 0.1884 Dependent Mean 12.00000 Adj R-Sq 0.1260 Coeff Var 6.24946

Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 10.63194 0.81094 13.11 <.0001 X2 1 0.19546 0.11251 1.74 0.1060

The REG Procedure Model: MODEL3 Dependent Variable: Y

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 2 9.00722 4.50361 39222.3 <.0001 Error 12 0.00138 0.00011482 Corrected Total 14 9.00860

Root MSE 0.01072 R-Square 0.9998 Dependent Mean 12.00000 Adj R-Sq 0.9998 Coeff Var 0.08930

Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -4.51541 0.06114 -73.85 <.0001 X1 1 3.09701 0.01227 252.31 <.0001 X2 1 1.03186 0.00368 280.08 <.0001

Inputting the New York Rivers data, p. 10.

data p010;
  length river $ 20 ;
  input river x1 x2 x3 x4 y;
  label x1 = 'Agriculture'
        x2 = 'Forest'
	x3 = 'Residential'
        x4 = 'Commercial/Industrial'
	y = 'Nitrogen';
cards;
Olean 26 63 1.2 0.29 1.1
Cassadaga 29 57 0.7 0.09 1.01
Oatka 54 26 1.8 0.58 1.9
Neversink 2 84 1.9 1.98 1
Hackensack 3 27 29.4 3.11 1.99
Wappinger 19 61 3.4 0.56 1.42
Fishkill 16 60 5.6 1.11 2.04
Honeoye  40 43 1.3 0.24 1.65
Susquehanna 28 62 1.1 0.15 1.01
Chenango 26 60 0.9 0.23 1.21
Tioughnioga 26 53 0.9 0.18 1.33
West_Canada 15 75 0.7 0.16 0.75
East_Canada 6 84 0.5 0.12 0.73
Saranac 3 81 0.8 0.35 0.8
Ausable 2 89 0.7 0.35 0.76
Black 6 82 0.5 0.15 0.87
Schoharie 22 70 0.9 0.22 0.8
Raquette 4 75 0.4 0.18 0.87
Oswegatchie 21 56 0.5 0.13 0.66
Cohocton 40 49 1.1 0.13 1.25
;
run;

Table 4.2, p. 99

proc reg data = p010;
  model y = x1-x4;
run; 
quit;

The REG Procedure Model: MODEL1 Dependent Variable: y Nitrogen

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 4 2.56985 0.64246 9.15 0.0006 Error 15 1.05273 0.07018 Corrected Total 19 3.62257

Root MSE 0.26492 R-Square 0.7094 Dependent Mean 1.15750 Adj R-Sq 0.6319 Coeff Var 22.88715 Parameter Estimates

Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 1.72221 1.23408 1.40 0.1832 x1 Agriculture 1 0.00581 0.01503 0.39 0.7046 x2 Forest 1 -0.01297 0.01393 -0.93 0.3667 x3 Residential 1 -0.00723 0.03383 -0.21 0.8337 x4 Commercial/Industrial 1 0.30503 0.16382 1.86 0.0823

Excluding the observation where River = Neversink.

proc reg data = p010; 
  model y = x1-x4;
  where river ~= 'Neversink';
run; 
quit;

The REG Procedure Model: MODEL1 Dependent Variable: y Nitrogen

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 4 3.07765 0.76941 20.76 <.0001 Error 14 0.51881 0.03706 Corrected Total 18 3.59646

Root MSE 0.19250 R-Square 0.8557 Dependent Mean 1.16579 Adj R-Sq 0.8145 Coeff Var 16.51280 Parameter Estimates

Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 1.09947 0.91164 1.21 0.2478 x1 Agriculture 1 0.01014 0.01098 0.92 0.3717 x2 Forest 1 -0.00759 0.01022 -0.74 0.4701 x3 Residential 1 -0.12379 0.03934 -3.15 0.0071 x4 Commercial/Industrial 1 1.52896 0.34372 4.45 0.0006

Excluding the observation where River = Hackensack.

proc reg data = p010; 
  model y = x1-x4;
  where river ~= 'Hackensack';
run; 
quit;

The REG Procedure Model: MODEL1 Dependent Variable: y Nitrogen

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F Model 4 2.49968 0.62492 22.24 <.0001 Error 14 0.39336 0.02810 Corrected Total 18 2.89304

Root MSE 0.16762 R-Square 0.8640 Dependent Mean 1.11368 Adj R-Sq 0.8252 Coeff Var 15.05109

Parameter Estimates

Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 1.62601 0.78109 2.08 0.0562 x1 Agriculture 1 0.00235 0.00954 0.25 0.8088 x2 Forest 1 -0.01276 0.00881 -1.45 0.1698 x3 Residential 1 0.18116 0.04439 4.08 0.0011 x4 Commercial/Industrial 1 0.07562 0.11396 0.66 0.5177

Fig. 4.5, p. 102.

symbol v=dot h=.8 c=blue;
proc gplot data = p010;
  plot y*x4;
run;
quit;

Fig. 4.6, p. 102-103.

proc reg data = p010 noprint;
  model y=x4;
  plot student.*obs. h.*obs.;
  output out=resid student=stdresid H=leverage; 
run;
quit;

Table 4.3, p. 103. The standardized residuals and the leverage values.

proc print data = resid;
 var stdresid leverage;
run;

Obs stdresid leverage

1 0.03228 0.05469 2 -0.04502 0.06670 3 1.95292 0.05038 4 -1.84723 0.24787 5 0.15529 0.67101 6 0.67231 0.05018 7 1.92326 0.08261 8 1.56562 0.05700 9 -0.09515 0.06232 10 0.38082 0.05752 11 0.74924 0.06038 12 -0.81033 0.06166 13 -0.83246 0.06443 14 -0.82939 0.05253 15 -0.93761 0.05253 16 -0.47590 0.06232 17 -0.72323 0.05806 18 -0.50049 0.06038 19 -1.03103 0.06371 20 0.57473 0.06371

Fig. 4.7(a) and 4.7(b), p. 107.
Plotting the Cook’s Distance and the DFFits.

symbol v=dot h=.8 c=blue;
proc reg data = p010 noprint;
  model y = x4;
  plot cookd.*obs. dffits.*obs.;
  output out=resid r=r H=h cookd=CookD dffits=dffits ; 
run;
quit;

Creating the values for the Hadi influence measure.

ods listing close;
proc reg data = p010;
  model y = x4;
  ods output anova=temp;
run;
quit;
ods listing;
data temp;
  set temp;
  if source = 'Error' then call symput ('sse', ss);
  if source = 'Model' then call symput ('p', df );
run;
%put &sse &p ; /* To make the numbers sse and p appear in the log file. */
data resid2;
   set resid;
   id = _N_;
   newid + 1;
   d = .;
   d = r/sqrt(&sse);
   hadi = h/(1-h) + (&p + 1)*d**2/((1-h)*(1-d**2)); 
   keep hadi d id  dffits cookd h;
run;

Table 4.4, p. 106.

proc print data = resid2;
  var CookD dffits hadi;
run;

Obs CookD dffits hadi

1 0.00003 0.00755 0.05797 2 0.00007 -0.01170 0.07170 3 0.10118 0.49244 0.58357 4 0.56225 -1.14475 0.77174 5 0.02459 0.21567 2.04228 6 0.01194 0.15210 0.10428 7 0.16653 0.62923 0.59652 8 0.07408 0.40249 0.37293 9 0.00030 -0.02385 0.06747 10 0.00443 0.09180 0.07727 11 0.01804 0.18753 0.12852 12 0.02157 -0.20566 0.14126 13 0.02386 -0.21651 0.14874 14 0.01907 -0.19351 0.13474 15 0.02437 -0.21998 0.15786 16 0.00753 -0.11999 0.09193 17 0.01612 -0.17708 0.12139 18 0.00805 -0.12417 0.09247 19 0.03617 -0.26945 0.19307 20 0.01124 0.14705 0.10539

Fig. 4.7(c), p. 107. Plotting the Hadi influence measure.

symbol v=dot h=.8 c=blue;
axis2 order=(0 to 20 by 4);
 
proc gplot data = resid2;
  plot hadi*id / haxis=axis2;  
run; 
quit;

>

Creating the potential and residuals.

data resid3;
  set resid2;
  po = h/(1-h);
  re = (1+1)*d**2/( (1-h)*(1-d**2) );
run;

Fig 4.8, p. 108. The Potential-Residual Plot.

symbol v=dot h=.8 c=blue;
axis2 order=(0 to 2.5 by .5);
 
proc gplot data = resid3;
  plot po*re/ vaxis=axis2;
run; 
quit;

Inputting the Scottish Hills Race data, table 4.5, p. 112

data p112;
  length Race $ 30;
  input Race Time  Distance  Climb;
cards;
Greenmantle_New_Year_Dash  	965  	2.5  	650  
Carnethy  		 	2901  	6  	2500  
Craig_Dunain  			2019  	6  	900  
Ben_Rha  		 	2736  	7.5  	800  
Ben_Lomond  			3736  	8  	3070  
Goatfell  			4393  	8  	2866  
Bens_of_Jura  		  	12277  	16  	7500  
Cairnpapple  			2182  	6  	800  
Scolty  			1785  	5  	800  
Traprain_Law  			2385  	6  	650  
Lairig_Ghru  			11560  	28  	2100  
Dollar  			2583  	5  	2000  
Lomonds_of_Fife  		3900  	9.5  	2200  
Cairn_Table  			2648  	6  	500  
Eildon_Two  			1616  	4.5  	1500  
Cairngorm  			4335  	10  	3000  
Seven_Hills_of_Edinburgh  	5905  	14  	2200  
Knock_Hill  			4719  	3  	350  
Black_Hill  			1045  	4.5  	1000  
Creag_Beag  			1954  	5.5  	600  
Kildoon  			957  	3  	300  
Meall_Ant_Suiche  		1674  	3.5	1500  
Half_Ben_Nevis  		2859  	6  	2200  
Cow_Hill  			1076  	2  	900  
North_Berwick_Law  		1121  	3  	600  
Creag_Dubh  			1573  	4  	2000  
Burnswark  			2066  	6  	800  
Largo  	        		1714  	5  	950  
Criffel  			3030  	6.5  	1750  
Achmony  			1257  	5  	500  
Ben_Nevis  			5135  	10  	4400  
Knockfarrel  			1943  	6  	600  
Two_Breweries_Fell  		10215  	18  	5200  
Cockleroi  			1686  	4.5  	850  
Moffat_Chase  			9590  	20  	5000  
;
run;

Fig. 4.11(a) and 4.11(b), p. 114. The added-variable plots and generating the output necessary for the residual plus component plots.

proc reg data = p112;
  model time = distance climb/partial;
  output out=resid H=h r=r ;
  ods output anova=temp;
run;
quit;

The REG Procedure Model: MODEL1 Dependent Variable: Time

Analysis of Variance

Sum of Mean Source DF Squares Square F Value Pr > F

Model 2 202712289 101356144 147.51 <.0001 Error 27 18552505 687130 Corrected Total 29 221264794

Root MSE 828.93294 R-Square 0.9162 Dependent Mean 3341.26667 Adj R-Sq 0.9099 Coeff Var 24.80894

Parameter Estimates

Parameter Standard Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -304.14317 263.96279 -1.15 0.2593 Distance 1 395.48217 35.81177 11.04 <.0001 Climb 1 0.40133 0.14962 2.68 0.0123

The REG Procedure Model: MODEL1 Partial Regression Residual Plot

–+—-+—-+—-+—-+—-+—-+—-+—-+—-+—-+—-+—-+—-+—-+—-+—-+—-+– Time | | | | | | | | 4000 + + | | | | | | | 1 | | | | | 3000 + + | | | | | | | | | | | | 2000 + + | | | | | 1 | | | | | | | 1000 + + | | | | | | | | | 1 1 | | 1 1 | 0 + 1 1 1 + | 1 1 1 1 1 | | 11 1 1 | | 1 1 2 1 | | 1 11 1 | | 1 | | 1 | -1000 + 1 + | | | | | | –+—-+—-+—-+—-+—-+—-+—-+—-+—-+—-+—-+—-+—-+—-+—-+—-+—-+– -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Intercept The REG Procedure Model: MODEL1 Partial Regression Residual Plot

—–+—–+—–+—–+—–+—–+—–+—–+—–+—–+—–+—–+—–+—–+—— 8000 + + | 1 | | | | | | | | | | | 6000 + + | | | | | | | | | | | | 4000 + + | | | 1 | Time | | | | | | | | 2000 + + | 1 1 | | 1 | | 1 | | | | 1 | | 11 | 0 + 2 + | 1 2 1 3 | | 1 | | 11 1 | | 21 | | 1 1 | | 1 | -2000 + + | 1 1 | | | | | | | | | | | -4000 + + —–+—–+—–+—–+—–+—–+—–+—–+—–+—–+—–+—–+—–+—–+—— -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20

Distance The REG Procedure Model: MODEL1 Partial Regression Residual Plot

——+——+——+——+——+——+——+——+——+——+——+——+—— 4000 + + | | | | | | | 1 | | | | | 3000 + + | | | | | | | | | | | | 2000 + 1 + | | | | Time | | | | | | | | 1000 + + | 1 | | 1 | | 1 | | 1 | | 1 | | 1 1 1 | 0 + 1 + | 1 1 11 | | 1 1 1 11 | | 1 1 2 | | 11 | | 1 | | 1 | -1000 + 1 + | 1 | | | | | | | | | | | -2000 + + ——+——+——+——+——+——+——+——+——+——+——+——+—— -3000 -2500 -2000 -1500 -1000 -500 0 500 1000 1500 2000 2500

Climb

Fig. 4.12, the partial residual plot for each predictor.
Note: Invoking the par_resid_plot macro which will generate only one plot at a time.

%par_resid_plot(p112, time, climb distance, distance)

%par_resid_plot(p112, time, climb distance, climb)

Generating the variables to be used in the Potential-Residual plot.

data temp;
  set temp;
  if source = 'Error' then call symput ('sse', ss);
run;
data resid2;
  set resid;
  d = .;
  d = r/sqrt(&sse);
  po = h/(1-h);
  re = 3*d**2/( (1-h)*(1-d**2) );
run;

It is 3 times d squared since d squared is multiplied by the number of predictors in the model + 1, i.e., p + 1 (which in this case is 2+1 = 3). The Potential-Residual Plot, fig. 4.13, p. 114.

symbol v=dot h=.8 c=blue; 
proc gplot data = resid2; 
  plot po*re ;
run; 
quit;