Applied Regression Analysis by John Fox Chapter 13: Collinearity and Its Purported Remedies

Section 13.1 Detecting Collinearity

Table 13.1 and Table 13.2 using data file Ericksen.

proc reg data=ericksen;
  model undcount=perc_min crimrate poverty diffeng hsgrad housing city countprc ;
run;
quit;
proc corr data=ericksen;
  var perc_min crimrate poverty diffeng hsgrad housing city countprc ;
run;

The REG Procedure
Model: MODEL1
Dependent Variable: undcount

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     8      280.79543       35.09943      17.25    <.0001
Error                    57      115.98480        2.03482
Corrected Total          65      396.78023


Root MSE              1.42647    R-Square     0.7077
Dependent Mean        1.92106    Adj R-Sq     0.6667
Coeff Var            74.25437

                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1       -1.77139        1.38218      -1.28      0.2052
perc_min      1        0.07983        0.02261       3.53      0.0008
crimrate      1        0.03012        0.01300       2.32      0.0241
poverty       1       -0.17837        0.08492      -2.10      0.0401
diffeng       1        0.21512        0.09221       2.33      0.0232
hsgrad        1        0.06129        0.04477       1.37      0.1764
housing       1       -0.03496        0.02463      -1.42      0.1613
city          1        1.15998        0.77064       1.51      0.1378
countprc      1        0.03699        0.00925       4.00      0.0002

The CORR Procedure

   8  Variables:    perc_min crimrate poverty  diffeng  hsgrad   housing  city     countprc

                                    Simple Statistics

Variable           N          Mean       Std Dev           Sum       Minimum       Maximum

perc_min          66      19.43636      17.51441          1283       0.70000      72.60000
crimrate          66      63.06061      24.89107          4162      25.00000     143.00000
poverty           66      13.46818       4.48108     888.90000       6.80000      23.90000
diffeng           66       1.92576       2.45396     127.10000       0.20000      12.70000
hsgrad            66      33.64697       8.49286          2221      17.50000      51.80000
housing           66      15.66515       9.82810          1034       7.00000      52.10000
city              66       0.24242       0.43183      16.00000             0       1.00000
countprc          66      11.72727      24.86737     774.00000             0     100.00000

                           Pearson Correlation Coefficients, N = 66
                                   Prob > |r| under H0: Rho=0

           perc_min   crimrate    poverty    diffeng     hsgrad    housing       city   countprc

perc_min    1.00000    0.65490    0.73842    0.39545    0.53516    0.35679    0.75774   -0.33444
                        <.0001     <.0001     0.0010     <.0001     0.0033     <.0001     0.0061

crimrate    0.65490    1.00000    0.36911    0.51165    0.06656    0.53172    0.72857   -0.23309
             <.0001                0.0023     <.0001     0.5954     <.0001     <.0001     0.0596

poverty     0.73842    0.36911    1.00000    0.15157    0.75064    0.33522    0.53752   -0.15704
             <.0001     0.0023                0.2244     <.0001     0.0059     <.0001     0.2079

diffeng     0.39545    0.51165    0.15157    1.00000   -0.11640    0.34021    0.48036   -0.10819
             0.0010     <.0001     0.2244                0.3520     0.0052     <.0001     0.3872

hsgrad      0.53516    0.06656    0.75064   -0.11640    1.00000    0.23485    0.31482   -0.41422
             <.0001     0.5954     <.0001     0.3520                0.0577     0.0100     0.0005

housing     0.35679    0.53172    0.33522    0.34021    0.23485    1.00000    0.56570   -0.08629
             0.0033     <.0001     0.0059     0.0052     0.0577                <.0001     0.4909

city        0.75774    0.72857    0.53752    0.48036    0.31482    0.56570    1.00000   -0.26882
             <.0001     <.0001     <.0001     <.0001     0.0100     <.0001                0.0291

countprc   -0.33444   -0.23309   -0.15704   -0.10819   -0.41422   -0.08629   -0.26882    1.00000
             0.0061     0.0596     0.2079     0.3872     0.0005     0.4909     0.0291

Section 13.2 Coping With Collinearity: No Quick Fix

Figure 13.6 on page 358 using dataset ericksen. In this example it is also shown how to create an annotate set and how to use the function compress to compress a string variable.

data temp;
  set ericksen;
  M=perc_min;
  C=crimrate;
  P=poverty;
  L=diffeng;
  H=hsgrad;
  O=housing;
  I=city;
  N=countprc;
run;
proc reg data=temp;
   model undcount= M C P L H O I N / selection=cp ;
   ods output  SubsetSelSummary=cperick;
run;
quit;
data subcperick;
  set cperick;
  np=NumInModel+1;
  diff=cp-np;
  if diff <= 10;
  var=compress(VarsInModel);/*to take away the spaces*/
  output;
  drop Model Dependent Control RSquare;
run;
data labels; /*Annotate set created for labeling*/
  length function style text $ 8;
  retain function 'label' xsys ysys '2' style 'swiss'
         size 1 when 'a' color 'black';
  set subcperick end=lastob; 
                            /* determine the values of the */
  x=np; y=diff;     /* of the x and y variables.*/
  text=left(put(var, $8.)); 
  position='B';
run;
axis1 order=(4 to 9 by 1) offset=(3, 5);
axis2 order=(0 to 10 by 2) label=(r=0 a=90);
symbol c=black i=none v=none;
proc gplot data=subcperick;
  plot diff*np=1 / annotate=labels vminor=0 hminor=0 haxis=axis1 vaxis=axis2  ;
  label np='p';
  label diff='Cp-p';
run;
quit;

Image chp13Fig6

In order to create Table 13.4 on page 359 using data file Ericksen, we create a table that contains the number of variables and their names in the model and the corresponding R-squares for the "Best" models. Table 13.4 can then be created based on this information by doing regression one model at a time.

proc reg data=ericksen;
   model undcount= perc_min crimrate poverty diffeng hsgrad housing city countprc
          / selection=cp ;
   ods output  SubsetSelSummary=cperick;
   run;
quit;
proc sort data=cperick;
  by NumInModel RSquare;
run;
data short;
  set cperick;
  by NumInModel;
  if last.NumInModel then output;
  drop Model Dependent Control; 
  run;
proc print data=short;
run;

The REG Procedure
Model: MODEL1
Dependent Variable: undcount

C(p) Selection Method

Number in
  Model        C(p)  R-Square  Variables in Model

       5     7.3196    0.6855  perc_min crimrate poverty diffeng countprc
       6     7.9829    0.6924  perc_min crimrate poverty diffeng city countprc
       .....
       6     8.2075    0.6912  perc_min crimrate poverty diffeng hsgrad countprc
       4     8.5149    0.6691  perc_min crimrate diffeng countprc
       5     8.8253    0.6778  perc_min poverty diffeng city countprc

Model        Cp  RSquare  VarsInModel 
 
 1          1   36.7625  0.4935   perc_min
 2          2   23.9334  0.5696   perc_min hsgrad
 3          3   12.6764  0.6375   perc_min crimrate countprc
 4          4    8.5149  0.6691   perc_min crimrate diffeng countprc
 5          5    7.3196  0.6855   perc_min crimrate poverty diffeng countprc
 6          6    7.9829  0.6924   perc_min crimrate poverty diffeng city countprc
 7          7    8.8738  0.6981   perc_min crimrate poverty diffeng housing city countprc
 8          8    9.0000  0.7077   perc_min crimrate poverty diffeng hsgrad housing city countprc