Applied Regression Analysis by John Fox Chapter 12: Nonlinearity and Other Ills

Figure 12.1, figure 12.2 and table 12.1, page 297-299 on data file ornstein. We skip the section on confidence envelopes. We only have the usual normal quantile plots without the confidence envelopes.

use https://stats.idre.ucla.edu/stat/stata/examples/ara/ornstein, clear

tab sector, gen(sect) /*generate dummy variables*/
(Output omitted here)

tab nation, gen(nat)/*generate dummy variables*/
(Output omitted here)

gen asset1=sqrt(assets)/*generate a new regressor*/
regress intrlcks asset1 nat2-nat4 sect1-sect5 sect7-sect10

  Source |       SS       df       MS                  Number of obs =     248
---------+------------------------------               F( 13,   234) =   34.11
   Model |  41816.5529    13  3216.65791               Prob > F      =  0.0000
Residual |  22069.8342   234  94.3155309               R-squared     =  0.6545
---------+------------------------------               Adj R-squared =  0.6354
   Total |  63886.3871   247   258.64934               Root MSE      =  9.7116

------------------------------------------------------------------------------
intrlcks |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
  asset1 |   .2517892   .0185222     13.594   0.000       .2152976    .2882808
    nat2 |  -1.158915   2.664005     -0.435   0.664      -6.407413    4.089584
    nat3 |  -4.444009   2.649276     -1.677   0.095       -9.66349    .7754719
    nat4 |  -8.089053   1.481003     -5.462   0.000      -11.00686   -5.171248
   sect1 |  -1.199539    2.04038     -0.588   0.557      -5.219402    2.820323
   sect2 |  -14.37594   5.576992     -2.578   0.011      -25.36347   -3.388413
   sect3 |   -5.12563   4.698778     -1.091   0.276      -14.38294    4.131683
   sect4 |  -5.698507   2.925712     -1.948   0.053      -11.46261     .065596
   sect5 |  -2.430237   4.014109     -0.605   0.545      -10.33865    5.478175
   sect7 |  -.8668517    2.63433     -0.329   0.742      -6.056886    4.323183
   sect8 |   .3422809   2.012105      0.170   0.865      -3.621874    4.306436
   sect9 |  -.3810376   2.819743     -0.135   0.893      -5.936365     5.17429
  sect10 |   5.151303   2.682082      1.921   0.056      -.1328104    10.43542
   _cons |   4.190453   1.846039      2.270   0.024       .5534732    7.827433
------------------------------------------------------------------------------

predict student, rstudent
qnorm student

Figure 12.1 (b) on page 297.

kdensity student, ylabel(0(.25).75)

gen inlck=sqrt(intrlcks+1)
regress inlck asset1 nat2-nat4 sect1-sect5 sect7-sect10

  Source |       SS       df       MS                  Number of obs =     248
---------+------------------------------               F( 13,   234) =   24.89
   Model |  477.296638    13   36.715126               Prob > F      =  0.0000
Residual |  345.199855   234  1.47521306               R-squared     =  0.5803
---------+------------------------------               Adj R-squared =  0.5570
   Total |  822.496493   247  3.32994531               Root MSE      =  1.2146

------------------------------------------------------------------------------
   inlck |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
  asset1 |   .0260108   .0023165     11.229   0.000        .021447    .0305747
    nat2 |  -.1139589   .3331737     -0.342   0.733      -.7703624    .5424445
    nat3 |  -.5266014   .3313317     -1.589   0.113      -1.179376    .1261729
    nat4 |  -1.105112   .1852217     -5.966   0.000      -1.470027    -.740197
   sect1 |  -.0567192   .2551801     -0.222   0.824      -.5594633    .4460249
   sect2 |  -2.250759   .6974864     -3.227   0.001      -3.624915   -.8766039
   sect3 |  -.7399749   .5876526     -1.259   0.209      -1.897741     .417791
   sect4 |  -.0880438   .3659042     -0.241   0.810      -.8089313    .6328437
   sect5 |  -.2453168   .5020245     -0.489   0.626      -1.234382    .7437487
   sect7 |   .1479077   .3294624      0.449   0.654      -.5011839    .7969993
   sect8 |   .3562041   .2516439      1.416   0.158       -.139573    .8519811
   sect9 |   .3540146   .3526512      1.004   0.316      -.3407624    1.048792
  sect10 |   .7860363   .3354345      2.343   0.020       .1251788    1.446894
   _cons |   2.329308   .2308748     10.089   0.000       1.874449    2.784167
------------------------------------------------------------------------------

predict student1, rstudent

qnorm student1, ylabel(-5(2.5)5) xlabel(-3(1)3)

Figure 12.2 (b), page 299.

kdensity student1, xlabel(-3(1)3) ylabel(0 .2 .4)

Remark: Combining the results from the two regression procedures, we get the result for Table 12.1, page 298.

Figure 12.3 (a) on page 303 on data file ornstein.

regress intrlcks asset1 nat2-nat4 sect1-sect5 sect7-sect10
(Output omitted here.)

predict fitted1, xb
 /*use student residuals from the first regression in previous section*/
 
graph twoway scatter student fitted1, yline(0) ylabel(-4 0 4) xlabel(-20(20)100)

Figure 12.3 (b) on page 303.

gen logfit=log10(2+fitted1)
gen logres = log10(abs(student))
graph twoway (scatter logres logfit)(lfit logres logfit), ylabel(-2(1)1)

regress inlck asset1 nat2-nat4 sect1-sect5 sect7-sect10
(Output omitted here)  

predict fitted2, xb
gen logfit2=log10(fitted2)
gen logres2=log10(abs(student1))
graph twoway (scatter logres2 logfit2)(lfit logres2 logfit2), ylabel(-3(1)1) xlabel(0(.2)1.2)

Figure 12.6 (a), (b) and (c) on page 312. Notice that procedure cprplot gives partial residual plot and we use the lowess option to get the nonparametric-regression smoothing shown in the book.

use https://stats.idre.ucla.edu/stat/stata/examples/ara/prestige, clear
(From Fox, Applied Regression Analysis.  Use 'notes' command for source of data)

regress prestige percwomn educat income
(Output omitted here.)

cprplot educat, lowess ylabel(0(25)100) xlabel(5(2.5)17.5)

cprplot income, lowess ylabel(-25(25)50) xlabel(0(10000)30000)

cprplot percwomn, lowess ylabel(-20 0 20) xlabel(0(25)100)

Formula in the middle of page 313 still using the data file prestige.

gen loginc=log10(income)/log10(2)
gen edu2=educat*educat
gen edu3=edu2*educat
gen w2=percwomn*percwomn
regress prestige loginc percwomn w2 educat edu2 edu3 

  Source |       SS       df       MS                  Number of obs =     102
---------+------------------------------               F(  6,    95) =   94.46
   Model |  25603.6492     6  4267.27487               Prob > F      =  0.0000
Residual |  4291.77689    95  45.1765988               R-squared     =  0.8564
---------+------------------------------               Adj R-squared =  0.8474
   Total |  29895.4261   101  295.994318               Root MSE      =  6.7214

------------------------------------------------------------------------------
prestige |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
  loginc |   8.783331   1.272748      6.901   0.000       6.256608    11.31005
percwomn |   -.179323    .085088     -2.108   0.038      -.3482439    -.010402
      w2 |   .0025001   .0009245      2.704   0.008       .0006648    .0043354
  educat |  -29.92002   15.25152     -1.962   0.053      -60.19811    .3580668
    edu2 |   2.915935   1.414413      2.062   0.042       .1079704    5.723901
    edu3 |  -.0806755   .0422082     -1.911   0.059      -.1644694    .0031185
   _cons |   20.83852   56.89955      0.366   0.715      -92.12136    133.7984
------------------------------------------------------------------------------

predict res2, r
summarize income percwomn /*get mean for income and percwomn*/

Variable |     Obs        Mean   Std. Dev.       Min        Max
---------+-----------------------------------------------------
  income |     102    6797.902   4245.922        611      25879  
percwomn |     102    28.97902   31.72493          0      97.51  

gen pm=20.8+8.78*log10(6797.902)/log10(2)-0.179*28.979+0.0025*28.979*28.979-29.9*educat+2.91*edu2-.0807*edu3
gen pmr=res2 + pm
graph twoway scatter pmr pm educat, connect(i l) sort msymbol(O i O)

Table 12.2 on page 319.

use https://stats.idre.ucla.edu/stat/stata/examples/ara/vocab, clear
(From Fox, Applied Regression Analysis.  Use 'notes' command for source of data)

regress vocab educ

  Source |       SS       df       MS                  Number of obs =     968
---------+------------------------------               F(  1,   966) =  318.92
   Model |  1175.11129     1  1175.11129               Prob > F      =  0.0000
Residual |  3559.41351   966  3.68469307               R-squared     =  0.2482
---------+------------------------------               Adj R-squared =  0.2474
   Total |  4734.52479   967  4.89609596               Root MSE      =  1.9196
(Further output omitted.)

xi: regress vocab i.educ

i.educ                Ieduc_0-20   (naturally coded; Ieduc_0 omitted)

  Source |       SS       df       MS                  Number of obs =     968
---------+------------------------------               F( 19,   948) =   18.13
   Model |  1261.69388    19  66.4049413               Prob > F      =  0.0000
Residual |  3472.83091   948  3.66332374               R-squared     =  0.2665
---------+------------------------------               Adj R-squared =  0.2518
   Total |  4734.52479   967  4.89609596               Root MSE      =   1.914

(Further output omitted)

With the results from these two regressions, we can calculate the nonlinear effect as follows.

display 1261.69399-1175.11129/*Sum of Squares for nonlinear effect*/
86.5827

display (86.57/18)/(3472.8/948)/*F-value*/
1.3128753

display fprob(18, 948, 1.31)/*p-value*/
.17255134

The first part is on Box-Cox transformation on the dependent variable using the data file ornstein, calculation on page 323 and 324.

use https://stats.idre.ucla.edu/stat/stata/examples/ara/ornstein, clear

gen dep=intrlcks+1 /*This is the dependent variable used through the example.*/
gen asset1=sqrt(assets)  /*regressor obtained from independent variable assets. */
gen logy=ln(dep)  /*To get geometric mean suggested in the footnote.*/
egen c1=mean(logy)  /*Contiue on geometric mean*/
gen c=exp(c1)
display c  /* showing geometric mean of the dependent variable*/
8.2501268

gen cv=dep*(ln(dep/c)-1)  /*generate constructed variable*/
xi: regress dep asset1 i.nation i.sector cv

i.nation              Inatio_1-4   (Inatio_1 for nation==CAN omitted)
i.sector              Isect_1-10   (Isect_1 for sector==AGR omitted)

  Source |       SS       df       MS                  Number of obs =     248
---------+------------------------------               F( 14,   233) =  103.67
   Model |  55048.7339    14  3932.05242               Prob > F      =  0.0000
Residual |  8837.65321   233  37.9298421               R-squared     =  0.8617
---------+------------------------------               Adj R-squared =  0.8534
   Total |  63886.3871   247   258.64934               Root MSE      =  6.1587

------------------------------------------------------------------------------
     dep |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
  asset1 |   .0699101   .0152576      4.582   0.000       .0398497    .0999705
Inatio_2 |  -.1373967   1.690291     -0.081   0.935      -3.467603     3.19281.....(continued)
      cv |   .5850363   .0313226     18.678   0.000       .5233246     .646748
   _cons |   11.38911   1.132516     10.056   0.000        9.15783    13.62039
------------------------------------------------------------------------------

test cv=0

 ( 1)  cv = 0.0

       F(  1,   233) =  348.86
            Prob > F =    0.0000
            
display 1-_coef[cv]
.41496366

A couple of new programs have been developed for Stata that deal with Box-Cox and Box-Tidwell transform. We will show how to use them here. The first procedure is boxcox, which is available in Stata 7. This procedure finds the maximum likelihood estimate of the parameter(s) of Box-Cox transform.

xi: boxcox dep asset1 i.nation i.sector, model(lhsonly)

i.nation          _Ination_1-4        (_Ination_1 for nation==CAN omitted)
i.sector          _Isector_1-10       (_Isector_1 for sector==AGR omitted)
Estimating comparison model

Iteration 0:   log likelihood = -1040.2744  
Iteration 1:   log likelihood = -917.82824  
Iteration 2:   log likelihood =  -909.0709  
Iteration 3:   log likelihood = -909.05843  
Iteration 4:   log likelihood = -909.05843  

Estimating full model

Iteration 0:   log likelihood =  -908.4755  
Iteration 1:   log likelihood = -820.15739  
Iteration 2:   log likelihood = -819.63494  
Iteration 3:   log likelihood =  -819.6349  

                                                  Number of obs   =        248
                                                  LR chi2(13)     =     178.85
Log likelihood = -819.6349                        Prob > chi2     =      0.000
 
------------------------------------------------------------------------------
         dep |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      /theta |   .3063002   .0528239     5.80   0.000     .2027673    .4098331
------------------------------------------------------------------------------
 
Estimates of scale-variant parameters
----------------------------
             |      Coef.
-------------+--------------
Notrans      |
      asset1 |   .0299857
  _Ination_2 |  -.1459837
  _Ination_3 |  -.6117602
  _Ination_4 |  -1.407129
  _Isector_2 |  -2.923013
  _Isector_3 |  -.9114291
  _Isector_4 |   .1482129
  _Isector_5 |  -.2716637
  _Isector_6 |   .0036382
  _Isector_7 |   .3405276
  _Isector_8 |   .5982639
  _Isector_9 |   .6550884
 _Isector_10 |   1.048606
       _cons |   2.115848
-------------+--------------
      /sigma |   1.525277
----------------------------

---------------------------------------------------------
   Test         Restricted     LR statistic      P-Value
    H0:       log likelihood       chi2       Prob > chi2
---------------------------------------------------------
theta = -1      -1063.8257       488.38           0.000
theta =  0      -835.82415        32.38           0.000
theta =  1       -908.4755       177.68           0.000
---------------------------------------------------------

The Box-Tidwell transformation on page 325 using data file prestige.

use https://stats.idre.ucla.edu/stat/stata/examples/ara/prestige, clear

gen w2=percwomn*percwomn
gen linc =income*ln(income)
gen ledu =educat*ln(educat)
regress prestige percwomn w2 educat income
(Output omitted.)

gen b1=_coef[educat]
gen b2=_coef[income]
regress prestige percwomn w2 educat income linc ledu
(Output omitted.)

gen d1=_coef[ledu]
gen d2=_coef[linc]
display 1+d1/b1 /*frist-step approximation on educat*/
2.2435437

display 1+d2/b2 /*first-step approximation on income*/
-.91030351

We now use the procedure boxtid to get the fully iterated MLEs of the transformation parameters for educat and income. The boxtid command can be downloaded within Stata by typing search boxtid (see How can I use the search command to search for programs and get additional help? for more information about using search), as shown below.

boxtid regress prestige percwomn w2 educat income, df(percwomn w2:1)

Iteration 0:  Deviance =  675.2403
Iteration 1:  Deviance =  674.8334 (change = -.4068906)
Iteration 2:  Deviance =  674.8169 (change = -.0164992)
Iteration 3:  Deviance =  674.8155 (change = -.0014157)
Iteration 4:  Deviance =  674.8154 (change = -.0001051)
-> gen double Ieduc__1 = educat^2.1939-182.7 if e(sample) 
-> gen double Ieduc__2 = educat^2.1939*ln(educat)-433.7 if e(sample) 
-> gen double Iinco__1 = X^-0.0375-1.015 if e(sample) 
-> gen double Iinco__2 = X^-0.0375*ln(X)+.3916 if e(sample) 
   (where: X = income/10000)
-> gen double Iperc__1 = percwomn-28.98 if e(sample) 
-> gen double Iw2__1 = w2-1836 if e(sample) 

[Total iterations: 8]

Box-Tidwell regression model

  Source |       SS       df       MS                  Number of obs =     102
---------+------------------------------               F(  6,    95) =   90.29
   Model |  25435.2987     6  4239.21644               Prob > F      =  0.0000
Residual |  4460.12743    95  46.9487098               R-squared     =  0.8508
---------+------------------------------               Adj R-squared =  0.8414
   Total |  29895.4261   101  295.994318               Root MSE      =  6.8519

------------------------------------------------------------------------------
prestige |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
Ieduc__1 |   .0984751   .1767093      0.557   0.579      -.2523372    .4492873
Ieduc_p1 |  -.0001244    .060728     -0.002   0.998      -.1206846    .1204359
Iinco__1 |  -333.2435   1640.086     -0.203   0.839      -3589.226    2922.739
Iinco_p1 |   .1065035   59.25056      0.002   0.999      -117.5207    117.7337
Iperc__1 |  -.1658225   .0903862     -1.835   0.070      -.3452618    .0136168
  Iw2__1 |   .0024899    .000972      2.562   0.012       .0005603    .0044195
   _cons |   47.51562   1.063237     44.690   0.000       45.40483    49.62641
------------------------------------------------------------------------------
educat   |   3.806089   .3444647     11.049   Nonlin. dev. 5.019   (P = 0.031)
      p1 |   2.193927   .6218453      3.528
------------------------------------------------------------------------------
income   |   .0011613     .00028      4.147   Nonlin. dev. 26.178  (P = 0.000)
      p1 |  -.0375134   .1638984     -0.229
------------------------------------------------------------------------------
Deviance:  674.815.

Figure 12.10. (a) and (b) on page 326 using the data file prestige. The constructed-variable plot in this case is the added variable plot for the constructed variable X*log(X).

regress prestige percwomn w2 educat income
(Output omitted)

avplot linc, ylabel(-40(20)20) xlabel(-4000(4000)8000)

avplot ledu, ylabel(-40(20)20) xlabel(-.8(.4)1.2)

use https://stats.idre.ucla.edu/stat/stata/examples/ara/ornstein, clear

gen asset1=sqrt(assets)
gen y=intrlcks+1
xi: regress y asset1 i.nation i.sector  /*generate residuals*/
(Output omitted)

predict res, r
predict fitted, xb
gen res2=res*res
egen m= mean(res2)
gen u=res2/m  /*generate U_i**/
regress u fitted

  Source |       SS       df       MS                  Number of obs =     248
---------+------------------------------               F(  1,   246) =   45.73
   Model |   147.64882     1   147.64882               Prob > F      =  0.0000
Residual |  794.296284   246  3.22884668               R-squared     =  0.1567
---------+------------------------------               Adj R-squared =  0.1533
   Total |  941.945104   247  3.81354293               Root MSE      =  1.7969

------------------------------------------------------------------------------
       u |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
  fitted |   .0594211   .0087872      6.762   0.000       .0421134    .0767288
   _cons |   .1336017   .1715663      0.779   0.437      -.2043246    .4715279
------------------------------------------------------------------------------

display 147.6488/2
73.8244

egen ybar=mean(fitted)
display 1-0.5*0.0594*ybar
.56695483

xi: regress u asset1 i.nation i.sector

i.nation              Inatio_1-4   (Inatio_1 for nation==CAN omitted)
i.sector              Isect_1-10   (Isect_1 for sector==AGR omitted)

  Source |       SS       df       MS                  Number of obs =     248
---------+------------------------------               F( 13,   234) =    4.04
   Model |  172.590223    13   13.276171               Prob > F      =  0.0000
Residual |   769.35488   234  3.28784137               R-squared     =  0.1832
---------+------------------------------               Adj R-squared =  0.1379
   Total |  941.945104   247  3.81354293               Root MSE      =  1.8132

(More output omitted)