Regression with Stata Chapter 3: Self Assessment Answers

1. Using the elemapi2 data file ( use https://stats.idre.ucla.edu/stat/stata/examples/ara/elemapi2 ) convert the variable ell into 2 categories using the following coding, 0-25 on ell becomes 0, and 26-100 on ell becomes 1. Use this recoded version of ell to predict api00 and interpret the results.

Answer 1.
We first use the elemapi2 data file

use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/elemapi2, clear

We convert ell into a 0/1 variable called ell_bin.

gen ell_bin = ell
recode ell_bin 0/25 = 0 26/100=1
(398 changes made)

We tabulate ell_bin to see that the recoding looks OK.

tab ell_bin

    ell_bin |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        201       50.25       50.25
          1 |        199       49.75      100.00
------------+-----------------------------------
      Total |        400      100.00

We now include ell_bin in the regression model.

regress api00 ell_bin

      Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  1,   398) =  451.15
       Model |  4289511.71     1  4289511.71           Prob > F      =  0.0000
    Residual |  3784160.29   398  9507.94043           R-squared     =  0.5313
-------------+------------------------------           Adj R-squared =  0.5301
       Total |  8073672.00   399  20234.7669           Root MSE      =  97.509

------------------------------------------------------------------------------
       api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     ell_bin |   -207.114   9.750989   -21.24   0.000    -226.2838   -187.9441
       _cons |   750.6617   6.877731   109.14   0.000     737.1405    764.1829
------------------------------------------------------------------------------

The coefficient for _cons represents the api scores for the schools where ell_bin is coded 0 (low number of English language learners). The coefficient for ell_bin represents the api scores for the schools with a high number of English language learners minus the api scores for the api scores for the schools with a low number of English language learners. When broken into these two categories, the schools with the high number of English language learners score 207 points lower on the api scores than schools with a low number of English language learners.

2. Convert the variable ell into 3 categories coding those scoring 0-14 on ell as 1, and those 15/41 as 2 and 42/100 as 3. Do an analysis predicting api00 from the ell variable converted to a 1/2/3 variable. Interpret the results.

Answer 2.
First we create the categorical variable called ell_cat.

generate ell_cat = ell
recode ell_cat 0/14=1 15/41=2 42/100=3
(385 changes made)

We check the creation of ell_cat using the tabulate command below.

tabulate ell_cat

    ell_cat |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |        136       34.00       34.00
          2 |        129       32.25       66.25
          3 |        135       33.75      100.00
------------+-----------------------------------
      Total |        400      100.00

We use xi with the regress command to perform this analysis, and this creates two dummy codes with category 1 (low number of English language learners) as the reference category.

xi : regress api00 i.ell_cat

i.ell_cat         _Iell_cat_1-3       (naturally coded; _Iell_cat_1 omitted)

      Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  2,   397) =  252.88
       Model |  4523139.17     2  2261569.59           Prob > F      =  0.0000
    Residual |  3550532.82   397  8943.40762           R-squared     =  0.5602
-------------+------------------------------           Adj R-squared =  0.5580
       Total |  8073672.00   399  20234.7669           Root MSE      =   94.57

------------------------------------------------------------------------------
       api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 _Iell_cat_2 |   -141.319   11.62278   -12.16   0.000    -164.1689   -118.4691
 _Iell_cat_3 |  -257.9758   11.48947   -22.45   0.000    -280.5636    -235.388
       _cons |   780.2647   8.109276    96.22   0.000     764.3222    796.2072
------------------------------------------------------------------------------

The _cons represents the mean for the reference category, when ell_cat is coded 1. The coefficient for _Iell_cat_2 is the difference in the mean api score between the ell_cat=2 group and the reference group, ell_cat=1, and this difference is significant. The schools with a middle amount of English language learners score 141 points lower on their api score as compared to the schools with low amounts of English language learners. The coefficient for _Iell_cat_3 is the difference in the api scores for the ell_cat=3 group and the reference group, and this is significant as well. The schools with high amounts of English language learners score about 257 points lower than schools with low amounts of English language learners.

3. Do a regression analysis predicting api00 from yr_rnd and the ell variable converted to a 0/1 variable. Then create an interaction term and run the analysis again. Interpret the results of these analyses.

Answer 3.
We use the regress command to perform this analysis below.

regress api00 yr_rnd ell_bin

      Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  2,   397) =  270.17
       Model |  4654146.48     2  2327073.24           Prob > F      =  0.0000
    Residual |  3419525.51   397  8613.41439           R-squared     =  0.5765
-------------+------------------------------           Adj R-squared =  0.5743
       Total |  8073672.00   399  20234.7669           Root MSE      =  92.808

------------------------------------------------------------------------------
       api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      yr_rnd |   -77.6647   11.93665    -6.51   0.000    -101.1316   -54.19776
     ell_bin |   -182.082   10.04678   -18.12   0.000    -201.8335   -162.3304
       _cons |   756.0712   6.598791   114.58   0.000     743.0982    769.0441
------------------------------------------------------------------------------

These results indicate that year round schools (yr_rnd=1) score about 77 points lower on the api test than non-year round schools (yr_rnd=0). Also, schools with high numbers of English language learners score about 182 points lower on the api test than the schools with low numbers of English language learners. Both of these effects are significant.

Now we include an interaction term in the analysis.

generate yr_ell = yr_rnd*ell_bin
regress api00 yr_rnd ell_bin yr_ell

      Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  3,   396) =  179.67
       Model |  4654224.91     3  1551408.30           Prob > F      =  0.0000
    Residual |  3419447.09   396  8634.96739           R-squared     =  0.5765
-------------+------------------------------           Adj R-squared =  0.5733
       Total |  8073672.00   399  20234.7669           Root MSE      =  92.925

------------------------------------------------------------------------------
       api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      yr_rnd |  -75.49121     25.748    -2.93   0.004    -126.1111   -24.87135
     ell_bin |  -181.6966   10.84157   -16.76   0.000    -203.0109   -160.3824
      yr_ell |  -2.770387   29.06936    -0.10   0.924    -59.91995    54.37918
       _cons |   755.9198   6.795314   111.24   0.000     742.5604    769.2792
------------------------------------------------------------------------------

The main effects of yr_rnd and ell_bin are still significant, but the interaction term yr_ell is not significant. This suggests that the effects we described in the analysis above are consistent across the levels of yr_rnd and ell_bin. In other words, we can say that the effect of ell_bin is much the same for the year round schools as for the non-year round schools.

We could also have run this analysis using the anova command, which can be much more convenient for models like these.

anova api00 yr_rnd ell_bin yr_rnd*ell_bin

                           Number of obs =     400     R-squared     =  0.5765
                           Root MSE      = 92.9245     Adj R-squared =  0.5733

                  Source |  Partial SS    df       MS           F     Prob > F
          ---------------+----------------------------------------------------
                   Model |  4654224.91     3  1551408.30     179.67     0.0000
                         |
                  yr_rnd |  241566.044     1  241566.044      27.98     0.0000
                 ell_bin |  1370062.10     1  1370062.10     158.66     0.0000
          yr_rnd*ell_bin |  78.4279246     1  78.4279246       0.01     0.9241
                         |
                Residual |  3419447.09   396  8634.96739   
          ---------------+----------------------------------------------------
                   Total |  8073672.00   399  20234.7669

And we can use the adjust command to get the means for the cells. You can relate the coefficients from the regression model to the means below. For example, the _cons is the mean for the cell where all the variables are 0, and so forth.

adjust , by(yr_rnd ell_bin)

-----------------------------------------------------
     Dependent variable: api00     Command: anova
-----------------------------------------------------

----------------------------
year      |
round     |     ell_bin     
school    |       0        1
----------+-----------------
       No |  755.92  574.223
      Yes | 680.429  495.962
----------------------------
     Key:  Linear Prediction

4. Do a regression analysis predicting api00 from ell coded as 0/1 (from question 1) and some_col, and the interaction of these two variables. Interpret the results, including showing a graph of the results.

Answer 4.
Create an interaction and run the analysis

gen ell_col = ell_bin*some_col
regress api00 ell_bin some_col ell_col

      Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  3,   396) =  167.96
       Model |  4520787.76     3  1506929.25           Prob > F      =  0.0000
    Residual |  3552884.24   396   8971.9299           R-squared     =  0.5599
-------------+------------------------------           Adj R-squared =  0.5566
       Total |  8073672.00   399  20234.7669           Root MSE      =   94.72

------------------------------------------------------------------------------
       api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     ell_bin |  -291.6353   19.90341   -14.65   0.000    -330.7649   -252.5057
    some_col |  -1.443942   .5595276    -2.58   0.010    -2.543958   -.3439265
     ell_col |   4.622981   .9174408     5.04   0.000     2.819317    6.426644
       _cons |   784.5261   14.72533    53.28   0.000     755.5765    813.4757
------------------------------------------------------------------------------

Make a graph to help in the interpretation.

predict predapi
separate predapi, by(ell_bin)

              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
predapi0        float  %9.0g                  predapi, ell_bin == 0
predapi1        float  %9.0g                  predapi, ell_bin == 1

graph twoway scatter api00 predapi0 predapi1 some_col, ///
	connect(i l l i) msymbol(o i i o) pstyle(p1 p2 p2 p1) sort

The graph helps us visually understand the interaction represented by ell_col. We can see that the regression lines between some_col and api00 are not parallel — specifically, the line for the schools with a low number of English language learners has a downward slope, and the line for the schools with a large number of English language learners has an upward slope. From the regression equation, we see that the slope of the line when ell_bin is 0 (low number of English language learners) is -1.44. This corresponds to the solid regression line we see in the above graph. The difference between the slopes for the schools with a high number of English language learners and the schools with a low number of English language learners is 4.62. In order to get the slopes for the schools with a high number of English language learners we would add 4.62 to -1.44 and that yields 3.18, so this is the slope for the line for the schools with the high number of English language learners. This corresponds to the dotted regression line that we see in the above graph.

5. Use the variable ell converted into 3 categories (from question 2) and predict api00 from ell in 3 categories, from some_col and the interaction. of these two variables. Interpret the results, including showing a graph.

We use the xi command with regress to perform the analysis looking at the effect of some_col and ell_cat and the interaction.

xi : regress api00 i.ell_cat*some_col

i.ell_cat         _Iell_cat_1-3       (naturally coded; _Iell_cat_1 omitted)
i.ell_~t*some~l   _IellXsome__#       (coded as above)

      Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  5,   394) =  109.56
       Model |  4696120.14     5  939224.028           Prob > F      =  0.0000
    Residual |  3377551.86   394  8572.46664           R-squared     =  0.5817
-------------+------------------------------           Adj R-squared =  0.5763
       Total |  8073672.00   399  20234.7669           Root MSE      =  92.588

------------------------------------------------------------------------------
       api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 _Iell_cat_2 |  -199.7005    25.2889    -7.90   0.000    -249.4186   -149.9825
 _Iell_cat_3 |  -349.7611   23.60764   -14.82   0.000    -396.1738   -303.3484
    some_col |  -2.056112   .6695881    -3.07   0.002    -3.372524   -.7396998
_IellXsome~2 |    2.48773   1.003112     2.48   0.014     .5156074    4.459852
_IellXsome~3 |   5.112258   1.159782     4.41   0.000     2.832123    7.392393
       _cons |   829.4451   17.87578    46.40   0.000     794.3012    864.5889
------------------------------------------------------------------------------

To help interpretation, lets make a graph of the predicted values.

predict yhat

(option xb assumed; fitted values)

separate yhat, by(ell_cat)

              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
yhat1           float  %9.0g                  yhat, ell_cat == 1
yhat2           float  %9.0g                  yhat, ell_cat == 2
yhat3           float  %9.0g                  yhat, ell_cat == 3

graph twoway scatter api00 yhat1 yhat2 yhat3 some_col, ///
	connect(i l l l i) msymbol(o i i i o) pstyle(p1 p2 p2 p2 p1) sort

We can use the information in the graph and in the regression equation to help interpret these results. First looking at the graph, we see that the slopes of the three regression lines are not parallel. For the schools with a low number of English language learners (when ell_cat is 1) the regression line has a downward slope, for the schools with a middle number of English language learners (when ell_cat is 2) the regression line is pretty flat, and for the schools with a high number of English language learners (when ell_cat is 3) the regression line has an upward tilt. we can use the regression model to compute the exact slopes of all three of these regression lines. Since group 1 is the reference category the slope for that regression line is the slope for some_col, which is -2.05.

The coefficient for _IellXsome~2 (2.48) tells us how much we need to add to -2.05 to get the coefficient for the second group. when we add -2.05 to 2.48 we get .43, the slope for the second group. Because the coefficient _IellXsome~2 is significant we can say that the coefficient for group 1 is significantly different from group 2.

The coefficient for _IellXsome~3 (5.11) tells us how much we need to add to -2.05 to get the coefficient for the third group. when we add -2.05 to 5.11 we get 3.06, the slope for the third group. Because the coefficient _IellXsome~3 is significant we can say that the coefficient for group 1 is significantly different from group 3.