**1**.
Using the **elemapi2** data file ( use
https://stats.idre.ucla.edu/stat/stata/examples/ara/elemapi2 ) convert the variable **ell**
into 2 categories using the following coding, 0-25 on **ell** becomes
0, and
26-100 on **ell** becomes 1. Use this recoded version of **ell**
to predict **api00** and interpret the results.

**Answer 1**.

We first use the elemapi2 data file

use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/elemapi2, clear

We convert **ell** into a 0/1 variable called **ell_bin**.

gen ell_bin = ell recode ell_bin 0/25 = 0 26/100=1(398 changes made)

We tabulate **ell_bin** to see that the recoding looks OK.

tab ell_bin

ell_bin | Freq. Percent Cum. ------------+----------------------------------- 0 | 201 50.25 50.25 1 | 199 49.75 100.00 ------------+----------------------------------- Total | 400 100.00

We now include **ell_bin** in the regression model.

regress api00 ell_binSource | SS df MS Number of obs = 400 -------------+------------------------------ F( 1, 398) = 451.15 Model | 4289511.71 1 4289511.71 Prob > F = 0.0000 Residual | 3784160.29 398 9507.94043 R-squared = 0.5313 -------------+------------------------------ Adj R-squared = 0.5301 Total | 8073672.00 399 20234.7669 Root MSE = 97.509 ------------------------------------------------------------------------------ api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- ell_bin | -207.114 9.750989 -21.24 0.000 -226.2838 -187.9441 _cons | 750.6617 6.877731 109.14 0.000 737.1405 764.1829 ------------------------------------------------------------------------------

The coefficient for **_cons** represents the api scores for the schools
where **ell_bin** is coded 0 (low number of English language learners).
The coefficient for **ell_bin** represents the api scores for the
schools with a high number of English language learners minus the api scores for the api
scores for the schools with a low number of English language learners. When broken into
these two categories, the schools with the high number of English language learners score
207 points lower on the api scores than schools with a low number of English language
learners.

**2**.
Convert the variable **ell** into 3 categories coding those scoring 0-14 on
ell as 1, and those 15/41 as 2 and 42/100 as 3. Do an analysis predicting **api00**
from the **ell** variable converted to a 1/2/3 variable. Interpret the
results.

**Answer 2**.

First we create the categorical variable called **ell_cat**.

generate ell_cat = ell recode ell_cat 0/14=1 15/41=2 42/100=3(385 changes made)

We check the creation of **ell_cat** using the **tabulate**
command below.

tabulate ell_catell_cat | Freq. Percent Cum. ------------+----------------------------------- 1 | 136 34.00 34.00 2 | 129 32.25 66.25 3 | 135 33.75 100.00 ------------+----------------------------------- Total | 400 100.00

We use **xi** with the **regress** command to perform this
analysis, and this creates two dummy codes with category 1 (low number of English
language
learners) as the reference category.

xi : regress api00 i.ell_cat

i.ell_cat _Iell_cat_1-3 (naturally coded; _Iell_cat_1 omitted) Source | SS df MS Number of obs = 400 -------------+------------------------------ F( 2, 397) = 252.88 Model | 4523139.17 2 2261569.59 Prob > F = 0.0000 Residual | 3550532.82 397 8943.40762 R-squared = 0.5602 -------------+------------------------------ Adj R-squared = 0.5580 Total | 8073672.00 399 20234.7669 Root MSE = 94.57 ------------------------------------------------------------------------------ api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Iell_cat_2 | -141.319 11.62278 -12.16 0.000 -164.1689 -118.4691 _Iell_cat_3 | -257.9758 11.48947 -22.45 0.000 -280.5636 -235.388 _cons | 780.2647 8.109276 96.22 0.000 764.3222 796.2072 ------------------------------------------------------------------------------

The **_cons** represents the mean for the reference category, when**
ell_cat** is coded 1. The coefficient for **_Iell_cat_2** is the
difference in the mean api score between the **ell_cat**=2 group and the
reference group, **ell_cat**=1, and this difference is significant. The
schools with a middle amount of English language learners score 141 points lower on their
api score as compared to the schools with low amounts of English language learners. The
coefficient for **_Iell_cat_3** is the difference in the api scores for the **ell_cat**=3
group and the reference group, and this is significant as well. The schools with
high amounts of English language learners score about 257 points lower than schools with
low amounts of English language learners.

**3**.
Do a regression analysis predicting **api00** from **yr_rnd**
and the **ell** variable converted to a 0/1 variable. Then create an
interaction term and run the analysis again. Interpret the results of these analyses.

**Answer 3**.

We use the **regress** command to perform this analysis below.

regress api00 yr_rnd ell_bin

Source | SS df MS Number of obs = 400 -------------+------------------------------ F( 2, 397) = 270.17 Model | 4654146.48 2 2327073.24 Prob > F = 0.0000 Residual | 3419525.51 397 8613.41439 R-squared = 0.5765 -------------+------------------------------ Adj R-squared = 0.5743 Total | 8073672.00 399 20234.7669 Root MSE = 92.808 ------------------------------------------------------------------------------ api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- yr_rnd | -77.6647 11.93665 -6.51 0.000 -101.1316 -54.19776 ell_bin | -182.082 10.04678 -18.12 0.000 -201.8335 -162.3304 _cons | 756.0712 6.598791 114.58 0.000 743.0982 769.0441 ------------------------------------------------------------------------------

These results indicate that year round schools (**yr_rnd**=1) score
about 77 points lower on the api test than non-year round schools (**yr_rnd**=0).
Also, schools with high numbers of English language learners score about 182 points
lower on the api test than the schools with low numbers of English language learners.
Both of these effects are significant.

Now we include an interaction term in the analysis.

generate yr_ell = yr_rnd*ell_bin regress api00 yr_rnd ell_bin yr_ell

Source | SS df MS Number of obs = 400 -------------+------------------------------ F( 3, 396) = 179.67 Model | 4654224.91 3 1551408.30 Prob > F = 0.0000 Residual | 3419447.09 396 8634.96739 R-squared = 0.5765 -------------+------------------------------ Adj R-squared = 0.5733 Total | 8073672.00 399 20234.7669 Root MSE = 92.925 ------------------------------------------------------------------------------ api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- yr_rnd | -75.49121 25.748 -2.93 0.004 -126.1111 -24.87135 ell_bin | -181.6966 10.84157 -16.76 0.000 -203.0109 -160.3824 yr_ell | -2.770387 29.06936 -0.10 0.924 -59.91995 54.37918 _cons | 755.9198 6.795314 111.24 0.000 742.5604 769.2792 ------------------------------------------------------------------------------

The main effects of **yr_rnd** and **ell_bin** are still
significant, but the interaction term **yr_ell** is not significant.
This suggests that the effects we described in the analysis above are consistent
across the levels of **yr_rnd** and **ell_bin**. In other
words, we can say that the effect of **ell_bin** is much the same for the
year round schools as for the non-year round schools.

We could also have run this analysis using the **anova** command, which
can be much more convenient for models like these.

anova api00 yr_rnd ell_bin yr_rnd*ell_bin

Number of obs = 400 R-squared = 0.5765 Root MSE = 92.9245 Adj R-squared = 0.5733 Source | Partial SS df MS F Prob > F ---------------+---------------------------------------------------- Model | 4654224.91 3 1551408.30 179.67 0.0000 | yr_rnd | 241566.044 1 241566.044 27.98 0.0000 ell_bin | 1370062.10 1 1370062.10 158.66 0.0000 yr_rnd*ell_bin | 78.4279246 1 78.4279246 0.01 0.9241 | Residual | 3419447.09 396 8634.96739 ---------------+---------------------------------------------------- Total | 8073672.00 399 20234.7669

And we can use the **adjust** command to get the means for the cells.
You can relate the coefficients from the regression model to the means below.
For example, the **_cons **is the mean for the cell where all the
variables are 0, and so forth.

adjust , by(yr_rnd ell_bin)----------------------------------------------------- Dependent variable: api00 Command: anova ----------------------------------------------------- ---------------------------- year | round | ell_bin school | 0 1 ----------+----------------- No | 755.92 574.223 Yes | 680.429 495.962 ---------------------------- Key: Linear Prediction

**4**.
Do a regression analysis predicting **api00** from **ell** coded
as 0/1 (from question 1) and **some_col**, and the interaction of these two
variables. Interpret the results, including showing a graph of the results.

**Answer 4**.

Create an interaction and run the analysis

gen ell_col = ell_bin*some_col regress api00 ell_bin some_col ell_col

Source | SS df MS Number of obs = 400 -------------+------------------------------ F( 3, 396) = 167.96 Model | 4520787.76 3 1506929.25 Prob > F = 0.0000 Residual | 3552884.24 396 8971.9299 R-squared = 0.5599 -------------+------------------------------ Adj R-squared = 0.5566 Total | 8073672.00 399 20234.7669 Root MSE = 94.72 ------------------------------------------------------------------------------ api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- ell_bin | -291.6353 19.90341 -14.65 0.000 -330.7649 -252.5057 some_col | -1.443942 .5595276 -2.58 0.010 -2.543958 -.3439265 ell_col | 4.622981 .9174408 5.04 0.000 2.819317 6.426644 _cons | 784.5261 14.72533 53.28 0.000 755.5765 813.4757 ------------------------------------------------------------------------------

Make a graph to help in the interpretation.

predict predapi separate predapi, by(ell_bin)storage display value variable name type format label variable label ------------------------------------------------------------------------------- predapi0 float %9.0g predapi, ell_bin == 0 predapi1 float %9.0g predapi, ell_bin == 1

graph twoway scatter api00 predapi0 predapi1 some_col, /// connect(i l l i) msymbol(o i i o) pstyle(p1 p2 p2 p1) sort

The graph helps us visually understand the interaction represented by **ell_col**.
We can see that the regression lines between **some_col** and **api00**
are not parallel — specifically, the line for the schools with a low number of English
language learners has a downward slope, and the line for the schools
with a large number of English language learners has an upward slope. From
the regression equation, we see that the slope of the line when **ell_bin**
is 0 (low number of English language learners) is -1.44. This corresponds
to the solid regression line we see in the above graph. The difference between
the slopes for the schools with a high number of English language learners and
the schools with a low number of English language learners is 4.62. In
order to get the slopes for the schools with a high number of English language
learners we would add 4.62 to -1.44 and that yields 3.18, so this is the slope
for the line for the schools with the high number of English language
learners. This corresponds to the dotted regression line that we see in
the above graph.

**5**.
Use the variable **ell** converted into 3 categories (from question 2) and
predict api00 from **ell** in 3 categories, from **some_col**
and the interaction. of these two variables. Interpret the results, including showing a
graph.

We use the **xi** command with **regress** to perform the
analysis looking at the effect of **some_col** and **ell_cat**
and the interaction.

xi : regress api00 i.ell_cat*some_col

i.ell_cat _Iell_cat_1-3 (naturally coded; _Iell_cat_1 omitted) i.ell_~t*some~l _IellXsome__# (coded as above) Source | SS df MS Number of obs = 400 -------------+------------------------------ F( 5, 394) = 109.56 Model | 4696120.14 5 939224.028 Prob > F = 0.0000 Residual | 3377551.86 394 8572.46664 R-squared = 0.5817 -------------+------------------------------ Adj R-squared = 0.5763 Total | 8073672.00 399 20234.7669 Root MSE = 92.588 ------------------------------------------------------------------------------ api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Iell_cat_2 | -199.7005 25.2889 -7.90 0.000 -249.4186 -149.9825 _Iell_cat_3 | -349.7611 23.60764 -14.82 0.000 -396.1738 -303.3484 some_col | -2.056112 .6695881 -3.07 0.002 -3.372524 -.7396998 _IellXsome~2 | 2.48773 1.003112 2.48 0.014 .5156074 4.459852 _IellXsome~3 | 5.112258 1.159782 4.41 0.000 2.832123 7.392393 _cons | 829.4451 17.87578 46.40 0.000 794.3012 864.5889 ------------------------------------------------------------------------------

To help interpretation, lets make a graph of the predicted values.

predict yhat

(option xb assumed; fitted values)

separate yhat, by(ell_cat)storage display value variable name type format label variable label ------------------------------------------------------------------------------- yhat1 float %9.0g yhat, ell_cat == 1 yhat2 float %9.0g yhat, ell_cat == 2 yhat3 float %9.0g yhat, ell_cat == 3

graph twoway scatter api00 yhat1 yhat2 yhat3 some_col, /// connect(i l l l i) msymbol(o i i i o) pstyle(p1 p2 p2 p2 p1) sort

We can use the information in the graph and in the regression equation to
help interpret these results. First looking at the graph, we see that the
slopes of the three regression lines are not parallel. For the schools
with a low number of English language learners (when **ell_cat** is 1) the
regression line has a downward slope, for the schools with a middle number of
English language learners (when **ell_cat** is 2) the regression line is
pretty flat, and for the schools with a high number of English language learners
(when **ell_cat** is 3) the regression line has an upward tilt. we can use
the regression model to compute the exact slopes of all three of these
regression lines. Since group 1 is the reference category the slope
for that regression line is the slope for **some_col, **which is -2.05.

The coefficient for **_IellXsome~2** (2.48) tells us how much we need to
add to -2.05 to get the coefficient for the second group. when we add -2.05 to 2.48 we get .43, the slope for the second group.
Because the
coefficient **_IellXsome~2** is significant we can say that the coefficient
for group 1 is significantly different from group 2.

The coefficient for **_IellXsome~3** (5.11) tells us how much we need to
add to -2.05 to get the coefficient for the third group. when we add -2.05
to 5.11 we get 3.06, the slope for the third group. Because the
coefficient **_IellXsome~3** is significant we can say that the coefficient
for group 1 is significantly different from group 3.