Sometimes your research may predict that the size of a regression coefficient may vary across groups. For example, you might believe that the regression coefficient of height predicting weight would differ across 3 age groups (young, middle age, senior citizen). Below, we have a data file with 10 fictional young people, 10 fictional middle age people, and 10 fictional senior citizens, along with their height in inches and their weight in pounds. The variable age indicates the age group and is coded 1 for young people, 2 for middle aged, and 3 for senior citizens.
id age height weight 1 1 56 140 2 1 60 155 3 1 64 143 4 1 68 161 5 1 72 139 6 1 54 159 7 1 62 138 8 1 65 121 9 1 65 161 10 1 70 145 11 2 56 117 12 2 60 125 13 2 64 133 14 2 68 141 15 2 72 149 16 2 54 109 17 2 62 128 18 2 65 131 19 2 65 131 20 2 70 145 21 3 64 211 22 3 68 223 23 3 72 235 24 3 76 247 25 3 80 259 26 3 62 201 27 3 69 228 28 3 74 245 29 3 75 241 30 3 82 269
We analyze their data separately using the regress command below after first sorting by age.
use https://stats.idre.ucla.edu/stat/stata/faq/compreg3, clear sort age by age: regress weight height
The parameter estimates (coefficients) for the young, middle age, and senior citizens are shown below, and the results do seem to suggest that height is a stronger predictor of weight for seniors (3.18) than for the middle aged (2.09). The results also seem to suggest that height does not predict weight as strongly for the young (-.37) as for the middle aged and seniors. However, we would need to perform specific significance tests to be able to make claims about the differences among these regression coefficients.
-> age= 1 ------------------------------------------------------------------------------ weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- height | -.3768309 .7743341 -0.487 0.640 -2.162449 1.408787 _cons | 170.1664 49.43018 3.443 0.009 56.18024 284.1526 ------------------------------------------------------------------------------ -> age= 2 ------------------------------------------------------------------------------ weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- height | 2.095872 .110491 18.969 0.000 1.84108 2.350665 _cons | -2.39747 7.053272 -0.340 0.743 -18.66234 13.8674 ------------------------------------------------------------------------------ -> age= 3 ------------------------------------------------------------------------------ weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- height | 3.189727 .1232367 25.883 0.000 2.905543 3.473912 _cons | 5.601677 8.930197 0.627 0.548 -14.99139 26.19475 -----------------------------------------------------------------------------
We can compare the regression coefficients among these three age groups to test the null hypothesis
Ho: B1 = B2 = B3
where B1 is the regression for the young, B2 is the regression for the middle aged, and B3 is the regression for senior citizens. To do this analysis, we first make a dummy variable called age1 that is coded 1 if young (age=1), 0 otherwise, and age2 that is coded 1 if middle aged (age=2), 0 otherwise. We also create age1ht that is age1 times height, and age2ht that is age2 times height.
generate age1 = 0 generate age2 = 0 replace age1 = 1 if age==1 replace age2 = 1 if age==2 generate age1ht = age1*height generate age2ht = age2*height
We can now use age1 age2 height, age1ht and age2ht as predictors in the regression equation in the regress command below. The regress command will be followed by the command:
test age1ht age2ht
which tests the null hypothesis:
Ho: B1 = B2 = B3
This test will have 2 df because it compares three regression coefficients.
regress weight age1 age2 height age1ht age2ht Source | SS df MS Number of obs = 30 ---------+------------------------------ F( 5, 24) = 220.26 Model | 69595.3546 5 13919.0709 Prob > F = 0.0000 Residual | 1516.64536 24 63.1935565 R-squared = 0.9787 ---------+------------------------------ Adj R-squared = 0.9742 Total | 71112.00 29 2452.13793 Root MSE = 7.9494 ------------------------------------------------------------------------------ weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- age1 | 164.5648 41.5549 3.960 0.001 78.79966 250.3299 age2 | -7.999147 41.5549 -0.192 0.849 -93.76425 77.76596 height | 3.189727 .4069417 7.838 0.000 2.349841 4.029614 age1ht | -3.566558 .6131609 -5.817 0.000 -4.83206 -2.301057 age2ht | -1.093855 .6131609 -1.784 0.087 -2.359357 .1716466 _cons | 5.601677 29.48854 0.190 0.851 -55.25967 66.46303 ------------------------------------------------------------------------------
The analysis below shows that the null hypothesis
Ho: B1 = B2 = B3
can be rejected (F=17.29, p = 0.0000). This means that the regression coefficients between height and weight do indeed significantly differ across the 3 age groups (young, middle age, senior citizen).
test age1ht age2ht ( 1) age1ht = 0.0 ( 2) age2ht = 0.0 F( 2, 24) = 17.29 Prob > F = 0.0000
Note that we constructed all of the variables manually to make it very clear what each variable represented. However, in day to day use, you would probably be more likely to use the xi prefix to generate the dummy variables and interactions for you. For example,
regress weight i.age##c.height Source | SS df MS Number of obs = 30 -------------+---------------------------------- F(5, 24) = 220.26 Model | 69595.3546 5 13919.0709 Prob > F = 0.0000 Residual | 1516.64536 24 63.1935565 R-squared = 0.9787 -------------+---------------------------------- Adj R-squared = 0.9742 Total | 71112 29 2452.13793 Root MSE = 7.9494 ------------------------------------------------------------------------------ weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 2 | -172.5639 41.40619 -4.17 0.000 -258.0221 -87.10574 3 | -164.5648 41.5549 -3.96 0.001 -250.3299 -78.79966 | height | -.3768309 .4586553 -0.82 0.419 -1.323449 .5697872 | age#c.height | 2 | 2.472703 .6486366 3.81 0.001 1.133983 3.811423 3 | 3.566558 .6131609 5.82 0.000 2.301056 4.83206 | _cons | 170.1664 29.2786 5.81 0.000 109.7384 230.5945 -----------------------------------------------------------------------------
However, you may see that in this example the first age group is the omitted group, where previously the third group was the omitted group. We can set the base (or reference) group 3 by specifying “b3” after the “i” in the factor variable notation. (The “b” is for “base”.)
regress weight ib3.age##c.height Source | SS df MS Number of obs = 30 -------------+---------------------------------- F(5, 24) = 220.26 Model | 69595.3546 5 13919.0709 Prob > F = 0.0000 Residual | 1516.64536 24 63.1935565 R-squared = 0.9787 -------------+---------------------------------- Adj R-squared = 0.9742 Total | 71112 29 2452.13793 Root MSE = 7.9494 ------------------------------------------------------------------------------ weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1 | 164.5648 41.5549 3.96 0.001 78.79966 250.3299 2 | -7.999147 41.5549 -0.19 0.849 -93.76425 77.76596 | height | 3.189727 .4069417 7.84 0.000 2.349841 4.029614 | age#c.height | 1 | -3.566558 .6131609 -5.82 0.000 -4.83206 -2.301056 2 | -1.093855 .6131609 -1.78 0.087 -2.359357 .1716466 | _cons | 5.601677 29.48854 0.19 0.851 -55.25967 66.46303 ------------------------------------------------------------------------------