Sometimes your research may predict that the size of a regression coefficient may vary across groups. For example, you might believe that the regression coefficient of height predicting weight would differ across three age groups (young, middle age, senior citizen). Below, we have a data file with 10 fictional young people, 10 fictional middle age people, and 10 fictional senior citizens, along with their height in inches and their weight in pounds. The variable age indicates the age group and is coded 1 for young people, 2 for middle aged, and 3 for senior citizens.
DATA htwt; INPUT id age height weight ; CARDS; 1 1 56 140 2 1 60 155 3 1 64 143 4 1 68 161 5 1 72 139 6 1 54 159 7 1 62 138 8 1 65 121 9 1 65 161 10 1 70 145 11 2 56 117 12 2 60 125 13 2 64 133 14 2 68 141 15 2 72 149 16 2 54 109 17 2 62 128 18 2 65 131 19 2 65 131 20 2 70 145 21 3 64 211 22 3 68 223 23 3 72 235 24 3 76 247 25 3 80 259 26 3 62 201 27 3 69 228 28 3 74 245 29 3 75 241 30 3 82 269 ; RUN;
We analyze their data separately using the proc reg below.
PROC REG DATA=htwt; BY age ; MODEL weight = height ; RUN;
The parameter estimates (coefficients) for the young, middle age, and senior citizens are shown below. below, and the results do seem to suggest that height is a stronger predictor of weight for seniors (3.18) than for the middle aged (2.09). The results also seem to suggest that height does not predict weight as strongly for the young (-.37) as for the middle aged and seniors. However, we would need to perform specific significance tests to be able to make claims about the differences among these regression coefficients.
AGE=1 Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 170.166445 49.43018216 3.443 0.0088 HEIGHT 1 -0.376831 0.77433413 -0.487 0.6396 AGE=2 Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 -2.397470 7.05327189 -0.340 0.7427 HEIGHT 1 2.095872 0.11049098 18.969 0.0001 AGE=3 Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 5.601677 8.93019669 0.627 0.5480 HEIGHT 1 3.189727 0.12323669 25.883 0.0001
We can compare the regression coefficients among these three age groups to test the null hypothesis
Ho: B1 = B2 = B3
where B1 is the regression for for the young, B2 is the regression for for the middle aged, and B3 is the regression for for senior citizens. To do this analysis, we first make a dummy variable called age1 that is coded 1 if young (age=1), 0 otherwise, and age2 that is coded 1 if middle aged (age=2), 0 otherwise. We also create age1ht that is age1 times height, and age2ht that is age2 times height.
data htwt2; set htwt; age1 = . ; age2 = . ; IF age = 1 then age1 = 1; ELSE age1 = 0 ; IF age = 2 then age2 = 1; ELSE age2 = 0 ; age1ht = age1*height ; age2ht = age2*height ; RUN;
We can now use age1 age2 height, age1ht and age2ht as predictors in the regression equation in proc reg below. In the proc reg we use the
TEST age1ht=0, age2ht=0;
statement to test the null hypothesis
Ho: B1 = B2 = B3
This test will have two degrees of freedom because it compares among three regression coefficients.
PROC REG DATA=htwt2 ; MODEL weight = age1 age2 height age1ht age2ht ; TEST age1ht=0, age2ht=0 ; RUN;
The output below shows that the null hypothesis
Ho: B1 = B2 = B3
can be rejected (F=17.29, p = 0.0000). This means that the regression coefficients between height and weight do indeed significantly differ across the 3 age groups (young, middle age, senior citizen).
Model: MODEL1 Dependent Variable: WEIGHT Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 5 69595.35464 13919.07093 220.261 0.0001 Error 24 1516.64536 63.19356 C Total 29 71112.00000 Root MSE 7.94944 R-square 0.9787 Dep Mean 171.00000 Adj R-sq 0.9742 C.V. 4.64879 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 5.601677 29.48853690 0.190 0.8509 AGE1 1 164.564768 41.55490307 3.960 0.0006 AGE2 1 -7.999147 41.55490307 -0.192 0.8490 HEIGHT 1 3.189727 0.40694172 7.838 0.0001 AGE1HT 1 -3.566558 0.61316088 -5.817 0.0001 AGE2HT 1 -1.093855 0.61316088 -1.784 0.0871 Dependent Variable: WEIGHT Test: Numerator: 1092.7718 DF: 2 F value: 17.2925 Denominator: 63.19356 DF: 24 Prob>F: 0.0001
It is also possible to run such an analysis in proc glm, using syntax as shown below. Instead of using a test statement, the contrast statement is used to test the null hypothesis
Ho: B1 = B2 = B3
The contrast statement uses the comma to join together what would have been two separate one degree of freedom tests into a single two degree of freedom test that tests the null hypothesis above.
PROC GLM DATA=htwt2 ; CLASS age ; MODEL weight = age height age*height / SOLUTION ; CONTRAST 'test equal slopes' age*height 1 -1 0, age*height 0 1 -1 ; RUN;
If you compare the contrast output from proc glm (labeled test equal slopes found below with the output from test from proc glm above, you will see the F values and p values are the same. This is because these two tests are equivalent.
General Linear Models Procedure Class Level Information Class Levels Values AGE 3 1 2 3 Number of observations in data set = 30 General Linear Models Procedure Dependent Variable: WEIGHT Sum of Mean Source DF Squares Square F Value Pr > F Model 5 69595.354644 13919.070929 220.26 0.0001 Error 24 1516.645356 63.193557 Corrected Total 29 71112.000000 R-Square C.V. Root MSE WEIGHT Mean 0.978672 4.648794 7.9494375 171.00000 Source DF Type I SS Mean Square F Value Pr > F AGE 2 64350.600000 32175.300000 509.15 0.0001 HEIGHT 1 3059.211075 3059.211075 48.41 0.0001 HEIGHT*AGE 2 2185.543569 1092.771784 17.29 0.0001 Source DF Type III SS Mean Square F Value Pr > F AGE 2 1395.9046778 697.9523389 11.04 0.0004 HEIGHT 1 2597.0189017 2597.0189017 41.10 0.0001 HEIGHT*AGE 2 2185.5435689 1092.7717845 17.29 0.0001 Contrast DF Contrast SS Mean Square F Value Pr > F test equal slopes 2 2185.5435689 1092.7717845 17.29 0.0001 T for H0: Pr > |T| Std Error of Parameter Estimate Parameter=0 Estimate INTERCEPT 5.6016771 B 0.19 0.8509 29.48853690 AGE 1 164.5647676 B 3.96 0.0006 41.55490307 2 -7.9991472 B -0.19 0.8490 41.55490307 3 0.0000000 B . . . HEIGHT 3.1897275 B 7.84 0.0001 0.40694172 HEIGHT*AGE 1 -3.5665584 B -5.82 0.0001 0.61316088 2 -1.0938553 B -1.78 0.0871 0.61316088 3 0.0000000 B . . . NOTE: The X'X matrix has been found to be singular and a generalized inverse was used to solve the normal equations. Estimates followed by the letter 'B' are biased, and are not unique estimators of the parameters.
You might notice that the null hypothesis that we are testing
Ho: B1 = B2 = B3
is similar to the null hypothesis that you might test using ANOVA to compare the means of the three groups,
Ho: Mu1 = Mu2 = Mu3
In ANOVA, you can get an overall F test testing the null hypothesis. In addition to that overall test, you could perform planned comparisons among the three groups. So far we have seen how to to an overall test of the equality of the three regression coefficients, and now we will test planned comparisons among the regression coefficients. Below, we show how you can perform two such tests using the contrasta statement in proc glm. The first contrast compares the regression coefficients of the middle aged vs. senior.
Ho: B2 = B3
The second contrast compares the regression coefficients of the young vs. middle aged and seniors.
Ho: B1 = (B2 + B3)/2PROC GLM DATA=htwt2 ; CLASS age ; MODEL weight = age height age*height ; CONTRAST 'Mid Age vs. Sen. ' age*height 0 1 -1 ; CONTRAST 'Yng vs (Mid & Sen)' age*height -2 1 1 ; RUN;
The output from contrast indicates that regression coefficients for middle aged and seniors do not significantly differ (F=3.18, p=0.0871) The second contrast was significant (F=29.96, p=0.0000) indicating that the regression coefficients for the young differ from the middle age and seniors combined.
General Linear Models Procedure Class Level Information Class Levels Values AGE 3 1 2 3 Number of observations in data set = 30 General Linear Models Procedure Dependent Variable: WEIGHT Sum of Mean Source DF Squares Square F Value Pr > F Model 5 69595.354644 13919.070929 220.26 0.0001 Error 24 1516.645356 63.193557 Corrected Total 29 71112.000000 R-Square C.V. Root MSE WEIGHT Mean 0.978672 4.648794 7.9494375 171.00000 Source DF Type I SS Mean Square F Value Pr > F AGE 2 64350.600000 32175.300000 509.15 0.0001 HEIGHT 1 3059.211075 3059.211075 48.41 0.0001 HEIGHT*AGE 2 2185.543569 1092.771784 17.29 0.0001 Source DF Type III SS Mean Square F Value Pr > F AGE 2 1395.9046778 697.9523389 11.04 0.0004 HEIGHT 1 2597.0189017 2597.0189017 41.10 0.0001 HEIGHT*AGE 2 2185.5435689 1092.7717845 17.29 0.0001 Contrast DF Contrast SS Mean Square F Value Pr > F Mid Age vs. Sen. 1 201.1146303 201.1146303 3.18 0.0871 Yng vs (Mid & Sen) 1 1893.2074903 1893.2074903 29.96 0.0001
We can do the exact same analysis in proc reg by coding age1 and age2 like the coding shown in the contrast statements above We will create age1 that will be:
0 for young 1 for middle age -1 for senior
and we will create age2 that will be:
-2 for young 1 for middle age 1 for senior
The significance tests in proc reg below for age1ht and age2ht will correspond to the contrast statements we used in proc glm above.
data htwt3; set htwt; age1 = . ; age2 = . ; IF age = 1 then age1 = 0; IF age = 2 then age1 = 1; IF age = 3 then age1 = -1; IF age = 1 then age2 = -2; IF age = 2 then age2 = 1; IF age = 3 then age2 = 1; age1ht = age1*height ; age2ht = age2*height ; RUN; PROC REG DATA=htwt3 ; MODEL weight = age1 age2 height age1ht age2ht ; RUN;
The results below correspond to the proc reg results above except that the proc glm results are reported as F values and the proc reg results are reported as t values. We can square the t values to make them comparable to the F values. Indeed, for the comparison of Middle age vs. Seniors, the t value of -1.784 when squared becomes 3.183, the same as the F value from proc glm. Likewise, for the comparison of Young vs. middle & Senior the t value from proc reg is 5.473 and when squared becomes 29.954, the same as the F value from proc glm.
Model: MODEL1 Dependent Variable: WEIGHT Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 5 69595.35464 13919.07093 220.261 0.0001 Error 24 1516.64536 63.19356 C Total 29 71112.00000 Root MSE 7.94944 R-square 0.9787 Dep Mean 171.00000 Adj R-sq 0.9742 C.V. 4.64879 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 57.790217 16.94450462 3.411 0.0023 AGE1 1 -3.999574 20.77745154 -0.192 0.8490 AGE2 1 -56.188114 11.96726393 -4.695 0.0001 HEIGHT 1 1.636256 0.25524084 6.411 0.0001 AGE1HT 1 -0.546928 0.30658044 -1.784 0.0871 AGE2HT 1 1.006544 0.18389498 5.473 0.0001