Suppose we have an ANOVA model, and we would like to compare means between one group and another group. This is commonly done with the estimate statement in SAS. Let’s look at a simple example using a data set called https://stats.idre.ucla.edu/wp-content/uploads/2016/02/elemapi2.sas7bdat . The dependent variable is the school’s API index (a continuous variable). The variables mealcat and collcat are categorical variables, both with three levels. These will be used as the predictor variables. This model is shown below. We have included the lsmeans statement to get the expected means for each group. This can be helpful if you want to calculate the contrast estimate by hand.
proc glm data = elemapi2; class collcat mealcat; model api00 = collcat mealcat collcat*mealcat/ss3; lsmeans collcat*mealcat; run; quit;Least Squares Means collcat mealcat api00 LSMEAN 1 1 816.914286 1 2 589.350000 1 3 493.918919 2 1 825.651163 2 2 636.604651 2 3 508.833333 3 1 782.150943 3 2 655.637681 3 3 541.733333
Let’s say that we want to look at a simple comparison of group 1 versus 2 and above of collcat when mealcat = 1. One way of doing this using proc glm with estimate statement. We use the e option on the estimate statement to have SAS print out the contrast coefficients that are applied to each group.
proc glm data = elemapi2; class collcat mealcat; model api00 = collcat mealcat collcat*mealcat/ss3; estimate 'collcat 1 vs 2+ within mealcat = 1' collcat 1 -.5 -.5 collcat*mealcat 1 0 0 -.5 0 0 -.5 0 0 / e; run; quit;
While this estimate statement will run the analysis we want, it is a little difficult to write. Another way of accomplishing the same thing is to use a cell means model. A cell means model estimates only one parameter for each cell and sets the intercept to 0. In general, the cell means model is not used to produce an overall test of model fit, but it is often used to write simpler estimate or contrast statements. So, in practice, we need to write the proc glm code twice, once for the model fit and the second time for the estimates or contrasts. In the code shown below, the first proc glm is for model fit and the second one with the estimate statement is used to estimate the simple comparison. We use the ss3 option to limit the output to only the Type III sums of squares.
In the second call to proc glm, which is a cell means model, the main effects are omitted; only the interaction is included in the model. We use the noint option on the model statement to specify that we are not going to estimate the intercept; therefore, we will estimate one parameter per cell. We use the e option to show us the contrast codes that were used. This is a useful way to be sure that the contrast codings were assigned as you intended.
* estimating the overall model; proc glm data = elemapi2; class collcat mealcat; model api00 = collcat mealcat collcat*mealcat/ss3; run; quit;* estimating the cell means model to get the desired estimate; proc glm data = elemapi2; class collcat mealcat; model api00 = collcat*mealcat/noint; estimate 'collcat 1 vs 2+ within mealcat = 1' collcat*mealcat 1 0 0 -.5 0 0 -.5 0 0 / e; run; quit;
Notice that the order of categorical variables on the class statement decides which variable is the row variable and which is the column variable. For example, in the code above, collcat will be the row variable and mealcat will be the column variable. Therefore, the simple comparison we are interested in can be formulated as shown in the following table. Writing the numbers in the table one row at a time, we can write our estimate statement as:
estimate 'simple comparison' collcat*mealcat 1 0 0 -.5 0 0 -.5 0 0 / e;
collcat /mealcat | mealcat = 1 | mealcat = 2 | mealcat = 3 |
collcat = 1 | 1 | 0 | 0 |
collcat =2 | -.5 | 0 | 0 |
collcat = 3 | -.5 | 0 | 0 |
Equivalently, we can make use of the option divisor = to rewrite the statement in terms of whole numbers as shown below.
estimate 'simple comparison' collcat*mealcat 2 0 0 -1 0 0 -1 0 0 /divisor=2 e;
If we switch the order of variables on the class
statement, we will have to rewrite our estimate statement accordingly. For example, we can rewrite the above
estimate statement such as the
following; it produces exactly the same result from the estimate
statement, since the corresponding table is simply being transposed.
mealcat /collcat |
collcat = 1 | collcat =2 | collcat=3 |
mealcat = 1 | 1 | -.5 | -.5 |
mealcat = 2 | 0 | 0 | 0 |
mealcat = 3 | 0 | 0 | 0 |
Let’s run the analysis model. Notice that both collcat and the collcat*mealcat interaction need to be specified on the estimate statement.
proc glm data = elemapi2; class collcat mealcat; model api00 = collcat mealcat collcat*mealcat/ss3; estimate 'collcat 1 vs 2+ within mealcat = 1' collcat 1 -.5 -.5 collcat*mealcat 1 0 0 -.5 0 0 -.5 0 0 / e; run; quit;The GLM Procedure Class Level Information Class Levels Values mealcat 3 1 2 3 collcat 3 1 2 3 Number of Observations Read 400 Number of Observations Used 400Coefficients for Estimate collcat 1 vs 2+ within mealcat = 1 Row 1 Intercept 0 collcat 1 1 collcat 2 -0.5 collcat 3 -0.5 mealcat 1 0 mealcat 2 0 mealcat 3 0 collcat*mealcat 1 1 1 collcat*mealcat 1 2 0 collcat*mealcat 1 3 0 collcat*mealcat 2 1 -0.5 collcat*mealcat 2 2 0 collcat*mealcat 2 3 0 collcat*mealcat 3 1 -0.5 collcat*mealcat 3 2 0 collcat*mealcat 3 3 0Dependent Variable: api00 Sum of Source DF Squares Mean Square F Value Pr > F Model 8 6243714.810 780464.351 166.76 <.0001 Error 391 1829957.187 4680.197 Corrected Total 399 8073671.998 R-Square Coeff Var Root MSE api00 Mean 0.773343 10.56356 68.41197 647.6225 Source DF Type III SS Mean Square F Value Pr > F collcat 2 42140.566 21070.283 4.50 0.0117 mealcat 2 4764843.563 2382421.781 509.04 <.0001 collcat*mealcat 4 124167.809 31041.952 6.63 <.0001 Standard Parameter Estimate Error t Value Pr > |t| collcat 1 vs 2+ within mealcat = 1 13.0132326 13.5279998 0.96 0.3367
This simple comparison is not statistically significant (t = 0.96, p = 0.3367).
Now let’s run the cell means model. Notice that only the interaction term is used on the model and the estimate statements.
proc glm data = elemapi2; class mealcat collcat; model api00 = mealcat*collcat/noint ss3; estimate 'collcat 1 vs 2+ within mealcat = 1' collcat*mealcat 1 -.5 -.5 0 0 0 0 0 0 /e; run; quit;< some output omitted >Coefficients for Estimate collcat 1 vs 2+ within mealcat = 1 Row 1 collcat*mealcat 1 1 1 collcat*mealcat 1 2 0 collcat*mealcat 1 3 0 collcat*mealcat 2 1 -0.5 collcat*mealcat 2 2 0 collcat*mealcat 2 3 0 collcat*mealcat 3 1 -0.5 collcat*mealcat 3 2 0 collcat*mealcat 3 3 0< some output omitted >Standard Parameter Estimate Error t Value Pr > |t| collcat 1 vs 2+ within mealcat = 1 13.0132326 13.5279998 0.96 0.3367
Brief summary
- A cell means model is used only for the purpose of making the contrast on an estimate statement easier to write. The only part of the output that is considered is the part related to the contrast estimate (this is usually found at the bottom of the output).
- Writing the estimate statement with a cell means model is easier because it includes only one vector (for highest-order interaction). Using the analysis model, the estimate statement for the same contrast may contain multiple vectors and/or matrices and is therefore more difficult to specify correctly.
- The cell means model approach can be used with models that include three-way or higher interactions. Only the highest-order interaction is included on the model statement, and only this term is used on the estimate statement.
- A cell means model can be used with other procedures, such as proc mixed.
- The technique can also be used with the contrast statement.
- You may notice a note in the SAS log file that reads:
Due to the presence of CLASS variables, an intercept is implicitly fitted. R-Square has been corrected for the mean.
This note may at first seem confusing, because we specified the noint option on the model statement. However, instead of estimating an intercept, we are estimating the mean for one of the groups in our model; hence, the same number of parameters are being estimated. (In the full model, there are eight parameters plus the intercept, for a total of nine parameters; in the cell means model, nine parameters are estimated.) The R-square is the same for the two models (R-square = .773343), which we would expect.
Example 2
In our second example, we will compare the means between two cells in the design. Specifically, we will compare collcat=2 at mealcat=1 to collcat=3 at mealcat=2.
collcat /mealcat | mealcat = 1 | mealcat = 2 | mealcat = 3 |
collcat = 1 | 0 | 0 | 0 |
collcat =2 | 1 | 0 | 0 |
collcat = 3 | 0 | -1 | 0 |
The estimate statement for this comparison is shown below.
proc glm data = elemapi2; class collcat mealcat; model api00 = collcat mealcat collcat*mealcat/ss3; estimate 'collcat 2/mealcat 1 vs. collcat 3/mealcat 2' collcat 0 1 -1 mealcat 1 -1 0 collcat*mealcat 0 0 0 1 0 0 0 -1 0 /e; run; quit;< some output omitted >Coefficients for Estimate collcat 2/mealcat 1 vs. collcat 3/mealcat 2 Row 1 Intercept 0 collcat 1 0 collcat 2 1 collcat 3 -1 mealcat 1 1 mealcat 2 -1 mealcat 3 0 collcat*mealcat 1 1 0 collcat*mealcat 1 2 0 collcat*mealcat 1 3 0 collcat*mealcat 2 1 1 collcat*mealcat 2 2 0 collcat*mealcat 2 3 0 collcat*mealcat 3 1 0 collcat*mealcat 3 2 -1 collcat*mealcat 3 3 0< some output omitted >Standard Parameter Estimate Error t Value Pr > |t| collcat 2/mealcat 1 vs. collcat 3/mealcat 2 170.013482 13.2917549 12.79 <.0001
This comparison is statistically significant (t = 12.79, p < .0001).
Using the cell means model, the estimate statement is constructed as shown below.
proc glm data = elemapi2; class collcat mealcat; model api00 = collcat*mealcat/noint; estimate 'collcat 2/mealcat 1 vs. collcat 3/mealcat 2' collcat*mealcat 0 0 0 1 0 0 0 -1 0 /e; run; quit;< some output omitted >Coefficients for Estimate collcat 2/mealcat 1 vs. collcat 3/mealcat 2 Row 1 collcat*mealcat 1 1 0 collcat*mealcat 1 2 0 collcat*mealcat 1 3 0 collcat*mealcat 2 1 1 collcat*mealcat 2 2 0 collcat*mealcat 2 3 0 collcat*mealcat 3 1 0 collcat*mealcat 3 2 -1 collcat*mealcat 3 3 0< some output omitted >Standard Parameter Estimate Error t Value Pr > |t| collcat 2/mealcat 1 vs. collcat 3/mealcat 2 170.013482 13.2917549 12.79 <.0001
Example 3
In our last example, we will look at a difference in differences. We will take the difference between the difference of ([collcat=1 and mealcat=1] and [collcat=1 and mealcat=3]) and ([collcat=3 and mealcat=1] and [collcat=3 and mealcat=3]).
Remember that a little bit of math needs to be done to get the correct signs of the contrast coefficients: (collcat=1/mealcat=1 – collcat=1/mealcat=3) – (collcat=3/mealcat=1 – collcat=3/mealcat=3) = collcat=1/mealcat=1 – collcat=1/mealcat=3 – collcat=3/mealcat=1 + collcat=3/mealcat=3.
collcat /mealcat | mealcat = 1 | mealcat = 2 | mealcat = 3 |
collcat = 1 | 1 | 0 | -1 |
collcat =2 | 0 | 0 | 0 |
collcat = 3 | -1 | 0 | 1 |
proc glm data = elemapi2; class collcat mealcat; model api00 = collcat mealcat collcat*mealcat/ss3; estimate 'differences in differences' collcat*mealcat 1 0 -1 0 0 0 -1 0 1 /e; run; quit;< some output omitted >Coefficients for Estimate differences in differences Row 1 Intercept 0 collcat 1 0 collcat 2 0 collcat 3 0 mealcat 1 0 mealcat 2 0 mealcat 3 0 collcat*mealcat 1 1 1 collcat*mealcat 1 2 0 collcat*mealcat 1 3 -1 collcat*mealcat 2 1 0 collcat*mealcat 2 2 0 collcat*mealcat 2 3 0 collcat*mealcat 3 1 -1 collcat*mealcat 3 2 0 collcat*mealcat 3 3 1< some output omitted >Standard Parameter Estimate Error t Value Pr > |t| differences in differences 82.5777567 24.4394069 3.38 0.0008
The comparison is statistically significant (t = 3.38, p = .0008).
Here is the estimate statement using the cell means model.
proc glm data = elemapi2; class collcat mealcat; model api00 = collcat*mealcat/noint; estimate 'differences in differences' collcat*mealcat 1 0 -1 0 0 0 -1 0 1 /e; run; quit;< some output omitted >Coefficients for Estimate differences in differences Row 1 collcat*mealcat 1 1 1 collcat*mealcat 1 2 0 collcat*mealcat 1 3 -1 collcat*mealcat 2 1 0 collcat*mealcat 2 2 0 collcat*mealcat 2 3 0 collcat*mealcat 3 1 -1 collcat*mealcat 3 2 0 collcat*mealcat 3 3 1< some output omitted >Standard Parameter Estimate Error t Value Pr > |t| differences in differences 82.5777567 24.4394069 3.38 0.0008