How can I test differences in means using a cell means model?

Suppose we have an ANOVA model, and we would like to compare means between one group and another group. This is commonly done with the estimate statement in SAS. Let’s look at a simple example using a data set called https://stats.idre.ucla.edu/wp-content/uploads/2016/02/elemapi2.sas7bdat . The dependent variable is the school’s API index (a continuous variable). The variables mealcat and collcat are categorical variables, both with three levels. These will be used as the predictor variables. This model is shown below. We have included the lsmeans statement to get the expected means for each group. This can be helpful if you want to calculate the contrast estimate by hand.

proc glm data = elemapi2;
class collcat mealcat;
model api00 = collcat mealcat collcat*mealcat/ss3;
lsmeans collcat*mealcat;
run;
quit;

Least Squares Means

collcat    mealcat    api00 LSMEAN

1          1            816.914286
1          2            589.350000
1          3            493.918919
2          1            825.651163
2          2            636.604651
2          3            508.833333
3          1            782.150943
3          2            655.637681
3          3            541.733333

Let’s say that we want to look at a simple comparison of group 1 versus 2 and above of collcat when mealcat = 1. One way of doing this using proc glm with estimate statement. We use the e option on the estimate statement to have SAS print out the contrast coefficients that are applied to each group.

proc glm data = elemapi2;
class collcat mealcat;
model api00 = collcat mealcat collcat*mealcat/ss3;
estimate 'collcat 1 vs 2+ within mealcat = 1'
              collcat 1 -.5 -.5
              collcat*mealcat 1   0   0
		             -.5  0   0
		             -.5  0   0 / e;
run;
quit;

While this estimate statement will run the analysis we want, it is a little difficult to write. Another way of accomplishing the same thing is to use a cell means model. A cell means model estimates only one parameter for each cell and sets the intercept to 0. In general, the cell means model is not used to produce an overall test of model fit, but it is often used to write simpler estimate or contrast statements. So, in practice, we need to write the proc glm code twice, once for the model fit and the second time for the estimates or contrasts. In the code shown below, the first proc glm is for model fit and the second one with the estimate statement is used to estimate the simple comparison. We use the ss3 option to limit the output to only the Type III sums of squares.

In the second call to proc glm, which is a cell means model, the main effects are omitted; only the interaction is included in the model. We use the noint option on the model statement to specify that we are not going to estimate the intercept; therefore, we will estimate one parameter per cell. We use the e option to show us the contrast codes that were used. This is a useful way to be sure that the contrast codings were assigned as you intended.

* estimating the overall model;
proc glm data = elemapi2;
class collcat mealcat;
model api00 = collcat mealcat collcat*mealcat/ss3;
run;
quit;

* estimating the cell means model to get the desired estimate;
proc glm data = elemapi2;
class collcat mealcat;
model api00 = collcat*mealcat/noint;
estimate 'collcat 1 vs 2+ within mealcat = 1'
          collcat*mealcat  1 0 0 -.5 0 0 -.5 0 0 / e;
run;
quit;

Notice that the order of categorical variables on the class statement decides which variable is the row variable and which is the column variable. For example, in the code above, collcat will be the row variable and mealcat will be the column variable. Therefore, the simple comparison we are interested in can be formulated as shown in the following table. Writing the numbers in the table one row at a time, we can write our estimate statement as:

  estimate 'simple comparison'
              collcat*mealcat 1 0 0 -.5 0 0 -.5 0 0 / e;

collcat /mealcat	mealcat = 1	mealcat = 2	mealcat = 3
collcat = 1	1	0	0
collcat =2	-.5	0	0
collcat = 3	-.5	0	0

Equivalently, we can make use of the option divisor = to rewrite the statement in terms of whole numbers as shown below.

  estimate 'simple comparison'
              collcat*mealcat 2 0 0 -1 0 0 -1 0 0 /divisor=2 e;

If we switch the order of variables on the class statement, we will have to rewrite our estimate statement accordingly. For example, we can rewrite the above estimate statement such as the following; it produces exactly the same result from the estimate statement, since the corresponding table is simply being transposed.

mealcat /collcat	collcat = 1	collcat =2	collcat=3
mealcat = 1	1	-.5	-.5
mealcat = 2	0	0	0
mealcat = 3	0	0	0

Let’s run the analysis model. Notice that both collcat and the collcat*mealcat interaction need to be specified on the estimate statement.

proc glm data = elemapi2;
class collcat mealcat;
model api00 = collcat mealcat collcat*mealcat/ss3;
estimate 'collcat 1 vs 2+ within mealcat = 1'
              collcat 1 -.5 -.5
              collcat*mealcat  1  0   0
		             -.5  0   0
		             -.5  0   0 / e;
run;
quit;

The GLM Procedure

   Class Level Information

Class         Levels    Values

mealcat            3    1 2 3

collcat            3    1 2 3

Number of Observations Read         400
Number of Observations Used         400

Coefficients for Estimate collcat 1 vs 2+ within mealcat = 1

                              Row 1

Intercept                         0

collcat         1                 1
collcat         2              -0.5
collcat         3              -0.5

mealcat         1                 0
mealcat         2                 0
mealcat         3                 0

collcat*mealcat 1 1               1
collcat*mealcat 1 2               0
collcat*mealcat 1 3               0
collcat*mealcat 2 1            -0.5
collcat*mealcat 2 2               0
collcat*mealcat 2 3               0
collcat*mealcat 3 1            -0.5
collcat*mealcat 3 2               0
collcat*mealcat 3 3               0

Dependent Variable: api00

                                        Sum of
Source                      DF         Squares     Mean Square    F Value    Pr > F

Model                        8     6243714.810      780464.351     166.76    <.0001

Error                      391     1829957.187        4680.197

Corrected Total            399     8073671.998

R-Square     Coeff Var      Root MSE    api00 Mean

0.773343      10.56356      68.41197      647.6225

Source                      DF     Type III SS     Mean Square    F Value    Pr > F

collcat                      2       42140.566       21070.283       4.50    0.0117
mealcat                      2     4764843.563     2382421.781     509.04    <.0001
collcat*mealcat              4      124167.809       31041.952       6.63    <.0001

                                                          Standard
Parameter                                 Estimate           Error    t Value    Pr > |t|

collcat 1 vs 2+ within mealcat = 1      13.0132326      13.5279998       0.96      0.3367

This simple comparison is not statistically significant (t = 0.96, p = 0.3367).

Now let’s run the cell means model. Notice that only the interaction term is used on the model and the estimate statements.

proc glm data = elemapi2;
class  mealcat collcat;
model api00 = mealcat*collcat/noint ss3;
estimate 'collcat 1 vs 2+ within mealcat = 1'
            collcat*mealcat 1 -.5 -.5  0 0 0 0 0 0 /e;
run;
quit;

< some output omitted >

Coefficients for Estimate collcat 1 vs 2+ within mealcat = 1

                              Row 1

collcat*mealcat 1 1               1
collcat*mealcat 1 2               0
collcat*mealcat 1 3               0
collcat*mealcat 2 1            -0.5
collcat*mealcat 2 2               0
collcat*mealcat 2 3               0
collcat*mealcat 3 1            -0.5
collcat*mealcat 3 2               0
collcat*mealcat 3 3               0

< some output omitted >

                                                          Standard
Parameter                                 Estimate           Error    t Value    Pr > |t|

collcat 1 vs 2+ within mealcat = 1      13.0132326      13.5279998       0.96      0.3367

Brief summary

A cell means model is used only for the purpose of making the contrast on an estimate statement easier to write. The only part of the output that is considered is the part related to the contrast estimate (this is usually found at the bottom of the output).
Writing the estimate statement with a cell means model is easier because it includes only one vector (for highest-order interaction). Using the analysis model, the estimate statement for the same contrast may contain multiple vectors and/or matrices and is therefore more difficult to specify correctly.
The cell means model approach can be used with models that include three-way or higher interactions. Only the highest-order interaction is included on the model statement, and only this term is used on the estimate statement.
A cell means model can be used with other procedures, such as proc mixed.
The technique can also be used with the contrast statement.
You may notice a note in the SAS log file that reads:

Due to the presence of CLASS variables, an intercept is implicitly fitted.  
R-Square has been corrected for the mean.

This note may at first seem confusing, because we specified the noint option on the model statement. However, instead of estimating an intercept, we are estimating the mean for one of the groups in our model; hence, the same number of parameters are being estimated. (In the full model, there are eight parameters plus the intercept, for a total of nine parameters; in the cell means model, nine parameters are estimated.) The R-square is the same for the two models (R-square = .773343), which we would expect.

Example 2

In our second example, we will compare the means between two cells in the design. Specifically, we will compare collcat=2 at mealcat=1 to collcat=3 at mealcat=2.

collcat /mealcat	mealcat = 1	mealcat = 2	mealcat = 3
collcat = 1	0	0	0
collcat =2	1	0	0
collcat = 3	0	-1	0

The estimate statement for this comparison is shown below.

proc glm data = elemapi2;
class collcat mealcat;
model api00 = collcat mealcat collcat*mealcat/ss3;
estimate 'collcat 2/mealcat 1 vs. collcat 3/mealcat 2'
              collcat 0 1 -1
	      mealcat 1 -1 0
              collcat*mealcat 0  0 0
		              1  0 0
		              0 -1 0 /e;
run;
quit;

< some output omitted >

Coefficients for Estimate collcat 2/mealcat 1 vs. collcat 3/mealcat 2

                              Row 1

Intercept                         0

collcat         1                 0
collcat         2                 1
collcat         3                -1

mealcat         1                 1
mealcat         2                -1
mealcat         3                 0

collcat*mealcat 1 1               0
collcat*mealcat 1 2               0
collcat*mealcat 1 3               0
collcat*mealcat 2 1               1
collcat*mealcat 2 2               0
collcat*mealcat 2 3               0
collcat*mealcat 3 1               0
collcat*mealcat 3 2              -1
collcat*mealcat 3 3               0

< some output omitted >

                                                                   Standard
Parameter                                          Estimate           Error    t Value    Pr > |t|

collcat 2/mealcat 1 vs. collcat 3/mealcat 2      170.013482      13.2917549      12.79      <.0001

This comparison is statistically significant (t = 12.79, p < .0001).

Using the cell means model, the estimate statement is constructed as shown below.

proc glm data = elemapi2;
class collcat mealcat;
model api00 = collcat*mealcat/noint;
estimate 'collcat 2/mealcat 1 vs. collcat 3/mealcat 2'
collcat*mealcat 0 0 0 1 0 0 0 -1 0 /e;
run;
quit;

< some output omitted >

Coefficients for Estimate collcat 2/mealcat 1 vs. collcat 3/mealcat 2

                              Row 1

collcat*mealcat 1 1               0
collcat*mealcat 1 2               0
collcat*mealcat 1 3               0
collcat*mealcat 2 1               1
collcat*mealcat 2 2               0
collcat*mealcat 2 3               0
collcat*mealcat 3 1               0
collcat*mealcat 3 2              -1
collcat*mealcat 3 3               0

< some output omitted >

                                                                   Standard
Parameter                                          Estimate           Error    t Value    Pr > |t|

collcat 2/mealcat 1 vs. collcat 3/mealcat 2      170.013482      13.2917549      12.79      <.0001

Example 3

In our last example, we will look at a difference in differences. We will take the difference between the difference of ([collcat=1 and mealcat=1] and [collcat=1 and mealcat=3]) and ([collcat=3 and mealcat=1] and [collcat=3 and mealcat=3]).

Remember that a little bit of math needs to be done to get the correct signs of the contrast coefficients: (collcat=1/mealcat=1 – collcat=1/mealcat=3) – (collcat=3/mealcat=1 – collcat=3/mealcat=3) = collcat=1/mealcat=1 – collcat=1/mealcat=3 – collcat=3/mealcat=1 + collcat=3/mealcat=3.

collcat /mealcat	mealcat = 1	mealcat = 2	mealcat = 3
collcat = 1	1	0	-1
collcat =2	0	0	0
collcat = 3	-1	0	1

proc glm data = elemapi2;
class collcat mealcat;
model api00 = collcat mealcat collcat*mealcat/ss3;
estimate 'differences in differences'
              collcat*mealcat 1 0 -1
		              0 0 0
		             -1 0 1 /e;
run;
quit;

< some output omitted >

Coefficients for Estimate differences in differences

                              Row 1

Intercept                         0

collcat         1                 0
collcat         2                 0
collcat         3                 0

mealcat         1                 0
mealcat         2                 0
mealcat         3                 0

collcat*mealcat 1 1               1
collcat*mealcat 1 2               0
collcat*mealcat 1 3              -1
collcat*mealcat 2 1               0
collcat*mealcat 2 2               0
collcat*mealcat 2 3               0
collcat*mealcat 3 1              -1
collcat*mealcat 3 2               0
collcat*mealcat 3 3               1

< some output omitted >

                                                  Standard
Parameter                         Estimate           Error    t Value    Pr > |t|

differences in differences      82.5777567      24.4394069       3.38      0.0008

The comparison is statistically significant (t = 3.38, p = .0008).

Here is the estimate statement using the cell means model.

proc glm data = elemapi2;
class collcat mealcat;
model api00 = collcat*mealcat/noint;
estimate 'differences in differences'
          collcat*mealcat 1 0 -1 0 0 0 -1 0 1 /e;
run;
quit;

< some output omitted >

Coefficients for Estimate differences in differences

                              Row 1

collcat*mealcat 1 1               1
collcat*mealcat 1 2               0
collcat*mealcat 1 3              -1
collcat*mealcat 2 1               0
collcat*mealcat 2 2               0
collcat*mealcat 2 3               0
collcat*mealcat 3 1              -1
collcat*mealcat 3 2               0
collcat*mealcat 3 3               1

< some output omitted >

                                                  Standard
Parameter                         Estimate           Error    t Value    Pr > |t|

differences in differences      82.5777567      24.4394069       3.38      0.0008