Consider this simple data file having nine subjects (**sub**)
in three groups (**iv**) with a score on the dv (**dv**).

DATA dummy; INPUT sub iv dv; CARDS; 1 1 48 2 1 49 3 1 50 4 2 17 5 2 20 6 2 23 7 3 28 8 3 30 9 3 32 ; RUN;

Below we do a **proc means** to find the
overall mean, and another **proc means** to find the means for the
three groups.

PROC MEANS DATA=dummy; VAR dv; RUN; PROC MEANS DATA=dummy; CLASS iv; VAR dv; RUN;

As we see below, the overall mean is 33, and the means for groups 1, 2 and 3 are 49, 20 and 30 respectively.

Analysis Variable : DV N Mean Std Dev Minimum Maximum --------------------------------------------------------- 9 33.0000000 12.8937970 17.0000000 50.0000000 --------------------------------------------------------- Analysis Variable : DV IV N Obs N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------------ 1 3 3 49.0000000 1.0000000 48.0000000 50.0000000 2 3 3 20.0000000 3.0000000 17.0000000 23.0000000 3 3 3 30.0000000 2.0000000 28.0000000 32.0000000 ------------------------------------------------------------------------------

Let’s run a standard ANOVA on this data using **proc glm**.

PROC GLM DATA=dummy; CLASS iv ; MODEL dv = iv ; RUN;

The results of the ANOVA are shown below.

General Linear Models Procedure Class Level Information Class Levels Values IV 3 1 2 3 Number of observations in data set = 9 General Linear Models Procedure Dependent Variable: DV Sum of Mean Source DF Squares Square F Value Pr > F Model 2 1302.0000000 651.0000000 139.50 0.0001 Error 6 28.0000000 4.6666667 Corrected Total 8 1330.0000000 R-Square C.V. Root MSE DV Mean 0.978947 6.546203 2.1602469 33.000000 Source DF Type I SS Mean Square F Value Pr > F IV 2 1302.0000000 651.0000000 139.50 0.0001> Source DF Type III SS Mean Square F Value Pr > F IV 2 1302.0000000 651.0000000 139.50 0.0001

Now, let’s take this information we have found, and
relate it to the results that we get when we run a similar analysis using dummy coding.
Let’s make a data file called **dummy2** that has dummy variables called **iv1**
(1 if iv=1), **iv2** (1 if iv=2) and **iv3** (1 if iv=3).
Note that **iv3** is not really necessary, but it could be useful for further
exploring the meaning of dummy variables. We will then use **proc reg** to
predict **dv** from **iv1** and **iv2**.

DATA dummy2; SET dummy; IF (iv = 1) THEN iv1 = 1; ELSE iv1 = 0; IF (iv = 2) THEN iv2 = 1; ELSE iv2 = 0; IF (iv = 3) THEN iv3 = 1; ELSE iv3 = 0; RUN; PROC REG DATA=dummy2; MODEL dv = iv1 iv2 ; RUN;

The output is shown below.

Model: MODEL1 Dependent Variable: DV Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 2 1302.00000 651.00000 139.500 0.0001 Error 6 28.00000 4.66667 C Total 8 1330.00000 Root MSE 2.16025 R-square 0.9789 Dep Mean 33.00000 Adj R-sq 0.9719 C.V. 6.54620 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 30.000000 1.24721913 24.054 0.0001 IV1 1 19.000000 1.76383421 10.772 0.0001 IV2 1 -10.000000 1.76383421 -5.669 0.0013

First, note that from the ANOVA using** proc glm**
that the F value was 139.5 and for the regression using **proc reg** the F
value (for the model) is also 139.5. This illustrates that the overall test of the model
using regression is really the same as doing an ANOVA.

After the **Analysis of Variance** section, there is a section titled
**Parameter
Estimates**. What is the interpretation of the values listed there, the 30, 19 and
-10? Notice how we have **iv1** and **iv2** that refer to group
1 and group 2, but we did not include any dummy variable referring to group 3. Group 3 is
often called the **omitted group** or **reference group**.
Recall that the means of the 3 groups were 49, 20 and 30 respectively. The **intercept**
term is the mean of the **omitted group**, and indeed the parameter estimate
from the output is the mean of group 3, 30. The parameter estimate for **iv1**
is the mean of group 1 minus the mean of group 3, 49 – 30 = 19, and indeed that is the
parameter estimate for **iv1**. Likewise, the parameter estimate for **iv2**
is the mean of group 2 – the mean of group 3, 20 – 30 = -10, the parameter estimate for
**iv2**.

So, in summary:

Interceptmean of group 3 (mean of omitted group) iv1mean of group 1 – group 3 (omitted group) iv2mean of group 2 – group 3 (omitted group)

Try running this example, but use **iv2** and **iv3** in **proc reg** (making group 1 the omitted group) and see what happens.

Finally, consider how the parameter estimates can be used in the regression model to obtain the means for the groups (the predicted values). The regression model is:

Ypredicted = 30 + iv1*19 + iv2*-10For group 1: Ypredicted = 30 + 1 * 19 + 0 * -10 = 49 For group 2: Ypredicted = 30 + 0 * 19 + 1 * -10 = 20 For group 3: Ypredicted = 30 + 0 * 19 + 0 * -10 = 30

As you see, the regression formula predicts that each group will have the mean value of its group.