Categorical variables require special attention in regression analysis because,
unlike dichotomous or continuous variables, they cannot by entered into the
regression equation just as they are. Instead, they need to be
recoded into a series of variables which can then be
entered into the regression model. There are a variety of coding systems that can be used when recoding categorical variables, and which one
you select depends on the comparisons that you want to make. For example,
you may want to compare each level of the categorical variable to the lowest
level (or any given level). In that case you would use a system called ** simple
coding**. Or you may want to compare each level to the next higher
level, in which case you would want to use **repeated coding**. We will discuss two
general types of coding and when to use them:
dummy coding and effect coding.

The examples in this page will use dataset called https://stats.idre.ucla.edu/wp-content/uploads/2016/02/hsb2-2.sav
and we will focus on the categorical variable **race**, which has four levels (1 =
Hispanic, 2 = Asian, 3 = African American and 4 = white) and we will use **write**
as our dependent variable. Although our
example uses a variable with four levels, these coding systems work with
variables that have more categories or fewer categories. No matter which coding system you select, you will always have one fewer recoded variables
than levels of the original variable. In our example, our categorical
variable has four levels. We will therefore have three new
variables. (A variable corresponding to the final level of the categorical
variables would be redundant and therefore unnecessary.)

## DUMMY CODING

Perhaps the simplest and perhaps most common coding system is called **dummy coding**. It is a way to make the
categorical variable into a series of dichotomous variables (variables that can have a value of zero or one only.) For all but one of the
levels of the categorical variable, a new variable will be created that has a
value of one for each observation at that level and zero for all others.
In our example using the variable race, the first new variable (x1) will have a
value of one for each observation in which race is Hispanic, and zero for all
other observations. Likewise, we create **x2** to be 1 when the person
is Asian, and 0 otherwise, and **x3** is 1 when the person is African
American, and 0 otherwise. The level of
the categorical variable that is coded as zero in all of the new variables is
the reference level, or the level to which all of the other levels are
compared. In our example, white is the reference level. You can select any level of the categorical variable as the
reference level.

DUMMY CODING

Level of race | New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) |

1 (Hispanic) | 1 | 0 | 0 |

2 (Asian) | 0 | 1 | 0 |

3 (African American) | 0 | 0 | 1 |

4 (white) | 0 | 0 | 0 |

After creating the new variables, they are entered into the regression (the
original variable is not entered), so we would enter **x1** **x2** and **x3**
instead of entering **race** into our regression equation and the regression output will include
coefficients for each of these variables. The coefficient for x1 is the
mean of the dependent variable for group 1 minus the mean of the dependent variable
for the omitted group. In our example, the coefficient for x1 would be the
mean of ** write** for the Hispanic group minus the mean of ** write** for the white
group. Likewise, the coefficient for x2 would be the mean of ** write** for the
Asian group minus the mean of ** write** for the white group, and the coefficient for
x3 would be the mean of ** write** for the African American group minus the mean of
**
write** for the white group.

### EFFECT CODING

Other coding systems use more values than just zero and one, and therefore allow you to make other types of comparisons. Unlike dummy coding, effect coding allows you to assign different weights the various levels of the categorical variable. While the "rule" in dummy coding is that only values of zero and one are valid, the "rule" in effect coding is that all of the values in any new variable must sum to zero. Which level is assigned a positive or negative value is not very important: 0 1 -1 0 is the same as 0 -1 1 0 in that both of these codings compare the second and the third levels of the variable, however the sign of the coefficient would change.

Another point to consider is that while you can use
dummy coding with any type of categorical variable, some forms of effect coding
make more sense with ordinal categorical variables than with nominal categorical
variables. For our example we use the variable race, which is a nominal
categorical variable. Because dummy coding compares the mean of the
dependent variable for each level of the categorical variable to the mean of the
dependent variable at for the reference group, it makes sense with a nominal
variable.
However, it may not make as much sense to use a coding scheme that tests the **linear**
effect of race. As we describe each type of coding system, we note
those coding systems with which it does not make as much sense to use a nominal
variable.

Within SPSS there are two general commands that you can use for analyzing data
with a continuous dependent variable and one or more categorical predictors, the
**regression** command and the **glm** command (that replaced the **manova**
command, not discussed in this page). If using the **regression**
command, you would create one
less new variables than there are levels in your categorical variable and use
these new variables as predictors in your regression model. The
values for these new variables will depend on how many levels are in your
categorical variable and the coding system you choose. From this point we
will refer to the coding scheme as used in the **regression** command as **regression
**coding. Another method for analyzing categorical data would be to use the **glm**
command and then you could use the /**lmatrix** or the /**contrast**
commands to perform comparisons among the groups. We will refer to this
coding scheme as **contrast** coding. So, if you are using the
regression command, be sure to choose the **regression** coding scheme and if
you are using the **glm** command be sure to choose the **contrast**
coding scheme.

Below is a table listing various types of contrasts and the
comparison that they make.

Name of contrast | Comparison made |

Simple | Compares each level of a variable to the last level (or whichever level is specified). |

Deviation | Compares deviations from the grand mean. |

Difference | Compares levels of a variable with the mean of the previous levels of the variable. |

Helmert | Compare levels of a variable with the mean of the subsequent levels of the variable. |

Polynomial | Orthogonal polynomial contrasts. |

Repeated | Adjacent levels of a variable. |

Special | User-defined contrast. |

## SIMPLE EFFECT CODING

The results of simple effect coding is very similar to dummy coding in that each group is compared to the reference group. In the example below, group 4 is the reference group and the first comparison compares group 1 to group 4, the second comparison compares group 2 to group 4, and the third comparison compares group 3 to group 4.

The **regression** coding is a bit more complex
than simple dummy coding. In our example below, group 4 is the reference
group and **x1** compares group 1 to group 4, **x2** compares group 2 to
group 4, and **x3** compares group 3 to group 4. Note that the coding
is a bit more tricky than simple dummy coding. For **x1** the coding is
3/4 (.75) for group 1, and -1/4 (-.25) for all other groups. Likewise, for
**x2** the coding is 3/4 (.75) for group 2, and -1/4 (-.25) for all other
groups, and for **x3** the coding is 3/4 (.75) for group 3, and -1/4 (-.25)
for all other groups. Note that each new variable must sum to 0.

SIMPLE regression coding

Level of race | New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) |

1 (Hispanic) | .75 | -.25 | -.25 |

2 (Asian) | -.25 | .75 | -.25 |

3 (African American) | -.25 | -.25 | .75 |

4 (white) | -.25 | -.25 | -.25 |

The **contrast** coding, see below, is more straightforward. It also
follows the rule that for effect
coding that the values in each new variable sum to zero. The first contrast
compares group 1 to group 4, and group 1 is coded "1" and group 4 is
coded "-1". Likewise, the second contrast compares group 2 to
group 4 by coding group 2 "1" and group 4 "-1". As you
can see with contrast coding, you can discern the meaning of the comparisons
simply by inspecting the contrast coefficients. For example, looking at
the contrast coefficients for **c3** you can see that this compares group 3
to group 4.

SIMPLE effect contrast coding

Level of race | New variable 1 (c1) | New variable 2 (c2) | New variable 3 (c3) |

1 (Hispanic) | 1 | 0 | 0 |

2 (Asian) | 0 | 1 | 0 |

3 (African American) | 0 | 0 | 1 |

4 (white) | -1 | -1 | -1 |

In the above examples, both the regression coefficient for ** x1** and the contrast estimate for
**c1** would be the mean of ** write** for level 1 (Hispanic) minus the mean of write
for level 4 (white). Likewise, the
regression coefficient for **x2** and the contrast estimate for **c2**
would be the mean of ** write** for level 2 (Asian) minus the mean of ** write**
for level 4 (white).

## DEVIATION EFFECT CODING

This coding system compares the mean of the dependent variable for a given level to the mean of the dependent variable for the other levels of the variable. In our example below, the first comparison compares level 1 (hispanics) to all 3 other groups, the second comparison compares level 2 (Asians) to the 3 other groups, and the third comparison compares level 3 (African Americans) to the 3 other groups.

As you see in the example below, the **regression**
coding is accomplished by assigning "1" to group 1 for the first
comparison (since group 1 is the group to be compared to all others), a
"1" to group 2 for the second comparison (since group 2 is to be
compared to all others), and "1" to group 3 for the third comparison
(since group 3 is to be compared to all others). Note that a
"-1" is assigned to group 4 for all 3 comparisons (since it is the
group that is never compared to the other groups) and all other values are
assigned a 0. This **regression** coding scheme yields the comparisons
described above.

DEVIATION regression coding

Level of race | New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) |

Level 1 v. Mean | Level 2 v. Mean | Level 3 v. Mean | |

1 (Hispanic) | 1 | 0 | 0 |

2 (Asian) | 0 | 1 | 0 |

3 (African American) | 0 | 0 | 1 |

4 (white) | -1 | -1 | -1 |

As you can see, **contrast** coding is much simpler. The first
comparison that compares group 1 to groups 2,3,4 assigns 3/4 (.75) to group 1
and -1/4 (.25) to groups 2,3,4. Likewise, the second comparison that
compares group 2 to groups 1,3,4 assigns 3/4 (.75) to group 2 and -1/4 (.25) to
groups 1,3,4 and so forth for the third comparison. Note that you could
substitute 3 for 3/4 and 1 for 1/4 and you would get the same test of
significance, but the contrast coefficient would be different.

DEVIATION contrast coding

Level of race | New variable 1 (c1) | New variable 2 (c2) | New variable 3 (c3) |

Level 1 v. Mean | Level 2 v. Mean | Level 3 v. Mean | |

1 (Hispanic) | .75 | -.25 | -.25 |

2 (Asian) | -.25 | .75 | -.25 |

3 (African American) | -.25 | -.25 | .75 |

4 (white) | -.25 | -.25 | -.25 |

In the above examples, both the
regression coefficient for ** x1** and the contrast estimate for **c1**
would be the mean of ** write** for level 1 (Hispanic) minus the mean of write
for levels 2,3 and 4 combimed. Likewise, the
regression coefficient for **x2** and the contrast estimate for **c2**
would be the mean of ** write** for level 2 (Asian) minus the mean of ** write**
for levels 1, 3, and 4 combined.

## DIFFERENCE CODING

In this coding system, each level is compared to the mean of the previous levels. In our example, the first comparison codes the comparison of the mean of the dependent variable for level 1 of race to the mean of the dependent variable for level 2 of race. The second comparison compares the mean of the dependent variable for both levels 1 and 2 of race with the mean of the dependent variable for level 3 of race, and the third comparison compares the mean of the dependent variable for levels 1,2 and 3 of race with the 4th level of race. Clearly, this coding system does not make much sense with our example of race because it is a nominal variable. However, this system is useful when the levels of the categorical variable are ordered in a meaningful way. For example, if we had a categorical variable in which work-related stress was coded as low, medium or high, then comparing the means of the previous levels of the variable would make more sense.

Below we see an
example of **regression** coding. For the first comparison, where the
first and second level are compared, **x1** is coded -1/2 (-.5) and 1/2
(.5) and the rest 0. For the second comparison, the values of **x2**
are coded -1/3 (-.333) then -1/3 (-.333) then 2/3 (.666) and then 0.
Finally, for the 3rd comparison, the values of **x3** are coded -1/4 -1/4
-/14 and then 3/4.

DIFFERENCE regression coding

New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) | |

Level 2 v. Level 1 | Level 3 v. Previous | Level 4 v. Previous | |

1 (Hispanic) | -.5 | -.333 | -.25 |

2 (Asian) | .5 | -.333 | -.25 |

3 (African American) | 0 | .666 | -.25 |

4 (white) | 0 | 0 | .75 |

For **contrast** coding, we see that the first comparison comparing groups
1 and 2 are coded -1 and 1 to compare these groups, and 0 otherwise. The
second comparison comparing groups 1,2 with group 3 are coded -.5 -.5 1
and 0, and the last comparison comparing groups 1,2,3 with group 4 are coded
-.333 -.333 -.333 and 1.

DIFFERENCE contrast coding

New variable 1 (c1) | New variable 2 (c2) | New variable 3 (c3) | |

Level 2 v. Level 1 | Level 3 v. Previous | Level 4 v. Previous | |

1 (Hispanic) | -1 | -.5 | -.333 |

2 (Asian) | 1 | -.5 | -.333 |

3 (African American) | 0 | 1 | -.333 |

4 (white) | 0 | 0 | 1 |

In the above examples, both the
regression coefficient for ** x1** and the contrast estimate for **c1**
would be the mean of ** write** for level 1 (Hispanic) minus the mean of write
for level 2 (Asian). Likewise, the
regression coefficient for **x2** and the contrast estimate for **c2**
would be the mean of ** write** for levels 1 and 2 combined minus the mean of
** write**
for level 3. Finally, the
regression coefficient for **x3** and the contrast estimate for **c3**
would be the mean of ** write** for levels 1, 2 and 3 combined minus the mean of
** write**
for level 4.

## HELMERT EFFECT CODING

Helmert coding is just the opposite of difference coding: instead of comparing each level of categorical variable to the mean of the previous level, it is compared to the mean of the subsequent levels. Hence, the first contrast compares the mean of the dependent variable for level 1 of race with the mean of all of the subsequent levels of race (levels 2, 3, and 4), the second contrast compares the mean of the dependent variable for level 2 of race with the mean of all of the subsequent levels of race (levels 3, and 4), and the third contrast compares the mean of the dependent variable for level 3 of race with the mean of all of the subsequent levels of race (level 4). However, this type of coding is useful in situations where the levels of the categorical variable are ordered say, from lowest to highest, or smallest to largest, etc.

Below we see an example of **regression**
coding, and you can see that the coding is simply the mirror image of the
difference coding. For the first comparison (comparing 1 with 2, 3, and 4)
the codes are 3/4 and -1/4 -1/4 -1/4. The second comparison compares
groups 2 with 3 and 4 and is coded 0 2/3 -1/3 -1/3. The third comparison
compares levels 3 and 4 and is coded 0 0 1/2 -1/2.

HELMERT regression coding

Level of race | New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) |

Level 1 v. Later | Level 2 v. Later | Level 3 v. Later | |

1 (Hispanic) | .75 | 0 | 0 |

2 (Asian) | -.25 | .666 | 0 |

3 (African American) | -.25 | -.333 | .5 |

4 (white) | -.25 | -.333 | -.5 |

For **contrast** coding, we see that the first comparison comparing group 1
with groups 2, 3 and 4 is coded 1 -.333 -.333 -.333 reflecting the comparison of
group 1 vs. all other gorups. The second comparison is coded 0 1 -.5 -.5
reflecting that it compares group 2 with groups 3 and 4. The 3rd
comparison is coded 0 0 1 -1 reflecting that group 3 is compared to group 4.

HELMERT contrast coding

Level of race | New variable 1 (c1) | New variable 2 (c2) | New variable 3 (c3) |

Level 1 v. Later | Level 2 v. Later | Level 3 v. Later | |

1 (Hispanic) | 1 | 0 | 0 |

2 (Asian) | -.333 | 1 | 0 |

3 (African American) | -.333 | -.5 | 1 |

4 (white) | -.333 | -.5 | -1 |

In the above examples, both the
regression coefficient for ** x1** and the contrast estimate for **c1**
would be the mean of ** write** for level 1 (Hispanic) vs all subsequent
levels (levels 2, 3 and 4). Likewise, the
regression coefficient for **x2** and the contrast estimate for **c2**
would be the mean of ** write** for level 2 minus the mean of ** write**
for levels 3 and 4. Finally, the
regression coefficient for **x3** and the contrast estimate for **c3**
would be the mean of ** write** for level 3 minus the mean of ** write**
for level 4.

## ORTHOGONAL POLYNOMIAL CODING

Orthogonal polynomial coding is a form trend analysis in that it is looking for the linear, quadratic and cubic trends in the categorical variable. This type of coding system should be used only with an ordinal variable in which the levels are equally spaced. An example of such a variable might be income, or education.

MORE HERE.

POLYNOMIAL

Level of race | Linear (x1) | Quadratic (x2) | Cubic (x3) |

1 (Hispanic) | -.671 | .5 | -.224 |

2 (Asian) | -.224 | -.5 | .671 |

3 (African American) | .224 | -.5 | -.671 |

4 (white) | .671 | .5 | .224 |

## REPEATED EFFECT CODING

In this coding system, the mean of the dependent variable for one level of the categorical variable is compared to the mean of the dependent variable for the adjacent level. In our example below, the first comparison compares the the mean of write for level 1 with the mean of write for level 2 of race (Hispanics minus Asians). The second comparison compares the mean of write for level 2 minus level 3, and the third comparison compares the mean of write for level 3 minus level 4. This type of coding may be useful with either a nominal or an ordinal variable.

Below we see an example of **regression** coding. For the first
comparison, where the first and second level are compared, **x1** is coded
-3/4 for level 1 and the rest -1/4. For the second comparison where level
2 is compared with level 3, **x2** is coded 1/2 1/2 -1/2 -1/2, and for the
third comparison where** **level 3 is compared with level 4, **x3 **is
coded 1/4 1/4 1/4 and -3/4.

REPEATED regression

Level of race | New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) |

Level 1 v. Level 2 | Level 2 v. Level 3 | Level 3 v. Level 4 | |

1 (Hispanic) | .75 | .5 | .25 |

2 (Asian) | -.25 | .5 | .25 |

3 (African American) | -.25 | -.5 | .25 |

4 (white) | -.25 | -.5 | -.75 |

For **contrast** coding, the coding more naturally reflects the
comparisons being made. The first comparison is coded 1 -1 0 0 reflecting
that group 1 is compared to group 2. The second comparison is coded 0 1 -1
0 reflecting that group 2 is compared to group 3, and the third comparison is
coded 0 0 1 -1 reflecting that group 3 is compared with group 4.

REPEATED contrast coding

Level of race | New variable 1 (c1) | New variable 2 (c2) | New variable 3 (c3) |

Level 1 v. Level 2 | Level 2 v. Level 3 | Level 3 v. Level 4 | |

1 (Hispanic) | 1 | 0 | 0 |

2 (Asian) | -1 | 1 | 0 |

3 (African American) | 0 | -1 | 1 |

4 (white) | 0 | 0 | -1 |

In the above examples, both the
regression coefficient for ** x1** and the contrast estimate for **c1**
would be the mean of ** write** for level 1 (Hispanic) minus the mean of **write**
for level 2 (Asian). Likewise, the
regression coefficient for **x2** and the contrast estimate for **c2**
would be the mean of ** write** for level 2 (Asian) minus the mean of **write**
for level 3 (African American), and the
regression coefficient for **x3** and the contrast estimate for **c3**
would be the mean of ** write** for level 3 (African American) minus the mean
of **write** for level 4 (white).

## SYNTAX

For most coding systems, there are two ways to code categorical variables: manually coding them and having SPSS code them for you. There are benefits and drawbacks to both approaches. The benefit of manually coding variables is that you have absolute control over how they are coded. The drawback to this approach is that it is relatively easy to make an error in writing the syntax. An error may be difficult to find, particularly if the error is a logic error instead of a syntax error. (SPSS will give you an error message in the output window if there is a syntax error, but not if there is a logical error.) One way to avoid having an error in your syntax is by allowing SPSS to code the varable(s) for you, but in doing so, you may have to give up some control over how the codes are assigned. Also, SPSS will not create certain kinds of codes for you, most notably dummy codes. Below we show two ways to create dummy codes and three ways to create each type of effect coding for our example using the four-level categorical variable race.

Before considering any analyses, let’s look at the mean of the dependent variable, write, for each level of race. This will help in interpreting the output from the analyses.

means tables = write by race.

Cases | ||||||
---|---|---|---|---|---|---|

Included | Excluded | Total | ||||

N | Percent | N | Percent | N | Percent | |

writing score * RACE | 200 | 100.0% | 0 | .0% | 200 | 100.0% |

RACE | Mean | N |
---|---|---|

hispanic | 46.4583 | 24 |

asian | 58.0000 | 11 |

african-amer | 48.2000 | 20 |

white | 54.0552 | 145 |

Total | 52.7750 | 200 |

## DUMMY CODING

In Method 1, we create a new variable (i.e., x1) that is set equal to zero. Then we change the value of this new variable to equal one if the level in the original (categorical) variable is one. We repeat this process for each new variable that we need to create. In Method 2, we use a "do-loop" to generate the new variables, which can be useful if your categorical variable has a large number of levels.

* Method 1 for creating dummy variables.

compute x1 = 0. if race = 1 x1 = 1. compute x2 = 0. if race = 1 x2 = 1. compute x3 = 0. if race = 1 x3 = 1. execute.

* Method 2 for creating dummy variables.

do repeat A=x1 x2 x3 /B=1 2 3. compute A=(x=B). end repeat. execute. regression /dep write /method = enter x1 x2 x3.

Model | Variables Entered | Variables Removed | Method |
---|---|---|---|

1 | X3, X2, X1(a) | . | Enter |

a All requested variables entered. | |||

b Dependent Variable: writing score |

The table above shows which variables were entered into the regression equation. It also indicates that the method used was "enter", as opposed to other possible methods that could have been specified, such as backward, forward or stepwise. The table also indicates that all of the variables listed on the /method= statement were entered into the regression equation.

Model | R | R Square | Adjusted R Square | Std. Error of the Estimate |
---|---|---|---|---|

1 | .327(a) | .107 | .093 | 9.02511 |

a Predictors: (Constant), X3, X2, X1 |

Model | Sum of Squares | df | Mean Square | F | Sig. | |
---|---|---|---|---|---|---|

1 | Regression | 1914.158 | 3 | 638.053 | 7.833 | .000(a) |

Residual | 15964.717 | 196 | 81.453 | |||

Total | 17878.875 | 199 | ||||

a Predictors: (Constant), X3, X2, X1 | ||||||

b Dependent Variable: writing score |

The table above entitled "Model Summary" indicates that one model was tested, that 10.7% of the variance in the dependent variable is accounted for by the independent variable, and that 9.3% of the variance of the dependent variable is accounted for by the independent variable when the number of independent variables in the equation is taken into consideration. The standard error of the estimate is also given. The table entitled "ANOVA" gives the sum of squares and the degrees of freedom (in the column labeled "df") for the regression, the residual and the total (regression plus residual). The mean square is given for the regression and the residual, and the F-value and the associated p-value (in the column labeled Sig.) is displayed. These results indicate that the regression is statistically significant at the .05 alpha level. As you will see, the overall test of race is the same regardless of the coding system used.

Unstandardized Coefficients | Standardized Coefficients | t | Sig. | |||
---|---|---|---|---|---|---|

Model | B | Std. Error | Beta | |||

1 | (Constant) | 54.055 | .749 | 72.122 | .000 | |

X1 | -7.597 | 1.989 | -.261 | -3.820 | .000 | |

X2 | 3.945 | 2.823 | .095 | 1.398 | .164 | |

X3 | -5.855 | 2.153 | -.186 | -2.720 | .007 | |

a Dependent Variable: writing score |

The table above gives the unstandardized coefficients for the regression equation (in the column labeled B) and the standard error (in the column labeled Std. Error). When using dummy coding, the constant is the mean of the omitted level of the categorical variable. The coefficient for x1 is the difference between the mean of the dependent variable for level 1 of race minus the mean of the dependent variable at level 4 of race (the reference level). Likewise, the coefficient for x2 and x3 is the mean of the dependent variable at that level of race minus the mean of the dependent variable for the reference level. The standardized coefficients are given in the column labeled Beta. The t-values and associated p-values are also given. The statistical significance of the constant is rarely of interest to researchers. The coefficients for x1 and x3 are statistically significant at the .05 (and .01) alpha level, while the coefficient for x2 is not. This indicates that level 1 of race (Hispanic) is significantly different from level 4 (white), and that level 3 (African American) is significantly different from level 4 (white).

### (left off here)

### EFFECT CODING

When doing any sort of effect coding, there are three approaches to the coding of the variables. The first approach is to manually compute them for use in OLS regression, which is shown in Method 1. You create a new variable, setting it equal to one of the values that it will assume, and then use "if" statements to change the value according to the values in the original (categorical) variable. If you use this approach, you can use either "regression" or "glm". The second approach is to use "glm" with "/lmatrix" statements. You will need to use one "/lmatrix" statement for each contrast. Hence, in our example, because we have a four-level categorical variable, we will need to use three "/lmatrix" statements (all of which are part of the same "glm" command). The third approach is to use "glm" and include a "/contrast () =" statement, placing the name of the categorical variable in the parentheses and the name of the contrast to be used after the equal sign. Below are examples of all three approaches. In Method 3, we include a "/print" statement with the "test(lmatrix)" option so that SPSS prints out the coding system used for the contrasts. For the example using difference coding, we also include the "parameter" option on the print statement. This causes SPSS to print out the coding system used for the regression analysis as well as the results of the regression analysis. This illustrates how the two coding systems are different and shows that the results of the regression are the same as when dummy coding is used. In the interest of conserving space, we include the output only for the third method of creating the codes. However, the output from the other methods will be very similar and will contain all of the same values for parameter estimates, tests of statistical significance, etc. We have interspersed explanations into the following output. For the other types of coding systems, we omit the output that is the same and only discuss the output that changes as a result of the different coding system used.

## SIMPLE EFFECT CODING

Method 1:

if race = 1 x1 = .75. if any(race,2,3,4) x1 = -.25. if race = 2 x2 = .75. if any(race,1,3,4) x2 = -.25. if race = 3 x3 = .75. if any(race,1,2,4) x3 = -.25.execute.

regression /dependent = write /method = enter x1 x2 x3.

**< output omitted >**

Method 2:

glm write by race /lmatrix "group 1 versus group 4" race 1 0 0 -1 /lmatrix "group 2 versus group 4" race 0 1 0 -1 /lmatrix "group 3 versus group 4" race 0 0 1 -1.

**< output omitted >**

Method 3:

glm write by race /contrast (race)=simple /print = parameter test(lmatrix).

Value Label | N | ||
---|---|---|---|

RACE | 1.00 | hispanic | 24 |

2.00 | asian | 11 | |

3.00 | african-amer | 20 | |

4.00 | white | 145 |

Source | Type III Sum of Squares | df | Mean Square | F | Sig. |
---|---|---|---|---|---|

Corrected Model | 1914.158(a) | 3 | 638.053 | 7.833 | .000 |

Intercept | 225523.580 | 1 | 225523.580 | 2768.770 | .000 |

RACE | 1914.158 | 3 | 638.053 | 7.833 | .000 |

Error | 15964.717 | 196 | 81.453 | ||

Total | 574919.000 | 200 | |||

Corrected Total | 17878.875 | 199 | |||

a R Squared = .107 (Adjusted R Squared = .093) |

The table above entitled "Between-Subjects Factors" shows the levels of the categorical variable, the value label associated with each level (if any) and the number of observations at each level (in the column N). The table entitled "Tests of Between-Subjects Effects" shows the source, the type III sums of squares, the degrees of freedom (called "df"), the mean square, the F values and the corresponding p-values. The F-value for the corrected model of 7.833 and its p-value of .000 indicate that the overall model is statistically significant. The F- and p-values for race are the same because in this model, we have only one independent variable. If we had more than one independent variable, the F- and p-values for the overall model would be different from those for the independent variables. The F- and p-values for the intercept are also statistically significant, but those are rarely of interest.

**Lower Bound**

B | Std. Error | t | Sig. | 95% Confidence Interval | ||
---|---|---|---|---|---|---|

Parameter | Upper Bound | |||||

Intercept | 54.055 | .749 | 72.122 | .000 | 52.577 | 55.533 |

[RACE=1.00] | -7.597 | 1.989 | -3.820 | .000 | -11.519 | -3.675 |

[RACE=2.00] | 3.945 | 2.823 | 1.398 | .164 | -1.622 | 9.511 |

[RACE=3.00] | -5.855 | 2.153 | -2.720 | .007 | -10.101 | -1.610 |

[RACE=4.00] | 0(a) | . | . | . | . | . |

a This parameter is set to zero because it is redundant. |

The table above entitled "Parameter Estimates" gives the coefficients (in the column labeled B), the associated standard errors (in the column labeled Std. Error), the associated t-values, the associated p-values (in the column labeled Sig.), and the lower and upper bounds for the 95% confidence interval.

For our example, the regression equation would be: y = 54.055 – 7.597×1 + 3.945×2 -5.855×3. All of the coefficients are statistically significant at the .05 alpha level except the one for x2. In other words, the mean of the dependent variable (write) for both x1 and x3 is statistically significantly different from the mean of the dependent variable for x4 (the omitted level), but not different from x2. Furthermore, the true value of the coefficient for x1 is between -11.519 and -3.675 with a 95% level of certainty. Likewise, the true value of the coefficient for x2 is between -1.622 and 9.511 with a 95% level of certainty, and so on.

You will notice that the values given "ANOVA" and "Coefficients" tables in the section on dummy coding are the same as the values given in the "Tests of Between-Subjects Effects" and "Parameter Estimates". This is because, as mentioned previously, that dummy coding and simple effect coding yield the same results when the same reference level is used in both coding systems.

Contrast | |
---|---|

Parameter | L1 |

Intercept | 1.000 |

[RACE=1.00] | .250 |

[RACE=2.00] | .250 |

[RACE=3.00] | .250 |

[RACE=4.00] | .250 |

The default display of this matrix is the transpose of the corresponding L matrix. Based on Type III Sums of Squares. |

Contrast | |||
---|---|---|---|

Parameter | L2 | L3 | L4 |

Intercept | 0 | 0 | 0 |

[RACE=1.00] | 1 | 0 | 0 |

[RACE=2.00] | 0 | 1 | 0 |

[RACE=3.00] | 0 | 0 | 1 |

[RACE=4.00] | -1 | -1 | -1 |

The default display of this matrix is the transpose of the corresponding L matrix. Based on Type III Sums of Squares. |

The table above entitled "Intercept" shows the coding that SPSS used for the intercept. Each level of race was given an equal value (.250), and the sum of those is the intercept (1.000). The table entitled "Race" shows the coding for race that was used in the calculations regarding the regression above. Notice that it is simple effect coding, but that it the same results would have been obtained using dummy coding. In this instance, the only difference between simple effect coding and dummy coding is the values assigned to the reference level (race = 4). Because it is the reference level, the only important point is that it have the same value in each of the new variables (called L2, L3 and L4). What that value is, either negative one in simple effect coding or zero in dummy coding, is irrelevant. Regardless of the coding system requested, SPSS will calculate the regression using simple effect coding. Which coding system you specify on the /contrast= statement will be used only in calculating the contrast estimates.

RACE Simple Contrast(a) | |||
---|---|---|---|

Parameter | Level 1 vs. Level 4 | Level 2 vs. Level 4 | Level 3 vs. Level 4 |

Intercept | 0 | 0 | 0 |

[RACE=1.00] | 1 | 0 | 0 |

[RACE=2.00] | 0 | 1 | 0 |

[RACE=3.00] | 0 | 0 | 1 |

[RACE=4.00] | -1 | -1 | -1 |

The default display of this matrix is the transpose of the corresponding L matrix. | |||

a Reference category = 4 |

Dependent Variable | |||
---|---|---|---|

RACE Simple Contrast(a) | writing score | ||

Level 1 vs. Level 4 | Contrast Estimate | -7.597 | |

Hypothesized Value | 0 | ||

Difference (Estimate – Hypothesized) | -7.597 | ||

Std. Error | 1.989 | ||

Sig. | .000 | ||

95% Confidence Interval for Difference | Lower Bound | -11.519 | |

Upper Bound | -3.675 | ||

Level 2 vs. Level 4 | Contrast Estimate | 3.945 | |

Hypothesized Value | 0 | ||

Difference (Estimate – Hypothesized) | 3.945 | ||

Std. Error | 2.823 | ||

Sig. | .164 | ||

95% Confidence Interval for Difference | Lower Bound | -1.622 | |

Upper Bound | 9.511 | ||

Level 3 vs. Level 4 | Contrast Estimate | -5.855 | |

Hypothesized Value | 0 | ||

Difference (Estimate – Hypothesized) | -5.855 | ||

Std. Error | 2.153 | ||

Sig. | .007 | ||

95% Confidence Interval for Difference | Lower Bound | -10.101 | |

Upper Bound | -1.610 | ||

a Reference category = 4 |

Source | Sum of Squares | df | Mean Square | F | Sig. |
---|---|---|---|---|---|

Contrast | 1914.158 | 3 | 638.053 | 7.833 | .000 |

Error | 15964.717 | 196 | 81.453 |

The table above entitled "Contrast Coefficients (L’ Matrix)" shows the coding scheme that was used for each comparison. The table entitled "Contrast Results (K Matrix)" shows the results of the various contrasts. In our example, the difference between level 1 of race and level 4 of race is statistically significant. You will notice that the contrast estimate is the difference between the mean for the dependent variable for the omitted level minus the mean of the dependent variable for the first level. In other words, 46.4583 – 54.0552 = -7.597. The hypothesized value is zero (and is zero for all contrast tests). This means that the null hypothesis is that the coefficient equals zero, which is almost always the null hypothesis in which researchers are interested. The row labeled Difference (Estimate – Hypothesized) gives the difference between the contrast estimate and the hypothesized value. Because the null hypothesis is always zero, the contrast estimate and the difference between the contrast estimate and the null hypothesis are the same value. Therefore, you can either refer to the contrast estimate as being either statistically significant or not, or you can refer to the difference as being either statistically significant or not. In our example, the difference between level 2 of race and level 4 of race is not statistically significant, and the difference between level 3 of race and level 4 of race is statistically significant. You will notice that the values given in this table are the same as those given in "Parameter Estimates" table. This is because both used the same coding system and the same reference level. If a different coding system had been requested on the /contrast= statement, or if a different reference level had been specified, the two tables would not have the same numbers. The table entitled "Test Results" indicates that the test of race is statistically significant. In other words, it is a test of all of the contrasts taken together. The results of this test are identical to the overall test of race because there are no other independent variables in the model. If there were, the results of the two tests would be different from one another.

## DEVIATION CODING

Method 1:

if race = 1 x1 = 1. if any(race,2,3) x1 = 0. if race = 4 x1 = -1. if race = 2 x2 = 1. if any(race,1,3) x2 = 0. if race = 4 x2 = -1. if race = 3 x3 = 1. if any(race,1,2) x3 = 0. if race = 4 x3 = -1. execute. regression /dep write /method = enter x1 x2 x3.

**< output omitted >**

Method 2:

glm write by race /lmatrix "group 1 versus groups 1 2 and 3" race .75 -.25 -.25 -.25 /lmatrix "group 2 versus groups 1 3 and 4" race -.25 .75 -.25 -.25 /lmatrix "group 3 versus groups 1 2 and 4" race -.25 -.25 .75 -.25.

**< output omitted >**

Method 3:

glm write by race /contrast (race)=deviation /print = parameter test(lmatrix).

Value Label | N | ||
---|---|---|---|

RACE | 1.00 | hispanic | 24 |

2.00 | asian | 11 | |

3.00 | african-amer | 20 | |

4.00 | white | 145 |

Source | Type III Sum of Squares | df | Mean Square | F | Sig. |
---|---|---|---|---|---|

Corrected Model | 1914.158(a) | 3 | 638.053 | 7.833 | .000 |

Intercept | 225523.580 | 1 | 225523.580 | 2768.770 | .000 |

RACE | 1914.158 | 3 | 638.053 | 7.833 | .000 |

Error | 15964.717 | 196 | 81.453 | ||

Total | 574919.000 | 200 | |||

Corrected Total | 17878.875 | 199 | |||

a R Squared = .107 (Adjusted R Squared = .093) |

B | Std. Error | t | Sig. | 95% Confidence Interval | ||
---|---|---|---|---|---|---|

Parameter | Lower Bound | Upper Bound | ||||

Intercept | 54.055 | .749 | 72.122 | .000 | 52.577 | 55.533 |

[RACE=1.00] | -7.597 | 1.989 | -3.820 | .000 | -11.519 | -3.675 |

[RACE=2.00] | 3.945 | 2.823 | 1.398 | .164 | -1.622 | 9.511 |

[RACE=3.00] | -5.855 | 2.153 | -2.720 | .007 | -10.101 | -1.610 |

[RACE=4.00] | 0(a) | . | . | . | . | . |

a This parameter is set to zero because it is redundant. |

Contrast | |
---|---|

Parameter | L1 |

Intercept | 1.000 |

[RACE=1.00] | .250 |

[RACE=2.00] | .250 |

[RACE=3.00] | .250 |

[RACE=4.00] | .250 |

The default display of this matrix is the transpose of the corresponding L matrix. Based on Type III Sums of Squares. |

Contrast | |||
---|---|---|---|

Parameter | L2 | L3 | L4 |

Intercept | 0 | 0 | 0 |

[RACE=1.00] | 1 | 0 | 0 |

[RACE=2.00] | 0 | 1 | 0 |

[RACE=3.00] | 0 | 0 | 1 |

[RACE=4.00] | -1 | -1 | -1 |

The default display of this matrix is the transpose of the corresponding L matrix. Based on Type III Sums of Squares. |

RACE Deviation Contrast(a) | |||
---|---|---|---|

Parameter | Level 1 vs. Mean | Level 2 vs. Mean | Level 3 vs. Mean |

Intercept | .000 | .000 | .000 |

[RACE=1.00] | .750 | -.250 | -.250 |

[RACE=2.00] | -.250 | .750 | -.250 |

[RACE=3.00] | -.250 | -.250 | .750 |

[RACE=4.00] | -.250 | -.250 | -.250 |

The default display of this matrix is the transpose of the corresponding L matrix. | |||

a Omitted category = 4 |

Dependent Variable | |||
---|---|---|---|

RACE Deviation Contrast(a) | writing score | ||

Level 1 vs. Mean | Contrast Estimate | -5.220 | |

Hypothesized Value | 0 | ||

Difference (Estimate - Hypothesized) | -5.220 | ||

Std. Error | 1.631 | ||

Sig. | .002 | ||

95% Confidence Interval for Difference | Lower Bound | -8.437 | |

Upper Bound | -2.003 | ||

Level 2 vs. Mean | Contrast Estimate | 6.322 | |

Hypothesized Value | 0 | ||

Difference (Estimate - Hypothesized) | 6.322 | ||

Std. Error | 2.160 | ||

Sig. | .004 | ||

95% Confidence Interval for Difference | Lower Bound | 2.061 | |

Upper Bound | 10.582 | ||

Level 3 vs. Mean | Contrast Estimate | -3.478 | |

Hypothesized Value | 0 | ||

Difference (Estimate - Hypothesized) | -3.478 | ||

Std. Error | 1.732 | ||

Sig. | .046 | ||

95% Confidence Interval for Difference | Lower Bound | -6.895 | |

Upper Bound | -6.203E-02 | ||

a Omitted category = 4 |

Source | Sum of Squares | df | Mean Square | F | Sig. |
---|---|---|---|---|---|

Contrast | 1914.158 | 3 | 638.053 | 7.833 | .000 |

Error | 15964.717 | 196 | 81.453 |

Notice the two different coding systems that are presented in this output.
In the table entitled "Race", you see the coding system that was used
to calculate the regression. In the table entitled "Contrast Coefficients (L' Matrix)",
you see the coding system that was used to calculate the contrast
coefficients. It is important to understand why two different coding
systems are displayed in the output and to which analysis they refer. From
now on, we will not include the "parameter" option on the print
statement so that the results of the regression analysis will not be
shown. These results would be the same for each example below.

The contrasts estimates in the table entitled "Contrast Results (K Matrix)"
are the mean of the particular level minus the grand (unweighted) mean.
This grand mean is not the mean of the dependent variable that is listed in the
output of the "means" command above. Rather it is the mean of means of the
dependent variable at each level of the categorical variable: (46.4583 +
58 + 48.2 + 54.0552) / 4 = 51.678375. The contrast estimate for level 1
versus mean is then 46.4583 - 51.678375 = -5.220.
The difference between this value and zero (the null hypothesis that the
contrast coefficient is zero) is statistically significant (p = .002). The
contrast coefficients for the other comparisons are calculated in the same
manner. As with the output of the code using simple effect coding, the
table "Test Results" shows the test of all of the contrasts taken
together. As expected, the values in this table are the same as those
previously.

## DIFFERENCE CODING

Method 1:

if race = 1 x1 = -.5. if race = 2 x1 = .5. if any(race,3,4) x1 = 0. if any(race,1,2) x2 = -.333. if race = 3 x2 = .667. if race = 4 x2 = 0. if any(race,1,2,3) x3 = -.25. if race = 4 x3 = .75. execute. regression /dep write /method = enter x1 x2 x3.

**< output omitted >**

Method 2:

glm write by race /lmatrix "group 2 versus group 1" race -1 1 0 0 /lmatrix "group 3 versus groups 1 and 2" race -.5 -.5 1 0 /lmatrix "group 4 versus groups 1 2 and 3" race -1/3 -1/3 -1/3 1.

**< output omitted >**

Method 3:

glm write by race /contrast (race)=difference /print = test(lmatrix).

**< some output omitted >**

RACE Difference Contrast | |||
---|---|---|---|

Parameter | Level 2 vs. Level 1 | Level 3 vs. Previous | Level 4 vs. Previous |

Intercept | .000 | .000 | .000 |

[RACE=1.00] | -1.000 | -.500 | -.333 |

[RACE=2.00] | 1.000 | -.500 | -.333 |

[RACE=3.00] | .000 | 1.000 | -.333 |

[RACE=4.00] | .000 | .000 | 1.000 |

The default display of this matrix is the transpose of the corresponding L matrix. |

Dependent Variable | |||
---|---|---|---|

RACE Difference Contrast | writing score | ||

Level 2 vs. Level 1 | Contrast Estimate | 11.542 | |

Hypothesized Value | 0 | ||

Difference (Estimate - Hypothesized) | 11.542 | ||

Std. Error | 3.286 | ||

Sig. | .001 | ||

95% Confidence Interval for Difference | Lower Bound | 5.061 | |

Upper Bound | 18.022 | ||

Level 3 vs. Previous | Contrast Estimate | -4.029 | |

Hypothesized Value | 0 | ||

Difference (Estimate - Hypothesized) | -4.029 | ||

Std. Error | 2.602 | ||

Sig. | .123 | ||

95% Confidence Interval for Difference | Lower Bound | -9.161 | |

Upper Bound | 1.103 | ||

Level 4 vs. Previous | Contrast Estimate | 3.169 | |

Hypothesized Value | 0 | ||

Difference (Estimate - Hypothesized) | 3.169 | ||

Std. Error | 1.488 | ||

Sig. | .034 | ||

95% Confidence Interval for Difference | Lower Bound | .235 | |

Upper Bound | 6.104 |

Source | Sum of Squares | df | Mean Square | F | Sig. |
---|---|---|---|---|---|

Contrast | 1914.158 | 3 | 638.053 | 7.833 | .000 |

Error | 15964.717 | 196 | 81.453 |

The contrast estimate for the first comparison shown in this output was
calculated by subtracting the mean of the dependent variable for level 1 of the
categorical variable from the mean of the dependent variable for level 2:
58 - 46.4583 = 11.542. This result is statistically significant. The
contrast estimate for the second comparison (between level 3 and the previous
levels) was calculated by subtracting the mean of the dependent variable for
levels 1 and 2 from that of level 3: 48.2 - [(46.4583 + 58) / 2] =
-4.029. This result is not statistically significant, meaning that there
is not a reliable difference between the mean of write for level 3 of race
compared to the mean of write for levels 1 and 2 (Hispanics and Asians).
As noted above, this type of coding system does not make much sense for a
nominal variable such as race. For the comparison of level 4 and the
previous levels, you take the mean of the dependent variable for the those
levels and subtract it from the mean of the dependent variable for level
4: 54.0552 - [(46.4583 + 58 + 48.2) / 3] = 3.169. This result is
statistically significant.

Note the use of fractions on the "/lmatrix" statement in Method
2. As mentioned above, you need to use numbers that sum to zero, such as
1/3 + 1/3 + 1/3 - 1. You cannot use .333 instead of 1/3: SPSS will
give an error message and fail to calculate the contrast coefficient. The
problem is that .333 + .333 + .333 - 1 is not sufficiently close to
zero.

**HELMERT CODING**

Method 1:

if race = 1 x1 = .75. if any(race,2,3,4) x1 = -.25. if race = 1 x2 = 0. if race = 2 x2 = .667. if any(race,3,4) x2 = -.333. if any(race,1,2) x3 = 0. if race = 3 x3 = .5. if race = 4 x3 = -.5. execute. regression /dep write /method = enter x1 x2 x3.

**< output omitted >**

Method 2:

glm write by race /lmatrix "group 1 versus groups 2 3 and 4" race 1 -1/3 -1/3 -1/3 /lmatrix "group 2 versus groups 3 and 4" race 0 1 -.5 -.5 /lmatrix "group 3 versus group 4" race 0 0 1 -1.

**< output omitted >**

Method 3:

glm write by race /contrast (race)=helmert /print = test(lmatrix).

**< some output omitted >**

RACE Helmert Contrast | |||
---|---|---|---|

Parameter | Level 1 vs. Later | Level 2 vs. Later | Level 3 vs. Level 4 |

Intercept | .000 | .000 | .000 |

[RACE=1.00] | 1.000 | .000 | .000 |

[RACE=2.00] | -.333 | 1.000 | .000 |

[RACE=3.00] | -.333 | -.500 | 1.000 |

[RACE=4.00] | -.333 | -.500 | -1.000 |

The default display of this matrix is the transpose of the corresponding L matrix. |

Dependent Variable | |||
---|---|---|---|

RACE Helmert Contrast | writing score | ||

Level 1 vs. Later | Contrast Estimate | -6.960 | |

Hypothesized Value | 0 | ||

Difference (Estimate - Hypothesized) | -6.960 | ||

Std. Error | 2.175 | ||

Sig. | .002 | ||

95% Confidence Interval for Difference | Lower Bound | -11.250 | |

Upper Bound | -2.670 | ||

Level 2 vs. Later | Contrast Estimate | 6.872 | |

Hypothesized Value | 0 | ||

Difference (Estimate - Hypothesized) | 6.872 | ||

Std. Error | 2.926 | ||

Sig. | .020 | ||

95% Confidence Interval for Difference | Lower Bound | 1.101 | |

Upper Bound | 12.644 | ||

Level 3 vs. Level 4 | Contrast Estimate | -5.855 | |

Hypothesized Value | 0 | ||

Difference (Estimate - Hypothesized) | -5.855 | ||

Std. Error | 2.153 | ||

Sig. | .007 | ||

95% Confidence Interval for Difference | Lower Bound | -10.101 | |

Upper Bound | -1.610 |

Source | Sum of Squares | df | Mean Square | F | Sig. |
---|---|---|---|---|---|

Contrast | 1914.158 | 3 | 638.053 | 7.833 | .000 |

Error | 15964.717 | 196 | 81.453 |

The contrast estimate for the comparison between level 1 and the remaining levels (called "later" in the output) is calculated by subtracting the mean of the dependent variable for levels 2, 3 and 4 from the mean of the dependent variable for level 1: 46.4583 - [(58 + 48.2 + 54.0552) / 3] = -6.960, which is statistically significant. This means that the mean of write for level 1 of race is statistically significantly different from the mean of write for levels 2 through 4. As noted above, this comparison probably is not meaningful because the variable race is nominal. This type of comparison would be more meaningful if the categorical variable was ordinal. To calculate the contrast coefficient for the comparison between level 2 and the later levels, you subtract the mean of the dependent variable for levels 3 and 4 from the mean of the dependent variable for level 2: 58 - [(48.2 + 54.0552) / 2] = -11.250, which is statistically significant. The contrast estimate for the comparison between level 3 and level 4 is the difference between the mean of the dependent variable for the two levels: 48.2 - 54.0552 = -5.855, which is also statistically significant.

## ORTHOGONAL POLYNOMIAL CODING

Method
1:

if race = 1 x1 = -.671. if race = 2 x1 = -.224. if race = 3 x1 = .224. if race = 4 x1 = .671. if race = 1 x2 = .5. if race = 2 x2 = -.5. if race = 3 x2 = -.5. if race = 4 x2 = .5. if race = 1 x3 = -.224. if race = 2 x3 = .671. if race = 3 x3 = -.671. if race = 4 x3 = .224. execute. regression /dep write /method = enter x1 x2 x3.

**< output omitted >**

Method 2:

glm write by race /lmatrix "linear" race -.671 -.224 .224 .671 /lmatrix "quadratic" race .5 -.5 -.5 .5 /lmatrix "cubic" race -.224 .671 -.671 .224.

**< output omitted >**

Method 3:

**glm write by race
/contrast (race)=polynomial
/print = test(lmatrix).**

**< some output omitted >**

RACE Polynomial Contrast(a) | |||
---|---|---|---|

Parameter | Linear | Quadratic | Cubic |

Intercept | .000 | .000 | .000 |

[RACE=1.00] | -.671 | .500 | -.224 |

[RACE=2.00] | -.224 | -.500 | .671 |

[RACE=3.00] | .224 | -.500 | -.671 |

[RACE=4.00] | .671 | .500 | .224 |

The default display of this matrix is the transpose of the corresponding L matrix. | |||

a Metric = 1.000, 2.000, 3.000, 4.000 |

Dependent Variable | |||
---|---|---|---|

RACE Polynomial Contrast(a) | writing score | ||

Linear | Contrast Estimate | 2.905 | |

Hypothesized Value | 0 | ||

Difference (Estimate - Hypothesized) | 2.905 | ||

Std. Error | 1.534 | ||

Sig. | .060 | ||

95% Confidence Interval for Difference | Lower Bound | -.121 | |

Upper Bound | 5.931 | ||

Quadratic | Contrast Estimate | -2.843 | |

Hypothesized Value | 0 | ||

Difference (Estimate - Hypothesized) | -2.843 | ||

Std. Error | 1.964 | ||

Sig. | .149 | ||

95% Confidence Interval for Difference | Lower Bound | -6.717 | |

Upper Bound | 1.031 | ||

Cubic | Contrast Estimate | 8.273 | |

Hypothesized Value | 0 | ||

Difference (Estimate - Hypothesized) | 8.273 | ||

Std. Error | 2.316 | ||

Sig. | .000 | ||

95% Confidence Interval for Difference | Lower Bound | 3.706 | |

Upper Bound | 12.840 | ||

a Metric = 1.000, 2.000, 3.000, 4.000 |

Source | Sum of Squares | df | Mean Square | F | Sig. |
---|---|---|---|---|---|

Contrast | 1914.158 | 3 | 638.053 | 7.833 | .000 |

Error | 15964.717 | 196 | 81.453 |

To calculate the contrast estimates for these comparisons, you need to multiply the code used in the new variable by the mean for the dependent variable for each level of the categorical variable, and then sum the values. For example, the code used in x1 for level 1 of race is -.671 and the mean of write for level 1 is 46.4583. Hence, you would multiple -.671 and 46.4583 and add that to the product of the code for level 2 of x1 and its mean, and so on. To obtain the contrast estimate for the linear contrast, you would do the following: -.671*46.4583 + -.224*58 + .224*48.2 + .671*54.0552 = 2.905 (with rounding error). This result is not statistically significant at the .05 alpha level, but it is close. The quadratic component is also not statistically significant, but the cubic one is. This suggests that, if the mean of the dependent variable plotted against race, the line would tend to have two bends. As noted earlier, this type of coding system does not make much sense with a nominal variable such as race.

## REPEATED EFFECT CODING

Method 1:

if race = 1 x1 = .75. if any(race,2,3,4) x1 = -.25. if any(race,1,2) x2 = .5. if any(race,3,4) x2 = -.5. if any(race,1,2,3) x3 = .25. if race = 4 x3 = -.75. execute. regression /dep write /method = enter x1 x2 x3.

**< output omitted >**

Method 2:

glm write by race /lmatrix "group 1 versus group 2" race 1 -1 0 0 /lmatrix "group 2 versus group 3" race 0 1 -1 0 /lmatrix "group 3 versus group 4" race 0 0 1 -1.

**< output omitted >**

Method 3:

glm write by race /contrast (race)=repeated /print = test(lmatrix).

**< some output omitted >**

RACE Repeated Contrast | |||
---|---|---|---|

Parameter | Level 1 vs. Level 2 | Level 2 vs. Level 3 | Level 3 vs. Level 4 |

Intercept | 0 | 0 | 0 |

[RACE=1.00] | 1 | 0 | 0 |

[RACE=2.00] | -1 | 1 | 0 |

[RACE=3.00] | 0 | -1 | 1 |

[RACE=4.00] | 0 | 0 | -1 |

The default display of this matrix is the transpose of the corresponding L matrix. |

Dependent Variable | |||
---|---|---|---|

RACE Repeated Contrast | writing score | ||

Level 1 vs. Level 2 | Contrast Estimate | -11.542 | |

Hypothesized Value | 0 | ||

Difference (Estimate - Hypothesized) | -11.542 | ||

Std. Error | 3.286 | ||

Sig. | .001 | ||

95% Confidence Interval for Difference | Lower Bound | -18.022 | |

Upper Bound | -5.061 | ||

Level 2 vs. Level 3 | Contrast Estimate | 9.800 | |

Hypothesized Value | 0 | ||

Difference (Estimate - Hypothesized) | 9.800 | ||

Std. Error | 3.388 | ||

Sig. | .004 | ||

95% Confidence Interval for Difference | Lower Bound | 3.119 | |

Upper Bound | 16.481 | ||

Level 3 vs. Level 4 | Contrast Estimate | -5.855 | |

Hypothesized Value | 0 | ||

Difference (Estimate - Hypothesized) | -5.855 | ||

Std. Error | 2.153 | ||

Sig. | .007 | ||

95% Confidence Interval for Difference | Lower Bound | -10.101 | |

Upper Bound | -1.610 |

Source | Sum of Squares | df | Mean Square | F | Sig. |
---|---|---|---|---|---|

Contrast | 1914.158 | 3 | 638.053 | 7.833 | .000 |

Error | 15964.717 | 196 | 81.453 |

With this coding system, adjacent levels of the categorical variable are compared. Hence, the mean of the dependent variable at level 1 is compared to the mean of the dependent variable at level 2: 46.4583 - 58 = -11.542, which is statistically significant. For the comparison between levels 2 and 3, the calculation of the contrast coefficient would be 58 - 48.2 = 9.8, which is also statistically significant. Finally, comparing levels 3 and 4, 48.2 - 54.0552 = -5.855, a statistically significant difference. One would conclude from this that each adjacent level of race is statistically significantly different.

SPECIAL USER-DEFINED CODING SYSTEM

Let's compare: 1) level 1 to level3, 2) level 2 to levels 1 and
4 and 3) levels 1 and 2 to levels 3 and 4.

Method 1:

if race = 1 x1 = -.5. if race = 2 x1 = .5. if race = 3 x1 = -1.5. if race = 4 x1 = 1.5.

if any(race,1,3) = 1 x2 = -1. if any(race,2,4) = 1 x2 = 1.

if any(race,1,3) = 1 x3 = 1.5. if race = 2 x3 = -.5. if race = 4 x3 = -2.5. execute.

regression /dep write /method = enter x1 x2 x3.

**< output omitted >**

Method 2:

glm write by race /lmatrix "compare group 1 to group 3" race 1 0 -1 0 /lmatrix "compare group 2 to groups 1 and 4" race -.5 1 0 -.5 /lmatrix "compare groups 1 and 2 to groups 3 and 4" race .5 .5 -.5 -.5.

**< output omitted >**

Method 3:

glm write by race /contrast (race)=special(1 0 -1 0, -.5 1 0 -.5, .5 .5 -.5 -.5) /print = test(lmatrix).

**< some output omitted >**

RACE Special Contrast | |||
---|---|---|---|

Parameter | L1 | L2 | L3 |

Intercept | .000 | .000 | .000 |

[RACE=1.00] | 1.000 | -.500 | .500 |

[RACE=2.00] | .000 | 1.000 | .500 |

[RACE=3.00] | -1.000 | .000 | -.500 |

[RACE=4.00] | .000 | -.500 | -.500 |

The default display of this matrix is the transpose of the corresponding L matrix. |

Dependent Variable | |||
---|---|---|---|

RACE Special Contrast | writing score | ||

L1 | Contrast Estimate | -1.742 | |

Hypothesized Value | 0 | ||

Difference (Estimate - Hypothesized) | -1.742 | ||

Std. Error | 2.732 | ||

Sig. | .525 | ||

95% Confidence Interval for Difference | Lower Bound | -7.131 | |

Upper Bound | 3.647 | ||

L2 | Contrast Estimate | 7.743 | |

Hypothesized Value | 0 | ||

Difference (Estimate - Hypothesized) | 7.743 | ||

Std. Error | 2.897 | ||

Sig. | .008 | ||

95% Confidence Interval for Difference | Lower Bound | 2.030 | |

Upper Bound | 13.457 | ||

L3 | Contrast Estimate | 1.102 | |

Hypothesized Value | 0 | ||

Difference (Estimate - Hypothesized) | 1.102 | ||

Std. Error | 1.964 | ||

Sig. | .576 | ||

95% Confidence Interval for Difference | Lower Bound | -2.772 | |

Upper Bound | 4.975 |

Source | Sum of Squares | df | Mean Square | F | Sig. |
---|---|---|---|---|---|

Contrast | 1914.158 | 3 | 638.053 | 7.833 | .000 |

Error | 15964.717 | 196 | 81.453 |

The first comparison of the mean of the dependent variable for level 1 to level 3 of the categorical variable was not statistically significant, while the comparison of the mean of the dependent variable for level 2 to that of levels 1 and 4 was. The comparison of the mean of the dependent variable for levels 1 and 2 to that of levels 3 and 4 was not statistically significant.