Categorical variables require special attention in regression analysis because, unlike dichotomous or continuous variables, they cannot by entered into the regression equation just as they are. Instead, they need to be recoded into a series of variables which can then be entered into the regression model. There are a variety of coding systems that can be used when recoding categorical variables, and which one you select depends on the comparisons that you want to make. For example, you may want to compare each level of the categorical variable to the lowest level (or any given level). In that case you would use a system called simple coding. Or you may want to compare each level to the next higher level, in which case you would want to use repeated coding. We will discuss two general types of coding and when to use them: dummy coding and effect coding.
The examples in this page will use dataset called https://stats.idre.ucla.edu/wp-content/uploads/2016/02/hsb2-2.sav and we will focus on the categorical variable race, which has four levels (1 = Hispanic, 2 = Asian, 3 = African American and 4 = white) and we will use write as our dependent variable. Although our example uses a variable with four levels, these coding systems work with variables that have more categories or fewer categories. No matter which coding system you select, you will always have one fewer recoded variables than levels of the original variable. In our example, our categorical variable has four levels. We will therefore have three new variables. (A variable corresponding to the final level of the categorical variables would be redundant and therefore unnecessary.)
DUMMY CODING
Perhaps the simplest and perhaps most common coding system is called dummy coding. It is a way to make the categorical variable into a series of dichotomous variables (variables that can have a value of zero or one only.) For all but one of the levels of the categorical variable, a new variable will be created that has a value of one for each observation at that level and zero for all others. In our example using the variable race, the first new variable (x1) will have a value of one for each observation in which race is Hispanic, and zero for all other observations. Likewise, we create x2 to be 1 when the person is Asian, and 0 otherwise, and x3 is 1 when the person is African American, and 0 otherwise. The level of the categorical variable that is coded as zero in all of the new variables is the reference level, or the level to which all of the other levels are compared. In our example, white is the reference level. You can select any level of the categorical variable as the reference level.
DUMMY CODING
Level of race | New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) |
1 (Hispanic) | 1 | 0 | 0 |
2 (Asian) | 0 | 1 | 0 |
3 (African American) | 0 | 0 | 1 |
4 (white) | 0 | 0 | 0 |
After creating the new variables, they are entered into the regression (the original variable is not entered), so we would enter x1 x2 and x3 instead of entering race into our regression equation and the regression output will include coefficients for each of these variables. The coefficient for x1 is the mean of the dependent variable for group 1 minus the mean of the dependent variable for the omitted group. In our example, the coefficient for x1 would be the mean of write for the Hispanic group minus the mean of write for the white group. Likewise, the coefficient for x2 would be the mean of write for the Asian group minus the mean of write for the white group, and the coefficient for x3 would be the mean of write for the African American group minus the mean of write for the white group.
EFFECT CODING
Other coding systems use more values than just zero and one, and therefore allow you to make other types of comparisons. Unlike dummy coding, effect coding allows you to assign different weights the various levels of the categorical variable. While the “rule” in dummy coding is that only values of zero and one are valid, the “rule” in effect coding is that all of the values in any new variable must sum to zero. Which level is assigned a positive or negative value is not very important: 0 1 -1 0 is the same as 0 -1 1 0 in that both of these codings compare the second and the third levels of the variable, however the sign of the coefficient would change.
Another point to consider is that while you can use dummy coding with any type of categorical variable, some forms of effect coding make more sense with ordinal categorical variables than with nominal categorical variables. For our example we use the variable race, which is a nominal categorical variable. Because dummy coding compares the mean of the dependent variable for each level of the categorical variable to the mean of the dependent variable at for the reference group, it makes sense with a nominal variable. However, it may not make as much sense to use a coding scheme that tests the linear effect of race. As we describe each type of coding system, we note those coding systems with which it does not make as much sense to use a nominal variable.
Within SPSS there are two general commands that you can use for analyzing data with a continuous dependent variable and one or more categorical predictors, the regression command and the glm command (that replaced the manova command, not discussed in this page). If using the regression command, you would create one fewer new variables than there are levels in your categorical variable and use these new variables as predictors in your regression model. The values for these new variables will depend on how many levels are in your categorical variable and the coding system you choose. From this point we will refer to the coding scheme as used in the regression command as regression coding. Another method for analyzing categorical data would be to use the glm command and then you could use the lmatrix or the contrast subcommands to perform comparisons among the groups. We will refer to this coding scheme as contrast coding. So, if you are using the regression command, be sure to choose the regression coding scheme and if you are using the glm command be sure to choose the contrast coding scheme.
Below is a table listing various types of contrasts and the comparison that they make.
Name of contrast | Comparison made |
Simple | Compares each level of a variable to the first level (or whichever level is specified) |
Deviation | Compares deviations from the grand mean |
Difference | Compares levels of a variable with the mean of the previous levels of the variable; also known as reverse-Helmert; this is an orthogonal contrast |
Helmert | Compare levels of a variable with the mean of the subsequent levels of the variable; this is an orthogonal contrast |
Polynomial | Orthogonal polynomial contrasts; the first degree of freedom contains the linear effect across the levels of the factor, the second degree of freedom contains the quadratic effect, and so on. In a balanced design, polynomial contrasts are orthogonal. |
Repeated | Compare adjacent levels of a variable; this is not an orthogonal contrast |
Special | User-defined contrast |
SIMPLE EFFECT CODING
The results of simple effect coding is very similar to dummy coding in that each group is compared to the reference group. In the example below, group 4 is the reference group and the first comparison compares group 1 to group 4, the second comparison compares group 2 to group 4, and the third comparison compares group 3 to group 4.
The regression coding is a bit more complex than simple dummy coding. In our example below, group 4 is the reference group and x1 compares group 1 to group 4, x2 compares group 2 to group 4, and x3 compares group 3 to group 4. Note that the coding is a bit more tricky than simple dummy coding. For x1 the coding is 3/4 (.75) for group 1, and -1/4 (-.25) for all other groups. Likewise, for x2 the coding is 3/4 (.75) for group 2, and -1/4 (-.25) for all other groups, and for x3 the coding is 3/4 (.75) for group 3, and -1/4 (-.25) for all other groups. Note that each new variable must sum to 0.
SIMPLE regression coding
Level of race | New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) |
1 (Hispanic) | .75 | -.25 | -.25 |
2 (Asian) | -.25 | .75 | -.25 |
3 (African American) | -.25 | -.25 | .75 |
4 (white) | -.25 | -.25 | -.25 |
The contrast coding, see below, is more straightforward. It also follows the rule that for effect coding that the values in each new variable sum to zero. The first contrast compares group 1 to group 4, and group 1 is coded “1” and group 4 is coded “-1”. Likewise, the second contrast compares group 2 to group 4 by coding group 2 “1” and group 4 “-1”. As you can see with contrast coding, you can discern the meaning of the comparisons simply by inspecting the contrast coefficients. For example, looking at the contrast coefficients for c3 you can see that this compares group 3 to group 4.
SIMPLE effect contrast coding
Level of race | New variable 1 (c1) | New variable 2 (c2) | New variable 3 (c3) |
1 (Hispanic) | 1 | 0 | 0 |
2 (Asian) | 0 | 1 | 0 |
3 (African American) | 0 | 0 | 1 |
4 (white) | -1 | -1 | -1 |
In the above examples, both the regression coefficient for x1 and the contrast estimate for c1 would be the mean of write for level 1 (Hispanic) minus the mean of write for level 4 (white). Likewise, the regression coefficient for x2 and the contrast estimate for c2 would be the mean of write for level 2 (Asian) minus the mean of write for level 4 (white).
DEVIATION EFFECT CODING
This coding system compares the mean of the dependent variable for a given level to the grand mean of the dependent variable. In our example below, the first comparison compares level 1 (Hispanic) to all 3 other groups, the second comparison compares level 2 (Asian) to the 3 other groups, and the third comparison compares level 3 (African American) to the 3 other groups.
As you see in the example below, the regression coding is accomplished by assigning “1” to group 1 for the first comparison (since group 1 is the group to be compared to all others), a “1” to group 2 for the second comparison (since group 2 is to be compared to all others), and “1” to group 3 for the third comparison (since group 3 is to be compared to all others). Note that a “-1” is assigned to group 4 for all 3 comparisons (since it is the group that is never compared to the other groups) and all other values are assigned a 0. This regression coding scheme yields the comparisons described above.
DEVIATION regression coding
Level of race | New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) |
Level 1 v. Mean | Level 2 v. Mean | Level 3 v. Mean | |
1 (Hispanic) | 1 | 0 | 0 |
2 (Asian) | 0 | 1 | 0 |
3 (African American) | 0 | 0 | 1 |
4 (white) | -1 | -1 | -1 |
As you can see, contrast coding is much simpler. The first comparison that compares group 1 to groups 2,3,4 assigns 3/4 (.75) to group 1 and -1/4 (.25) to groups 2,3,4. Likewise, the second comparison that compares group 2 to groups 1,3,4 assigns 3/4 (.75) to group 2 and -1/4 (.25) to groups 1,3,4 and so forth for the third comparison. Note that you could substitute 3 for 3/4 and 1 for 1/4 and you would get the same test of significance, but the contrast coefficient would be different.
DEVIATION contrast coding
Level of race | New variable 1 (c1) | New variable 2 (c2) | New variable 3 (c3) |
Level 1 v. Mean | Level 2 v. Mean | Level 3 v. Mean | |
1 (Hispanic) | .75 | -.25 | -.25 |
2 (Asian) | -.25 | .75 | -.25 |
3 (African American) | -.25 | -.25 | .75 |
4 (white) | -.25 | -.25 | -.25 |
In the above examples, both the regression coefficient for x1 and the contrast estimate for c1 would be the mean of write for level 1 (Hispanic) minus the mean of write for levels 2,3 and 4 combined. Likewise, the regression coefficient for x2 and the contrast estimate for c2 would be the mean of write for level 2 (Asian) minus the mean of write for levels 1, 3, and 4 combined.
DIFFERENCE CODING
In this coding system, each level is compared to the mean of the previous levels. In our example, the first comparison codes the comparison of the mean of the dependent variable for level 1 of race to the mean of the dependent variable for level 2 of race. The second comparison compares the mean of the dependent variable for both levels 1 and 2 of race with the mean of the dependent variable for level 3 of race, and the third comparison compares the mean of the dependent variable for levels 1,2 and 3 of race with the 4th level of race. Clearly, this coding system does not make much sense with our example of race because it is a nominal variable. However, this system is useful when the levels of the categorical variable are ordered in a meaningful way. For example, if we had a categorical variable in which work-related stress was coded as low, medium or high, then comparing the means of the previous levels of the variable would make more sense.
Below we see an example of regression coding. For the first comparison, where the first and second level are compared, x1 is coded -1/2 (-.5) and 1/2 (.5) and the rest 0. For the second comparison, the values of x2 are coded -1/3 (-.333) then -1/3 (-.333) then 2/3 (.666) and then 0. Finally, for the 3rd comparison, the values of x3 are coded -1/4 -1/4 -/14 and then 3/4.
DIFFERENCE regression coding
New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) | |
Level 2 v. Level 1 | Level 3 v. Previous | Level 4 v. Previous | |
1 (Hispanic) | -.5 | -.333 | -.25 |
2 (Asian) | .5 | -.333 | -.25 |
3 (African American) | 0 | .666 | -.25 |
4 (white) | 0 | 0 | .75 |
For contrast coding, we see that the first comparison comparing groups 1 and 2 are coded -1 and 1 to compare these groups, and 0 otherwise. The second comparison comparing groups 1,2 with group 3 are coded -.5 -.5 1 and 0, and the last comparison comparing groups 1,2,3 with group 4 are coded -.333 -.333 -.333 and 1.
DIFFERENCE contrast coding
New variable 1 (c1) | New variable 2 (c2) | New variable 3 (c3) | |
Level 2 v. Level 1 | Level 3 v. Previous | Level 4 v. Previous | |
1 (Hispanic) | -1 | -.5 | -.333 |
2 (Asian) | 1 | -.5 | -.333 |
3 (African American) | 0 | 1 | -.333 |
4 (white) | 0 | 0 | 1 |
In the above examples, both the regression coefficient for x1 and the contrast estimate for c1 would be the mean of write for level 1 (Hispanic) minus the mean of write for level 2 (Asian). Likewise, the regression coefficient for x2 and the contrast estimate for c2 would be the mean of write for levels 1 and 2 combined minus the mean of write for level 3. Finally, the regression coefficient for x3 and the contrast estimate for c3 would be the mean of write for levels 1, 2 and 3 combined minus the mean of write for level 4.
HELMERT EFFECT CODING
Helmert coding is just the opposite of difference coding: instead of comparing each level of categorical variable to the mean of the previous levels, it is compared to the mean of the subsequent levels. Hence, the first contrast compares the mean of the dependent variable for level 1 of race with the mean of all of the subsequent levels of race (levels 2, 3, and 4), the second contrast compares the mean of the dependent variable for level 2 of race with the mean of all of the subsequent levels of race (levels 3, and 4), and the third contrast compares the mean of the dependent variable for level 3 of race with the mean of all of the subsequent levels of race (level 4). However, this type of coding is useful in situations where the levels of the categorical variable are ordered say, from lowest to highest, or smallest to largest, etc.
Below we see an example of regression coding, and you can see that the coding is simply the mirror image of the difference coding. For the first comparison (comparing 1 with 2, 3, and 4) the codes are 3/4 and -1/4 -1/4 -1/4. The second comparison compares groups 2 with 3 and 4 and is coded 0 2/3 -1/3 -1/3. The third comparison compares levels 3 and 4 and is coded 0 0 1/2 -1/2.
HELMERT regression coding
Level of race | New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) |
Level 1 v. Later | Level 2 v. Later | Level 3 v. Later | |
1 (Hispanic) | .75 | 0 | 0 |
2 (Asian) | -.25 | .666 | 0 |
3 (African American) | -.25 | -.333 | .5 |
4 (white) | -.25 | -.333 | -.5 |
For contrast coding, we see that the first comparison comparing group 1 with groups 2, 3 and 4 is coded 1 -.333 -.333 -.333 reflecting the comparison of group 1 versus all other groups. The second comparison is coded 0 1 -.5 -.5 reflecting that it compares group 2 with groups 3 and 4. The 3rd comparison is coded 0 0 1 -1 reflecting that group 3 is compared to group 4.
HELMERT contrast coding
Level of race | New variable 1 (c1) | New variable 2 (c2) | New variable 3 (c3) |
Level 1 v. Later | Level 2 v. Later | Level 3 v. Later | |
1 (Hispanic) | 1 | 0 | 0 |
2 (Asian) | -.333 | 1 | 0 |
3 (African American) | -.333 | -.5 | 1 |
4 (white) | -.333 | -.5 | -1 |
In the above examples, both the regression coefficient for x1 and the contrast estimate for c1 would be the mean of write for level 1 (Hispanic) vs all subsequent levels (levels 2, 3 and 4). Likewise, the regression coefficient for x2 and the contrast estimate for c2 would be the mean of write for level 2 minus the mean of write for levels 3 and 4. Finally, the regression coefficient for x3 and the contrast estimate for c3 would be the mean of write for level 3 minus the mean of write for level 4.
ORTHOGONAL POLYNOMIAL CODING
Orthogonal polynomial coding is a form trend analysis in that it is looking for the linear, quadratic and cubic trends in the categorical variable. This type of coding system should be used only with an ordinal variable in which the levels are equally spaced. An example of such a variable might be income, or education.
MORE HERE.
POLYNOMIAL
Level of race | Linear (x1) | Quadratic (x2) | Cubic (x3) |
1 (Hispanic) | -.671 | .5 | -.224 |
2 (Asian) | -.224 | -.5 | .671 |
3 (African American) | .224 | -.5 | -.671 |
4 (white) | .671 | .5 | .224 |
REPEATED EFFECT CODING
In this coding system, the mean of the dependent variable for one level of the categorical variable is compared to the mean of the dependent variable for the adjacent level. In our example below, the first comparison compares the the mean of write for level 1 with the mean of write for level 2 of race (Hispanic minus Asian). The second comparison compares the mean of write for level 2 minus level 3, and the third comparison compares the mean of write for level 3 minus level 4. This type of coding may be useful with either a nominal or an ordinal variable.
Below we see an example of regression coding. For the first comparison, where the first and second level are compared, x1 is coded -3/4 for level 1 and the rest -1/4. For the second comparison where level 2 is compared with level 3, x2 is coded 1/2 1/2 -1/2 -1/2, and for the third comparison where level 3 is compared with level 4, x3 is coded 1/4 1/4 1/4 and -3/4.
REPEATED regression
Level of race | New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) |
Level 1 v. Level 2 | Level 2 v. Level 3 | Level 3 v. Level 4 | |
1 (Hispanic) | .75 | .5 | .25 |
2 (Asian) | -.25 | .5 | .25 |
3 (African American) | -.25 | -.5 | .25 |
4 (white) | -.25 | -.5 | -.75 |
For contrast coding, the coding more naturally reflects the comparisons being made. The first comparison is coded 1 -1 0 0 reflecting that group 1 is compared to group 2. The second comparison is coded 0 1 -1 0 reflecting that group 2 is compared to group 3, and the third comparison is coded 0 0 1 -1 reflecting that group 3 is compared with group 4.
REPEATED contrast coding
Level of race | New variable 1 (c1) | New variable 2 (c2) | New variable 3 (c3) |
Level 1 v. Level 2 | Level 2 v. Level 3 | Level 3 v. Level 4 | |
1 (Hispanic) | 1 | 0 | 0 |
2 (Asian) | -1 | 1 | 0 |
3 (African American) | 0 | -1 | 1 |
4 (white) | 0 | 0 | -1 |
In the above examples, both the regression coefficient for x1 and the contrast estimate for c1 would be the mean of write for level 1 (Hispanic) minus the mean of write for level 2 (Asian). Likewise, the regression coefficient for x2 and the contrast estimate for c2 would be the mean of write for level 2 (Asian) minus the mean of write for level 3 (African American), and the regression coefficient for x3 and the contrast estimate for c3 would be the mean of write for level 3 (African American) minus the mean of write for level 4 (white).
SYNTAX
For most coding systems, there are two ways to code categorical variables: manually coding them and having SPSS code them for you. There are benefits and drawbacks to both approaches. The benefit of manually coding variables is that you have absolute control over how they are coded. The drawback to this approach is that it is relatively easy to make an error in writing the syntax. An error may be difficult to find, particularly if the error is a logic error instead of a syntax error. (SPSS will give you an error message in the output window if there is a syntax error, but not if there is a logical error.) One way to avoid having an error in your syntax is by allowing SPSS to code the varable(s) for you, but in doing so, you may have to give up some control over how the codes are assigned. Also, SPSS will not create certain kinds of codes for you, most notably dummy codes. Below we show two ways to create dummy codes and three ways to create each type of effect coding for our example using the four-level categorical variable race.
Before considering any analyses, let’s look at the mean of the dependent variable, write, for each level of race. This will help in interpreting the output from the analyses.
means tables = write by race.
Cases | ||||||
---|---|---|---|---|---|---|
Included | Excluded | Total | ||||
N | Percent | N | Percent | N | Percent | |
writing score * RACE | 200 | 100.0% | 0 | .0% | 200 | 100.0% |
RACE | Mean | N |
---|---|---|
hispanic | 46.4583 | 24 |
asian | 58.0000 | 11 |
african-amer | 48.2000 | 20 |
white | 54.0552 | 145 |
Total | 52.7750 | 200 |
DUMMY CODING
In Method 1, we create a new variable (i.e., x1) that is set equal to zero. Then we change the value of this new variable to equal one if the level in the original (categorical) variable is one. We repeat this process for each new variable that we need to create. In Method 2, we use a “do-loop” to generate the new variables, which can be useful if your categorical variable has a large number of levels.
* Method 1 for creating dummy variables.
compute x1 = 0. if race = 1 x1 = 1. compute x2 = 0. if race = 1 x2 = 1. compute x3 = 0. if race = 1 x3 = 1. execute.
* Method 2 for creating dummy variables.
do repeat A=x1 x2 x3 /B=1 2 3. compute A=(x=B). end repeat. execute. regression /dep write /method = enter x1 x2 x3.
Model | Variables Entered | Variables Removed | Method | ||||
---|---|---|---|---|---|---|---|
1 | X3, X2, X1(a) | . | Enter | ||||
a All requested variables entered. | b Dependent Variable: writing score |
The table above shows which variables were entered into the regression equation. It also indicates that the method used was “enter”, as opposed to other possible methods that could have been specified, such as backward, forward or stepwise. The table also indicates that all of the variables listed on the /method= statement were entered into the regression equation.
Model | R | R Square | Adjusted R Square | Std. Error of the Estimate |
---|---|---|---|---|
1 | .327(a) | .107 | .093 | 9.02511 |
a Predictors: (Constant), X3, X2, X1 |
Model | Sum of Squares | df | Mean Square | F | Sig. | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Regression | 1914.158 | 3 | 638.053 | 7.833 | .000(a) | |||||||
Residual | 15964.717 | 196 | 81.453 | ||||||||||
Total | 17878.875 | 199 | |||||||||||
a Predictors: (Constant), X3, X2, X1 | b Dependent Variable: writing score |
The table above entitled “Model Summary” indicates that one model was tested, that 10.7% of the variance in the dependent variable is accounted for by the independent variable, and that 9.3% of the variance of the dependent variable is accounted for by the independent variable when the number of independent variables in the equation is taken into consideration. The standard error of the estimate is also given. The table entitled “ANOVA” gives the sum of squares and the degrees of freedom (in the column labeled “df”) for the regression, the residual and the total (regression plus residual). The mean square is given for the regression and the residual, and the F-value and the associated p-value (in the column labeled Sig.) is displayed. These results indicate that the regression is statistically significant at the .05 alpha level. As you will see, the overall test of race is the same regardless of the coding system used.
Unstandardized Coefficients | Standardized Coefficients | t | Sig. | |||
---|---|---|---|---|---|---|
Model | B | Std. Error | Beta | |||
1 | (Constant) | 54.055 | .749 | 72.122 | .000 | |
X1 | -7.597 | 1.989 | -.261 | -3.820 | .000 | |
X2 | 3.945 | 2.823 | .095 | 1.398 | .164 | |
X3 | -5.855 | 2.153 | -.186 | -2.720 | .007 | |
a Dependent Variable: writing score |
The table above gives the unstandardized coefficients for the regression equation (in the column labeled B) and the standard error (in the column labeled Std. Error). When using dummy coding, the constant is the mean of the omitted level of the categorical variable. The coefficient for x1 is the difference between the mean of the dependent variable for level 1 of race minus the mean of the dependent variable at level 4 of race (the reference level). Likewise, the coefficient for x2 and x3 is the mean of the dependent variable at that level of race minus the mean of the dependent variable for the reference level. The standardized coefficients are given in the column labeled Beta. The t-values and associated p-values are also given. The statistical significance of the constant is rarely of interest to researchers. The coefficients for x1 and x3 are statistically significant at the .05 (and .01) alpha level, while the coefficient for x2 is not. This indicates that level 1 of race (Hispanic) is significantly different from level 4 (white), and that level 3 (African American) is significantly different from level 4 (white).
(left off here)
EFFECT CODING
When doing any sort of effect coding, there are three approaches to the coding of the variables. The first approach is to manually compute them for use in OLS regression, which is shown in Method 1. You create a new variable, setting it equal to one of the values that it will assume, and then use “if” statements to change the value according to the values in the original (categorical) variable. If you use this approach, you can use either “regression” or “glm”. The second approach is to use “glm” with “/lmatrix” statements. You will need to use one “/lmatrix” statement for each contrast. Hence, in our example, because we have a four-level categorical variable, we will need to use three “/lmatrix” statements (all of which are part of the same “glm” command). The third approach is to use “glm” and include a “/contrast () =” statement, placing the name of the categorical variable in the parentheses and the name of the contrast to be used after the equal sign. Below are examples of all three approaches. In Method 3, we include a “/print” statement with the “test(lmatrix)” option so that SPSS prints out the coding system used for the contrasts. For the example using difference coding, we also include the “parameter” option on the print statement. This causes SPSS to print out the coding system used for the regression analysis as well as the results of the regression analysis. This illustrates how the two coding systems are different and shows that the results of the regression are the same as when dummy coding is used. In the interest of conserving space, we include the output only for the third method of creating the codes. However, the output from the other methods will be very similar and will contain all of the same values for parameter estimates, tests of statistical significance, etc. We have interspersed explanations into the following output. For the other types of coding systems, we omit the output that is the same and only discuss the output that changes as a result of the different coding system used.
SIMPLE EFFECT CODING
Method 1:
if race = 1 x1 = .75. if any(race,2,3,4) x1 = -.25. if race = 2 x2 = .75. if any(race,1,3,4) x2 = -.25. if race = 3 x3 = .75. if any(race,1,2,4) x3 = -.25. execute.
regression /dependent = write /method = enter x1 x2 x3.
< output omitted >
Method 2:
glm write by race /lmatrix "group 1 versus group 4" race 1 0 0 -1 /lmatrix "group 2 versus group 4" race 0 1 0 -1 /lmatrix "group 3 versus group 4" race 0 0 1 -1.
< output omitted >
Method 3:
glm write by race /contrast (race)=simple /print = parameter test(lmatrix).
Value Label | N | ||
---|---|---|---|
RACE | 1.00 | hispanic | 24 |
2.00 | asian | 11 | |
3.00 | african-amer | 20 | |
4.00 | white | 145 |
Source | Type III Sum of Squares | df | Mean Square | F | Sig. |
---|---|---|---|---|---|
Corrected Model | 1914.158(a) | 3 | 638.053 | 7.833 | .000 |
Intercept | 225523.580 | 1 | 225523.580 | 2768.770 | .000 |
RACE | 1914.158 | 3 | 638.053 | 7.833 | .000 |
Error | 15964.717 | 196 | 81.453 | ||
Total | 574919.000 | 200 | |||
Corrected Total | 17878.875 | 199 | |||
a R Squared = .107 (Adjusted R Squared = .093) |
The table above entitled “Between-Subjects Factors” shows the levels of the categorical variable, the value label associated with each level (if any) and the number of observations at each level (in the column N). The table entitled “Tests of Between-Subjects Effects” shows the source, the type III sums of squares, the degrees of freedom (called “df”), the mean square, the F values and the corresponding p-values. The F-value for the corrected model of 7.833 and its p-value of .000 indicate that the overall model is statistically significant. The F- and p-values for race are the same because in this model, we have only one independent variable. If we had more than one independent variable, the F- and p-values for the overall model would be different from those for the independent variables. The F- and p-values for the intercept are also statistically significant, but those are rarely of interest.
Lower Bound
B | Std. Error | t | Sig. | 95% Confidence Interval | ||
---|---|---|---|---|---|---|
Parameter | Upper Bound | |||||
Intercept | 54.055 | .749 | 72.122 | .000 | 52.577 | 55.533 |
[RACE=1.00] | -7.597 | 1.989 | -3.820 | .000 | -11.519 | -3.675 |
[RACE=2.00] | 3.945 | 2.823 | 1.398 | .164 | -1.622 | 9.511 |
[RACE=3.00] | -5.855 | 2.153 | -2.720 | .007 | -10.101 | -1.610 |
[RACE=4.00] | 0(a) | . | . | . | . | . |
a This parameter is set to zero because it is redundant. |
The table above entitled “Parameter Estimates” gives the coefficients (in the column labeled B), the associated standard errors (in the column labeled Std. Error), the associated t-values, the associated p-values (in the column labeled Sig.), and the lower and upper bounds for the 95% confidence interval.
For our example, the regression equation would be: y = 54.055 – 7.597×1 + 3.945×2 -5.855×3. All of the coefficients are statistically significant at the .05 alpha level except the one for x2. In other words, the mean of the dependent variable (write) for both x1 and x3 is statistically significantly different from the mean of the dependent variable for x4 (the omitted level), but not different from x2. Furthermore, the true value of the coefficient for x1 is between -11.519 and -3.675 with a 95% level of certainty. Likewise, the true value of the coefficient for x2 is between -1.622 and 9.511 with a 95% level of certainty, and so on.
You will notice that the values given “ANOVA” and “Coefficients” tables in the section on dummy coding are the same as the values given in the “Tests of Between-Subjects Effects” and “Parameter Estimates”. This is because, as mentioned previously, that dummy coding and simple effect coding yield the same results when the same reference level is used in both coding systems.
Contrast | |
---|---|
Parameter | L1 |
Intercept | 1.000 |
[RACE=1.00] | .250 |
[RACE=2.00] | .250 |
[RACE=3.00] | .250 |
[RACE=4.00] | .250 |
The default display of this matrix is the transpose of the corresponding L matrix. Based on Type III Sums of Squares. |
Contrast | |||
---|---|---|---|
Parameter | L2 | L3 | L4 |
Intercept | 0 | 0 | 0 |
[RACE=1.00] | 1 | 0 | 0 |
[RACE=2.00] | 0 | 1 | 0 |
[RACE=3.00] | 0 | 0 | 1 |
[RACE=4.00] | -1 | -1 | -1 |
The default display of this matrix is the transpose of the corresponding L matrix. Based on Type III Sums of Squares. |
The table above entitled “Intercept” shows the coding that SPSS used for the intercept. Each level of race was given an equal value (.250), and the sum of those is the intercept (1.000). The table entitled “Race” shows the coding for race that was used in the calculations regarding the regression above. Notice that it is simple effect coding, but that it the same results would have been obtained using dummy coding. In this instance, the only difference between simple effect coding and dummy coding is the values assigned to the reference level (race = 4). Because it is the reference level, the only important point is that it have the same value in each of the new variables (called L2, L3 and L4). What that value is, either negative one in simple effect coding or zero in dummy coding, is irrelevant. Regardless of the coding system requested, SPSS will calculate the regression using simple effect coding. Which coding system you specify on the /contrast= statement will be used only in calculating the contrast estimates.
RACE Simple Contrast(a) | |||||||
---|---|---|---|---|---|---|---|
Parameter | Level 1 vs. Level 4 | Level 2 vs. Level 4 | Level 3 vs. Level 4 | ||||
Intercept | 0 | 0 | 0 | ||||
[RACE=1.00] | 1 | 0 | 0 | ||||
[RACE=2.00] | 0 | 1 | 0 | ||||
[RACE=3.00] | 0 | 0 | 1 | ||||
[RACE=4.00] | -1 | -1 | -1 | ||||
The default display of this matrix is the transpose of the corresponding L matrix. | a Reference category = 4 |
Dependent Variable | |||
---|---|---|---|
RACE Simple Contrast(a) | writing score | ||
Level 1 vs. Level 4 | Contrast Estimate | -7.597 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | -7.597 | ||
Std. Error | 1.989 | ||
Sig. | .000 | ||
95% Confidence Interval for Difference | Lower Bound | -11.519 | |
Upper Bound | -3.675 | ||
Level 2 vs. Level 4 | Contrast Estimate | 3.945 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | 3.945 | ||
Std. Error | 2.823 | ||
Sig. | .164 | ||
95% Confidence Interval for Difference | Lower Bound | -1.622 | |
Upper Bound | 9.511 | ||
Level 3 vs. Level 4 | Contrast Estimate | -5.855 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | -5.855 | ||
Std. Error | 2.153 | ||
Sig. | .007 | ||
95% Confidence Interval for Difference | Lower Bound | -10.101 | |
Upper Bound | -1.610 | ||
a Reference category = 4 |
Source | Sum of Squares | df | Mean Square | F | Sig. |
---|---|---|---|---|---|
Contrast | 1914.158 | 3 | 638.053 | 7.833 | .000 |
Error | 15964.717 | 196 | 81.453 |
The table above entitled “Contrast Coefficients (L’ Matrix)” shows the coding scheme that was used for each comparison. The table entitled “Contrast Results (K Matrix)” shows the results of the various contrasts. In our example, the difference between level 1 of race and level 4 of race is statistically significant. You will notice that the contrast estimate is the difference between the mean for the dependent variable for the omitted level minus the mean of the dependent variable for the first level. In other words, 46.4583 – 54.0552 = -7.597. The hypothesized value is zero (and is zero for all contrast tests). This means that the null hypothesis is that the coefficient equals zero, which is almost always the null hypothesis in which researchers are interested. The row labeled Difference (Estimate – Hypothesized) gives the difference between the contrast estimate and the hypothesized value. Because the null hypothesis is always zero, the contrast estimate and the difference between the contrast estimate and the null hypothesis are the same value. Therefore, you can either refer to the contrast estimate as being either statistically significant or not, or you can refer to the difference as being either statistically significant or not. In our example, the difference between level 2 of race and level 4 of race is not statistically significant, and the difference between level 3 of race and level 4 of race is statistically significant. You will notice that the values given in this table are the same as those given in “Parameter Estimates” table. This is because both used the same coding system and the same reference level. If a different coding system had been requested on the /contrast= statement, or if a different reference level had been specified, the two tables would not have the same numbers. The table entitled “Test Results” indicates that the test of race is statistically significant. In other words, it is a test of all of the contrasts taken together. The results of this test are identical to the overall test of race because there are no other independent variables in the model. If there were, the results of the two tests would be different from one another.
DEVIATION CODING
Method 1:
if race = 1 x1 = 1. if any(race,2,3) x1 = 0. if race = 4 x1 = -1. if race = 2 x2 = 1. if any(race,1,3) x2 = 0. if race = 4 x2 = -1. if race = 3 x3 = 1. if any(race,1,2) x3 = 0. if race = 4 x3 = -1. execute. regression /dep write /method = enter x1 x2 x3.
< output omitted >
Method 2:
glm write by race /lmatrix "group 1 versus groups 1 2 and 3" race .75 -.25 -.25 -.25 /lmatrix "group 2 versus groups 1 3 and 4" race -.25 .75 -.25 -.25 /lmatrix "group 3 versus groups 1 2 and 4" race -.25 -.25 .75 -.25.
< output omitted >
Method 3:
glm write by race /contrast (race)=deviation /print = parameter test(lmatrix).
Value Label | N | ||
---|---|---|---|
RACE | 1.00 | hispanic | 24 |
2.00 | asian | 11 | |
3.00 | african-amer | 20 | |
4.00 | white | 145 |
Source | Type III Sum of Squares | df | Mean Square | F | Sig. |
---|---|---|---|---|---|
Corrected Model | 1914.158(a) | 3 | 638.053 | 7.833 | .000 |
Intercept | 225523.580 | 1 | 225523.580 | 2768.770 | .000 |
RACE | 1914.158 | 3 | 638.053 | 7.833 | .000 |
Error | 15964.717 | 196 | 81.453 | ||
Total | 574919.000 | 200 | |||
Corrected Total | 17878.875 | 199 | |||
a R Squared = .107 (Adjusted R Squared = .093) |
B | Std. Error | t | Sig. | 95% Confidence Interval | ||
---|---|---|---|---|---|---|
Parameter | Lower Bound | Upper Bound | ||||
Intercept | 54.055 | .749 | 72.122 | .000 | 52.577 | 55.533 |
[RACE=1.00] | -7.597 | 1.989 | -3.820 | .000 | -11.519 | -3.675 |
[RACE=2.00] | 3.945 | 2.823 | 1.398 | .164 | -1.622 | 9.511 |
[RACE=3.00] | -5.855 | 2.153 | -2.720 | .007 | -10.101 | -1.610 |
[RACE=4.00] | 0(a) | . | . | . | . | . |
a This parameter is set to zero because it is redundant. |
Contrast | |
---|---|
Parameter | L1 |
Intercept | 1.000 |
[RACE=1.00] | .250 |
[RACE=2.00] | .250 |
[RACE=3.00] | .250 |
[RACE=4.00] | .250 |
The default display of this matrix is the transpose of the corresponding L matrix. Based on Type III Sums of Squares. |
Contrast | |||
---|---|---|---|
Parameter | L2 | L3 | L4 |
Intercept | 0 | 0 | 0 |
[RACE=1.00] | 1 | 0 | 0 |
[RACE=2.00] | 0 | 1 | 0 |
[RACE=3.00] | 0 | 0 | 1 |
[RACE=4.00] | -1 | -1 | -1 |
The default display of this matrix is the transpose of the corresponding L matrix. Based on Type III Sums of Squares. |
RACE Deviation Contrast(a) | |||||||
---|---|---|---|---|---|---|---|
Parameter | Level 1 vs. Mean | Level 2 vs. Mean | Level 3 vs. Mean | ||||
Intercept | .000 | .000 | .000 | ||||
[RACE=1.00] | .750 | -.250 | -.250 | ||||
[RACE=2.00] | -.250 | .750 | -.250 | ||||
[RACE=3.00] | -.250 | -.250 | .750 | ||||
[RACE=4.00] | -.250 | -.250 | -.250 | ||||
The default display of this matrix is the transpose of the corresponding L matrix. | a Omitted category = 4 |
Dependent Variable | |||
---|---|---|---|
RACE Deviation Contrast(a) | writing score | ||
Level 1 vs. Mean | Contrast Estimate | -5.220 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | -5.220 | ||
Std. Error | 1.631 | ||
Sig. | .002 | ||
95% Confidence Interval for Difference | Lower Bound | -8.437 | |
Upper Bound | -2.003 | ||
Level 2 vs. Mean | Contrast Estimate | 6.322 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | 6.322 | ||
Std. Error | 2.160 | ||
Sig. | .004 | ||
95% Confidence Interval for Difference | Lower Bound | 2.061 | |
Upper Bound | 10.582 | ||
Level 3 vs. Mean | Contrast Estimate | -3.478 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | -3.478 | ||
Std. Error | 1.732 | ||
Sig. | .046 | ||
95% Confidence Interval for Difference | Lower Bound | -6.895 | |
Upper Bound | -6.203E-02 | ||
a Omitted category = 4 |
Source | Sum of Squares | df | Mean Square | F | Sig. |
---|---|---|---|---|---|
Contrast | 1914.158 | 3 | 638.053 | 7.833 | .000 |
Error | 15964.717 | 196 | 81.453 |
Notice the two different coding systems that are presented in this output. In the table entitled “Race”, you see the coding system that was used to calculate the regression. In the table entitled “Contrast Coefficients (L’ Matrix)”, you see the coding system that was used to calculate the contrast coefficients. It is important to understand why two different coding systems are displayed in the output and to which analysis they refer. From now on, we will not include the “parameter” option on the print statement so that the results of the regression analysis will not be shown. These results would be the same for each example below.
The contrasts estimates in the table entitled “Contrast Results (K Matrix)” are the mean of the particular level minus the grand (unweighted) mean. This grand mean is not the mean of the dependent variable that is listed in the output of the “means” command above. Rather it is the mean of means of the dependent variable at each level of the categorical variable: (46.4583 + 58 + 48.2 + 54.0552) / 4 = 51.678375. The contrast estimate for level 1 versus mean is then 46.4583 – 51.678375 = -5.220. The difference between this value and zero (the null hypothesis that the contrast coefficient is zero) is statistically significant (p = .002). The contrast coefficients for the other comparisons are calculated in the same manner. As with the output of the code using simple effect coding, the table “Test Results” shows the test of all of the contrasts taken together. As expected, the values in this table are the same as those previously.
DIFFERENCE CODING
Method 1:
if race = 1 x1 = -.5. if race = 2 x1 = .5. if any(race,3,4) x1 = 0. if any(race,1,2) x2 = -.333. if race = 3 x2 = .667. if race = 4 x2 = 0. if any(race,1,2,3) x3 = -.25. if race = 4 x3 = .75. execute. regression /dep write /method = enter x1 x2 x3.
< output omitted >
Method 2:
glm write by race /lmatrix "group 2 versus group 1" race -1 1 0 0 /lmatrix "group 3 versus groups 1 and 2" race -.5 -.5 1 0 /lmatrix "group 4 versus groups 1 2 and 3" race -1/3 -1/3 -1/3 1.
< output omitted >
Method 3:
glm write by race /contrast (race)=difference /print = test(lmatrix).
< some output omitted >
RACE Difference Contrast | |||
---|---|---|---|
Parameter | Level 2 vs. Level 1 | Level 3 vs. Previous | Level 4 vs. Previous |
Intercept | .000 | .000 | .000 |
[RACE=1.00] | -1.000 | -.500 | -.333 |
[RACE=2.00] | 1.000 | -.500 | -.333 |
[RACE=3.00] | .000 | 1.000 | -.333 |
[RACE=4.00] | .000 | .000 | 1.000 |
The default display of this matrix is the transpose of the corresponding L matrix. |
Dependent Variable | |||
---|---|---|---|
RACE Difference Contrast | writing score | ||
Level 2 vs. Level 1 | Contrast Estimate | 11.542 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | 11.542 | ||
Std. Error | 3.286 | ||
Sig. | .001 | ||
95% Confidence Interval for Difference | Lower Bound | 5.061 | |
Upper Bound | 18.022 | ||
Level 3 vs. Previous | Contrast Estimate | -4.029 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | -4.029 | ||
Std. Error | 2.602 | ||
Sig. | .123 | ||
95% Confidence Interval for Difference | Lower Bound | -9.161 | |
Upper Bound | 1.103 | ||
Level 4 vs. Previous | Contrast Estimate | 3.169 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | 3.169 | ||
Std. Error | 1.488 | ||
Sig. | .034 | ||
95% Confidence Interval for Difference | Lower Bound | .235 | |
Upper Bound | 6.104 |
Source | Sum of Squares | df | Mean Square | F | Sig. |
---|---|---|---|---|---|
Contrast | 1914.158 | 3 | 638.053 | 7.833 | .000 |
Error | 15964.717 | 196 | 81.453 |
The contrast estimate for the first comparison shown in this output was calculated by subtracting the mean of the dependent variable for level 1 of the categorical variable from the mean of the dependent variable for level 2: 58 – 46.4583 = 11.542. This result is statistically significant. The contrast estimate for the second comparison (between level 3 and the previous levels) was calculated by subtracting the mean of the dependent variable for levels 1 and 2 from that of level 3: 48.2 – [(46.4583 + 58) / 2] = -4.029. This result is not statistically significant, meaning that there is not a reliable difference between the mean of write for level 3 of race compared to the mean of write for levels 1 and 2 (Hispanics and Asians). As noted above, this type of coding system does not make much sense for a nominal variable such as race. For the comparison of level 4 and the previous levels, you take the mean of the dependent variable for the those levels and subtract it from the mean of the dependent variable for level 4: 54.0552 – [(46.4583 + 58 + 48.2) / 3] = 3.169. This result is statistically significant.
Note the use of fractions on the “/lmatrix” statement in Method 2. As mentioned above, you need to use numbers that sum to zero, such as 1/3 + 1/3 + 1/3 – 1. You cannot use .333 instead of 1/3: SPSS will give an error message and fail to calculate the contrast coefficient. The problem is that .333 + .333 + .333 – 1 is not sufficiently close to zero.
HELMERT CODING
Method 1:
if race = 1 x1 = .75. if any(race,2,3,4) x1 = -.25. if race = 1 x2 = 0. if race = 2 x2 = .667. if any(race,3,4) x2 = -.333. if any(race,1,2) x3 = 0. if race = 3 x3 = .5. if race = 4 x3 = -.5. execute. regression /dep write /method = enter x1 x2 x3.
< output omitted >
Method 2:
glm write by race /lmatrix "group 1 versus groups 2 3 and 4" race 1 -1/3 -1/3 -1/3 /lmatrix "group 2 versus groups 3 and 4" race 0 1 -.5 -.5 /lmatrix "group 3 versus group 4" race 0 0 1 -1.
< output omitted >
Method 3:
glm write by race /contrast (race)=helmert /print = test(lmatrix).
< some output omitted >
RACE Helmert Contrast | |||
---|---|---|---|
Parameter | Level 1 vs. Later | Level 2 vs. Later | Level 3 vs. Level 4 |
Intercept | .000 | .000 | .000 |
[RACE=1.00] | 1.000 | .000 | .000 |
[RACE=2.00] | -.333 | 1.000 | .000 |
[RACE=3.00] | -.333 | -.500 | 1.000 |
[RACE=4.00] | -.333 | -.500 | -1.000 |
The default display of this matrix is the transpose of the corresponding L matrix. |
Dependent Variable | |||
---|---|---|---|
RACE Helmert Contrast | writing score | ||
Level 1 vs. Later | Contrast Estimate | -6.960 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | -6.960 | ||
Std. Error | 2.175 | ||
Sig. | .002 | ||
95% Confidence Interval for Difference | Lower Bound | -11.250 | |
Upper Bound | -2.670 | ||
Level 2 vs. Later | Contrast Estimate | 6.872 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | 6.872 | ||
Std. Error | 2.926 | ||
Sig. | .020 | ||
95% Confidence Interval for Difference | Lower Bound | 1.101 | |
Upper Bound | 12.644 | ||
Level 3 vs. Level 4 | Contrast Estimate | -5.855 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | -5.855 | ||
Std. Error | 2.153 | ||
Sig. | .007 | ||
95% Confidence Interval for Difference | Lower Bound | -10.101 | |
Upper Bound | -1.610 |
Source | Sum of Squares | df | Mean Square | F | Sig. |
---|---|---|---|---|---|
Contrast | 1914.158 | 3 | 638.053 | 7.833 | .000 |
Error | 15964.717 | 196 | 81.453 |
The contrast estimate for the comparison between level 1 and the remaining levels (called “later” in the output) is calculated by subtracting the mean of the dependent variable for levels 2, 3 and 4 from the mean of the dependent variable for level 1: 46.4583 – [(58 + 48.2 + 54.0552) / 3] = -6.960, which is statistically significant. This means that the mean of write for level 1 of race is statistically significantly different from the mean of write for levels 2 through 4. As noted above, this comparison probably is not meaningful because the variable race is nominal. This type of comparison would be more meaningful if the categorical variable was ordinal. To calculate the contrast coefficient for the comparison between level 2 and the later levels, you subtract the mean of the dependent variable for levels 3 and 4 from the mean of the dependent variable for level 2: 58 – [(48.2 + 54.0552) / 2] = -11.250, which is statistically significant. The contrast estimate for the comparison between level 3 and level 4 is the difference between the mean of the dependent variable for the two levels: 48.2 – 54.0552 = -5.855, which is also statistically significant.
ORTHOGONAL POLYNOMIAL CODING
Method 1:
if race = 1 x1 = -.671. if race = 2 x1 = -.224. if race = 3 x1 = .224. if race = 4 x1 = .671. if race = 1 x2 = .5. if race = 2 x2 = -.5. if race = 3 x2 = -.5. if race = 4 x2 = .5. if race = 1 x3 = -.224. if race = 2 x3 = .671. if race = 3 x3 = -.671. if race = 4 x3 = .224. execute. regression /dep write /method = enter x1 x2 x3.
< output omitted >
Method 2:
glm write by race /lmatrix "linear" race -.671 -.224 .224 .671 /lmatrix "quadratic" race .5 -.5 -.5 .5 /lmatrix "cubic" race -.224 .671 -.671 .224.
< output omitted >
Method 3:
glm write by race /contrast (race)=polynomial /print = test(lmatrix). < some output omitted >
RACE Polynomial Contrast(a) | |||||||
---|---|---|---|---|---|---|---|
Parameter | Linear | Quadratic | Cubic | ||||
Intercept | .000 | .000 | .000 | ||||
[RACE=1.00] | -.671 | .500 | -.224 | ||||
[RACE=2.00] | -.224 | -.500 | .671 | ||||
[RACE=3.00] | .224 | -.500 | -.671 | ||||
[RACE=4.00] | .671 | .500 | .224 | ||||
The default display of this matrix is the transpose of the corresponding L matrix. | a Metric = 1.000, 2.000, 3.000, 4.000 |
Dependent Variable | |||
---|---|---|---|
RACE Polynomial Contrast(a) | writing score | ||
Linear | Contrast Estimate | 2.905 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | 2.905 | ||
Std. Error | 1.534 | ||
Sig. | .060 | ||
95% Confidence Interval for Difference | Lower Bound | -.121 | |
Upper Bound | 5.931 | ||
Quadratic | Contrast Estimate | -2.843 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | -2.843 | ||
Std. Error | 1.964 | ||
Sig. | .149 | ||
95% Confidence Interval for Difference | Lower Bound | -6.717 | |
Upper Bound | 1.031 | ||
Cubic | Contrast Estimate | 8.273 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | 8.273 | ||
Std. Error | 2.316 | ||
Sig. | .000 | ||
95% Confidence Interval for Difference | Lower Bound | 3.706 | |
Upper Bound | 12.840 | ||
a Metric = 1.000, 2.000, 3.000, 4.000 |
Source | Sum of Squares | df | Mean Square | F | Sig. |
---|---|---|---|---|---|
Contrast | 1914.158 | 3 | 638.053 | 7.833 | .000 |
Error | 15964.717 | 196 | 81.453 |
To calculate the contrast estimates for these comparisons, you need to multiply the code used in the new variable by the mean for the dependent variable for each level of the categorical variable, and then sum the values. For example, the code used in x1 for level 1 of race is -.671 and the mean of write for level 1 is 46.4583. Hence, you would multiple -.671 and 46.4583 and add that to the product of the code for level 2 of x1 and its mean, and so on. To obtain the contrast estimate for the linear contrast, you would do the following: -.671*46.4583 + -.224*58 + .224*48.2 + .671*54.0552 = 2.905 (with rounding error). This result is not statistically significant at the .05 alpha level, but it is close. The quadratic component is also not statistically significant, but the cubic one is. This suggests that, if the mean of the dependent variable plotted against race, the line would tend to have two bends. As noted earlier, this type of coding system does not make much sense with a nominal variable such as race.
REPEATED EFFECT CODING
Method 1:
if race = 1 x1 = .75. if any(race,2,3,4) x1 = -.25. if any(race,1,2) x2 = .5. if any(race,3,4) x2 = -.5. if any(race,1,2,3) x3 = .25. if race = 4 x3 = -.75. execute. regression /dep write /method = enter x1 x2 x3.
< output omitted >
Method 2:
glm write by race /lmatrix "group 1 versus group 2" race 1 -1 0 0 /lmatrix "group 2 versus group 3" race 0 1 -1 0 /lmatrix "group 3 versus group 4" race 0 0 1 -1.
< output omitted >
Method 3:
glm write by race /contrast (race)=repeated /print = test(lmatrix).
< some output omitted >
RACE Repeated Contrast | |||
---|---|---|---|
Parameter | Level 1 vs. Level 2 | Level 2 vs. Level 3 | Level 3 vs. Level 4 |
Intercept | 0 | 0 | 0 |
[RACE=1.00] | 1 | 0 | 0 |
[RACE=2.00] | -1 | 1 | 0 |
[RACE=3.00] | 0 | -1 | 1 |
[RACE=4.00] | 0 | 0 | -1 |
The default display of this matrix is the transpose of the corresponding L matrix. |
Dependent Variable | |||
---|---|---|---|
RACE Repeated Contrast | writing score | ||
Level 1 vs. Level 2 | Contrast Estimate | -11.542 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | -11.542 | ||
Std. Error | 3.286 | ||
Sig. | .001 | ||
95% Confidence Interval for Difference | Lower Bound | -18.022 | |
Upper Bound | -5.061 | ||
Level 2 vs. Level 3 | Contrast Estimate | 9.800 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | 9.800 | ||
Std. Error | 3.388 | ||
Sig. | .004 | ||
95% Confidence Interval for Difference | Lower Bound | 3.119 | |
Upper Bound | 16.481 | ||
Level 3 vs. Level 4 | Contrast Estimate | -5.855 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | -5.855 | ||
Std. Error | 2.153 | ||
Sig. | .007 | ||
95% Confidence Interval for Difference | Lower Bound | -10.101 | |
Upper Bound | -1.610 |
Source | Sum of Squares | df | Mean Square | F | Sig. |
---|---|---|---|---|---|
Contrast | 1914.158 | 3 | 638.053 | 7.833 | .000 |
Error | 15964.717 | 196 | 81.453 |
With this coding system, adjacent levels of the categorical variable are compared. Hence, the mean of the dependent variable at level 1 is compared to the mean of the dependent variable at level 2: 46.4583 – 58 = -11.542, which is statistically significant. For the comparison between levels 2 and 3, the calculation of the contrast coefficient would be 58 – 48.2 = 9.8, which is also statistically significant. Finally, comparing levels 3 and 4, 48.2 – 54.0552 = -5.855, a statistically significant difference. One would conclude from this that each adjacent level of race is statistically significantly different.
SPECIAL USER-DEFINED CODING SYSTEM
Let’s compare: 1) level 1 to level3, 2) level 2 to levels 1 and 4 and 3) levels 1 and 2 to levels 3 and 4.
Method 1:
if race = 1 x1 = -.5. if race = 2 x1 = .5. if race = 3 x1 = -1.5. if race = 4 x1 = 1.5.
if any(race,1,3) = 1 x2 = -1. if any(race,2,4) = 1 x2 = 1.
if any(race,1,3) = 1 x3 = 1.5. if race = 2 x3 = -.5. if race = 4 x3 = -2.5. execute.
regression /dep write /method = enter x1 x2 x3.
< output omitted >
Method 2:
glm write by race /lmatrix "compare group 1 to group 3" race 1 0 -1 0 /lmatrix "compare group 2 to groups 1 and 4" race -.5 1 0 -.5 /lmatrix "compare groups 1 and 2 to groups 3 and 4" race .5 .5 -.5 -.5.
< output omitted >
Method 3:
glm write by race /contrast (race)=special(1 0 -1 0, -.5 1 0 -.5, .5 .5 -.5 -.5) /print = test(lmatrix).
< some output omitted >
RACE Special Contrast | |||
---|---|---|---|
Parameter | L1 | L2 | L3 |
Intercept | .000 | .000 | .000 |
[RACE=1.00] | 1.000 | -.500 | .500 |
[RACE=2.00] | .000 | 1.000 | .500 |
[RACE=3.00] | -1.000 | .000 | -.500 |
[RACE=4.00] | .000 | -.500 | -.500 |
The default display of this matrix is the transpose of the corresponding L matrix. |
Dependent Variable | |||
---|---|---|---|
RACE Special Contrast | writing score | ||
L1 | Contrast Estimate | -1.742 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | -1.742 | ||
Std. Error | 2.732 | ||
Sig. | .525 | ||
95% Confidence Interval for Difference | Lower Bound | -7.131 | |
Upper Bound | 3.647 | ||
L2 | Contrast Estimate | 7.743 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | 7.743 | ||
Std. Error | 2.897 | ||
Sig. | .008 | ||
95% Confidence Interval for Difference | Lower Bound | 2.030 | |
Upper Bound | 13.457 | ||
L3 | Contrast Estimate | 1.102 | |
Hypothesized Value | 0 | ||
Difference (Estimate – Hypothesized) | 1.102 | ||
Std. Error | 1.964 | ||
Sig. | .576 | ||
95% Confidence Interval for Difference | Lower Bound | -2.772 | |
Upper Bound | 4.975 |
Source | Sum of Squares | df | Mean Square | F | Sig. |
---|---|---|---|---|---|
Contrast | 1914.158 | 3 | 638.053 | 7.833 | .000 |
Error | 15964.717 | 196 | 81.453 |
The first comparison of the mean of the dependent variable for level 1 to level 3 of the categorical variable was not statistically significant, while the comparison of the mean of the dependent variable for level 2 to that of levels 1 and 4 was. The comparison of the mean of the dependent variable for levels 1 and 2 to that of levels 3 and 4 was not statistically significant.