Regression with Stata Chapter 5 – Additional coding systems for categorical variables in regression analysis

Chapter Outline
    5.1 Simple Coding
    5.2 Forward Difference Coding
    5.3 Backward Difference Coding
    5.4 Helmert Coding
    5.5 Reverse Helmert Coding
    5.6 Deviation Coding
    5.7 Orthogonal Polynomial Coding
    5.8 User-Defined Coding
    5.9 Summary

Please note: This page makes use of the program xi3 which is no longer being maintained and has been from our archives. References to xi3 will be left on this page because they illustrate specific principles of coding categorical variables.

5.0 Introduction

Categorical variables require special attention in regression analysis because, unlike dichotomous or continuous variables, they cannot by entered into the regression equation just as they are. For example, if you have a variable called race that is coded 1 = Hispanic, 2 = Asian 3 = Black 4 = White, then entering race in your regression will look at the linear effect of race, which is probably not what you intended. Instead, categorical variables like this need to be recoded into a series of variables which can then be entered into the regression model. There are a variety of coding systems that can be used when coding categorical variables. Ideally, you would choose a coding system that reflects the comparisons that you want to make. In Chapter 3 of the Regression with Stata Web Book we covered the use of categorical variables in regression analysis focusing on the use of dummy variables, but that is not the only coding scheme that you can use. For example, you may want to compare each level to the next higher level, in which case you would want to use "forward difference" coding, or you might want to compare each level to the mean of the subsequent levels of the variable, in which case you would want to use "Helmert" coding. By deliberately choosing a coding system, you can obtain comparisons that are most meaningful for testing your hypotheses. Regardless of the coding system you choose, the test of the overall effect of the categorical variable (i.e., the overall effect of race) will remain the same. Below is a table listing various types of contrasts and the comparison that they make.

Name of contrast	Comparison made
Simple Coding	Compares each level of a variable to the reference level
Forward Difference Coding	Adjacent levels of a variable (each level minus the next level)
Backward Difference Coding	Adjacent levels of a variable (each level minus the prior level)
Helmert Coding	Compare levels of a variable with the mean of the subsequent levels of the variable
Reverse Helmert Coding	Compares levels of a variable with the mean of the previous levels of the variable
Deviation Coding	Compares deviations from the grand mean
Orthogonal Polynomial Coding	Orthogonal polynomial contrasts
User-Defined Coding	User-defined contrast

There are a couple of notes to be made about the coding systems listed above. The first is that they represent planned comparisons and not post hoc comparisons. In other words, they are comparisons that you plan to do before you begin analyzing your data, not comparisons that you think of once you have seen the results of preliminary analyses. Also, some forms of coding make more sense with ordinal categorical variables than with nominal categorical variables. Below we will show examples using race as a categorical variable, which is a nominal variable. Because simple effect coding compares the mean of the dependent variable for each level of the categorical variable to the mean of the dependent variable at for the reference level, it makes sense with a nominal variable. However, it may not make as much sense to use a coding scheme that tests the linear effect of race. As we describe each type of coding system, we note those coding systems with which it does not make as much sense to use a nominal variable. Also, you may notice that we follow several rules when creating the contrast coding schemes. For more information about these rules, please see the section on User-Defined Coding.

This page will illustrate two ways that you can conduct analyses using these coding schemes: 1) using the xi3 command (an extended version of the xi command) and 2) manually coding the variables and entering them using the regress command. When using regress to do contrasts, you first need to create k-1 new variables (where k is the number of levels of the categorical variable) and use these new variables as predictors in your regression model.

The Example Data File

The examples in this page will use dataset called hsb2.dta that you can download from within Stata like this.

use https://stats.idre.ucla.edu/stat/stata/notes/hsb2

Within this data file, we will focus on the categorical variable race, which has four levels (1 = Hispanic, 2 = Asian, 3 = African American and 4 = white) and we will use write as our dependent variable. Although our example uses a variable with four levels, these coding systems work with variables that have more or fewer categories. No matter which coding system you select, you will always have one fewer recoded variables than levels of the original variable. In our example, our categorical variable has four levels so we will have three new variables (a variable corresponding to the final level of the categorical variables would be redundant and therefore unnecessary).

Before considering any analyses, let’s look at the mean of the dependent variable, write, for each level of race. This will help in interpreting the output from later analyses.

tabulate race, summarize(write)

            |      Summary of writing score
       race |        Mean   Std. Dev.       Freq.
------------+------------------------------------
   hispanic |   46.458333   8.2724223          24
      asian |          58   7.8993671          11
  african-a |        48.2   9.3222992          20
      white |   54.055172   9.1725582         145
------------+------------------------------------
      Total |      52.775    9.478586         200

5.1 Simple Coding

The results of simple coding are very similar to dummy coding in that each level is compared to the reference level. In the example below, level 1 is the reference level and the first comparison compares level 2 to level 1, the second comparison compares level 3 to level 1, and the third comparison compares level 4 to level 1.

Method 1: Using xi3

When using xi3, we can refer to g.race to indicate that we wish to code race using simple coding comparing each group to a reference group, as shown in the example below.

xi3: regress write g.race
s.race            _Irace_1-4          (naturally coded; _Irace_1 omitted)

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
       Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
    Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
-------------+------------------------------           Adj R-squared =  0.0934
       Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    _Irace_2 |   11.54167   3.286129     3.51   0.001     5.060956    18.02238
    _Irace_3 |   1.741667   2.732488     0.64   0.525    -3.647186    7.130519
    _Irace_4 |   7.596839    1.98887     3.82   0.000     3.674507    11.51917
       _cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

The coefficient for _Irace_2 compares the mean of the dependent variable, write, for levels 2 and 1 yielding 58-46.458 = 11.54 and is statistically significant (p<.000). The coefficient for _Irace_3 compares the mean of the dependent variable, write, for levels 3 and 1, yielding 48.2 - 46.46 = 1.74, and this is not statistically significant. Finally, the coefficient for _Irace_4 compares the mean of the dependent variable, write, for levels 4 and 1, yielding 7.59, and that is statistically significant.

Method 2: Manual Coding

If we wished, we could manually code race instead of allowing xi3 to do the coding for us. Below we see the coding that replicates the results we saw in the example above. In the coding below, level 1 is the reference level and x1 compares level 2 to level 1, x2 compares level 3 to level 1, and x3 compares level 4 to level 1. For x1 the coding is 3/4 for level 2, and -1/4 for all other levels. Likewise, for x2 the coding is 3/4 for level 2, and -1/4 for all other levels, and for x3 the coding is 3/4 for level 3, and -1/4 for all other levels. It is not intuitive that this regression coding scheme yields these comparisons; however, if you desire simple comparisons, you can follow this general rule to obtain these comparisons.

SIMPLE regression coding

Level of race	New variable 1 (x1)	New variable 2 (x2)	New variable 3 (x3)
1 (Hispanic)	-1/4	-1/4	-1/4
2 (Asian)	3/4	-1/4	-1/4
3 (African American)	-1/4	3/4	-1/4
4 (white)	-1/4	-1/4	3/4

Below we show the more general rule for creating this kind of coding scheme using regression coding, where k is the number of levels of the categorical variable (in this instance, k = 4).

SIMPLE regression coding

Level of race	New variable 1 (x1)	New variable 2 (x2)	New variable 3 (x3)
1 (Hispanic)	-1 / k	-1 / k	-1 / k
2 (Asian)	(k-1) / k	-1 / k	-1 / k
3 (African American)	-1 / k	(k-1) / k	-1 / k
4 (white)	-1 / k	-1 / k	(k-1) / k

Below we illustrate how to create x1, x2 and x3 and enter these new variables into the regression model using the regression command.

generate x1 = -1/4
replace x1 = 3/4 if race==2

generate x2 = -1/4
replace x2 = 3/4 if race==3

generate x3 = -1/4
replace x3 = 3/4 if race==4

regress write x1 x2 x3

As you can see, the results below match those when we used the xi3 command above.

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
       Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
    Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
-------------+------------------------------           Adj R-squared =  0.0934
       Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   11.54167   3.286129     3.51   0.001     5.060956    18.02238
          x2 |   1.741667   2.732488     0.64   0.525    -3.647186    7.130519
          x3 |   7.596839    1.98887     3.82   0.000     3.674507    11.51917
       _cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

5.2 Forward Difference Coding

In this coding system, the mean of the dependent variable for one level of the categorical variable is compared to the mean of the dependent variable for the next (adjacent) level. In our example below, the first comparison compares the mean of write for level 1 with the mean of write for level 2 of race (Hispanics minus Asians). The second comparison compares the mean of write for level 2 minus level 3, and the third comparison compares the mean of write for level 3 minus level 4. This type of coding may be useful with either a nominal or an ordinal variable.

Method 1: Using xi3

We can indicate that we want forward adjacent difference coding for race by specifying a.race as shown below.

xi3 : regress write a.race

f.race            _Irace_1-4          (naturally coded; _Irace_4 omitted)

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
       Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
    Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
-------------+------------------------------           Adj R-squared =  0.0934
       Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    _Irace_1 |  -11.54167   3.286129    -3.51   0.001    -18.02238   -5.060956
    _Irace_2 |        9.8   3.387834     2.89   0.004     3.118714    16.48129
    _Irace_3 |  -5.855172    2.15276    -2.72   0.007    -10.10072   -1.609626
       _cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

With this coding system, adjacent levels of the categorical variable are compared. Hence, the mean of the dependent variable at level 1 is compared to the mean of the dependent variable at level 2: 46.4583 - 58 = -11.542, which is statistically significant. For the comparison between levels 2 and 3, the calculation of the contrast coefficient would be 58 - 48.2 = 9.8, which is also statistically significant. Finally, comparing levels 3 and 4, 48.2 - 54.0552 = -5.855, a statistically significant difference. One would conclude from this that each adjacent level of race is statistically significantly different.

Method 2: Manual Coding

For the first comparison, where the first and second levels are compared, x1 is coded 3/4 for level 1 and the other levels are coded -1/4. For the second comparison where level 2 is compared with level 3, x2 is coded 1/2 1/2 -1/2 -1/2, and for the third comparison where level 3 is compared with level 4, x3 is coded 1/4 1/4 1/4 -3/4.

FORWARD DIFFERENCE regression coding

Level of race	New variable 1 (x1)	New variable 2 (x2)	New variable 3 (x3)
	Level 1 v. Level 2	Level 2 v. Level 3	Level 3 v. Level 4
1 (Hispanic)	3/4	1/2	1/4
2 (Asian)	-1/4	1/2	1/4
3 (African American)	-1/4	-1/2	1/4
4 (white)	-1/4	-1/2	-3/4

The general rule for this regression coding scheme is shown below, where k is the number of levels of the categorical variable (in this case k = 4).

FORWARD DIFFERENCE regression coding

Level of race	New variable 1 (x1)	New variable 2 (x2)	New variable 3 (x3)
	Level 1 v. Level 2	Level 2 v. Level 3	Level 3 v. Level 4
1 (Hispanic)	(k-1)/k	(k-2)/k	(k-3)/k
2 (Asian)	-1/k	(k-2)/k	(k-3)/k
3 (African American)	-1/k	-2/k	(k-3)/k
4 (white)	-1/k	-2/k	-3/k

generate x1 = 3/4 if race==1
replace x1 = -1/4 if inlist(race,2,3,4)

generate x2 = 1/2 if inlist(race,1,2)
replace x2 = -1/2 if inlist(race,3,4)

generate x3 = 1/4 if inlist(race,1,2,3)
replace x3 = -3/4 if race==4

regress write x1 x2 x3

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
       Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
    Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
-------------+------------------------------           Adj R-squared =  0.0934
       Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |  -11.54167   3.286129    -3.51   0.001    -18.02238   -5.060956
          x2 |        9.8   3.387834     2.89   0.004     3.118714    16.48129
          x3 |  -5.855172    2.15276    -2.72   0.007    -10.10072   -1.609626
       _cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

You can see the regression coefficient for x1 is the mean of write for level 1 (Hispanic) minus the mean of write for level 2 (Asian). Likewise, the regression coefficient for x2 is the mean of write for level 2 (Asian) minus the mean of write for level 3 (African American), and the regression coefficient for x3 is the mean of write for level 3 (African American) minus the mean of write for level 4 (white).

5.3 Backward Difference Coding

In this coding system, the mean of the dependent variable for one level of the categorical variable is compared to the mean of the dependent variable for the prior adjacent level. In our example below, the first comparison compares the mean of write for level 2 with the mean of write for level 1 of race (Hispanics minus Asians). The second comparison compares the mean of write for level 3 minus level 2, and the third comparison compares the mean of write for level 4 minus level 3. This type of coding may be useful with either a nominal or an ordinal variable.

Method 1: Using xi3

We can indicate that we want backward difference coding for race by specifying b.race as shown below.

xi3 : regress write b.race

b.race            _Irace_1-4          (naturally coded; _Irace_1 omitted)

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
       Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
    Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
-------------+------------------------------           Adj R-squared =  0.0934
       Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    _Irace_2 |   11.54167   3.286129     3.51   0.001     5.060956    18.02238
    _Irace_3 |       -9.8   3.387834    -2.89   0.004    -16.48129   -3.118714
    _Irace_4 |   5.855172    2.15276     2.72   0.007     1.609626    10.10072
       _cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

With this coding system, adjacent levels of the categorical variable are compared, with each level compared to the prior level. Hence, the mean of the dependent variable at level 2 is compared to the mean of the dependent variable at level 1: 58-46.4583 = 11.542, which is statistically significant. For the comparison between levels 3 and 2, we calculate 48.2 - 58 = -9.8, which is also statistically significant. Finally, comparing levels 4 and 3, 54.0552 - 48.2 = 5.855, a statistically significant difference. One would conclude from this that each adjacent level of race is statistically significantly different.

Method 2: Manual Coding

For the first comparison, where the first and second levels are compared, x1 is coded 3/4 for level 1 while the other levels are coded -1/4. For the second comparison where level 2 is compared with level 3, x2 is coded 1/2 1/2 -1/2 -1/2, and for the third comparison where level 3 is compared with level 4, x3 is coded 1/4 1/4 1/4 -3/4.

BACKWARD DIFFERENCE regression coding

Level of race	New variable 1 (x1)	New variable 2 (x2)	New variable 3 (x3)
	Level 2 v. Level 1	Level 3 v. Level 2	Level 4 v. Level 3
1 (Hispanic)	- 3/4	-1/2	-1/4
2 (Asian)	1/4	-1/2	-1/4
3 (African American)	1/4	1/2	-1/4
4 (white)	1/4	1/2	3/4

The general rule for this regression coding scheme is shown below, where k is the number of levels of the categorical variable (in this case, k = 4).

BACKWARD DIFFERENCE regression coding

Level of race	New variable 1 (x1)	New variable 2 (x2)	New variable 3 (x3)
	Level 1 v. Level 2	Level 2 v. Level 3	Level 3 v. Level 4
1 (Hispanic)	-(k-1)/k	-(k-2)/k	-(k-3)/k
2 (Asian)	1/k	-(k-2)/k	-(k-3)/k
3 (African American)	1/k	2/k	-(k-3)/k
4 (white)	1/k	2/k	3/k

generate x1 = -3/4 if race==1
replace  x1 =  1/4 if inlist(race,2,3,4)

generate x2 = -1/2 if inlist(race,1,2)
replace  x2 =  1/2 if inlist(race,3,4)

generate x3 = -1/4 if inlist(race,1,2,3)
replace  x3 =  3/4 if race==4

regress write x1 x2 x3

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
       Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
    Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
-------------+------------------------------           Adj R-squared =  0.0934
       Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   11.54167   3.286129     3.51   0.001     5.060956    18.02238
          x2 |       -9.8   3.387834    -2.89   0.004    -16.48129   -3.118714
          x3 |   5.855172    2.15276     2.72   0.007     1.609626    10.10072
       _cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

In the above example, the regression coefficient for x1 is the mean of write for level 2 minus the mean of write for level 1 (58- 46.4583 = 11.542). Likewise, the regression coefficient for x2 is the mean of write for level 3 minus the mean of write for level 2, and the regression coefficient for x3 is the mean of write for level 4 minus the mean of write for level 3.

5.4 Helmert Coding

Helmert coding compares each level of a categorical variable to the mean of the subsequent levels. Hence, the first contrast compares the mean of the dependent variable for level 1 of race with the mean of all of the subsequent levels of race (levels 2, 3, and 4), the second contrast compares the mean of the dependent variable for level 2 of race with the mean of all of the subsequent levels of race (levels 3 and 4), and the third contrast compares the mean of the dependent variable for level 3 of race with the mean of all of the subsequent levels of race (level 4). While this type of coding system does not make much sense with a nominal variable like race, it is useful in situations where the levels of the categorical variable are ordered say, from lowest to highest, or smallest to largest, etc.

Method 1: Using xi3

We can specify Helmert coding for race using h.race as shown below.

xi3 : regress write h.race

h.race            _Irace_1-4          (naturally coded; _Irace_4 omitted)

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
       Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
    Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
-------------+------------------------------           Adj R-squared =  0.0934
       Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    _Irace_1 |  -6.960057   2.175211    -3.20   0.002    -11.24988   -2.670234
    _Irace_2 |   6.872414   2.926325     2.35   0.020     1.101287    12.64354
    _Irace_3 |  -5.855172    2.15276    -2.72   0.007    -10.10072   -1.609626
       _cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

The regression coefficient for the comparison between level 1 and the remaining levels is calculated by taking the mean of the dependent variable for level 1 and subtracting the mean of the dependent variable for levels 2, 3 and 4: 46.4583 - [(58 + 48.2 + 54.0552) / 3] = -6.960, which is statistically significant. This means that the mean of write for level 1 of race is statistically significantly different from the mean of write for levels 2 through 4. As noted above, this comparison probably is not meaningful because the variable race is nominal. This type of comparison would be more meaningful if the categorical variable was ordinal.

To calculate the contrast coefficient for the comparison between level 2 and the later levels, you subtract the mean of the dependent variable for levels 3 and 4 from the mean of the dependent variable for level 2: 58 - [(48.2 + 54.0552) / 2] = 6.872, which is statistically significant. The regression coefficient for the comparison between level 3 and level 4 is the difference between the mean of the dependent variable for the two levels: 48.2 - 54.0552 = -5.855, which is also statistically significant.

Method 2: Manual Coding

Below we see an example of Helmert regression coding. For the first comparison (comparing level 1 with levels 2, 3 and 4) the codes are 3/4 and -1/4 -1/4 -1/4. The second comparison compares level 2 with levels 3 and 4 and is coded 0 2/3 -1/3 -1/3. The third comparison compares level 3 to level 4 and is coded 0 0 1/2 -1/2.

HELMERT regression coding

Level of race	New variable 1 (x1)	New variable 2 (x2)	New variable 3 (x3)
	Level 1 v. Later	Level 2 v. Later	Level 3 v. Later
1 (Hispanic)	3/4	0	0
2 (Asian)	-1/4	2/3	0
3 (African American)	-1/4	-1/3	1/2
4 (white)	-1/4	-1/3	-1/2

Below we illustrate how to create x1, x2 and x3 and enter these new variables into the regression model using the regression command.

generate x1 = -3/4 if race==1
replace  x1 =  1/4 if inlist(race,2,3,4)

generate x2 =    0 if race==1
replace  x2 =  2/3 if race==2
replace  x2 = -1/3 if inlist(race,3,4)

generate x3 =    0 if inlist(race,1,2)
replace  x3 =  1/2 if race==3           
replace  x3 = -1/2 if race==4

regress write x1 x2 x3

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
       Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
    Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
-------------+------------------------------           Adj R-squared =  0.0934
       Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   6.960057   2.175211     3.20   0.002     2.670234    11.24988
          x2 |   6.872414   2.926325     2.35   0.020     1.101287    12.64354
          x3 |  -5.855172    2.15276    -2.72   0.007    -10.10072   -1.609626
       _cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

As you see above, regression coefficient for x1 is the mean of write for level 1 (Hispanic) versus all subsequent levels (levels 2, 3 and 4). Likewise, the regression coefficient for x2 is the mean of write for level 2 minus the mean of write for levels 3 and 4. Finally, the regression coefficient for x3 is the mean of write for level 3 minus the mean of write for level 4.

5.5 Reverse Helmert Coding

Reverse Helmert coding (also know as difference coding) is just the opposite of Helmert coding: instead of comparing each level of categorical variable to the mean of the subsequent level(s), each is compared to the mean of the previous level(s). In our example, the first contrast codes the comparison of the mean of the dependent variable for level 2 of race to the mean of the dependent variable for level 1 of race. The second comparison compares the mean of the dependent variable level 3 of race with both levels 1 and 2 of race, and the third comparison compares the mean of the dependent variable for level 4 of race with levels 1, 2 and 3. Clearly, this coding system does not make much sense with our example of race because it is a nominal variable. However, this system is useful when the levels of the categorical variable are ordered in a meaningful way. For example, if we had a categorical variable in which work-related stress was coded as low, medium or high, then comparing the means of the previous levels of the variable would make more sense.

Method 1: Using xi3

We can specify Helmert coding for race using r.race as shown below.

xi3 : regress write r.race

r.race            _Irace_1-4          (naturally coded; _Irace_1 omitted)

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
       Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
    Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
-------------+------------------------------           Adj R-squared =  0.0934
       Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    _Irace_2 |   11.54167   3.286129     3.51   0.001     5.060956    18.02238
    _Irace_3 |  -4.029167   2.602363    -1.55   0.123    -9.161394    1.103061
    _Irace_4 |   3.169061   1.487987     2.13   0.034     .2345401    6.103582
       _cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

The regression coefficient for the first comparison shown in this output was calculated by subtracting the mean of the dependent variable for level 2 of the categorical variable from the mean of the dependent variable for level 1: 58 - 46.4583 = 11.542. This result is statistically significant. The regression coefficient for the second comparison (between level 3 and the previous levels) was calculated by subtracting the mean of the dependent variable for levels 1 and 2 from that of level 3: 48.2 - [(46.4583 + 58) / 2] = -4.029. This result is not statistically significant, meaning that there is not a reliable difference between the mean of write for level 3 of race compared to the mean of write for levels 1 and 2 (Hispanics and Asians). As noted above, this type of coding system does not make much sense for a nominal variable such as race. For the comparison of level 4 and the previous levels, you take the mean of the dependent variable for the those levels and subtract it from the mean of the dependent variable for level 4: 54.0552 - [(46.4583 + 58 + 48.2) / 3] = 3.169. This result is statistically significant.

Method 2: Manual Coding

The regression coding for reverse Helmert coding is shown below. For the first comparison, where the first and second level are compared, x1 is coded -1/2 and 1/2 and 0 otherwise. For the second comparison, the values of x2 are coded -1/3 -1/3 2/3 and 0. Finally, for the third comparison, the values of x3 are coded -1/4 -1/4 -/14 and 3/4.

REVERSE HELMERT regression coding

Level of race	New variable 1 (x1)	New variable 2 (x2)	New variable 3 (x3)
1 (Hispanic)	-1/2	-1/3	-1/4
2 (Asian)	1/2	-1/3	-1/4
3 (African American)	0	2/3	-1/4
4 (white)	0	0	3/4

Below we illustrate how to create x1, x2 and x3 and enter these new variables into the regression model using the regress command.

generate x1 = -1/2 if race==1
replace  x1 =  1/2 if race==2
replace  x1 =    0 if inlist(race,3,4)

generate x2 = -1/3 if inlist(race,1,2)
replace  x2 =  2/3 if race==3
replace  x2 =    0 if race==4

generate x3 = -1/4 if inlist(race,1,2,3)
replace  x3 =  3/4 if race==4

regress write x1 x2 x3

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
       Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
    Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
-------------+------------------------------           Adj R-squared =  0.0934
       Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   11.54167   3.286129     3.51   0.001     5.060956    18.02238
          x2 |  -4.029167   2.602363    -1.55   0.123    -9.161394    1.103061
          x3 |   3.169061   1.487987     2.13   0.034     .2345401    6.103582
       _cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

In the above example, the regression coefficient for x1 is the mean of write for level 1 (Hispanic) minus the mean of write for level 2 (Asian). Likewise, the regression coefficient for x2 is the mean of write for levels 1 and 2 combined minus the mean of write for level 3. Finally, the regression coefficient for x3 is the mean of write for levels 1, 2 and 3 combined minus the mean of write for level 4.

5.6 Deviation Coding

This coding system compares the mean of the dependent variable for a given level to the mean of the dependent variable for the all levels of the variable. In our example below, the first comparison compares level 2 (Asians) to all levels of race, the second compares level 3 (African Americans) to all levels of race, and the third comparison compares level 4 (White) to all levels of race.

Method 1: Using xi3

We indicate we would like race to be coded using deviation effect coding using e.race as shown below.

. xi3 : regress write e.race
d.race            _Irace_1-4          (naturally coded; _Irace_1 omitted)

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
       Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
    Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
-------------+------------------------------           Adj R-squared =  0.0934
       Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    _Irace_2 |   6.321624   2.160314     2.93   0.004     2.061179    10.58207
    _Irace_3 |  -3.478376   1.732305    -2.01   0.046    -6.894726    -.062027
    _Irace_4 |   2.376796   1.115991     2.13   0.034     .1759051    4.577687
       _cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

The regression coefficient for _Irace_2 is the mean for level 2 minus the grand mean. However, this grand mean is not the overall mean of the dependent variable that you would get from the summarize command. Rather, it is the mean of means of the dependent variable at each level of the categorical variable: (46.4583 + 58 + 48.2 + 54.0552) / 4 = 51.678375. This regression coefficient is then 58 - 51.678375 = 6.32. Likewise, the coefficient for _Irace_3 is the mean for level 3 of race minus the overall mean, i.e., 48.2 - 51.678 = -3.47, and _Irace_4 is the mean for level 4 of race minus the overall mean, 54.055 - 51.678 = 2.37.

Method 2: Manual Coding

As you see in the example below, the regression coding is accomplished by assigning 1 to level 2 for the first comparison (because level 2 is the level to be compared to all), level 1 to level 3 for the second comparison (because level 3 is to be compared to all), and 1 to level 4 for the third comparison (because level 4 is to be compared to all). Note that a -1 is assigned to level 1 for all three comparisons (because it is the level that is never compared to the other levels) and all other values are assigned a 0. This regression coding scheme yields the comparisons described above.

DEVIATION regression coding

Level of race	New variable 1 (x1)	New variable 2 (x2)	New variable 3 (x3)
	Level 2 v. Mean	Level 3 v. Mean	Level 4 v. Mean
1 (Hispanic)	-1	-1	-1
2 (Asian)	1	0	0
3 (African American)	0	1	0
4 (white)	0	0	1

Below we illustrate how to create x1, x2 and x3 and enter these new variables into the regression model using the regress command.

generate x1 = -1 if race==1
replace  x1 =  1 if race==2
replace  x1 =  0 if inlist(race,3,4)

generate x2 = -1 if race==1
replace  x2 =  1 if race==3
replace  x2 =  0 if inlist(race,2,4)

generate x3 = -1 if race==1
replace  x3 =  1 if race==4
replace  x3 =  0 if inlist(race,2,3)

regress write x1 x2 x3

     Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
       Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
    Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
-------------+------------------------------           Adj R-squared =  0.0934
       Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   6.321624   2.160314     2.93   0.004     2.061179    10.58207
          x2 |  -3.478376   1.732305    -2.01   0.046    -6.894726    -.062027
          x3 |   2.376796   1.115991     2.13   0.034     .1759051    4.577687
       _cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

The regression coefficients for this analysis match those in the example above and have the same interpretation.

5.7 Orthogonal Polynomial Coding

Orthogonal polynomial coding is a form of trend analysis in that it is looking for the linear, quadratic and cubic trends in the categorical variable. This type of coding system should be used only with an ordinal variable in which the levels are equally spaced. Examples of such a variable might be income or education. The table below shows the contrast coefficients for the linear, quadratic and cubic trends for the four levels. These could be obtained from most statistics books on linear models.

POLYNOMIAL

Level of race	Linear (x1)	Quadratic (x2)	Cubic (x3)
1 (Hispanic)	-.671	.5	-.224
2 (Asian)	-.224	-.5	.671
3 (African American)	.224	-.5	-.671
4 (white)	.671	.5	.224

Method 1: Using xi3

We indicate we would like race to be coded using orthogonal polynomials by using o.race as shown below.

. xi3 : regress write o.race
o.race            _Irace_1-4          (naturally coded; _Irace_4 omitted)

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
       Model |  1914.15805     3  638.052682           Prob > F      =  0.0000
    Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
-------------+------------------------------           Adj R-squared =  0.0934
       Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    _Irace_1 |   2.080058   .6381718     3.26   0.001     .8214929    3.338622
    _Irace_2 |  -.2159021   .6381718    -0.34   0.735    -1.474467    1.042663
    _Irace_3 |   2.279811   .6381718     3.57   0.000     1.021246    3.538375
       _cons |     52.775   .6381718    82.70   0.000     51.51644    54.03356
------------------------------------------------------------------------------

The three coded variables, _Irace_1, _Irace_2 and _Irace_3, represent the linear, quadratic and cubic trends respectively. Of course, the term 'trend' doesn't make sense if the variable is nominal, like race. But if we pretend that race is ordinal than there would be a significant linear and cubic trend. It is also easy to test for nonlinear trend.

. test _Irace_2 _Irace_3

 ( 1)  _Irace_2 = 0.0
 ( 2)  _Irace_3 = 0.0

       F(  2,   196) =    6.44
            Prob > F =    0.0020

The test for nonlinear trend is statistically significant. This example worked okay to show how to use xi3 but we need an ordered example that can be interpreted.

Example 2

We will create our own categorical variable, readcat, from the continuous variable read.

. gen readcat = read
recode readcat 1/43=1 44/49=2 50/59=3 60/100=4

tab readcat

    readcat |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         39       19.50       19.50
          2 |         44       22.00       41.50
          3 |         61       30.50       72.00
          4 |         56       28.00      100.00
------------+-----------------------------------
      Total |        200      100.00

Now we can run the regression with xi3.

. xi3: regress write o.readcat
o.readcat         _Ireadcat_1-4       (naturally coded; _Ireadcat_4 omitted)

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =   29.64
       Model |  5579.22989     3   1859.7433           Prob > F      =  0.0000
    Residual |  12299.6451   196  62.7532914           R-squared     =  0.3121
-------------+------------------------------           Adj R-squared =  0.3015
       Total |   17878.875   199   89.843593           Root MSE      =  7.9217

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 _Ireadcat_1 |    5.27249   .5601486     9.41   0.000     4.167798    6.377182
 _Ireadcat_2 |   .3097532   .5601486     0.55   0.581     -.794939    1.414445
 _Ireadcat_3 |  -.0324612   .5601486    -0.06   0.954    -1.137153    1.072231
       _cons |     52.775   .5601486    94.22   0.000     51.67031    53.87969
------------------------------------------------------------------------------

We see from the significant _Ireadcat_1 that the linear trend is significant while neither quadratic nor cubic trends (_Ireadcat_2 & _Ireadcat_3 ) are significant. The test for nonlinear trend is also nonsignificant.

. test _Ireadcat_2 _Ireadcat_3

 ( 1)  _Ireadcat_2 = 0.0
 ( 2)  _Ireadcat_3 = 0.0

       F(  2,   196) =    0.15
            Prob > F =    0.8569

Method 2: Manual Coding

For the moment we are skipping manual coding.

5.8 User Defined Coding

You can use the xi3 command to create your own regression coding system. For our example, we will make the following three comparisons:

1) level 1 to level 3
2) level 2 to levels 1 and 4
3) levels 1 and 2 to levels 3 and 4.

In order to compare level 1 to level 3, we use the contrast coefficients 1 0 -1 0. To compare level 2 to levels 1 and 4 we use the contrast coefficients -1/2 1 0 -1/2 Finally, to compare levels 1 and 2 with levels 3 and 4 we use the coefficients 1/2 1/2 -1/2 -1/2. Before proceeding to the Stata code necessary to conduct these analyses, let's take a moment to more fully explain the logic behind the selection of these contrast coefficients.

For the first contrast, we are comparing level 1 to level 3, and the contrast coefficients are 1 0 -1 0. This means that the levels associated with the contrast coefficients with opposite signs are being compared. In fact, the mean of the dependent variable is multiplied by the contrast coefficient. Hence, levels 2 and 4 are not involved in the comparison: they are multiplied by zero and "dropped out." You will also notice that the contrast coefficients sum to zero. This is necessary. If the contrast coefficients do not sum to zero, the contrast is not estimable and Stata will issue an error message. Which level of the categorical variable is assigned a positive or negative value is not terribly important: 1 0 -1 0 is the same as -1 0 1 0 in that both of these codings compare the first and the third levels of the variable. However, the sign of the regression coefficient would change.

Now let's look at the contrast coefficients for the second and third comparisons. You will notice that in both cases we use fractions that sum to one (or minus one). They do not have to sum to one (or minus one). You may wonder why we would use fractions like -1/2 1 0 -1/2 instead of whole numbers such as -1 2 0 -1. While -1/2 1 0 -1/2 and -1 2 0 -1 both compare level 2 with levels 1 and 4 and both will give you the same t-value and p-value for the regression coefficient, the regression coefficients themselves would be different, as would their interpretation. The coefficient for the -1/2 1 0 -1/2 contrast is the mean of level 2 minus the mean of the means for levels 1 and 4: 58 - (46.4583 + 54.0552)/2 = 7.74325. (Alternatively, you can multiply the contrasts by the mean of the dependent variable for each level of the categorical variable: -1/2*46.4583 + 1*58.00 + 0*48.20 + -1/2*54.0552 = 7.74325. Clearly these are equivalent ways of thinking about how the contrast coefficient is calculated.) By comparison, the coefficient for the -1 2 0 -1 contrast is two times the mean for level 2 minus the means of the dependent variable for levels 1 and 4: 2*58 - (46.4583 + 54.0552) = 15.4865, which is the same as -1*46.4583 + 2*58 + 0*48.20 - 1*54.0552 = 15.4865. Note that the regression coefficient using the contrast coefficients -1 2 0 -1 is twice the regression coefficient obtained when -1/2 1 0 -1/2 is used.

Method 1: Using xi3

We use the char command to indicate the contrast coefficients to be used for race as shown below. In order to compare level 1 to level 3, we use the contrast coefficients 1 0 -1 0. To compare level 2 to levels 1 and 4 we use the contrast coefficients -1/2 1 0 -1/2 Finally, to compare levels 1 and 2 with levels 3 and 4, we use the coefficients 1/2 1/2 -1/2 -1/2. These coefficients are used in the char race[user] command below. This indicates that for race that the user defined contrast is defined as having three contrasts (because race has four levels) as (1 0 -1 0 -.5 1 0 -.5 .5 .5 -.5 -.5).

char race[user] (1 0 -1 0  -.5 1 0 -.5  .5 .5 -.5 -.5)

xi3 : regress write u.race
u.race            _Irace_1-4          (naturally coded; _Irace_4 omitted)

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
       Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
    Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
-------------+------------------------------           Adj R-squared =  0.0934
       Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    _Irace_1 |  -1.741667   2.732488    -0.64   0.525    -7.130519    3.647186
    _Irace_2 |   7.743247   2.897186     2.67   0.008     2.029588    13.45691
    _Irace_3 |    1.10158   1.964244     0.56   0.576    -2.772186    4.975347
       _cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

The coefficient for _Irace_1 corresponds to the first contrast comparing level 1 to level 3 of race. The coefficient is the mean of level 1 of write minus the mean for level 3 of write, and the significance of this is .525, i.e., not significant. The coefficient for _Irace_2 is 7.743, which is the mean of level 2 minus the mean of level 1 and level 4, and this difference is significant, p = 0.008. The final regression coefficient is 1.1 which is the mean of levels 1 and 2 minus the mean of levels 3 and 4, and this contrast is not statistically significant, p = .576.

Method 2: Manual Coding

As in the prior examples, we will make the following three comparisons:

1) level 1 to level 3,
2) level 2 to levels 1 and 4 and
3) levels 1 and 2 to levels 3 and 4.

The xi3 command converts the contrast coding into regression coding for us. However, we could do this process manually as well.

For methods 1 and 2 it was quite easy to translate the comparisons we wanted to make into contrast codings, but it is not as easy to translate the comparisons we want into a regression coding scheme. If we know the contrast coding system, then we can convert that into a regression coding system using the Stata program shown below. As you can see, we place the three contrast codings we want into the matrix c and then perform a set of matrix operations on c, yielding the matrix x. We then display x using the print command.

matrix input c = (1 0 -1 0  -.5 1 0 -.5  .5 .5 -.5 -.5)
matrix x = c'*inv(c*c')
matrix list x

x[4,3]
      r1    r2    r3
c1   -.5    -1   1.5
c2    .5     1   -.5
c3  -1.5    -1   1.5
c4   1.5     1  -2.5

This converted the contrast coding into the regression coding that we need for running this analysis with the regress command. Below, we use the generate and replace commands to create x1, x2 and x3 according to the coding shown above and then enter them into the regression analysis.

generate x1 =  -.5 if race == 1
replace  x1 =   .5 if race == 2
replace  x1 = -1.5 if race == 3
replace  x1 =  1.5 if race == 4

generate x2 =  -1 if race == 1
replace  x2 =   1 if race == 2
replace  x2 =  -1 if race == 3
replace  x2 =   1 if race == 4

generate x3 =  1.5 if race == 1
replace  x3 =  -.5 if race == 2
replace  x3 =  1.5 if race == 3
replace  x3 = -2.5 if race == 4

regress write x1 x2 x3

Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 3, 196) = 7.83 Model | 1914.15805 3 638.052682 Prob > F = 0.0001 Residual | 15964.717 196 81.4526375 R-squared = 0.1071 -------------+------------------------------ Adj R-squared = 0.0934 Total | 17878.875 199 89.843593 Root MSE = 9.0251

------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x1 | -1.741667 2.732488 -0.64 0.525 -7.130519 3.647186 x2 | 7.743247 2.897186 2.67 0.008 2.029588 13.45691 x3 | 1.10158 1.964244 0.56 0.576 -2.772186 4.975347 _cons | 51.67838 .982122 52.62 0.000 49.74149 53.61526 ------------------------------------------------------------------------------

As you can see, the results of this analysis matches those produced using xi3.

5.9 Summary

This page has described a number of different coding systems that you could use for categorical data, and two different strategies you could use for performing the analyses. You can choose a coding system that yields comparisons that make the most sense for testing your hypotheses. Between the two strategies (xi3 and manual coding), you can see that xi3 automates the process of creating the coding, but this gives up a certain amount of control. If you like, you can use manual coding which gives you more control over creating the coding of the variables, but may be more laborious and tedious. In general we would recommend using the easiest method that accomplishes your goals.

5.10 Additional Information

Here are some additional resources.

Stata Textbook Examples from Design and Analysis: Chapter 6
Stata Textbook Examples from Design and Analysis: Chapter 7
Stata Textbook Examples: Applied Regression Analysis, Chapter 8
One-Way ANOVA Contrast Code Problems From Charles Judd and Gary McClelland
Two-way contrast code solutions