SPSS Library: Appendix for Dummy Coding

DUMMY CODING

Perhaps the simplest and perhaps most common coding system is called dummy coding. It is a way to make the categorical variable into a series of dichotomous variables (variables that can have a value of zero or one only.) For all but one of the levels of the categorical variable, a new variable will be created that has a value of one for each observation at that level and zero for all others. In our example using the variable race, the first new variable (x1) will have a value of one for each observation in which race is Hispanic, and zero for all other observations. Likewise, we create x2 to be 1 when the person is Asian, and 0 otherwise, and x3 is 1 when the person is African American, and 0 otherwise. The level of the categorical variable that is coded as zero in all of the new variables is the reference level, or the level to which all of the other levels are compared. In our example, white is the reference level. You can select any level of the categorical variable as the reference level.

DUMMY CODING

Level of race	New variable 1 (x1)	New variable 2 (x2)	New variable 3 (x3)
1 (Hispanic)	1	0	0
2 (Asian)	0	1	0
3 (African American)	0	0	1
4 (white)	0	0	0

After creating the new variables, they are entered into the regression (the original variable is not entered), so we would enter x1 x2 and x3 instead of entering race into our regression equation and the regression output will include coefficients for each of these variables. The coefficient for x1 is the mean of the dependent variable for group 1 minus the mean of the dependent variable for the omitted group. In our example, the coefficient for x1 would be the mean of write for the Hispanic group minus the mean of write for the white group. Likewise, the coefficient for x2 would be the mean of write for the Asian group minus the mean of write for the white group, and the coefficient for x3 would be the mean of write for the African American group minus the mean of write for the white group.

In Method 1, we create a new variable (i.e., x1) that is set equal to zero. Then we change the value of this new variable to equal one if the level in the original (categorical) variable is one. We repeat this process for each new variable that we need to create. In Method 2, we use a "do-loop" to generate the new variables, which can be useful if your categorical variable has a large number of levels.

Method 1 for creating dummy variables

compute x1 = 0.
if race = 1 x1 = 1.
compute x2 = 0.
if race = 1 x2 = 1.
compute x3 = 0.
if race = 1 x3 = 1.
execute.

Method 2 for creating dummy variables

do repeat A=x1 x2 x3
 /B=1 2 3.
compute A=(x=B).
end repeat.
execute.

regression
 /dep write
 /method = enter x1 x2 x3.

**Variables Entered/Removed(b)**
Model	Variables Entered	Variables Removed	Method
1	X3, X2, X1(a)	.	Enter
a All requested variables entered.
b Dependent Variable: writing score

The table above shows which variables were entered into the regression equation. It also indicates that the method used was "enter", as opposed to other possible methods that could have been specified, such as backward, forward or stepwise. The table also indicates that all of the variables listed on the /method= statement were entered into the regression equation.

**Model Summary**
Model	R	R Square	Adjusted R Square	Std. Error of the Estimate
1	.327(a)	.107	.093	9.02511
a Predictors: (Constant), X3, X2, X1

**ANOVA(b)**
Model		Sum of Squares	df	Mean Square	F	Sig.
1	Regression	1914.158	3	638.053	7.833	.000(a)
	Residual	15964.717	196	81.453
	Total	17878.875	199
a Predictors: (Constant), X3, X2, X1
b Dependent Variable: writing score

The table above entitled "Model Summary" indicates that one model was tested, that 10.7% of the variance in the dependent variable is accounted for by the independent variable, and that 9.3% of the variance of the dependent variable is accounted for by the independent variable when the number of independent variables in the equation is taken into consideration. The standard error of the estimate is also given. The table entitled "ANOVA" gives the sum of squares and the degrees of freedom (in the column labeled "df") for the regression, the residual and the total (regression plus residual). The mean square is given for the regression and the residual, and the F-value and the associated p-value (in the column labeled Sig.) is displayed. These results indicate that the regression is statistically significant at the .05 alpha level. As you will see, the overall test of race is the same regardless of the coding system used.

**Coefficients(a)**
		Unstandardized Coefficients		Standardized Coefficients	t	Sig.
Model		B	Std. Error	Beta	t	Sig.
1	(Constant)	54.055	.749		72.122	.000
	X1	-7.597	1.989	-.261	-3.820	.000
	X2	3.945	2.823	.095	1.398	.164
	X3	-5.855	2.153	-.186	-2.720	.007
a Dependent Variable: writing score

The table above gives the unstandardized coefficients for the regression equation (in the column labeled B) and the standard error (in the column labeled Std. Error). When using dummy coding, the constant is the mean of the omitted level of the categorical variable. The coefficient for x1 is the difference between the mean of the dependent variable for level 1 of race minus the mean of the dependent variable at level 4 of race (the reference level). Likewise, the coefficient for x2 and x3 is the mean of the dependent variable at that level of race minus the mean of the dependent variable for the reference level. The standardized coefficients are given in the column labeled Beta. The t-values and associated p-values are also given. The statistical significance of the constant is rarely of interest to researchers. The coefficients for x1 and x3 are statistically significant at the .05 (and .01) alpha level, while the coefficient for x2 is not. This indicates that level 1 of race (Hispanic) is significantly different from level 4 (white), and that level 3 (African American) is significantly different from level 4 (white).