There are two easy ways to create dummy variables in Stata. Let’s begin with a simple
dataset that has three levels of the variable group
:
input group 1 1 2 3 2 2 1 3 3 end
We can create dummy variables using the tabulate
command and the generate( )
option, as shown
below.
tabulate group, generate(dum)
group | Freq. Percent Cum. ------------+----------------------------------- 1 | 3 33.33 33.33 2 | 3 33.33 66.67 3 | 3 33.33 100.00 ------------+----------------------------------- Total | 9 100.00list
group dum1 dum2 dum3 1. 1 1 0 0 2. 1 1 0 0 3. 2 0 1 0 4. 3 0 0 1 5. 2 0 1 0 6. 2 0 1 0 7. 1 1 0 0 8. 3 0 0 1 9. 3 0 0 1
The tabulate
command with the generate
option created three dummy variables called dum1
, dum2
and dum3
.
An Example Using the High School and Beyond Dataset
Using High School and Beyond dataset we wish to account for variability in the writing
test scores using information on reading, math and the program type the student is in. The
categorical variable prog
has three levels: 1) general program, 2) academic
program, and 3) vocational program. First, we will load the dataset from the Internet,
then we will create dummy variables for prog
using the tabulate
command.
use https://stats.idre.ucla.edu/stat/stata/notes/hsb2, clear tabulate prog, generate(prog)
type of | program | Freq. Percent Cum. ------------+----------------------------------- general | 45 22.50 22.50 academic | 105 52.50 75.00 vocation | 50 25.00 100.00 ------------+----------------------------------- Total | 200 100.00
The tabulate
command with the generate
option created the following variables: prog1
, prog2
, and prog3
. In a
regression analysis we can only use two of the three dummy variables. Since prog
has three levels it uses two degrees of freedom. Here is the regression analysis.
regress write read math prog2 prog3
Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 4, 195) = 41.03 Model | 8170.58624 4 2042.64656 Prob > F = 0.0000 Residual | 9708.28876 195 49.7860962 R-squared = 0.4570 -------------+------------------------------ Adj R-squared = 0.4459 Total | 17878.875 199 89.843593 Root MSE = 7.0559 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- read | .289028 .0659478 4.38 0.000 .1589656 .4190905 math | .3587215 .0745443 4.81 0.000 .2117048 .5057381 prog2 | .6647754 1.32845 0.50 0.617 -1.955198 3.284749 prog3 | -2.253484 1.468445 -1.53 0.127 -5.149556 .6425886 _cons | 19.00854 3.40933 5.58 0.000 12.28465 25.73243 ------------------------------------------------------------------------------
In the analysis all of the variables were statistically
significant except for prog2
and prog3
. However, it is necessary to remember that it is the
combination of prog2
and prog3
that makes up the variable program type.
Let’s test prog2
and prog3
together.
test prog2 prog3
( 1) prog2 = 0.0 ( 2) prog3 = 0.0 F( 2, 195) = 2.32 Prob > F = 0.1015
As it turns out, by testing prog2
and prog3
together, we find that the variable program type is not statistically significant.
We can also do this in one step using the i.
or factor variable notation, as shown below. Factor variables create indicator variables from categorical variables and are allowed with most estimation and postestimation commands Note how the results below match those above exactly.
regress write read math i.prog
Source | SS df MS Number of obs = 200
-------------+---------------------------------- F(4, 195) = 41.03
Model | 8170.58624 4 2042.64656 Prob > F = 0.0000
Residual | 9708.28876 195 49.7860962 R-squared = 0.4570
-------------+---------------------------------- Adj R-squared = 0.4459
Total | 17878.875 199 89.843593 Root MSE = 7.0559
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
read | .289028 .0659478 4.38 0.000 .1589656 .4190905
math | .3587215 .0745443 4.81 0.000 .2117048 .5057381
|
prog |
academic | .6647754 1.32845 0.50 0.617 -1.955198 3.284749
vocation | -2.253484 1.468445 -1.53 0.127 -5.149556 .6425886
|
_cons | 19.00854 3.40933 5.58 0.000 12.28464 25.73243
------------------------------------------------------------------------------
As we did in the prior example, we can test the overall effect of program type with the
test
command as shown below.test 2.prog 3.prog
( 1) 2.prog = 0 ( 2) 3.prog = 0 F( 2, 195) = 2.32 Prob > F = 0.1015The
contrast
command can be used to get the multi-degree-of-freedom test of the categorical variable.contrast prog
Contrasts of marginal linear predictions Margins : asbalanced ------------------------------------------------ | df F P>F -------------+---------------------------------- prog | 2 2.32 0.1015 | Denominator | 195 ------------------------------------------------For more information
See the Stata manual on tabulate and factor variables.