This module shows how to create and recode variables. In Stata you can create new variables with generate and you can modify the values of an existing variable with replace and with recode.
Computing new variables using generate and replace
Let’s use the auto data for our examples. In this section we will see how to compute variables with generate and replace.
sysuse auto, clear
The variable length contains the length of the car in inches. Below we see summary statistics for length.
summarize length
Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- length | 74 187.9324 22.26634 142 233
Let’s use the generate command to make a new variable that has the length in feet instead of inches, called len_ft.
generate len_ft = length / 12
We should emphasize that generate is for creating a new variable. For an existing variable, you need to use the replace command (not generate). As shown below, we use replace to repeat the assignment to len_ft.
replace len_ft = length / 12 (49 real changes made) summarize length len_ft Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- length | 74 187.9324 22.26634 142 233 len_ft | 74 15.66104 1.855528 11.83333 19.41667
The syntax of generate and replace are identical, except: – generate works when the variable does not yet exist and will give an error if the variable already exists. – replace works when the variable already exists, and will give an error if the variable does not yet exist.
Suppose we wanted to make a variable called length2 which has length squared.
generate length2 = length^2 summarize length2 Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- length2 | 74 35807.69 8364.045 20164 54289
Or we might want to make loglen which is the natural log of length.
generate loglen = log(length) summarize loglen Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- loglen | 74 5.229035 .1201383 4.955827 5.451038
Let’s get the mean and standard deviation of length and we can make Z-scores of length.
summarize length Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- length | 74 187.9324 22.26634 142 233
The mean is 187.93 and the standard deviation is 22.27, so zlength can be computed as shown below.
generate zlength = (length - 187.93) / 22.27 summarize zlength Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- zlength | 74 .0001092 .9998357 -2.062416 2.023799
With generate and replace you can use + – for addition and subtraction you can use * / for multiplication and division you can use ^ for exponents (e.g., length^2) you can use ( ) for controlling order of operations.
Recoding new variables using generate and replace
Suppose that we wanted to break mpg down into three categories. Let’s look at a table of mpg to see where we might draw the lines for such categories.
tabulate mpg mpg | Freq. Percent Cum. ------------+----------------------------------- 12 | 2 2.70 2.70 14 | 6 8.11 10.81 15 | 2 2.70 13.51 16 | 4 5.41 18.92 17 | 4 5.41 24.32 18 | 9 12.16 36.49 19 | 8 10.81 47.30 20 | 3 4.05 51.35 21 | 5 6.76 58.11 22 | 5 6.76 64.86 23 | 3 4.05 68.92 24 | 4 5.41 74.32 25 | 5 6.76 81.08 26 | 3 4.05 85.14 28 | 3 4.05 89.19 29 | 1 1.35 90.54 30 | 2 2.70 93.24 31 | 1 1.35 94.59 34 | 1 1.35 95.95 35 | 2 2.70 98.65 41 | 1 1.35 100.00 ------------+----------------------------------- Total | 74 100.00
Let’s convert mpg into three categories to help make this more readable. Here we convert mpg into three categories using generate and replace.
generate mpg3 = . (74 missing values generated) replace mpg3 = 1 if (mpg <= 18) (27 real changes made) replace mpg3 = 2 if (mpg >= 19) & (mpg <=23) (24 real changes made) replace mpg3 = 3 if (mpg >= 24) & (mpg <.) (23 real changes made)
Let’s use tabulate to check that this worked correctly. Indeed, you can see that a value of 1 for mpg3 goes from 12-18, a value of 2 goes from 19-23, and a value of 3 goes from 24-41.
tabulate mpg mpg3 | mpg3 mpg | 1 2 3 | Total -----------+---------------------------------+---------- 12 | 2 0 0 | 2 14 | 6 0 0 | 6 15 | 2 0 0 | 2 16 | 4 0 0 | 4 17 | 4 0 0 | 4 18 | 9 0 0 | 9 19 | 0 8 0 | 8 20 | 0 3 0 | 3 21 | 0 5 0 | 5 22 | 0 5 0 | 5 23 | 0 3 0 | 3 24 | 0 0 4 | 4 25 | 0 0 5 | 5 26 | 0 0 3 | 3 28 | 0 0 3 | 3 29 | 0 0 1 | 1 30 | 0 0 2 | 2 31 | 0 0 1 | 1 34 | 0 0 1 | 1 35 | 0 0 2 | 2 41 | 0 0 1 | 1 -----------+---------------------------------+---------- Total | 27 24 23 | 74
Now, we could use mpg3 to show a crosstab of mpg3 by foreign to contrast the mileage of the foreign and domestic cars.
tabulate mpg3 foreign, column | foreign mpg3 | 0 1 | Total -----------+----------------------+---------- 1 | 22 5 | 27 | 42.31 22.73 | 36.49 -----------+----------------------+---------- 2 | 19 5 | 24 | 36.54 22.73 | 32.43 -----------+----------------------+---------- 3 | 11 12 | 23 | 21.15 54.55 | 31.08 -----------+----------------------+---------- Total | 52 22 | 74 | 100.00 100.00 | 100.00
The crosstab above shows that 21% of the domestic cars fall into the high mileage category, while 55% of the foreign cars fit into this category.
Recoding variables using recode
There is an easier way to recode mpg to three categories using generate and recode. First, we make a copy of mpg, calling it mpg3a. Then, we use recode to convert mpg3a into three categories: min-18 into 1, 19-23 into 2, and 24-max into 3.
generate mpg3a = mpg recode mpg3a (min/18=1) (19/23=2) (24/max=3) (74 changes made)
Let’s double check to see that this worked correctly. We see that it worked perfectly.
tabulate mpg mpg3a | mpg3a mpg | 1 2 3 | Total -----------+---------------------------------+---------- 12 | 2 0 0 | 2 14 | 6 0 0 | 6 15 | 2 0 0 | 2 16 | 4 0 0 | 4 17 | 4 0 0 | 4 18 | 9 0 0 | 9 19 | 0 8 0 | 8 20 | 0 3 0 | 3 21 | 0 5 0 | 5 22 | 0 5 0 | 5 23 | 0 3 0 | 3 24 | 0 0 4 | 4 25 | 0 0 5 | 5 26 | 0 0 3 | 3 28 | 0 0 3 | 3 29 | 0 0 1 | 1 30 | 0 0 2 | 2 31 | 0 0 1 | 1 34 | 0 0 1 | 1 35 | 0 0 2 | 2 41 | 0 0 1 | 1 -----------+---------------------------------+---------- Total | 27 24 23 | 74
Recodes with if
Let’s create a variable called mpgfd that assesses the mileage of the cars with respect to their origin. Let this be a 0/1 variable called mpgfd which is: 0 if below the median mpg for its group (foreign/domestic) 1 if at/above the median mpg for its group (foreign/domestic).
sort foreign by foreign: summarize mpg, detail -> foreign= 0 mpg ------------------------------------------------------------- Percentiles Smallest 1% 12 12 5% 14 12 10% 14 14 Obs 52 25% 16.5 14 Sum of Wgt. 52 50% 19 Mean 19.82692 Largest Std. Dev. 4.743297 75% 22 28 90% 26 29 Variance 22.49887 95% 29 30 Skewness .7712432 99% 34 34 Kurtosis 3.441459 -> foreign= 1 mpg ------------------------------------------------------------- Percentiles Smallest 1% 14 14 5% 17 17 10% 17 17 Obs 22 25% 21 18 Sum of Wgt. 22 50% 24.5 Mean 24.77273 Largest Std. Dev. 6.611187 75% 28 31 90% 35 35 Variance 43.70779 95% 35 35 Skewness .657329 99% 41 41 Kurtosis 3.10734
We see that the median is 19 for the domestic (foreign==0) cars and 24.5 for the foreign (foreign==1) cars. The generate and recode commands below recode mpg into mpgfd based on the domestic car median for the domestic cars, and based on the foreign car median for the foreign cars.
generate mpgfd = mpg recode mpgfd (min/18=0) (19/max=1) if foreign==0 (52 changes made) recode mpgfd (min/24=0) (25/max=1) if foreign==1 (22 changes made)
We can check using this below, and the recoded value mpgfd looks correct.
by foreign: tabulate mpg mpgfd -> foreign= 0 | mpgfd mpg | 0 1 | Total -----------+----------------------+---------- 12 | 2 0 | 2 14 | 5 0 | 5 15 | 2 0 | 2 16 | 4 0 | 4 17 | 2 0 | 2 18 | 7 0 | 7 19 | 0 8 | 8 20 | 0 3 | 3 21 | 0 3 | 3 22 | 0 5 | 5 24 | 0 3 | 3 25 | 0 1 | 1 26 | 0 2 | 2 28 | 0 2 | 2 29 | 0 1 | 1 30 | 0 1 | 1 34 | 0 1 | 1 -----------+----------------------+---------- Total | 22 30 | 52 -> foreign= 1 | mpgfd mpg | 0 1 | Total -----------+----------------------+---------- 14 | 1 0 | 1 17 | 2 0 | 2 18 | 2 0 | 2 21 | 2 0 | 2 23 | 3 0 | 3 24 | 1 0 | 1 25 | 0 4 | 4 26 | 0 1 | 1 28 | 0 1 | 1 30 | 0 1 | 1 31 | 0 1 | 1 35 | 0 2 | 2 41 | 0 1 | 1 -----------+----------------------+---------- Total | 11 11 | 22
Summary
Create a new variable len_ft which is length divided by 12.
generate len_ft = length / 12
Change values of an existing variable named len_ft.
replace len_ft = length / 12
Recode mpg into mpg3, having three categories using generate and replace if.
generate mpg3 = . replace mpg3 = 1 if (mpg <=18) replace mpg3 = 2 if (mpg >=19) & (mpg <=23) replace mpg3 = 3 if (mpg >=24) & (mpg <.)
Recode mpg into mpg3a, having three categories, 1 2 3, using generate and recode.
generate mpg3a = mpg recode mpg3a (min/18=1) (19/23=2) (24/max=3)
Recode mpg into mpgfd, having two categories, but using different cutoffs for foreign and domestic cars.
generate mpgfd = mpg recode mpgfd (min/18=0) (19/max=1) if foreign==0 recode mpgfd (min/24=0) (25/max=1) if foreign==1