codebook | Show codebook information for file |
order | Order the variables in a data set |
label data | Apply a label to a data set |
label variable | Apply a label to a variable |
label define | Define value labels for a categorical variable |
label values | Apply value labels to a variable |
encode | Create numeric version of a string variable |
list | Lists the observations |
rename | Rename a variable |
recode | Recode the values of a variable |
notes | Apply notes to the data file |
generate | Creates a new variable |
replace | Replaces values for an existing variable |
egen | Extended generate – has special functions that can be used when creating a new variable |
2.0 Demonstration and explanation
In this section we will use Stata commands to label and transform variables, and to create new variables that are functions of existing variables. We first load the data and use codebook to look at all variables, including labeling information.
use https://stats.idre.ucla.edu/stat/data/hs0, clear codebook
A) Use order to control the ordering of variables as columns in the dataset
While there are several possible orderings of variables that are logical, we will put the id variable first, followed by the demographic variables describing the students, such as gender, race, ses and prgtype in the first few columns. The last columns will then contain the test scores.
order id gender race ses prgtype, first
B) Use label data to describe the dataset and label variable to give variable names more meaning.
To remember the contents of a dataset, we can apply a label to it as well as some notes, using the note command.
label data "High School and Beyond" notes id: anonymous id notes
Short variables are desirable to keep coding clean, but may obscure what the variable reprsents. Variable labels allow us to provide a longer description of the variable’s contents.
label variable schtyp "type of school"
C) Use label define to create a set of value labels, and then use label values to apply the value labels to a variable
All variables that undergo any kind of numerical calculation must have a numerical represenation in Stata. Categorical variables should thus use numbers to define the categories and value labels to give meaning to the numbers. Below we create a set of value labels called “scl”, which gives meaning to the values 1 and 2. We then apply the “scl” label to the schtyp variable.
codebook schtyp /*check for value labels first */ label define scl 1 public 2 private label values schtyp scl codebook schtyp
Labels will typically be used in the output. Labels can be suppressed in many commands using the nolabel option.
list schtyp in 1/10 list schtyp in 1/10, nolabel
D) The encode command will convert a string variable to numeric and will label its values automatically
Because the variable prgtype is a string variable, it cannot be used in any commands requiring numerical calculations. We use encode to create a numeric version of this variable, called prog, which will use the string values of prgtype as the value labels.
encode prgtype, gen(prog) label variable prog "type of program" codebook prog list prog in 1/10 list prog in 1/10, nolabel
E) Renaming and recoding variables
The variable gender may give us trouble in the future because it is difficult to know what the 1s and 2s mean. Consider giving dummy (indicator) variables the name signified by the value of 1. Below we use rename to rename gender to female, which is what female=1 indicates. We then change the values of the gender variable from 1,2 to 0,1. Dummy variables should always be valued 0,1 rather than 1,2.
rename gender female recode female (1=0)(2=1) label define fm 1 female 0 male label values female fm codebook female list female in 1/10 list female in 1/10, nolabel
F) Creating variables from other variables, generate and egen
The generate command creates variables that are created from other variables through simple arithmetic or logical operations. Here we create a variable representing the total of the 5 test score variables.
generate total = read + write + math + science summarize total
Note that there are five missing values of total because there are five missing values of science.
Let’s now use recode to assign letter grades to ranges of the total score. For example the code (0/140=0 F) tells Stata to recode all values of total between 0 and 140 to 0, and then give the label “F” to the value 0. The recoded variable will be created as a new variable called grade.
recode total (0/140=0 F) (141/180=1 D) (181/210=2 C) (211/234=3 B) (235/300=4 A), gen(grade) label variable grade "combined grades of read, write, math, science" codebook grade list read write math science total grade in 1/10 list read write math science total grade in 1/10, nolabel
The Stata command egen, which stands for extended generation, is used to create variables that require some additional function in order to be generated. Examples of these function include taking the mean, discretizing a continuous variable, and counting how many from a set of variables have missing values.
In our first example, we will use egen to create standard scores for the variable read.
egen zread = std(read) summarize zread list read zread in 1/10
Next we will egen a variable that contains the mean of read for each level of ses.
egen readmean = mean(read), by(ses) list read ses readmean in 1/10
Finally we will compute the average of several variables for each observation. Please note that there will be a mean for observation 9 even though it has a missing value for science.
egen row_mean = rowmean(read write math science) list read write math science row_mean in 1/10
See help egen for a full list of the available functions.
Finally, we will save our data and continue on to the next unit.
save hs1
3.0 For more information
- Data Management Using Stata: A Practical Handbook
- Chapters 4-5
- Statistics with Stata 12
- Chapter 2
- Gentle Introduction to Stata, Revised Third Edition
- Chapter 3
- Data Analysis Using Stata, Third Edition
- Chapter 5
- An Introduction to Stata for Health Researchers, Third Edition
- Chapters 7-8
- Stata Learning Modules
Creating and recoding variables
How can I quickly convert many string variables into numeric variables?