Stata Class Notes: Modifying Data

codebook	Show codebook information for file
order	Order the variables in a data set
label data	Apply a label to a data set
label variable	Apply a label to a variable
label define	Define value labels for a categorical variable
label values	Apply value labels to a variable
encode	Create numeric version of a string variable
list	Lists the observations
rename	Rename a variable
recode	Recode the values of a variable
notes	Apply notes to the data file
generate	Creates a new variable
replace	Replaces values for an existing variable
egen	Extended generate – has special functions that can be used when creating a new variable

2.0 Demonstration and explanation

In this section we will use Stata commands to label and transform variables, and to create new variables that are functions of existing variables. We first load the data and use codebook to look at all variables, including labeling information.

use https://stats.idre.ucla.edu/stat/data/hs0, clear
codebook

A) Use order to control the ordering of variables as columns in the dataset

While there are several possible orderings of variables that are logical, we will put the id variable first, followed by the demographic variables describing the students, such as gender, race, ses and prgtype in the first few columns. The last columns will then contain the test scores.

order id gender race ses prgtype, first

B) Use label data to describe the dataset and label variable to give variable names more meaning.

To remember the contents of a dataset, we can apply a label to it as well as some notes, using the note command.

label data "High School and Beyond"
notes id:  anonymous id
notes

Short variables are desirable to keep coding clean, but may obscure what the variable reprsents. Variable labels allow us to provide a longer description of the variable’s contents.

label variable schtyp "type of school"

C) Use label define to create a set of value labels, and then use label values to apply the value labels to a variable

All variables that undergo any kind of numerical calculation must have a numerical represenation in Stata. Categorical variables should thus use numbers to define the categories and value labels to give meaning to the numbers. Below we create a set of value labels called “scl”, which gives meaning to the values 1 and 2. We then apply the “scl” label to the schtyp variable.

codebook schtyp                    /*check for value labels first */
label define scl 1 public 2 private
label values schtyp scl
codebook schtyp

Labels will typically be used in the output. Labels can be suppressed in many commands using the nolabel option.

list schtyp in 1/10
list schtyp in 1/10, nolabel

D) The encode command will convert a string variable to numeric and will label its values automatically

Because the variable prgtype is a string variable, it cannot be used in any commands requiring numerical calculations. We use encode to create a numeric version of this variable, called prog, which will use the string values of prgtype as the value labels.

encode prgtype, gen(prog)
label variable prog "type of program"
codebook prog
list prog in 1/10
list prog in 1/10, nolabel

E) Renaming and recoding variables

The variable gender may give us trouble in the future because it is difficult to know what the 1s and 2s mean. Consider giving dummy (indicator) variables the name signified by the value of 1. Below we use rename to rename gender to female, which is what female=1 indicates. We then change the values of the gender variable from 1,2 to 0,1. Dummy variables should always be valued 0,1 rather than 1,2.

rename gender female
recode female (1=0)(2=1)
label define fm 1 female 0 male
label values female fm
codebook female
list female in 1/10
list female in 1/10, nolabel

F) Creating variables from other variables, generate and egen

The generate command creates variables that are created from other variables through simple arithmetic or logical operations. Here we create a variable representing the total of the 5 test score variables.

generate total = read + write + math + science
summarize total

Note that there are five missing values of total because there are five missing values of science.

Let’s now use recode to assign letter grades to ranges of the total score. For example the code (0/140=0 F) tells Stata to recode all values of total between 0 and 140 to 0, and then give the label “F” to the value 0. The recoded variable will be created as a new variable called grade.

recode total (0/140=0 F) (141/180=1 D) (181/210=2 C) (211/234=3 B) (235/300=4 A), gen(grade)
label variable grade "combined grades of read, write, math, science"
codebook grade
list read write math science total grade in 1/10
list read write math science total grade in 1/10, nolabel

The Stata command egen, which stands for extended generation, is used to create variables that require some additional function in order to be generated. Examples of these function include taking the mean, discretizing a continuous variable, and counting how many from a set of variables have missing values.

In our first example, we will use egen to create standard scores for the variable read.

egen zread = std(read)
summarize zread
list read zread in 1/10

Next we will egen a variable that contains the mean of read for each level of ses.

egen readmean = mean(read), by(ses)
list read ses readmean in 1/10

Finally we will compute the average of several variables for each observation. Please note that there will be a mean for observation 9 even though it has a missing value for science.

egen row_mean = rowmean(read write math science)
list read write math science row_mean in 1/10

See help egen for a full list of the available functions.

Finally, we will save our data and continue on to the next unit.

save hs1

3.0 For more information

Data Management Using Stata: A Practical Handbook
- Chapters 4-5
Statistics with Stata 12
- Chapter 2
Gentle Introduction to Stata, Revised Third Edition
- Chapter 3
Data Analysis Using Stata, Third Edition
- Chapter 5
An Introduction to Stata for Health Researchers, Third Edition
- Chapters 7-8
Stata Learning Modules

Labeling data

Creating and recoding variables

Stata Frequently Asked Questions

How can I quickly convert many string variables into numeric variables?

How can I quickly recode continuous variables into groups?

How do I standardize variables in Stata?