Stata Class Notes: Managing Data

pwd	Show current directory (pwd=print working directory)
dir or ls	Show files in current directory
cd	Change directory
keep if	Keep observations if condition is met
keep	Keep variables or observations
drop	Drop variables or observations
drop if	Drop observations if condition is met
append	Append a data file to current file
sort	Sort observations
merge	Merge a data file with current file

2.0 Demonstration and explanation

We begin with the dataset we created in the last section.

use hs1, clear

A) Use keep and drop with an if statement to subset observations

When the commands keep and drop are specified without variable names, they keep and drop observations according to some condition specified in an if statement. Suppose we wish to analyze 2 subsets of the hs1 data separately, males and females. We subset the dataset twice, reloading between, and save the datasets for use later.

keep if female == 0
count
save hsmale, replace
use hs1, clear
keep if female == 1
count
save hsfemale, replace

Note the same subsetting would result from drop if read .

B) Use keep and drop with variable names to remove variables from the dataset

Suppose that our data file had many, many variables, say 2000 variables, but we only care about just a handful of them, id, female, read and write. We can subset our dataset to keep just those variables as shown below.

use hs1, clear
keep id female read write
save hskept, replace
describe
list in 1/20

When there are more variables to keep than drop, it will probably make more sense to use drop than keep.

use hs1, clear
drop female read write
save hsdropped, replace
describe
list in 1/10

C) Adding observations with append

Suppose we now wish to put the separate male and female files back together by. We accomplish this by appending the datasets, or stacking the observations. If there is a variable that does not appear in one of the 2 datasets, then it will have missing values in the dataset in which it did not originally appear. Below we first load the male dataset into memory, and then append the female dataset.

use hsmale
tabulate female
append using hsfemale
tabulate female
save hsmasters, replace

D) Adding variables with merge

At times we will need to merge variables from 2 different datasets together. The merge command requires an ID variable to match observations. Stata allows 1:1 merges, where each file has each ID represented once, as well as 1:many merges, where one file has each ID represented once while the other has each ID represented multiple times. A tracking variable, _merge, is created upon merging that describes the source of each observation – matched in both datasets or found only in one of the 2 dataset. Below, we perform a 1:1 merge matched on the id variable between the hskept and hsdropped datasets we created earlier.

use hskept, clear
list
merge 1:1 id using hsdropped
tab _merge
list
save hsmerged

3.0 For more information

Data Management Using Stata: A Practical Handbook
- Chapters 6-8
Statistics with Stata 12
- Chapter 2
Gentle Introduction to Stata, Revised Third Edition
- Chapter 3
Data Analysis Using Stata, Third Edition
- Chapter 11
An Introduction to Stata for Health Researchers, Third Edition
- Chapter 9
Stata Learning Modules
- Subsetting variables and observations
- Combining Stata data files
Stata Frequently Asked Questions
- How can I merge multiple files in Stata?