pwd | Show current directory (pwd=print working directory) |
dir or ls | Show files in current directory |
cd | Change directory |
keep if | Keep observations if condition is met |
keep | Keep variables or observations |
drop | Drop variables or observations |
drop if | Drop observations if condition is met |
append | Append a data file to current file |
sort | Sort observations |
merge | Merge a data file with current file |
2.0 Demonstration and explanation
We begin with the dataset we created in the last section.
use hs1, clear
A) Use keep and drop with an if statement to subset observations
When the commands keep and drop are specified without variable names, they keep and drop observations according to some condition specified in an if statement. Suppose we wish to analyze 2 subsets of the hs1 data separately, males and females. We subset the dataset twice, reloading between, and save the datasets for use later.
keep if female == 0 count save hsmale, replace use hs1, clear keep if female == 1 count save hsfemale, replace
Note the same subsetting would result from drop if read .
B) Use keep and drop with variable names to remove variables from the dataset
Suppose that our data file had many, many variables, say 2000 variables, but we only care about just a handful of them, id, female, read and write. We can subset our dataset to keep just those variables as shown below.
use hs1, clear keep id female read write save hskept, replace describe list in 1/20
When there are more variables to keep than drop, it will probably make more sense to use drop than keep.
use hs1, clear drop female read write save hsdropped, replace describe list in 1/10
C) Adding observations with append
Suppose we now wish to put the separate male and female files back together by. We accomplish this by appending the datasets, or stacking the observations. If there is a variable that does not appear in one of the 2 datasets, then it will have missing values in the dataset in which it did not originally appear. Below we first load the male dataset into memory, and then append the female dataset.
use hsmale tabulate female append using hsfemale tabulate female save hsmasters, replace
D) Adding variables with merge
At times we will need to merge variables from 2 different datasets together. The merge command requires an ID variable to match observations. Stata allows 1:1 merges, where each file has each ID represented once, as well as 1:many merges, where one file has each ID represented once while the other has each ID represented multiple times. A tracking variable, _merge, is created upon merging that describes the source of each observation – matched in both datasets or found only in one of the 2 dataset. Below, we perform a 1:1 merge matched on the id variable between the hskept and hsdropped datasets we created earlier.
use hskept, clear list merge 1:1 id using hsdropped tab _merge list save hsmerged
3.0 For more information
- Data
Management Using Stata: A Practical Handbook
- Chapters 6-8
- Statistics
with Stata 12
- Chapter 2
- Gentle Introduction to Stata, Revised Third Edition
- Chapter 3
- Data Analysis Using Stata, Third Edition
- Chapter 11
-
An Introduction to Stata for Health Researchers, Third Edition
- Chapter 9
- Stata Learning Modules
- Stata Frequently Asked Questions