This module shows how you can subset data in Stata. You can subset data by keeping or dropping variables, and you can subset data by keeping or dropping observations. You can also subset data as you use a data file if you are trying to read a file that is too big to fit into the memory on your computer.
Keeping and dropping variables
Sometimes you do not want all of the variables in a data file. You can use the keep and drop commands to subset variables.
If we think of your data like a spreadsheet, this section will show how you can remove columns (variables) from your data. Let’s illustrate this with the auto data file.
sysuse auto
We can use the describe command to see its variables.
describe Contains data from C:Program FilesStata10adobase/a/auto.dta obs: 74 1978 Automobile Data vars: 12 13 Apr 2007 17:45 size: 3,478 (99.7% of memory free) (_dta has notes) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- make str18 %-18s Make and Model price int %8.0gc Price mpg int %8.0g Mileage (mpg) rep78 int %8.0g Repair Record 1978 headroom float %6.1f Headroom (in.) trunk int %8.0g Trunk space (cu. ft.) weight int %8.0gc Weight (lbs.) length int %8.0g Length (in.) turn int %8.0g Turn Circle (ft.) displacement int %8.0g Displacement (cu. in.) gear_ratio float %6.2f Gear Ratio foreign byte %8.0g origin Car type ------------------------------------------------------------------------------- Sorted by: foreign
Suppose we want to just have make mpg and price, we can keep just those variables, as shown below.
keep make mpg price
If we issue the describe command again, we see that indeed those are the only variables left.
describe Contains data from C:Program FilesStata10adobase/a/auto.dta obs: 74 1978 Automobile Data vars: 3 13 Apr 2007 17:45 size: 1,924 (99.8% of memory free) (_dta has notes) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- make str18 %-18s Make and Model price int %8.0gc Price mpg int %8.0g Mileage (mpg) ------------------------------------------------------------------------------- Sorted by: Note: dataset has changed since last saved
Remember, this has not changed the file on disk, but only the copy we have in memory. If we saved this file calling it auto, it would mean that we would replace the existing file (with all the variables) with this file which just has make, mpg and price. In effect, we would permanently lose all of the other variables in the data file. It is important to be careful when using the save command after you have eliminated variables, and it is recommended that you save such files to a file with a new name, e.g., save auto2.
Let’s show how to use the drop command to drop variables. First, let’s clear out the data in memory and use the auto data file.
sysuse auto, clear
perhaps we are not interested in the variables displ and gear_ratio. We can get rid of them using the drop command shown below.
drop displ gear_ratio
Again, using describe shows that the variables have been eliminated.
describe Contains data from C:Program FilesStata10adobase/a/auto.dta obs: 74 1978 Automobile Data vars: 10 13 Apr 2007 17:45 size: 3,034 (99.7% of memory free) (_dta has notes) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- make str18 %-18s Make and Model price int %8.0gc Price mpg int %8.0g Mileage (mpg) rep78 int %8.0g Repair Record 1978 headroom float %6.1f Headroom (in.) trunk int %8.0g Trunk space (cu. ft.) weight int %8.0gc Weight (lbs.) length int %8.0g Length (in.) turn int %8.0g Turn Circle (ft.) foreign byte %8.0g origin Car type ------------------------------------------------------------------------------- Sorted by: foreign Note: dataset has changed since last save
If we wanted to make this change permanent, we could save the file as auto2.dta as shown below.
save auto2 file auto2.dta saved
Keeping and dropping observations
The above showed how to use keep and drop variables to eliminate variables from your data file. The keep if and drop if commands can be used to keep and drop observations. Thinking of your data like a spreadsheet, the keep if and drop if commands can be used to eliminate rows of your data.
Let’s illustrate this with the auto data. Let’s use the auto file and clear out the data currently in memory.
sysuse auto , clear
The variable rep78 has values 1 to 5, and also has some missing values, as shown below.
tabulate rep78 , missingRepair | Record 1978 | Freq. Percent Cum. ------------+----------------------------------- 1 | 2 2.70 2.70 2 | 8 10.81 13.51 3 | 30 40.54 54.05 4 | 18 24.32 78.38 5 | 11 14.86 93.24 . | 5 6.76 100.00 ------------+----------------------------------- Total | 74 100.00
We may want to eliminate the observations which have missing values using drop if as shown below. The portion after the drop if specifies which observations that should be eliminated.
drop if missing(rep78) (5 observations deleted)
Using the tabulate command again shows that these observations have been eliminated.
tabulate rep78 , missing rep78 | Freq. Percent Cum. ------------+----------------------------------- 1 | 2 2.90 2.90 2 | 8 11.59 14.49 3 | 30 43.48 57.97 4 | 18 26.09 84.06 5 | 11 15.94 100.00 ------------+----------------------------------- Total | 69 100.00
We could make this change permanent by using the save command to save the file.
Let’s illustrate using keep if to eliminate observations. First let’s clear out the current file and use the auto data file.
sysuse auto , clear
The keep if command can be used to eliminate observations, except that the part after the keep if specifies which observations should be kept. Suppose we want to keep just the cars which had a repair rating of 3 or less. The easiest way to do this would be using the keep if command, as shown below.
keep if (rep78 <= 3) (34 observations deleted)
The tabulate command shows that this was successful.
tabulate rep78, missing rep78 | Freq. Percent Cum. ------------+----------------------------------- 1 | 2 5.00 5.00 2 | 8 20.00 25.00 3 | 30 75.00 100.00 ------------+----------------------------------- Total | 40 100.00
Before we go on to the next section, let’s clear out the data that is currently in memory.
clear
Selecting variables and observations with "use"
The above sections showed how to use keep, drop, keep if, and drop if for eliminating variables and observations. Sometimes, you may want to use a data file which is bigger than you can fit into memory and you would wish to eliminate variables and/or observations as you use the file. This is illustrated below with the auto data file.
Selecting variables. You can specify just the variables you wish to bring in on the use command. For example, let’s use the auto data file with just make price and mpg.
use make price mpg using http://www.stata-press.com/data/r10/auto
The describe command shows us that this worked.
describe Contains data from http://www.stata-press.com/data/r10/auto.dta obs: 74 1978 Automobile Data vars: 3 13 Apr 2007 17:45 size: 1,924 (99.8% of memory free) (_dta has notes) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- make str18 %-18s Make and Model price int %8.0gc Price mpg int %8.0g Mileage (mpg) ------------------------------------------------------------------------------- Sorted by:
Let’s clear out the data before the next example.
clear
Suppose we want to just bring in the observations where rep78 is 3 or less. We can do this as shown below.
use http://www.stata-press.com/data/r10/auto if (rep78 <= 3)
We can use tabulate to double check that this worked.
tabulate rep78, missing rep78 | Freq. Percent Cum. ------------+----------------------------------- 1 | 2 5.00 5.00 2 | 8 20.00 25.00 3 | 30 75.00 100.00 ------------+----------------------------------- Total | 40 100.00
Let’s clear out the data before the next example.
clear
Let’s show another example. Lets read in just the cars that had a rating of 4 or higher.
use http://www.stata-press.com/data/r10/auto if (rep78 >= 4) & (rep78 <.)
Let’s check this using the tabulate command.
tabulate rep78, missing rep78 | Freq. Percent Cum. ------------+----------------------------------- 4 | 18 62.07 62.07 5 | 11 37.93 100.00 ------------+----------------------------------- Total | 29 100.00
Let’s clear out the data before the next example.
clear
You can both eliminate variables and observations with the use command. Let’s read in just make mpg price and rep78 for the cars with a repair record of 3 or lower.
use make mpg price rep78 if (rep78 <= 3) using http://www.stata-press.com/data/r10/auto
Let’s check this using describe and tabulate.
describe Contains data from http://www.stata-press.com/data/r10/auto.dta obs: 40 1978 Automobile Data vars: 4 13 Apr 2007 17:45 size: 1,120 (99.9% of memory free) (_dta has notes) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- make str18 %-18s Make and Model price int %8.0gc Price mpg int %8.0g Mileage (mpg) rep78 int %8.0g Repair Record 1978 ------------------------------------------------------------------------------- Sorted by:tabulate rep78 rep78 | Freq. Percent Cum. ------------+----------------------------------- 1 | 2 5.00 5.00 2 | 8 20.00 25.00 3 | 30 75.00 100.00 ------------+----------------------------------- Total | 40 100.00
Let’s clear out the data before the next example.
clear
Note that the ordering of if and using is arbitrary.
use make mpg price rep78 using http://www.stata-press.com/data/r10/auto if (rep78 <= 3)
Let’s check this using describe and tabulate.
describe Contains data from http://www.stata-press.com/data/r10/auto.dta obs: 40 1978 Automobile Data vars: 4 13 Apr 2007 17:45 size: 1,120 (99.9% of memory free) (_dta has notes) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- make str18 %-18s Make and Model price int %8.0gc Price mpg int %8.0g Mileage (mpg) rep78 int %8.0g Repair Record 1978 ------------------------------------------------------------------------------- Sorted by:tabulate rep78 rep78 | Freq. Percent Cum. ------------+----------------------------------- 1 | 2 5.00 5.00 2 | 8 20.00 25.00 3 | 30 75.00 100.00 ------------+----------------------------------- Total | 40 100.00
Have a look at this command. Do you think it will work?
use make mpg if (rep78 <= 3) using http://www.stata-press.com/data/r10/auto rep78 not found r(111);
You see, rep78 was not one of the variables read in, so it could not be used in the if portion. To use a variable in the if portion, it has to be one of the variables that is read in.
Summary
Using keep/drop to eliminate variables
keep make price mpgdrop displ gear_ratio
Using keep if/drop if to eliminate observations
drop if missing(rep78)keep if (rep78 <= 3)
Eliminating variables and/or observations with use
use make mpg price rep78 using autouse auto if (rep78 <= 3)
use make mpg price rep78 using auto if (rep78 <= 3)