This module will illustrate how you can combine files in Stata. Examples will include appending files, one to one match merging, and one to many match merging.
Appending data files
When you have two data files, you may want to combine them by stacking them one on top of the other. For example, we have a file containing dads and a file containing moms as shown below.
input famid str4 name inc 2 "Art" 22000 1 "Bill" 30000 3 "Paul" 25000 end save dads, replace listfamid name inc 1. 2 Art 22000 2. 1 Bill 30000 3. 3 Paul 25000clear input famid str4 name inc 1 "Bess" 15000 3 "Pat" 50000 2 "Amy" 18000 end save moms, replace listfamid name inc 1. 1 Bess 15000 2. 3 Pat 50000 3. 2 Amy 18000
If we wanted to combine these files by stacking them one atop the other, we can use the append command as shown below.
use dads, clearappend using moms
We can use the list command to see if this worked correctly.
listfamid name inc 1. 2 Art 22000 2. 1 Bill 30000 3. 3 Paul 25000 4. 1 Bess 15000 5. 3 Pat 50000 6. 2 Amy 18000
The append worked properly… the dads and moms are stacked together in one file. But, there is a little problem. We can’t tell the dads from the moms. Let’s try doing this again, but first we will create a variable called momdad in the dads and moms data file which will contain dad for the dads data file and mom for the moms data file. When we combine the two files together, the momdad variable will tell us who the moms and dads are.
Here we make momdad variable for the dads data file. We save the file calling it dads1.
use dads, cleargenerate str3 momdad = "dad"save dads1file dads1.dta saved
Here we make momdad variable for the moms data file. We save the file calling it moms1.
use moms, cleargenerate str3 momdad = "mom"save moms1file moms1.dta saved
Now, let’s append dads1 and moms1 together.
use dads1, clearappend using moms1
Now, when we list the data the momdad variable shows who the moms and dads are.
listfamid name inc momdad 1. 2 Art 22000 dad 2. 1 Bill 30000 dad 3. 3 Paul 25000 dad 4. 1 Bess 15000 mom 5. 3 Pat 50000 mom 6. 2 Amy 18000 mom
Match merging
Another way of combining data files is match merging. Say that we wanted to combine the dads with the faminc data file, having the dads information and the family information side by side. We can do this with a match merge.
Let’s have a look at the dads and faminc file.
use dads, clearlistfamid name inc 1. 2 Art 22000 2. 1 Bill 30000 3. 3 Paul 25000clear input famid faminc96 faminc97 faminc98 3 75000 76000 77000 1 40000 40500 41000 2 45000 45400 45800 end save faminc, replace listfamid faminc96 faminc97 faminc98 1. 3 75000 76000 77000 2. 1 40000 40500 41000 3. 2 45000 45400 45800
We want to combine the data files so they look like this.
famid name inc faminc96 faminc97 faminc98 1 Bill 30000 40000 40500 41000 2 Art 22000 45000 45400 45800 3 Paul 25000 75000 76000 77000
Notice that the famid variable is used to associate the observation from the
dads file with the appropriate observation from the faminc file. The strategy for merging the files goes like this.
1. sort
dads on famid and save that file (calling it dads2).
2. sort
faminc on famid and save that file (calling it faminc2).
3. use the
dads2 file.
4. merge the dads2 file with the faminc2 file using
famid to match them.
Here are those four steps.
1. Sort the dads file by famid and save it as dads2
use dads, clearsort famidsave dads2file dads2.dta saved
2. Sort the faminc file by famid and save it as faminc2.
use faminc, clearsort famidsave faminc2file faminc2.dta saved
3. Use the dads2 file
use dads2, clear
4. Merge with the faminc2 file using famid as the key variable.
merge famid using faminc2
It seems like this worked just fine, but what is that _merge variable?
list, nodisplay noobs
famid name inc faminc96 faminc97 faminc98 _merge 1 Bill 30000 40000 40500 41000 3 2 Art 22000 45000 45400 45800 3 3 Paul 25000 75000 76000 77000 3
The _merge variable indicates, for each observation, how the merge went. This is useful for identifying mismatched records.
_merge can have
one of three values
1 – The record contains information from file1 only (e.g., a
dad2 record with no corresponding faminc2 record.
2 – The record contains information from file2 only (e.g., a
faminc2 record with no corresponding dad2 record.
3 – The record contains information from both files (e.g., the
dad2 and faminc2 records matched up).
When you have many records, tabulating _merge is very useful to summarize how many mismatched you have. In our case, all of the records match so the value for _merge was always 3.
tabulate _merge_merge | Freq. Percent Cum. ------------+----------------------------------- 3 | 3 100.00 100.00 ------------+----------------------------------- Total | 3 100.00
One-to-many match merging
Another kind of merge is called a one to many merge. Our one to one merge matched up dads and faminc and there was a one to one matching of the files. If we merge dads with kids, there can be multiple kids per dad and hence this is a one to many merge.
As you see below, the strategy for the one to many merge is really the same as the one to one merge.
1. sort dads on famid and save that file as dads3
2. sort
kids on famid and save that file as kids3
3. use the
dads3 file
4. merge the dads3 file with the kids3 file using
famid to match them.
The 4 steps are shown below.
1. Sort the dads data file on famid and save that file as dads3.
use dads, clearsort famidsave dads3file dads3.dta savedlistfamid name inc 1. 1 Bill 30000 2. 2 Art 22000 3. 3 Paul 25000
2. Sort the kids data file on famid and save that file as kids3.
clear input famid str4 kidname birth age wt str1 sex 1 "Beth" 1 9 60 "f" 2 "Andy" 1 8 40 "m" 3 "Pete" 1 6 20 "f" 1 "Bob" 2 6 80 "m" 1 "Barb" 3 3 50 "m" 2 "Al" 2 6 20 "f" 2 "Ann" 3 2 60 "m" 3 "Pam" 2 4 40 "f" 3 "Phil" 3 2 20 "m" end sort famid save kids3 file kids3.dta saved list famid kidname birth age wt sex 1. 1 Beth 1 9 60 f 2. 1 Bob 2 6 40 m 3. 1 Barb 3 3 20 f 4. 2 Andy 1 8 80 m 5. 2 Al 2 6 50 m 6. 2 Ann 3 2 20 f 7. 3 Pete 1 6 60 m 8. 3 Pam 2 4 40 f 9. 3 Phil 3 2 20 m
3. Use the dads3 file.
use dads3, clear
4. Merge the dads3 file with the kids3 file using famid to match them.
merge famid using kids3
Let’s list out the results.
list famid name kidname birth age _mergefamid name kidname birth age _merge 1. 1 Bill Barb 3 3 3 2. 2 Art Al 2 6 3 3. 3 Paul Pam 2 4 3 4. 1 Bill Bob 2 6 3 5. 1 Bill Beth 1 9 3 6. 2 Art Andy 1 8 3 7. 2 Art Ann 3 2 3 8. 3 Paul Phil 3 2 3 9. 3 Paul Pete 1 6 3
The results are a bit easier to read if we sort the data on famid and birth.
sort famid birthlist famid name kidname birth age _mergefamid name kidname birth age _merge 1. 1 Bill Beth 1 9 3 2. 1 Bill Bob 2 6 3 3. 1 Bill Barb 3 3 3 4. 2 Art Andy 1 8 3 5. 2 Art Al 2 6 3 6. 2 Art Ann 3 2 3 7. 3 Paul Pete 1 6 3 8. 3 Paul Pam 2 4 3 9. 3 Paul Phil 3 2 3
As you see, this is basically the same as a one to one merge. You may wonder if the order of the files on the merge statement is relevant. Here, we switch the order of the files and the results are the same. The only difference is the order of the records after the merge.
use kids3, clearmerge famid using dads3list famid name kidname birth agefamid name kidname birth age 1. 1 Bill Beth 1 9 2. 1 Bill Bob 2 6 3. 1 Bill Barb 3 3 4. 2 Art Andy 1 8 5. 2 Art Al 2 6 6. 2 Art Ann 3 2 7. 3 Paul Pete 1 6 8. 3 Paul Pam 2 4 9. 3 Paul Phil 3 2
Summary
Appending data example
use dads, clear append using moms
Match merge example steps (one-to-one and one-to-many)
1. sort dads on famid and save that file 2. sort kids on famid and save that file 3. use the dads file 4. merge the dads file with the kids file using famid to match them.
Match merge example program
use dads, clear sort famid save dads2 use faminc, clear sort famid save faminc2 use dads2, clear merge famid using faminc2