1. Introduction
When you have two data files, you can combine them by merging them side by side, matching up observations based on an identifier. For example, below we have a data file containing information on dads and we have a file containing information on family income called faminc. We would like to match merge the files together so we have the dads observation on the same line with the faminc observation based on the key variable famid.
dads famid name inc 2 Art 22000 1 Bill 30000 3 Paul 25000faminc famid faminc96 faminc97 faminc98 3 75000 76000 77000 1 40000 40500 41000 2 45000 45400 45800
After match merging the files, they would look like this.
famid name inc faminc96 faminc97 faminc98 1 Bill 30000 40000 40500 41000 2 Art 22000 45000 45400 45800 3 Paul 25000 75000 76000 77000
2. One-to-one merge
There are three steps to match merge the dads file with the faminc file (this is called a one-to-one merge because there is a one to one correspondence between the dads and faminc records). These three steps are illustrated in the SAS program merge1.sas below.
- Use proc sort to sort dads on famid and save that file (we will call it dads2)
- Use proc sort to sort faminc on famid and save that file (we will call it faminc2)
- merge the dads2 and faminc2 files based on famid
These three steps are illustrated in the program below.
* We first created the dads and faminc data files below ; DATA dads; INPUT famid name $ inc ; CARDS; 2 Art 22000 1 Bill 30000 3 Paul 25000 ; RUN;DATA faminc; INPUT famid faminc96 faminc97 faminc98 ; CARDS; 3 75000 76000 77000 1 40000 40500 41000 2 45000 45400 45800* 1. Sort the dads file by "famid" & save sorted file as dads2 ;PROC SORT DATA=dads OUT=dads2; BY famid; RUN;* 2. Sort faminc by "famid" & save sorted file as faminc2 ;PROC SORT DATA=faminc OUT=faminc2; BY famid; RUN;* 3. Merge dads2 and faminc2 by famid in a data step ;DATA dadfam ; MERGE dads2 faminc2; BY famid; RUN;* Let's do a proc print and look at the results. ;PROC PRINT DATA=dadfam; RUN;
The output of the program is shown below.
OBS FAMID NAME INC FAMINC96 FAMINC97 FAMINC98 1 1 Bill 30000 40000 40500 41000 2 2 Art 22000 45000 45400 45800 3 3 Paul 25000 75000 76000 77000
The output from shows that the match merge worked properly. The dad and faminc are merged side by side. The next example considers a one-to-many merge where one observation in one file may have multiple matching records in another file. We will see that kind of merge is really no different from the one-to-one merge we saw here.
3. One-to-many merge
Imagine that we had a file with dads like we saw in the previous example, and we had a file with kids where a dad could have more than one kid. Matching up the "dads" with the "kids" is called a "one-to-many" merge since you are matching one dad observation to possibly many kids records. The dads and kids records are shown below.
dads famid name inc 2 Art 22000 1 Bill 30000 3 Paul 25000kids famid kidname birth age wt sex 1 Beth 1 9 60 f 1 Bob 2 6 40 m 1 Barb 3 3 20 f 2 Andy 1 8 80 m 2 Al 2 6 50 m 2 Ann 3 2 20 f 3 Pete 1 6 60 m 3 Pam 2 4 40 f 3 Phil 3 2 20 m
After matching the dads with the kids you get a file that looks like the one below. Bill is matched up with his kids Beth, Bob and Barb; Art is matched up with Andy Al, and Ann; and Paul is matched up with Pete, Pam and Phil.
dadkid FAMID NAME INC KIDNAME BIRTH AGE WT SEX 1 Bill 30000 Beth 1 9 60 f 1 Bill 30000 Bob 2 6 40 m 1 Bill 30000 Barb 3 3 20 f 2 Art 22000 Andy 1 8 80 m 2 Art 22000 Al 2 6 50 m 2 Art 22000 Ann 3 2 20 f 3 Paul 25000 Pete 1 6 60 m 3 Paul 25000 Pam 2 4 40 f 3 Paul 25000 Phil 3 2 20 m
Just like the "one-to-one" merge, we follow the same three steps for a "one-to-many" merge. These three steps are illustrated in the SAS program merge2.sas below.
- Use proc sort to sort dads on famid and save that file (we will call it dads2)
- Use proc sort to sort kids on famid and save that file (we will call it kids2)
- merge the dads2 and kids2 files based on famid
The program below illustrates these steps.
* first we make the "dads" data file ;DATA dads; INPUT famid name $ inc ; CARDS; 2 Art 22000 1 Bill 30000 3 Paul 25000 ; RUN;* Next we make the "kids" data file ;DATA kids; INPUT famid kidname $ birth age wt sex $ ; CARDS; 1 Beth 1 9 60 f 1 Bob 2 6 40 m 1 Barb 3 3 20 f 2 Andy 1 8 80 m 2 Al 2 6 50 m 2 Ann 3 2 20 f 3 Pete 1 6 60 m 3 Pam 2 4 40 f 3 Phil 3 2 20 m ; RUN;* 1. sort "dads" on famid and save the sorted file as "dads2" ;PROC SORT DATA=dads OUT=dads2; BY famid; RUN;* 2. sort "kids" on famid and save the sorted file as "kids2" ;PROC SORT DATA=kids OUT=kids2; BY famid; RUN;* 3. merge "dads2" and "kids2" based on famid, creating "dadkid" ;DATA dadkid; MERGE dads2 kids2; BY famid; RUN;* Let's do a PROC PRINT of "dadkid" to see if the merge worked ;PROC PRINT DATA=dadkid; RUN;
The output of the program is shown below.
OBS FAMID NAME INC KIDNAME BIRTH AGE WT SEX 1 1 Bill 30000 Beth 1 9 60 f 2 1 Bill 30000 Bob 2 6 40 m 3 1 Bill 30000 Barb 3 3 20 f 4 2 Art 22000 Andy 1 8 80 m 5 2 Art 22000 Al 2 6 50 m 6 2 Art 22000 Ann 3 2 20 f 7 3 Paul 25000 Pete 1 6 60 m 8 3 Paul 25000 Pam 2 4 40 f 9 3 Paul 25000 Phil 3 2 20 m
The output shows just what we hoped to see, the dads merged along side of their kids. You might have wondered what would have happened if the merge statement had reversed the order of the files, had we changed step 3 to look like below.
* 3. merge "dads2" and "kids2" based on famid, creating "dadkid" ;DATA dadkid; MERGE kids2 dads2; BY famid; RUN;* Let's do a PROC PRINT of "dadkid" see what happens ;PROC PRINT DATA=dadkid; RUN;
The output with the modified step 3 is shown below.
OBS FAMID KIDNAME BIRTH AGE WT SEX NAME INC 1 1 Beth 1 9 60 f Bill 30000 2 1 Bob 2 6 40 m Bill 30000 3 1 Barb 3 3 20 f Bill 30000 4 2 Andy 1 8 80 m Art 22000 5 2 Al 2 6 50 m Art 22000 6 2 Ann 3 2 20 f Art 22000 7 3 Pete 1 6 60 m Paul 25000 8 3 Pam 2 4 40 f Paul 25000 9 3 Phil 3 2 20 m Paul 25000
This output shows what happened when we switched the order of kids2 and dads2 in the merge statement. The merge results are basically the same, except that the order of the variables is modified — the kids variables are on the left and the dads variables are at the right. Other than that, the results are the same.
4. Problems to look out for
These examples cover situations where there are no complications. We show some examples of complications that can arise and how you can solve them below.
4.1 Mismatching records in one-to-one merge
The two data files have may have records that do not match. Below we illustrate this by including an extra dad (Karl in famid 4) that does not have a corresponding family, and there are two extra families (5 and 6) in the family file that do not have a corresponding dad.
DATA dads; INPUT famid name $ inc; DATALINES; 2 Art 22000 1 Bill 30000 3 Paul 25000 4 Karl 95000 ; RUN; DATA faminc; INPUT famid faminc96 faminc97 faminc98; DATALINES; 3 75000 76000 77000 1 40000 40500 41000 2 45000 45400 45800 5 55000 65000 70000 6 22000 24000 28000 ; RUN; PROC SORT DATA=dads; BY famid; RUN; PROC SORT DATA=faminc; BY famid; RUN; DATA merge121; MERGE dads(IN=fromdadx) faminc(IN=fromfamx); BY famid; fromdad = fromdadx; fromfam = fromfamx; RUN;
As you see above, we use the in option to create a 0/1 variable fromdadx that indicates whether the resulting file contains a record with data from the dads file. Likewise, we use IN option to create a 0/1 variable fromfamx that indicates if the observation came from the faminc file. The fromdadx and fromfamx variables are temporary, so we make copies of them in fromdad and fromfam so we have copies of these variables that stay with the file. We can then use proc print and proc freq to identify the mismatching records.
PROC PRINT DATA=merge121;
RUN;PROC FREQ DATA=merge121;
TABLES fromdad*fromfam;
RUN;
The output below illustrates that there were mismatching records. For famid 4, the value of fromdad is 1 and fromfam is 0, as we would expect since there was data from dads for famid 4, but no data from faminc. Also, as we expect, this record has valid data for the variables from the dads file (name and inc) and missing data for the variables from faminc (faminc96 faminc97 and faminc98). We see the reverse pattern for famid‘s5 and 6.
OBS FAMID NAME INC FAMINC96 FAMINC97 FAMINC98 FROMDAD FROMFAM 1 1 Bill 30000 40000 40500 41000 1 1 2 2 Art 22000 45000 45400 45800 1 1 3 3 Paul 25000 75000 76000 77000 1 1 4 4 Karl 95000 . . . 1 0 5 5 . 55000 65000 70000 0 1 6 6 . 22000 24000 28000 0 1
A closer look at the fromdad and fromfam variables reveals that there are three records that have matching data: one that has data from the dads only, and two records that have data from the faminc file only. The crosstab table below confirms this.
TABLE OF FROMDAD BY FROMFAM FROMDAD FROMFAM Frequency| Percent | Row Pct | Col Pct | 0| 1| Total ---------+--------+--------+ 0 | 0 | 2 | 2 | 0.00 | 33.33 | 33.33 | 0.00 | 100.00 | | 0.00 | 40.00 | ---------+--------+--------+ 1 | 1 | 3 | 4 | 16.67 | 50.00 | 66.67 | 25.00 | 75.00 | | 100.00 | 60.00 | ---------+--------+--------+ Total 1 5 6 16.67 83.33 100.00
You may want to use this strategy to check the matching of the two files. If there are unexpected mismatched records, then you should investigate to understand the cause of the mismatched records.
Use the where statement in a proc print to eliminate some of the non-matching records.
4.2 Variables with the same name, but different information
Below we have the files with the information about the dads and family, but look more closely at the names of the variables. In the dads file, there is a variable called inc98, and in the family file there are variables inc96, inc97 and inc98. Let’s attempt to merge these files and see what happens.
DATA dads; INPUT famid name $ inc98; DATALINES; 2 Art 22000 1 Bill 30000 3 Paul 25000 ; RUN; DATA faminc; INPUT famid inc96 inc97 inc98; DATALINES; 3 75000 76000 77000 1 40000 40500 41000 2 45000 45400 45800 ; RUN; PROC SORT DATA=dads; BY famid; RUN; PROC SORT DATA=faminc; BY famid; RUN; DATA merge121; MERGE faminc dads; BY famid; RUN;PROC PRINT DATA=merge121;
RUN;
The results are shown below. As you see, the variable inc98 has the data from the dads file, the file that appears last on the merge statement. When you merge files that have the same variable, SAS will use the values from the file that appears last on the merge statement.
OBS FAMID INC96 INC97 INC98 NAME 1 1 40000 40500 30000 Bill 2 2 45000 45400 22000 Art 3 3 75000 76000 25000 Paul
There are a couple of ways you can solve this problem.
Solution #1. The most obvious solution is to choose variable names in the original files that will not conflict with each other. However, you may have files where the names have already been chosen.
Solution #2. You can rename the variables in a data step using the rename option (which renames the variables before doing the merging). This allows you to select variable names that do not conflict with each other, as illustrated below.
DATA merge121;
MERGE faminc(RENAME=(inc96=faminc96 inc97=faminc97 inc98=faminc98))
dads(RENAME=(inc98=dadinc98));
BY famid;
RUN;PROC PRINT DATA=merge121;
RUN;
As you can see below, the variables were renamed as specified.
OBS FAMID FAMINC96 FAMINC97 FAMINC98 NAME DADINC98 1 1 40000 40500 41000 Bill 30000 2 2 45000 45400 45800 Art 22000 3 3 75000 76000 77000 Paul 25000
5. For more information
- For information on concatenating data files, see the SAS Learning Module on Concatenating Data Files in SAS.