Match merging data files using proc sql

1. One-to-one merge

Below we have a file containing family id, father’s name and income. We also have a file containing income information for multiple years. We would like to match merge the files together so we have the dads observation on the same line with the faminc observation based on the key variable famid. In proc sql we use where statement to do the matching as shown below.

data dads; 
  input famid name $ inc ; 
cards; 
2 Art  22000 
1 Bill 30000 
3 Paul 25000 
; 
run; 

data faminc; 
  input famid faminc96 faminc97 faminc98 ; 
cards; 
3 75000 76000 77000 
1 40000 40500 41000 
2 45000 45400 45800 
;
run;

proc sql;
  create table dadfam1 as
  select * 
  from dads, faminc
  where dads.famid=faminc.famid
  order by dads.famid;
quit;

proc print data=dadfam1;
run;

Obs    famid    name     inc     faminc96    faminc97    faminc98

 1       1      Bill    30000      40000       40500       41000
 2       2      Art     22000      45000       45400       45800
 3       3      Paul    25000      75000       76000       77000

2. One-to-many merge

Imagine that we had a file with dads like we saw in the previous example, and we had a file with kids where a dad could have more than one kid. Matching up the "dads" with the "kids" is called a "one-to-many" merge since you are matching one dad observation to possibly many kids records. The dads and kids records are shown below. Notice here we have variable fid in the first data set and famid in the second. These are the variables that we want to match. When we merge the two using proc sql, we don’t have to rename them, since we can use data set name identifier.

data dads; 
  input fid name $ inc ; 
cards; 
2 Art  22000 
1 Bill   30000 
3 Paul  25000 
; 
run; 

* Next we make the "kids" data file ;
data kids; 
  input famid kidname $ birth age wt sex $ ; 
cards; 
1 Beth 1 9 60 f 
1 Bob  2 6 40 m 
1 Barb 3 3 20 f 
2 Andy 1 8 80 m 
2 Al   2 6 50 m 
2 Ann  3 2 20 f 
3 Pete 1 6 60 m 
3 Pam  2 4 40 f 
3 Phil 3 2 20 m 
; 
run;

proc sql; 
  create table dadkid2 as
  select  * 
  from dads, kids
  where dads.fid=kids.famid
  order by dads.fid, kids.kidname;
quit;

proc print data=dadkid2;
run;

Obs    fid    name     inc     famid    kidname    birth    age    wt    sex

 1      1     Bill    30000      1       Barb        3       3     20     f
 2      1     Bill    30000      1       Beth        1       9     60     f
 3      1     Bill    30000      1       Bob         2       6     40     m
 4      2     Art     22000      2       Al          2       6     50     m
 5      2     Art     22000      2       Andy        1       8     80     m
 6      2     Art     22000      2       Ann         3       2     20     f
 7      3     Paul    25000      3       Pam         2       4     40     f
 8      3     Paul    25000      3       Pete        1       6     60     m
 9      3     Paul    25000      3       Phil        3       2     20     m

3. Renaming variables with the same name in merging

Below we have the files with the information about the dads and family, but look more closely at the names of the variables. In the dads file, there is a variable called inc98, and in the family file there are variables inc96, inc97 and inc98.

data dads;
 input famid name $ inc98;
cards;
2 Art  22000
1 Bill 30000
3 Paul 25000
;
run;

data faminc;
 input famid inc96 inc97 inc98;
cards;
3 75000 76000 77000
1 40000 40500 41000
2 45000 45400 45800
;
run;

Let’s merge them using the same strategy used in our previous example on merging. We see below that we lost variable inc98 from the second dataset faminc. Proc sql uses the column from the first data set in case of same variable names from both datasets. This may not be what we want.

proc sql; 
  create table dadkid4 as
  select * 
  from dads, faminc
  where dads.famid=faminc.famid
  order by dads.famid;
quit;
proc print data=dadkid4;
run;

Obs    famid    name    inc98    inc96    inc97

 1       1      Bill    30000    40000    40500
 2       2      Art     22000    45000    45400
 3       3      Paul    25000    75000    76000

In proc sql we can rename the variables using the as statement shown below.

proc sql; 
  create table dadkid5 as
  select *, dads.inc98 as dadinc98, faminc.inc98 as faminc98  
  from dads, faminc
  where dads.famid=faminc.famid
  order by dads.famid;
quit;

proc print data=dadkid5;
run;

Obs    famid    name    inc98    inc96    inc97    dadinc98    faminc98

 1       1      Bill    30000    40000    40500      30000       41000
 2       2      Art     22000    45000    45400      22000       45800
 3       3      Paul    25000    75000    76000      25000       77000

4. Using full join to handle mismatching records in a one-to-one merge

The two datasets may have records that do not match. Below we illustrate this by including an extra dad (Karl in famid 4) that does not have a corresponding family, and there are two extra families (5 and 6) in the family file that do not have a corresponding dad.

data dads;
 input famid name $ inc;
cards;
2 Art  22000
1 Bill 30000
3 Paul 25000
4 Karl 95000
;
run;

data faminc;
 input famid faminc96 faminc97 faminc98;
cards;
3 75000 76000 77000
1 40000 40500 41000
2 45000 45400 45800
5 55000 65000 70000
6 22000 24000 28000
;
run;

Let’s apply the previous example to these two datasets. We see that the unmatched records have been dropped out in the merged data set, since the where statement eliminated them.

proc sql; 
  create table dadkid3 as
  select  *
  from dads, faminc
  where dads.famid=faminc.famid
  order by dads.famid;
quit;

proc print data=dadkid3;
run;

Obs    famid    name     inc     faminc96    faminc97    faminc98

 1       1      Bill    30000      40000       40500       41000
 2       2      Art     22000      45000       45400       45800
 3       3      Paul    25000      75000       76000       77000

What if we want to keep all the records from both datasets even they do not match? The following proc sql does it in a more complex way. Here we create two new variables. One is indic, an indicator variable that indicates whether an observation is from both datasets, 1 being from both datasets and 0 otherwise. Another variable is fid, a coalesce of famid from both datasets. This gives us more control over our datasets. We can decide if we have a mismatch and where the mismatch happens.

proc sql; 
  create table dadkid4  as
  select  *, (dads.famid=faminc.famid) as indic, 
             (dads.famid ~=.) as dadind,
             (faminc.famid ~=.) as famind,
             coalesce(dads.famid, faminc.famid) as fid
  from dads full join faminc on dads.famid=faminc.famid;
quit;

proc print data=dadkid4;
run;

Obs   famid   name    inc    faminc96   faminc97   faminc98   indic   dadind   famind   fid
 1      1     Bill   30000     40000      40500      41000      1        1        1      1
 2      2     Art    22000     45000      45400      45800      1        1        1      2
 3      3     Paul   25000     75000      76000      77000      1        1        1      3
 4      4     Karl   95000         .          .          .      0        1        0      4
 5      .                .     55000      65000      70000      0        0        1      5
 6      .                .     22000      24000      28000      0        0        1      6

5. Producing all the possible distinct pairs of the values in a column

Let’s say that we have a data set containing a variable called city. We want to create all possible distinct pairs of cities appeared in the variable. This would be really tricky to do if we only use a data step. But it can be accomplished fairly straightforwardly with SAS proc sql as shown below. Proc sql is first used to select distinct cities and to save them to a new dataset. It is used again to create all distinct pairs of cities. As shown below, there are seven different places. Therefore there will be 7*6/2 =21 pairs of cities.

data places;
input pid city $12.;
cards; 
 1   LosAngeles
 2   Orlando
 3   London
 4   NewYork
 5   Boston
 6   Paris
 7   Washington
 8   LosAngeles
 9   Orlando
 10  London
;
run;

proc sql;
  create table discity  as 
  select distinct city 
  from places; 
quit;

proc print data=discity;
  title "Distinct Cities";
  format city $12.;
run;

proc sql;
   create table pair_places as
   select f1.city as orig ,
          f2.city as dest
    from  discity  as f1 , discity as f2
   where f1.city ne ' ' & f1.city < f2.city
   order by f1.city, f2.city; 
quit;

title 'All Possible Paired Places';
proc print data=pair_places;
  format orig dest $12.;
run;

Distinct Cities                                                     

Obs    city

 1     Boston
 2     London
 3     LosAngeles
 4     NewYork
 5     Orlando
 6     Paris
 7     Washington
 
All Possible Paired Places                                            

Obs    orig            dest

  1    Boston          London
  2    Boston          LosAngeles
  3    Boston          NewYork
  4    Boston          Orlando
  5    Boston          Paris
  6    Boston          Washington
  7    London          LosAngeles
  8    London          NewYork
  9    London          Orlando
 10    London          Paris
 11    London          Washington
 12    LosAngeles      NewYork
 13    LosAngeles      Orlando
 14    LosAngeles      Paris
 15    LosAngeles      Washington
 16    NewYork         Orlando
 17    NewYork         Paris
 18    NewYork         Washington
 19    Orlando         Paris
 20    Orlando         Washington
 21    Paris           Washington