1. One-to-one merge
Below we have a file containing family id, father’s name and income. We also have a file containing income information for multiple years. We would like to match merge the files together so we have the dads observation on the same line with the faminc observation based on the key variable famid. In proc sql we use where statement to do the matching as shown below.
data dads; input famid name $ inc ; cards; 2 Art 22000 1 Bill 30000 3 Paul 25000 ; run; data faminc; input famid faminc96 faminc97 faminc98 ; cards; 3 75000 76000 77000 1 40000 40500 41000 2 45000 45400 45800 ; run;proc sql; create table dadfam1 as select * from dads, faminc where dads.famid=faminc.famid order by dads.famid; quit; proc print data=dadfam1; run; Obs famid name inc faminc96 faminc97 faminc98 1 1 Bill 30000 40000 40500 41000 2 2 Art 22000 45000 45400 45800 3 3 Paul 25000 75000 76000 77000
2. One-to-many merge
Imagine that we had a file with dads like we saw in the previous example, and we had a file with kids where a dad could have more than one kid. Matching up the "dads" with the "kids" is called a "one-to-many" merge since you are matching one dad observation to possibly many kids records. The dads and kids records are shown below. Notice here we have variable fid in the first data set and famid in the second. These are the variables that we want to match. When we merge the two using proc sql, we don’t have to rename them, since we can use data set name identifier.
data dads; input fid name $ inc ; cards; 2 Art 22000 1 Bill 30000 3 Paul 25000 ; run; * Next we make the "kids" data file ; data kids; input famid kidname $ birth age wt sex $ ; cards; 1 Beth 1 9 60 f 1 Bob 2 6 40 m 1 Barb 3 3 20 f 2 Andy 1 8 80 m 2 Al 2 6 50 m 2 Ann 3 2 20 f 3 Pete 1 6 60 m 3 Pam 2 4 40 f 3 Phil 3 2 20 m ; run; proc sql; create table dadkid2 as select * from dads, kids where dads.fid=kids.famid order by dads.fid, kids.kidname; quit; proc print data=dadkid2; run;Obs fid name inc famid kidname birth age wt sex 1 1 Bill 30000 1 Barb 3 3 20 f 2 1 Bill 30000 1 Beth 1 9 60 f 3 1 Bill 30000 1 Bob 2 6 40 m 4 2 Art 22000 2 Al 2 6 50 m 5 2 Art 22000 2 Andy 1 8 80 m 6 2 Art 22000 2 Ann 3 2 20 f 7 3 Paul 25000 3 Pam 2 4 40 f 8 3 Paul 25000 3 Pete 1 6 60 m 9 3 Paul 25000 3 Phil 3 2 20 m
3. Renaming variables with the same name in merging
Below we have the files with the information about the dads and family, but look more closely at the names of the variables. In the dads file, there is a variable called inc98, and in the family file there are variables inc96, inc97 and inc98.
data dads; input famid name $ inc98; cards; 2 Art 22000 1 Bill 30000 3 Paul 25000 ; run; data faminc; input famid inc96 inc97 inc98; cards; 3 75000 76000 77000 1 40000 40500 41000 2 45000 45400 45800 ; run;
Let’s merge them using the same strategy used in our previous example on merging. We see below that we lost variable inc98 from the second dataset faminc. Proc sql uses the column from the first data set in case of same variable names from both datasets. This may not be what we want.
proc sql; create table dadkid4 as select * from dads, faminc where dads.famid=faminc.famid order by dads.famid; quit; proc print data=dadkid4; run; Obs famid name inc98 inc96 inc97 1 1 Bill 30000 40000 40500 2 2 Art 22000 45000 45400 3 3 Paul 25000 75000 76000
In proc sql we can rename the variables using the as statement shown below.
proc sql; create table dadkid5 as select *, dads.inc98 as dadinc98, faminc.inc98 as faminc98 from dads, faminc where dads.famid=faminc.famid order by dads.famid; quit; proc print data=dadkid5; run; Obs famid name inc98 inc96 inc97 dadinc98 faminc98 1 1 Bill 30000 40000 40500 30000 41000 2 2 Art 22000 45000 45400 22000 45800 3 3 Paul 25000 75000 76000 25000 77000
4. Using full join to handle mismatching records in a one-to-one merge
The two datasets may have records that do not match. Below we illustrate this by including an extra dad (Karl in famid 4) that does not have a corresponding family, and there are two extra families (5 and 6) in the family file that do not have a corresponding dad.
data dads; input famid name $ inc; cards; 2 Art 22000 1 Bill 30000 3 Paul 25000 4 Karl 95000 ; run; data faminc; input famid faminc96 faminc97 faminc98; cards; 3 75000 76000 77000 1 40000 40500 41000 2 45000 45400 45800 5 55000 65000 70000 6 22000 24000 28000 ; run;
Let’s apply the previous example to these two datasets. We see that the unmatched records have been dropped out in the merged data set, since the where statement eliminated them.
proc sql; create table dadkid3 as select * from dads, faminc where dads.famid=faminc.famid order by dads.famid; quit; proc print data=dadkid3; run;Obs famid name inc faminc96 faminc97 faminc98 1 1 Bill 30000 40000 40500 41000 2 2 Art 22000 45000 45400 45800 3 3 Paul 25000 75000 76000 77000
What if we want to keep all the records from both datasets even they do not match? The following proc sql does it in a more complex way. Here we create two new variables. One is indic, an indicator variable that indicates whether an observation is from both datasets, 1 being from both datasets and 0 otherwise. Another variable is fid, a coalesce of famid from both datasets. This gives us more control over our datasets. We can decide if we have a mismatch and where the mismatch happens.
proc sql; create table dadkid4 as select *, (dads.famid=faminc.famid) as indic, (dads.famid ~=.) as dadind, (faminc.famid ~=.) as famind, coalesce(dads.famid, faminc.famid) as fid from dads full join faminc on dads.famid=faminc.famid; quit; proc print data=dadkid4; run;Obs famid name inc faminc96 faminc97 faminc98 indic dadind famind fid 1 1 Bill 30000 40000 40500 41000 1 1 1 1 2 2 Art 22000 45000 45400 45800 1 1 1 2 3 3 Paul 25000 75000 76000 77000 1 1 1 3 4 4 Karl 95000 . . . 0 1 0 4 5 . . 55000 65000 70000 0 0 1 5 6 . . 22000 24000 28000 0 0 1 6
5. Producing all the possible distinct pairs of the values in a column
Let’s say that we have a data set containing a variable called city. We want to create all possible distinct pairs of cities appeared in the variable. This would be really tricky to do if we only use a data step. But it can be accomplished fairly straightforwardly with SAS proc sql as shown below. Proc sql is first used to select distinct cities and to save them to a new dataset. It is used again to create all distinct pairs of cities. As shown below, there are seven different places. Therefore there will be 7*6/2 =21 pairs of cities.
data places; input pid city $12.; cards; 1 LosAngeles 2 Orlando 3 London 4 NewYork 5 Boston 6 Paris 7 Washington 8 LosAngeles 9 Orlando 10 London ; run; proc sql; create table discity as select distinct city from places; quit; proc print data=discity; title "Distinct Cities"; format city $12.; run; proc sql; create table pair_places as select f1.city as orig , f2.city as dest from discity as f1 , discity as f2 where f1.city ne ' ' & f1.city < f2.city order by f1.city, f2.city; quit; title 'All Possible Paired Places'; proc print data=pair_places; format orig dest $12.; run; Distinct Cities Obs city 1 Boston 2 London 3 LosAngeles 4 NewYork 5 Orlando 6 Paris 7 Washington All Possible Paired Places Obs orig dest 1 Boston London 2 Boston LosAngeles 3 Boston NewYork 4 Boston Orlando 5 Boston Paris 6 Boston Washington 7 London LosAngeles 8 London NewYork 9 London Orlando 10 London Paris 11 London Washington 12 LosAngeles NewYork 13 LosAngeles Orlando 14 LosAngeles Paris 15 LosAngeles Washington 16 NewYork Orlando 17 NewYork Paris 18 NewYork Washington 19 Orlando Paris 20 Orlando Washington 21 Paris Washington