How can I merge two files on address?

Given two data sets containing different pieces of information for an overlapping set of addresses, you may be interested in merging your datasets by address. Different data sources can represent the same address in very different ways, so this can be a daunting task. This page aims to offer some data management tips that will prevent errors and help in detecting true matches. We have included some code fragments and suggestions for SAS procedures that are especially useful when working with addresses.

Data Management Tips

1. Separate the addresses into components (street number, street name, city, state, zip). This can be made easier with the SAS functions scan, substring, and the prx functions (prxparse, prxmatch, prxsubstr, etc.). You may later combine them again into one variable, but cleaning and formatting address components will be easier with separate variables.

In the dataset created below, the street number, street name, and street type all appear in the variable address1 and city, state, and zip code all appear in the variable address2. We use scan and substring here to separate address1 into three component variables. For details on how to use these functions, see SAS’s online documentation.

data test;
  length address1 $40. address2 $60.;
  input address1 $ 1-20 address2 $ 21-80;
  datalines;
  1234 Washington St     DC          12345
  1234 Irving St         Charlotte  NC      12345
  45 Wall     street     New           York  NY 90454
  ;
run;

proc print data = test; run;

Obs         address1         address2
 1     1234 Washington St    DC          12345
 2     1234 Irving St        Charlotte  NC      12345
 3     45 Wall     street    New           York  NY 90454

data test; set test;
  streettype=scan(address1,-1);
  streetnum=scan(address1,1);
  streettype_start = index(address1, trim(streettype));
  streetnum_end = length(streetnum);
  streetname = trim(substr(address1, streetnum_end+1, (streettype_start-streetnum_end-1)));
  drop streettype_start streetnum_end;
run;

proc print data = test;
  var streetnum streetname streettype;
run;

Obs    streetnum    streetname    streettype
 1       1234       Washington      St
 2       1234       Irving          St
 3       45         Wall            street

If your data are less consistent, regular expressions may be the best approach. For help with regular expressions in SAS, see our related code fragments page.

2. Within your string variables, trim leading and trailing spaces and check that the variable length is not truncating any of the values. For help with this, see SAS FAQ: How can I get rid of extra spaces in a string variable?. If your strings have both capital and lowercase letters, use either the upcase or lowcase function to force your string variables to be one or the other. If you have a string variable that contains only digits (like the variable streetnum in the above example), consider converting it to a numeric variable to avoid worrying about spacing or formatting.

data test; set test;
  streetname_cap = upcase(streetname);
  streetnum_num = 1*streetnum;
run;

3. Focus on the information in your data that is most meaningful in matching records. There is likely information within an address that is redundant and more difficult to match. For example, given a zip code, city and state are redundant pieces of information that may be difficult to match due to different spelling/spacing/capitalization patterns. In the example above, streettype looks like it does not add information, but might complicate a merge.

4. Test your code as you go. Start with a very small set of observations from each dataset that includes some observations that you believe should match and some that should not. Run your merge and see if it works they way you believe it should. Use in = options in your merge so that you can quickly see how many records are not being matched. In the code below, we create a variable that indicates the outcome of the merge for each observation.

data merged_data; merge address_A (in = A) address_B (in = B);
  by streetnum streetname zip;
  if A and not B then match = "A only";
  if B and not A then match = "B only";
  if A and B then match = "match";
run;

Common Issues to Look Out For

Numbered streets that can be presented as numbers (2nd St., 5th Ave.) or spelled out (Second St., Fifth Ave.).
Zip codes with leading zeroes (these occur in New England) that may or may not appear.
Directions on streets (N. Taylor Rd., E. 154th St.).
Fractions in addresses (332 1/2 Ashton Ave.).
Different indications of apartments/condos/units/suites/floors in a given building.

You may face all or none of the issues listed above. Most of them can be addressed with the procedures presented in the code fragments or links shown above.

Final Thoughts

Keep in mind that matching on addresses is difficult under the best circumstances. Being realistic in your expectations and knowing the quality of your data can allow you to know when you have hit a reasonable match rate.