1. Introduction
When you have two data files, you may want to combine them by stacking them one on top of the other (referred to as concatenating files). Below we have a file called dads and a file containing moms.
dads famid name inc 2 Art 22000 1 Bill 30000 3 Paul 25000 moms famid name inc 1 Bess 15000 3 Pat 50000 2 Amy 18000
Below we have stacked (concatenated) these files creating a file we called momdad. These examples will show how to concatenate files in SAS.
momdad famid name inc 2 Art 22000 1 Bill 30000 3 Paul 25000 1 Bess 15000 3 Pat 50000 2 Amy 18000
2. Concatenating the moms and dads
The SAS program below creates a SAS data file called dads and a file called moms. It then combines them (concatenates them) creating a file called dadmom.
* Here is a file with information about dads with their family id name and income ; DATA dads; INPUT famid name $ inc ; CARDS; 2 Art 22000 1 Bill 30000 3 Paul 25000 ; RUN; * Here is a file with information about moms with their family id name and income ; DATA moms; INPUT famid name $ inc ; CARDS; 1 Bess 15000 3 Pat 50000 2 Amy 18000 ; RUN; * We can combine these files by stacking them one on top the other ; * by setting them both together in the same data step as shown below ; DATA dadmom; SET dads moms; RUN; * Let's use PROC PRINT to look at the result ; PROC PRINT DATA=dadmom; RUN;
The output of this program is shown below.
OBS FAMID NAME INC 1 2 Art 22000 2 1 Bill 30000 3 3 Paul 25000 4 1 Bess 15000 5 3 Pat 50000 6 2 Amy 18000
The output from this program shows that the files were combined properly. The dads and moms are stacked together in one file. But, there is a little problem. We can’t tell the dads from the moms. Let’s try doing this again but in such a way that we can tell which observations are the moms and which are the dads.
3. Concatenating the moms and dads, a better example
In order to tell the dads from the moms, let’s create a variable called momdad in the dads and moms data files that will contain dad for the dads data file and mom for the moms data file. When we combine the two files together the momdad variable will tell us who the moms and dads are.
DATA dads; INPUT famid name $ inc ; momdad = "dad"; CARDS; 2 Art 22000 1 Bill 30000 3 Paul 25000 ; RUN;DATA moms; INPUT famid name $ inc ; momdad = "mom"; CARDS; 1 Bess 15000 3 Pat 50000 2 Amy 18000 ; RUN;DATA dadmom; SET dads moms; RUN;* Now when we do the proc print you can see the dads from the moms ;PROC PRINT DATA=dadmom; RUN;
The output of this program is shown below.
OBS FAMID NAME INC MOMDAD 1 2 Art 22000 dad 2 1 Bill 30000 dad 3 3 Paul 25000 dad 4 1 Bess 15000 mom 5 3 Pat 50000 mom 6 2 Amy 18000 mom
Here we get a more desirable result, because we can tell the dads from the moms by looking at the variable momdad. This required some thinking ahead because we had to put momdad in both the dads data file and the moms data file before we merged the data files.
4. Problems to look out for
These above examples cover situations where there are no complications. However, look out for the following problems.
4.1. The two data files have different variable names for the same thing
For example, income is called dadinc and in the dads file and called mominc in the moms file, as shown below.
DATA dads; INPUT famid name $ dadinc ; DATALINES; 2 Art 22000 1 Bill 30000 3 Paul 25000 ; RUN; DATA moms; INPUT famid name $ mominc ; DATALINES; 1 Bess 15000 3 Pat 50000 2 Amy 18000 ; RUN; DATA momdad; SET dads(IN=dad) moms(IN=mom); IF dad=1 THEN momdad="dad"; IF mom=1 THEN momdad="mom"; run;PROC PRINT DATA=momdad; RUN;
You can see the problem illustrated below.
OBS FAMID NAME DADINC MOMINC DAD MOM MOMDAD 1 2 Art 22000 . 1 0 dad 2 1 Bill 30000 . 1 0 dad 3 3 Paul 25000 . 1 0 dad 4 1 Bess . 15000 0 1 mom 5 3 Pat . 50000 0 1 mom 6 2 Amy . 18000 0 1 mom
Solution #1. The most obvious solution is to choose appropriate variable names for the original files (i.e., name the variable inc in both the moms and dads file). This solution is not always possible since you might be concatenating files that you did not originally create. To save space, we omit illustrating this solution.
Solution #2. If solution #1 is not possible, then this problem can be addressed using an if statement in a data step.
DATA momdad; SET dads(IN=dad) moms(IN=mom); IF dad=1 THEN DO; momdad="dad"; inc=dadinc; END; IF mom=1 THEN DO; momdad="mom"; inc=mominc; END; RUN; PROC PRINT DATA=momdad; RUN;
The results are shown below, where inc now has the income for both the moms and dads.
OBS FAMID NAME DADINC MOMINC DAD MOM MOMDAD INC 1 2 Art 22000 . 1 0 dad 22000 2 1 Bill 30000 . 1 0 dad 30000 3 3 Paul 25000 . 1 0 dad 25000 4 1 Bess . 15000 0 1 mom 15000 5 3 Pat . 50000 0 1 mom 50000 6 2 Amy . 18000 0 1 mom 18000
Solution 3. Another way you can fix this is by using the rename option on the set statement of a data step to rename the variables just before the files are combined.
DATA momdad; SET dads(RENAME=(dadinc=inc)) moms(RENAME=(mominc=inc)); RUN; PROC PRINT DATA=momdad; RUN;
The output for Solution 3 is below.
OBS FAMID NAME INC 1 2 Art 22000 2 1 Bill 30000 3 3 Paul 25000 4 1 Bess 15000 5 3 Pat 50000 6 2 Amy 18000
4.2 The two data files have different lengths for variables of the same name
In all of the examples above, the variable name was input with the format $ indicating name is an alphabetic (string) variable with a default length of 8. What would happen if name in the dads file was input using $3. and name in the moms file was input using $4. ? This is illustrated below.
DATA dads; INPUT famid name $3. inc; DATALINES; 2 Art 22000 1 Bob 30000 3 Tom 25000 RUN; DATA moms; INPUT famid name $4. inc; DATALINES; 1 Bess 15000 3 Rory 50000 2 Jane 18000 RUN; DATA momdad; SET dads moms; RUN; PROC PRINT DATA=momdad; RUN;
The output is below.
OBS FAMID NAME INC 1 2 Art 22000 2 1 Bob 30000 3 3 Tom 25000 4 1 Bes 15000 5 3 Ror 50000 6 2 Jan 18000
Note that the names for the moms are truncated to be length 3. This is because the length for names in the dads file is 3. To fix this, use the length statement in the data step that merges the two files.
DATA momdad; LENGTH name $ 4; SET dads moms; RUN; PROC PRINT DATA=momdad; RUN;
The output is below.
OBS NAME FAMID INC 1 Art 2 22000 2 Bob 1 30000 3 Tom 3 25000 4 Bess 1 15000 5 Rory 3 50000 6 Jane 2 18000
4.3 The two data files have variables with the same name but different codes
This problem is similar to the problem above, except that it has an additional wrinkle, illustrated below. In the dads file there is a variable called fulltime that is coded 1 if the dad is working full time, 0 if he is not. The moms file also has a variable called fulltime that is coded Y is she is working full time, and N if she is not. Not only are these variables of different types (numeric and character), but they are coded differently as well.
DATA dads; INPUT famid name $ inc fulltime; DATALINES; 2 Art 22000 0 1 Bill 30000 1 3 Paul 25000 1 ; RUN; DATA moms; INPUT famid name $ inc fulltime $1.; DATALINES; 1 Bess 15000 N 3 Pat 50000 Y 2 Amy 18000 N ; RUN;
Solution #1. Code the variables in the two files in the same way. For example, code fulltime using 0/1 for both files with 1 indicating working fulltime. This is the simplest solution if you are creating the files yourself. We will omit illustrating this solution to save space.
Solution #2. You may not have created the original raw data files, so solution #1 may not be possible for you. In that case, you can create a new variable in each file that has the same coding and will be compatible when you merge the files. Below we illustrate this strategy.
For the dads file, we make a variable called full that is the same as fulltime, and save the file as dads2, dropping fulltime. For the moms, we create full by recoding fulltime, and save the file as moms2, also dropping fulltime. The files dads2 and moms2 both have the variable full coded the same way (0/1 where 1=works full time) so we can combine those files together.
DATA dads2; SET dads; full=fulltime; DROP fulltime; RUN; DATA moms2; SET moms; IF fulltime="Y" THEN full=1; IF fulltime="N" THEN full=0; DROP fulltime; RUN; DATA momdad; SET dads2 moms2; RUN;PROC PRINT DATA=momdad; RUN;
The results are shown below.
OBS FAMID NAME INC FULL 1 2 Art 22000 0 2 1 Bill 30000 1 3 3 Paul 25000 1 4 1 Bess 15000 0 5 3 Pat 50000 1 6 2 Amy 18000 0
5. For more information
- For more information about concatenating data files, see the section on Combining Data Sets in Chapter 4 of the SAS Language Reference and Chapters 14 and 15 of the SAS Language and Procedures Guide .