Here we illustrate how to collapse data across observations using proc means. Our example uses a hypothetical data set containing information about kids in three families. These examples show how you can collapse across kids to form family records from the kids records.
1. Reading the data file
Here is the SAS program that makes a data file called kids. It contains three families (famid) each with three kids. It contains the family ID, the name of the kid, the order of birth (1 2 3 for 1st, 2nd, 3rd), and the age, weight and sex of each kid.
DATA kids; LENGTH kidname $ 4 sex $ 1; INPUT famid kidname birth age wt sex ; CARDS; 1 Beth 1 9 60 f 1 Bob 2 6 40 m 1 Barb 3 3 20 f 2 Andy 1 8 80 m 2 Al 2 6 50 m 2 Ann 3 2 20 f 3 Pete 1 6 60 m 3 Pam 2 4 40 f 3 Phil 3 2 20 m ; RUN; PROC PRINT DATA=kids; RUN;
Here is the output of the proc print.
OBS KIDNAME SEX FAMID BIRTH AGE WT 1 Beth f 1 1 9 60 2 Bob m 1 2 6 40 3 Barb f 1 3 3 20 4 Andy m 2 1 8 80 5 Al m 2 2 6 50 6 Ann f 2 3 2 20 7 Pete m 3 1 6 60 8 Pam f 3 2 4 40 9 Phil m 3 3 2 20
2. Using proc means to collapse data across records
We can use proc means to collapse across across families. The example below computes the average age of each child within each family (because of the class famid; statement) and then outputs the results into a SAS data file called fam2.
PROC MEANS DATA=kids ; CLASS famid; VAR age; OUTPUT OUT=fam2 MEAN= ; RUN;
The output of the proc means is shown below.
FAMID N Obs N Mean Std Dev Minimum Maximum --------------------------------------------------- 1 3 3 6.000 3.00000 3.00000 9.0000000 2 3 3 5.333 3.05505 2.00000 8.0000000 3 3 3 4.000 2.00000 2.00000 6.0000000 ----------------------------------------------------
And we use proc print to have a look at fam2.
PROC PRINT DATA=fam2; RUN;
And this output shows that the data file fam2 contains the average of age for the kids for each family.
OBS FAMID _TYPE_ _FREQ_ AGE 1 . 0 9 5.11111 2 1 1 3 6.00000 3 2 1 3 5.33333 4 3 1 3 4.00000
However, there is one extra record (the one shown below). This is the overall mean (notice that the _FREQ_ for it is 9, and there are a total of nine kids). We really don’t want this record.
OBS FAMID _TYPE_ _FREQ_ AGE 1 . 0 9 5.11111
We can suppress the creation of the record with the overall mean with the nway option on the proc means statement. In general, when you use proc means with the class statement and make an output data file, you usually will want to use the nway option as shown below.
PROC MEANS DATA=kids NWAY ; CLASS famid; VAR age; OUTPUT OUT=fam3 MEAN= ; RUN;[we omit the proc means output]
PROC PRINT DATA=fam3; RUN;
Now, the fam3 data file had just has three records with the average age for each family.
OBS FAMID _TYPE_ _FREQ_ AGE 1 1 1 3 6.00000 2 2 1 3 5.33333 3 3 1 3 4.00000
3. Explicitly naming the collapsed variables
The following proc means example does the exact same thing as the prior example, except that the average of age is explicitly named, calling it avgage.
PROC MEANS DATA=kids NWAY ; CLASS famid; VAR age; OUTPUT OUT=fam4 MEAN=avgage ; RUN;[we omit the proc means output]
PROC PRINT DATA=fam4; RUN;
The output is the same as before, except that the average of age is called avgage.
OBS FAMID _TYPE_ _FREQ_ AVGAGE 1 1 1 3 6.00000 2 2 1 3 5.33333 3 3 1 3 4.00000
The rest of the examples will explicitly name the collapsed variables (e.g., use mean=avgage instead of just mean= ). In general, it is better to explicitly name the variables to avoid confusion between the original variable and the collapsed variable.
4. Getting means of more than one variable
We can request averages for more than one variable. Here we get the average for age and for wt all in the same command.
PROC MEANS DATA=kids NWAY ; CLASS famid; VAR age wt; OUTPUT OUT=fam5 MEAN=avgage avgwt; RUN;FAMID N Obs Variable N Mean Std Dev Minimum ----------------------------------------------------------- 1 3 AGE 3 6.0000 3.0000000 3.0000000 WT 3 40.0000 20.0000000 20.0000000 2 3 AGE 3 5.3333 3.0550505 2.0000000 WT 3 50.0000 30.0000000 20.0000000 3 3 AGE 3 4.0000 2.0000000 2.0000000 WT 3 40.0000 20.0000000 20.0000000 ------------------------------------------------------------ [to save space, we omit the output with the maximum of age and wt] PROC PRINT DATA=fam5; RUN;
As you see in the output below, avgage is the average age and avgwt is the average weight of the kids in each family.
OBS FAMID _TYPE_ _FREQ_ AVGAGE AVGWT 1 1 1 3 6.00000 40 2 2 1 3 5.33333 50 3 3 1 3 4.00000 40
5. Requesting multiple statistics at once
We can request multiple statistics at once. The command below gets the mean, standard deviation and age (mean std and N) for age and wt within each family.
PROC MEANS DATA=kids NWAY; CLASS famid; VAR age wt; OUTPUT OUT=fam6 MEAN=avgage avgwt STD=stdage stdwt N=nage nwt ; RUN;
The results below shows the output of the proc means.
FAMID N Obs Variable N Mean Std Dev Minimum -------------------------------------------------------------- 1 3 AGE 3 6.0000000 3.0000000 3.0000000 WT 3 40.0000000 20.0000000 20.0000000 2 3 AGE 3 5.3333333 3.0550505 2.0000000 WT 3 50.0000000 30.0000000 20.0000000 3 3 AGE 3 4.0000000 2.0000000 2.0000000 WT 3 40.0000000 20.0000000 20.0000000 -------------------------------------------------------------- [to save space, we omit the output with the maximum of age and wt]PROC PRINT DATA=fam6; RUN;
The results below correspond to the proc means above. You can see that the average age and wt by family are in avgage and avgwt. Likewise stdage and stdwt contain the standard deviation of age and wt for each family, and nage and nwt have the valid number of observations for age and wt for each family.
OBS FAMID _TYPE_ _FREQ_ AVGAGE AVGWT STDAGE STDWT NAGE NWT 1 1 1 3 6.00000 40 3.00000 20 3 3 2 2 1 3 5.33333 50 3.05505 30 3 3 3 3 1 3 4.00000 40 2.00000 20 3 3
6. Suppressing proc means output
In our example, we have just three families. For your data, you might have dozens, hundreds, or thousands of families (or whatever grouping you are using). The output of the proc means can get very long, so you may want to suppress the output. You can do that with the noprint option as shown below.
PROC MEANS DATA=kids NWAY NOPRINT ; CLASS famid; VAR age wt; OUTPUT OUT=fam7 MEAN=avgage avgwt STD=stdage stdwt N=nage nwt ; RUN;
The output from the proc means is not printed due to the noprint option.
7. Counting the number of boys and girls in the family
Suppose you wanted a count of the number of boys and girls in the family. We can do that with one extra step. We will make a dummy variable that is 1 if a boy (0 if not), and a dummy variable that is 1 if a girl (and 0 if not). The sum of the boy dummy variable within a family is the number of boys in the family and the sum of the girl dummy variable within a family is the number of girls in the family.
First, we use a data step to make the boy and girl dummy variable.
DATA kids2 ; SET kids; If sex = "m" THEN boy = 1; ELSE boy = 0 ; If sex = "f" THEN girl= 1; ELSE girl= 0 ; RUN;
We use proc print to look at the boy and girl variables to double check them.
PROC PRINT DATA=kids2; VAR sex boy girl ; RUN;OBS SEX BOY GIRL 1 f 0 1 2 m 1 0 3 f 0 1 4 m 1 0 5 m 1 0 6 f 0 1 7 m 1 0 8 f 0 1 9 m 1 0
We use proc means to sum up the boy and girl dummy variables for each family and to create a data file called fam8 that contains the sum of boy in boys and the sum of girl in girls. We use the noprint option to suppress the output of the proc means.
PROC MEANS DATA=kids2 NWAY NOPRINT ; CLASS famid; VAR boy girl ; OUTPUT OUT=fam8 SUM=boys girls ; RUN;
We do a proc print to look at the output data file.
PROC PRINT DATA=fam8; RUN;
As we expect, the proc print shows that boys contains the count of boys in each family and girls contains the count of girls in each family.
OBS FAMID _TYPE_ _FREQ_ BOYS GIRLS 1 1 1 3 1 2 2 2 1 3 2 1 3 3 1 3 2 1
8. Merging the collapsed data back with the original data
Sometimes you want to merge the collapsed data back with the original data. Let’s use an example creating avgage and avgwt for each family, then merge those results back with the original kids data.
First, let’s collapse the data across families to make avgage and avgwt just as we have done before.
PROC MEANS DATA=kids NWAY NOPRINT ; CLASS famid; VAR age wt; OUTPUT OUT=fam9 MEAN=avgage avgwt; RUN;
Second, we sort kids and sort fam9 both on famid, preparing for merging them together.
PROC SORT DATA=kids OUT=skids ; BY famid ; RUN; PROC SORT DATA=fam9 OUT=sfam9 ; BY famid ; RUN;
Third, we merge the sorted files together (skids and sfam9) by famid. We can drop _type_ and _freq_ since they are not needed, but we don’t have to drop them.
DATA kidsmrg ; MERGE skids sfam9 ; BY famid ; DROP _type_ _freq_ ; RUN;
We can print out the results, showing that the variables avgage and avgwt are now merged back with the original kids so each kid has the associated average age and weight for their family.
PROC PRINT DATA=kidsmrg; RUN; OBS KIDNAME SEX FAMID BIRTH AGE WT AVGAGE AVGWT 1 Beth f 1 1 9 60 6.00000 40 2 Bob m 1 2 6 40 6.00000 40 3 Barb f 1 3 3 20 6.00000 40 4 Andy m 2 1 8 80 5.33333 50 5 Al m 2 2 6 50 5.33333 50 6 Ann f 2 3 2 20 5.33333 50 7 Pete m 3 1 6 60 4.00000 40 8 Pam f 3 2 4 40 4.00000 40 9 Phil m 3 3 2 20 4.00000 40
9. Problems to look out for
- You may end up with records that you were not expecting if you forget to use the nway option.
- If you collapse across records, and then remerge back with the original data, be sure that you explicitly name the variables when you collapse them. If you don’t, the variables from the collapsed data will have the same names as the original data, and they will clash when you remerge the data.
10. For more information
-
For more information about merging data files, see the SAS Learning Module on Match Merging Data Files in SAS.