1. Introduction
This module will illustrate how to collapse across variables. First, let’s read in a sample dataset named kids which includes the variables famid (family id) and wt (kids weight in pounds).
DATA kids; LENGTH kidname $ 4 sex $ 1; INPUT famid kidname birth age wt sex ; CARDS; 1 Beth 1 9 60 f 1 Bob 2 6 40 m 1 Barb 3 3 20 f 2 Andy 1 8 80 m 2 Al 2 6 50 m 2 Ann 3 2 20 f 3 Pete 1 6 60 m 3 Pam 2 4 40 f 3 Phil 3 2 20 m 4 Sam 1 11 100 m 4 Stu 2 8 90 m ; RUN; PROC PRINT DATA=kids; RUN;
The output is shown below.
OBS KIDNAME SEX FAMID BIRTH AGE WT 1 Beth f 1 1 9 60 2 Bob m 1 2 6 40 3 Barb f 1 3 3 20 4 Andy m 2 1 8 80 5 Al m 2 2 6 50 6 Ann f 2 3 2 20 7 Pete m 3 1 6 60 8 Pam f 3 2 4 40 9 Phil m 3 3 2 20 10 Sam m 4 1 11 100 11 Stu m 4 2 8 90
2. Collapsing and computing average weights using proc means
Next, by using proc means, one can create a variable that represents the sum of ALL the weights of each person within a family, a variable that represents the average weight of each person within a family, and a variable that counts the number of people within a family. This can be seen in the example below, where three new variables, sumwt, meanwt and cnt, are created by famid, and then written to the new dataset fam1.
PROC MEANS DATA=kids NWAY ; CLASS famid ; VAR wt ; OUTPUT OUT=fam1 SUM=sumwt MEAN=meanwt N=cnt ; RUN; PROC PRINT DATA=fam1; VAR famid sumwt meanwt cnt; RUN;
The output is shown below.
Analysis Variable : WTFAMID N Obs N Mean Std Dev Minimum Maximum ——————————————————————————- 1 3 3 40.0000000 20.0000000 20.0000000 60.0000000
2 3 3 50.0000000 30.0000000 20.0000000 80.0000000
3 3 3 40.0000000 20.0000000 20.0000000 60.0000000
4 2 2 95.0000000 7.0710678 90.0000000 100.0000000 ——————————————————————————- OBS FAMID SUMWT MEANWT CNT
1 1 120 40 3 2 2 150 50 3 3 3 120 40 3 4 4 190 95 2
3. Collapsing and computing average weights manually (collapsing across observations)
Of course, collapsing can always be done manually within a data step. This, however, requires a bit more complex SAS programming. To create sum, mean, and N (sample size) variables that summarize values within a group (e.g., families), one can count over observations within a group by using a retained variable and a counter. In the example below, retained counter variables are created that count across observations within families until the last record within a family is encountered. (This is possible because a retained variable allows the value for the last observation to be available for use when accessing the current observation.) Then, the retained variable from the last observation within each family is written to the new SAS dataset fam2. At the final step, only the variables famid, sumwt, meanwt, and cnt are kept in the dataset fam2. Note that the variable meanwt does NOT need to be retained. This is because at each step, it is simply a function of the retained variables sumwt and cnt.
PROC SORT DATA=kids OUT=sortkids ; BY famid ; RUN ; DATA fam2 ; SET sortkids ; BY famid ; RETAIN sumwt cnt; IF first.famid THEN DO; sumwt = 0; cnt = 0; END; sumwt = sumwt + wt ; cnt = cnt + 1; meanwt=sumwt/cnt; /* this outputs a record ONLY when at the last obs in a family*/ ; IF last.famid THEN OUTPUT; KEEP famid sumwt meanwt cnt ; RUN; PROC PRINT DATA=fam2 ; RUN;
The output is shown below.
OBS FAMID SUMWT CNT MEANWT 1 1 120 3 40 2 2 150 3 50 3 3 120 3 40 4 4 190 2 95
4. Computing sums, counts and other summary information
The above example illustrated how one can compute sums, means, and counts within groups using the retain statement within a data step. Other variables, such as dummy or flag variables, can also be computed using the retain statement. For example, say a study is interested in (1) the number of boys in each family, (2) whether or not there is a girl in the family and (3) if any of the children in each family are over 85 pounds in weight. All of this information can be collected and stored using the retain statement. The example below works similarly to the example above; however, this example additionally creates a variable numboys, which counts the number of boys in each family, and the flag variables hasgirl and over85, which take on the values of ‘1’ or ‘0’, depending on whether or not there is a girl in the family, or if a family has a child over 85 pounds, respectively.
PROC SORT DATA=kids OUT=sortkids ; BY famid ; RUN ; DATA fam3 ; SET sortkids ; BY famid ; RETAIN sumwt cnt numboys hasgirl over85 ; IF first.famid THEN DO; sumwt = 0; /* sum of weights for family */ ; cnt = 0; /* count of kids in family */; numboys= 0; /* number of boys in family */; hasgirl= 0; /* 1 if family has girl, 0 if no girl */; over85 = 0; /* 1 if family has child with wt over 85, 0 if not */; END; sumwt = sumwt + wt ; cnt = cnt + 1; IF (sex = 'm') THEN numboys = numboys + 1 ; IF (sex = 'f') THEN hasgirl = 1 ; IF (wt > 85) THEN over85 = 1 ; /* this outputs a record ONLY when at the last obs in a family */; IF last.famid THEN DO; meanwt = sumwt / cnt ; /* do any final computations before outputting record */; OUTPUT; END; KEEP famid sumwt cnt numboys hasgirl over85 meanwt ; RUN; PROC PRINT DATA=fam3 ; RUN;
The output is shown below.
OBS FAMID SUMWT CNT NUMBOYS HASGIRL OVER85 MEANWT1 1 120 3 1 1 0 40 2 2 150 3 2 1 0 50 3 3 120 3 2 1 0 40 4 4 190 2 2 0 1 95