Sometimes you have data files that need to be collapsed to be useful to you. For example, you might have student data but you really want classroom data, or you might have weekly data but you want monthly data, etc. We will illustrate this using an example showing how you can collapse data across kids to make family level data.
Here is a file containing information about the kids in three families. There is one record per kid. Birth is the order of birth (i.e., 1 is first), age wt and sex are the child’s age, weight and sex. We will use this file for showing how to collapse data across observations.
data list list / famid * kidname (A8) birth age wt * sex (A8). begin data 1 "Beth" 1 9 60 "f" 1 "Bob" 2 6 40 "m" 1 "Barb" 3 3 20 "f" 2 "Andy" 1 8 80 "m" 2 "Al" 2 6 50 "m" 2 "Ann" 3 2 20 "f" 3 "Pete" 1 6 60 "m" 3 "Pam" 2 4 40 "f" 3 "Phil" 3 2 20 "m" end data. save outfile 'd:kids.sav'.
First, a note about the way these data need to be entered into SPSS. The (A_) tells SPSS that the variable(s) before that option are string variables. If you are listing only one string variable, you need to put an asterisk before the name of the string variable to tell SPSS that the variables listed before the asterisk are numeric variables. Hence, the asterisks (*) after famid and wt are necessary because SPSS would have assumed that all variables listed before the (A8) option were string variables.
We will also need a constant in the data set, which we create below.
compute const = 1. execute. list. FAMID KIDNAME BIRTH AGE WT SEX CONST 1.00 Beth 1.00 9.00 60.00 f 1.00 1.00 Bob 2.00 6.00 40.00 m 1.00 1.00 Barb 3.00 3.00 20.00 f 1.00 2.00 Andy 1.00 8.00 80.00 m 1.00 2.00 Al 2.00 6.00 50.00 m 1.00 2.00 Ann 3.00 2.00 20.00 f 1.00 3.00 Pete 1.00 6.00 60.00 m 1.00 3.00 Pam 2.00 4.00 40.00 f 1.00 3.00 Phil 3.00 2.00 20.00 m 1.00 Number of cases read: 9 Number of cases listed: 9
Consider the aggregate command below. It collapses across all of the observations to make a single record with the average age of the kids. Note that you need to specify a new file into which the aggregated data will be placed. You also need to specify either a /break subcommand or a /missing, /presorted or /document subcommand with the aggregate command. In many cases, the /break subcommand is most useful. It tells SPSS by which variable to collapse the data. In this example, we collapsed by a constant, meaning that all cases were collapsed into a single case. You can see this in the output below.
aggregate outfile 'd:kids1.sav' /break = const /avgage=mean(age). get file 'd:kids1.sav'. list.CONST AVGAGE 1.00 5.11 Number of cases read: 1 Number of cases listed: 1
The following aggregate command will create one record for each family that contains the average age of the kids in the family. The average of age is named avgage and we specified that we want the mean.
get file 'd:kids.sav'. aggregate outfile 'd:kids2.sav' /break = famid /age1 = mean(age). get file 'd:kids2.sav'. list.FAMID AGE1 1.00 6.00 2.00 5.33 3.00 4.00 Number of cases read: 3 Number of cases listed: 3
We can request averages for more than one variable. Each new variable must be specified on its own subcommand.
get file 'd:kids.sav'. aggregate outfile 'd:kids3.sav' /break = famid /avgage = mean(age) /avgwt = mean(wt). get file 'd:kids3.sav'. list.FAMID AVGAGE AVGWT 1.00 6.00 40.00 2.00 5.33 50.00 3.00 4.00 40.00 Number of cases read: 3 Number of cases listed: 3
This command gets the average of age age wt like the command above, and also computes numkids which is the count of the number of kids in each family (obtained by counting the number of observations with valid values of birth). The nu function tells SPSS to provide the number of unweighted cases, i.e., to give the total count.
get file 'd:kids.sav'. aggregate outfile 'd:kids4.sav' /break = famid /avgage = mean(age) /avgwt = mean(wt) /numkids = nu(birth). get file 'd:kids4.sav'. list.FAMID AVGAGE AVGWT NUMKIDS 1.00 6.00 40.00 3 2.00 5.33 50.00 3 3.00 4.00 40.00 3 Number of cases read: 3 Number of cases listed: 3
Suppose you wanted a count of the number of boys & girls in the family. We can do that with one extra step. We will create a dummy variable that is 1 if the kid is a boy (0 if not), and a dummy variable that is 1 if the kid is a girl (and 0 if not). The sum of the boy dummy variable is the number of boys and the sum of the girl dummy variable is the number of girls.
get file 'd:kids.sav'. freq var = sex.
Statistics
SEXN Valid 9 Missing 0
SEX Frequency Percent Valid Percent Cumulative Percent Valid f 4 44.4 44.4 44.4 m 5 55.6 55.6 100.0 Total 9 100.0 100.0
We can look at the dummy variables. Sexdum1 is the dummy variable for girls. Sexdum2 is the dummy variable for boys. The sum of sexdum1 is the number of girls in the family. The sum of sexdum2 is the number of boys in the family.
compute sexdum1 = 0. if sex = "f" sexdum1 = 1. compute sexdum2 = 0. if sex = "m" sexdum2 = 1. execute. list famid sex sexdum1 sexdum2.FAMID SEX SEXDUM1 SEXDUM2 1.00 f 1.00 .00 1.00 m .00 1.00 1.00 f 1.00 .00 2.00 m .00 1.00 2.00 m .00 1.00 2.00 f 1.00 .00 3.00 m .00 1.00 3.00 f 1.00 .00 3.00 m .00 1.00 Number of cases read: 9 Number of cases listed: 9
The command below creates girls which is the number of girls in the family, and boys which is the number of boys in the family. You can also add a label after the new variable name by placing it in single quote marks. You can see the labels in SPSS data editor after clicking on the "variable view" tab in the lower left corner of the editor.
aggregate outfile 'd:kids5.sav' /break = famid /numkids = nu(birth) /girls 'number of girls' = sum(sexdum1) /boys 'number of boys' = sum(sexdum2). get file 'd:kids5.sav'.
We can list out the data to confirm that it worked correctly.
list.FAMID NUMKIDS GIRLS BOYS 1.00 3 2.00 1.00 2.00 3 1.00 2.00 3.00 3 1.00 2.00 Number of cases read: 3 Number of cases listed: 3
There are a number of functions that can be used with the aggregate command, including:
Function | Purpose |
sum | sum |
mean | mean |
sd | standard deviation |
max | maximum |
min | minimum |
pgt | percent of cases greater than value |
plt | percent of cases less than value |
pin | percent of cases between values |
pout | percent of cases not in range |
fgt | fraction greater than value |
flt | fraction less than value |
fin | fraction between values |
fout | fraction not in range |
n | weighted number of cases |
nu | unweighted number of cases |
nmiss | weighted number of missing cases |
numiss | unweighted number of missing cases |
first | first nonmissing value |
last | last nonmissing value |