Collapsing data across observations | SPSS Learning Modules

Sometimes you have data files that need to be collapsed to be useful to you. For example, you might have student data but you really want classroom data, or you might have weekly data but you want monthly data, etc. We will illustrate this using an example showing how you can collapse data across kids to make family level data.

Here is a file containing information about the kids in three families. There is one record per kid. Birth is the order of birth (i.e., 1 is first), age wt and sex are the child’s age, weight and sex. We will use this file for showing how to collapse data across observations.

data list list / famid * kidname (A8) birth age wt * sex (A8).
begin data
1 "Beth" 1 9 60 "f"
1 "Bob" 2 6 40 "m"
1 "Barb" 3 3 20 "f"
2 "Andy" 1 8 80 "m"
2 "Al" 2 6 50 "m"
2 "Ann" 3 2 20 "f"
3 "Pete" 1 6 60 "m"
3 "Pam" 2 4 40 "f"
3 "Phil" 3 2 20 "m"
end data.

save outfile 'd:kids.sav'.

First, a note about the way these data need to be entered into SPSS. The (A_) tells SPSS that the variable(s) before that option are string variables. If you are listing only one string variable, you need to put an asterisk before the name of the string variable to tell SPSS that the variables listed before the asterisk are numeric variables. Hence, the asterisks (*) after famid and wt are necessary because SPSS would have assumed that all variables listed before the (A8) option were string variables.

We will also need a constant in the data set, which we create below.

compute const = 1.
execute.
list. 
 
   FAMID KIDNAME     BIRTH      AGE       WT SEX         CONST

    1.00 Beth         1.00     9.00    60.00 f            1.00
    1.00 Bob          2.00     6.00    40.00 m            1.00
    1.00 Barb         3.00     3.00    20.00 f            1.00
    2.00 Andy         1.00     8.00    80.00 m            1.00
    2.00 Al           2.00     6.00    50.00 m            1.00
    2.00 Ann          3.00     2.00    20.00 f            1.00
    3.00 Pete         1.00     6.00    60.00 m            1.00
    3.00 Pam          2.00     4.00    40.00 f            1.00
    3.00 Phil         3.00     2.00    20.00 m            1.00

Number of cases read:  9    Number of cases listed:  9

Consider the aggregate command below. It collapses across all of the observations to make a single record with the average age of the kids. Note that you need to specify a new file into which the aggregated data will be placed. You also need to specify either a /break subcommand or a /missing, /presorted or /document subcommand with the aggregate command. In many cases, the /break subcommand is most useful. It tells SPSS by which variable to collapse the data. In this example, we collapsed by a constant, meaning that all cases were collapsed into a single case. You can see this in the output below.

aggregate outfile 'd:kids1.sav'
 /break = const
 /avgage=mean(age).

get file 'd:kids1.sav'.

list.

   CONST   AVGAGE

    1.00     5.11

Number of cases read:  1    Number of cases listed:  1

The following aggregate command will create one record for each family that contains the average age of the kids in the family. The average of age is named avgage and we specified that we want the mean.

get file 'd:kids.sav'.

aggregate outfile 'd:kids2.sav'
 /break = famid
 /age1 = mean(age).

get file 'd:kids2.sav'.

list.

   FAMID     AGE1

    1.00     6.00
    2.00     5.33
    3.00     4.00

Number of cases read:  3    Number of cases listed:  3

We can request averages for more than one variable. Each new variable must be specified on its own subcommand.

get file 'd:kids.sav'.

aggregate outfile 'd:kids3.sav'
 /break = famid
 /avgage = mean(age)
 /avgwt = mean(wt).

get file 'd:kids3.sav'.

list.

   FAMID   AVGAGE    AVGWT

    1.00     6.00    40.00
    2.00     5.33    50.00
    3.00     4.00    40.00

Number of cases read:  3    Number of cases listed:  3

This command gets the average of age age wt like the command above, and also computes numkids which is the count of the number of kids in each family (obtained by counting the number of observations with valid values of birth). The nu function tells SPSS to provide the number of unweighted cases, i.e., to give the total count.

get file 'd:kids.sav'.

aggregate outfile 'd:kids4.sav'
 /break = famid
 /avgage = mean(age)
 /avgwt = mean(wt)
 /numkids = nu(birth).

get file 'd:kids4.sav'.

list.

   FAMID   AVGAGE    AVGWT NUMKIDS

    1.00     6.00    40.00       3
    2.00     5.33    50.00       3
    3.00     4.00    40.00       3

Number of cases read:  3    Number of cases listed:  3

Suppose you wanted a count of the number of boys & girls in the family. We can do that with one extra step. We will create a dummy variable that is 1 if the kid is a boy (0 if not), and a dummy variable that is 1 if the kid is a girl (and 0 if not). The sum of the boy dummy variable is the number of boys and the sum of the girl dummy variable is the number of girls.

get file 'd:kids.sav'.

freq var = sex.
Statistics
SEX
N Valid 9

Missing 0

**Statistics**
SEX
N	Valid	9
Missing	0

SEX

Frequency Percent Valid Percent Cumulative Percent

Valid f 4 44.4 44.4 44.4

m 5 55.6 55.6 100.0

Total 9 100.0 100.0

**SEX**
	Frequency	Percent	Valid Percent	Cumulative Percent
Valid	f	4	44.4	44.4	44.4
m	5	55.6	55.6	100.0
Total	9	100.0	100.0

We can look at the dummy variables. Sexdum1 is the dummy variable for girls. Sexdum2 is the dummy variable for boys. The sum of sexdum1 is the number of girls in the family. The sum of sexdum2 is the number of boys in the family.

compute sexdum1 = 0.
if sex = "f" sexdum1 = 1.
compute sexdum2 = 0.
if sex = "m" sexdum2 = 1.
execute.

list famid sex sexdum1 sexdum2.

   FAMID SEX       SEXDUM1  SEXDUM2

    1.00 f            1.00      .00
    1.00 m             .00     1.00
    1.00 f            1.00      .00
    2.00 m             .00     1.00
    2.00 m             .00     1.00
    2.00 f            1.00      .00
    3.00 m             .00     1.00
    3.00 f            1.00      .00
    3.00 m             .00     1.00

Number of cases read:  9    Number of cases listed:  9

The command below creates girls which is the number of girls in the family, and boys which is the number of boys in the family. You can also add a label after the new variable name by placing it in single quote marks. You can see the labels in SPSS data editor after clicking on the "variable view" tab in the lower left corner of the editor.

aggregate outfile 'd:kids5.sav'
 /break = famid
 /numkids = nu(birth)
 /girls 'number of girls' = sum(sexdum1)
 /boys 'number of boys' = sum(sexdum2).

get file 'd:kids5.sav'.

We can list out the data to confirm that it worked correctly.

list.

   FAMID NUMKIDS    GIRLS     BOYS

    1.00       3     2.00     1.00
    2.00       3     1.00     2.00
    3.00       3     1.00     2.00

Number of cases read:  3    Number of cases listed:  3

There are a number of functions that can be used with the aggregate command, including:

Function	Purpose
sum	sum
mean	mean
sd	standard deviation
max	maximum
min	minimum
pgt	percent of cases greater than value
plt	percent of cases less than value
pin	percent of cases between values
pout	percent of cases not in range
fgt	fraction greater than value
flt	fraction less than value
fin	fraction between values
fout	fraction not in range
n	weighted number of cases
nu	unweighted number of cases
nmiss	weighted number of missing cases
numiss	unweighted number of missing cases
first	first nonmissing value
last	last nonmissing value