Sometimes you have data files that need to be collapsed to be useful to you. For example, you might have student data but you really want classroom data, or you might have weekly data but you want monthly data, etc. We will illustrate this using an example showing how you can collapse data across kids to make family level data.
Here is a file containing information about the kids in three families. There is one record per kid. Birth is the order of birth (i.e., 1 is first), age wt and sex are the child’s age, weight and sex. We will use this file for showing how to collapse data across observations.
use https://stats.idre.ucla.edu/stat/stata/modules/kids, clearlistfamid kidname birth age wt sex 1. 1 Beth 1 9 60 f 2. 1 Bob 2 6 40 m 3. 1 Barb 3 3 20 f 4. 2 Andy 1 8 80 m 5. 2 Al 2 6 50 m 6. 2 Ann 3 2 20 f 7. 3 Pete 1 6 60 m 8. 3 Pam 2 4 40 f 9. 3 Phil 3 2 20 m
Consider the collapse command below. It collapses across all of the observations to make a single record with the average age of the kids.
collapse agelistage 1. 5.111111
The above collapse command was not very useful, but you can combine it with the by(famid) option, and then it creates one record for each family that contains the average age of the kids in the family.
use https://stats.idre.ucla.edu/stat/stata/modules/kids, clearcollapse age, by(famid)listfamid age 1. 1 6 2. 2 5.333333 3. 3 4
The following collapse command does the exact same thing as above, except that the average of age is named avgage and we have explicitly told the collapse command that we want it to compute the mean.
use https://stats.idre.ucla.edu/stat/stata/modules/kids, clearcollapse (mean) avgage=age, by(famid)listfamid avgage 1. 1 6 2. 2 5.333333 3. 3 4
We can request averages for more than one variable. Here we get the average for age and for wt all in the same command.
use https://stats.idre.ucla.edu/stat/stata/modules/kids, clearcollapse (mean) avgage=age avgwt=wt, by(famid)listfamid avgage avgwt 1. 1 6 40 2. 2 5.333333 50 3. 3 4 40
This command gets the average of age and wt like the command above, and also computes numkids which is the count of the number of kids in each family (obtained by counting the number of observations with valid values of birth).
use https://stats.idre.ucla.edu/stat/stata/modules/kids, clearcollapse (mean) avgage=age avgwt=wt (count) numkids=birth, by(famid)listfamid avgage avgwt numkids 1. 1 6 40 3 2. 2 5.333333 50 3 3. 3 4 40 3
Suppose you wanted a count of the number of boys and girls in the family. We can do that with one extra step. We will create a dummy variable that is 1 if the kid is a boy (0 if not), and a dummy variable that is 1 if the kid is a girl (and 0 if not). The sum of the boy dummy variable is the number of boys and the sum of the girl dummy variable is the number of girls.
First, let’s use the kids file (and clear out the existing data).
use https://stats.idre.ucla.edu/stat/stata/modules/kids, clear
We use tabulate with the generate option to make the dummy variables.
tabulate sex, generate(sexdum)sex | Freq. Percent Cum. ------------+----------------------------------- f | 4 44.44 44.44 m | 5 55.56 100.00 ------------+----------------------------------- Total | 9 100.00
We can look at the dummy variables. Sexdum1 is the dummy variable for girls. Sexdum2 is the dummy variable for boys. The sum of sexdum1 is the number of girls in the family. The sum of sexdum2 is the number of boys in the family.
list famid sex sexdum1 sexdum2famid sex sexdum1 sexdum2 1. 1 f 1 0 2. 1 m 0 1 3. 1 f 1 0 4. 2 m 0 1 5. 2 m 0 1 6. 2 f 1 0 7. 3 m 0 1 8. 3 f 1 0 9. 3 m 0 1
The command below creates girls which is the number of girls in the family, and boys which is the number of boys in the family.
collapse (count) numkids=birth (sum) girls=sexdum1 boys=sexdum2, by(famid)
We can list out the data to confirm that it worked correctly.
list famid boys girls numkidsfamid boys girls numkids 1. 1 1 2 3 2. 2 2 1 3 3. 3 2 1 3
Summary
To create one record per family (famid) with the average of age within each family.
collapse age, by(famid)
To create one record per family (famid) with the average of age (called avgage) and average weight (called avgwt) within each family.
collapse (mean) avgage=age avgwt=wt, by(famid)
Same as above example, but also counts the number of kids within each family calling that numkids.
collapse (mean) avgage=age avgwt=wt (count) numkids=birth, by(famid)
Counts the number of boys and girls in each family by using tabulate to create dummy variables based on sex and then summing the dummy variables within each family.
tabulate sex, generate(sexdum) collapse (sum) girls=sexdum1 boys=sexdum2, by(famid)