This module illustrates the power (and simplicity) of Stata in its ability to reshape data files. These examples take wide data files and reshape them into long form. These show common examples of reshaping data, but do not exhaustively demonstrate the different kinds of data reshaping that you could encounter.
Example #1: Reshaping data wide to long
Consider the family income data file below.
use https://stats.idre.ucla.edu/stat/stata/modules/faminc, clearlistfamid faminc96 faminc97 faminc98 1. 3 75000 76000 77000 2. 1 40000 40500 41000 3. 2 45000 45400 45800
This is called a wide format since the years of data are wide. We may want the data to be long, where each year of data is in a separate observation. The reshape command can accomplish this, as shown below.
reshape long faminc, i(famid) j(year)(note: j = 96 97 98) Data wide -> long ----------------------------------------------------------------------------- Number of obs. 3 -> 9 Number of variables 4 -> 3 j variable (3 values) -> year xij variables: faminc96 faminc97 faminc98 -> faminc -----------------------------------------------------------------------------
The list command shows that the data are now in long form, where each year is represented as its own observation.
listfamid year faminc 1. 1 96 40000 2. 1 97 40500 3. 1 98 41000 4. 2 96 45000 5. 2 97 45400 6. 2 98 45800 7. 3 96 75000 8. 3 97 76000 9. 3 98 77000
Let’s look at the wide format and contrast it with the long format.
The reshape wide command puts the data back into wide format. We then list out the wide file.
reshape wide(note: j = 96 97 98) Data long -> wide ----------------------------------------------------------------------------- Number of obs. 9 -> 3 Number of variables 3 -> 4 j variable (3 values) year -> (dropped) xij variables: faminc -> faminc96 faminc97 faminc98 -----------------------------------------------------------------------------listfamid faminc96 faminc97 faminc98 1. 1 40000 40500 41000 2. 2 45000 45400 45800 3. 3 75000 76000 77000
The reshape long command puts the data back into long format. We then list out the long file.
reshape long(note: j = 96 97 98) Data wide -> long ----------------------------------------------------------------------------- Number of obs. 3 -> 9 Number of variables 4 -> 3 j variable (3 values) -> year xij variables: faminc96 faminc97 faminc98 -> faminc -----------------------------------------------------------------------------listfamid year faminc 1. 1 96 40000 2. 1 97 40500 3. 1 98 41000 4. 2 96 45000 5. 2 97 45400 6. 2 98 45800 7. 3 96 75000 8. 3 97 76000 9. 3 98 77000
Now let’s look at the pieces of the original reshape command.
reshape long faminc, i(famid) j(year)
long tells reshape that we want to go from wide to long
faminc tells
Stata that the stem of the variable to be converted from wide to
long is faminc
i(famid) option tells reshape that famid is the unique identifier for records in their
wide format
j(year) tells reshape that the suffix of faminc (i.e., 96 97 98) should be placed in a variable called
year
Example #2: Reshaping data wide to long
Consider the file containing the kids and their heights at 1 year of age (ht1) and at 2 years of age (ht2).
use https://stats.idre.ucla.edu/stat/stata/modules/kidshtwt, clearlist famid birth ht1 ht2famid birth ht1 ht2 1. 1 1 2.8 3.4 2. 1 2 2.9 3.8 3. 1 3 2.2 2.9 4. 2 1 2 3.2 5. 2 2 1.8 2.8 6. 2 3 1.9 2.4 7. 3 1 2.2 3.3 8. 3 2 2.3 3.4 9. 3 3 2.1 2.9
Lets reshape this data into a long format. The critical questions
are:
Q: What is the stem of the variable going from wide to long.
A: The stem is
ht
Q: What variable uniquely identifies an observation when it is in the
wide form.
A: famid and birth together uniquely identify the
wide observations.
Q: What do we want to call the variable which contains the suffix of
ht, i.e., 1
and 2.
A: Lets call the suffix age.
With the answers to these questions, the reshape command will look like this.
reshape long ht, i(famid birth) j(age)
Let’s look at the wide data, and then the data reshaped to be long.
list famid birth ht1 ht2famid birth ht1 ht2 1. 1 1 2.8 3.4 2. 1 2 2.9 3.8 3. 1 3 2.2 2.9 4. 2 1 2 3.2 5. 2 2 1.8 2.8 6. 2 3 1.9 2.4 7. 3 1 2.2 3.3 8. 3 2 2.3 3.4 9. 3 3 2.1 2.9reshape long ht, i(famid birth) j(age)(note: j = 1 2) Data wide -> long ----------------------------------------------------------------------------- Number of obs. 9 -> 18 Number of variables 7 -> 7 j variable (2 values) -> age xij variables: ht1 ht2 -> ht -----------------------------------------------------------------------------list famid birth age htfamid birth age ht 1. 1 1 1 2.8 2. 1 1 2 3.4 3. 1 2 1 2.9 4. 1 2 2 3.8 5. 1 3 1 2.2 6. 1 3 2 2.9 7. 2 1 1 2 8. 2 1 2 3.2 9. 2 2 1 1.8 10. 2 2 2 2.8 11. 2 3 1 1.9 12. 2 3 2 2.4 13. 3 1 1 2.2 14. 3 1 2 3.3 15. 3 2 1 2.3 16. 3 2 2 3.4 17. 3 3 1 2.1 18. 3 3 2 2.9
Example #3: Reshaping data wide to long
The file with the kids heights at age 1 and age 2 also contains their weights at age 1 and age 2 (called wt1 and wt2).
use https://stats.idre.ucla.edu/stat/stata/modules/kidshtwt, clearlist famid birth ht1 ht2 wt1 wt2famid birth ht1 ht2 wt1 wt2 1. 1 1 2.8 3.4 19 28 2. 1 2 2.9 3.8 21 28 3. 1 3 2.2 2.9 20 23 4. 2 1 2 3.2 25 30 5. 2 2 1.8 2.8 20 33 6. 2 3 1.9 2.4 22 33 7. 3 1 2.2 3.3 22 28 8. 3 2 2.3 3.4 20 30 9. 3 3 2.1 2.9 22 31
Let’s reshape this data into a long format. This is basically the same as the previous command except that ht is replaced with ht wt.
reshape long ht wt, i(famid birth) j(age)
Let’s look at the wide data, and then the data reshaped to be long.
list famid birth ht1 ht2 wt1 wt2famid birth ht1 ht2 wt1 wt2 1. 1 1 2.8 3.4 19 28 2. 1 2 2.9 3.8 21 28 3. 1 3 2.2 2.9 20 23 4. 2 1 2 3.2 25 30 5. 2 2 1.8 2.8 20 33 6. 2 3 1.9 2.4 22 33 7. 3 1 2.2 3.3 22 28 8. 3 2 2.3 3.4 20 30 9. 3 3 2.1 2.9 22 31reshape long ht wt, i(famid birth) j(age)(note: j = 1 2) Data wide -> long ----------------------------------------------------------------------------- Number of obs. 9 -> 18 Number of variables 7 -> 6 j variable (2 values) -> age xij variables: ht1 ht2 -> ht wt1 wt2 -> wt -----------------------------------------------------------------------------list famid birth age ht wtfamid birth age ht wt 1. 1 1 1 2.8 19 2. 1 1 2 3.4 28 3. 1 2 1 2.9 21 4. 1 2 2 3.8 28 5. 1 3 1 2.2 20 6. 1 3 2 2.9 23 7. 2 1 1 2 25 8. 2 1 2 3.2 30 9. 2 2 1 1.8 20 10. 2 2 2 2.8 33 11. 2 3 1 1.9 22 12. 2 3 2 2.4 33 13. 3 1 1 2.2 22 14. 3 1 2 3.3 28 15. 3 2 1 2.3 20 16. 3 2 2 3.4 30 17. 3 3 1 2.1 22 18. 3 3 2 2.9 31
Example #4: Reshaping data wide to long with character suffixes
It also is possible to reshape a wide data file to be long when there are character suffixes. Look at the dadmomw file below.
use https://stats.idre.ucla.edu/stat/stata/modules/dadmomw, clearlistfamid named incd namem incm 1. 1 Bill 30000 Bess 15000 2. 2 Art 22000 Amy 18000 3. 3 Paul 25000 Pat 50000
We would like to make name and inc into long formats but their suffixes are characters (d & m) instead of numbers. Stata can handle that as long as you use the string option to indicate that the suffix is a character. In the example below, the string option has been added at the end of the command.
reshape long name inc, i(famid) j(dadmom) string
Let’s look at the data before and after reshaping.
listfamid named incd namem incm 1. 1 Bill 30000 Bess 15000 2. 2 Art 22000 Amy 18000 3. 3 Paul 25000 Pat 50000reshape long name inc, i(famid) j(dadmom) string(note: j = d m) Data wide -> long ----------------------------------------------------------------------------- Number of obs. 3 -> 6 Number of variables 5 -> 4 j variable (2 values) -> dadmom xij variables: named namem -> name incd incm -> inc -----------------------------------------------------------------------------listfamid dadmom name inc 1. 1 d Bill 30000 2. 1 m Bess 15000 3. 2 d Art 22000 4. 2 m Amy 18000 5. 3 d Paul 25000 6. 3 m Pat 50000
Summary reshaping data wide to long
Wide format famid faminc96 faminc97 faminc98 1. 1 40000 40500 41000 2. 2 45000 45400 45800 3. 3 75000 76000 77000 reshape long faminc, i(famid) j(year) Long Format famid year faminc 1. 1 96 40000 2. 1 97 40500 3. 1 98 41000 4. 2 96 45000 5. 2 97 45400 6. 2 98 45800 7. 3 96 75000 8. 3 97 76000 9. 3 98 77000
The general syntax of reshape long can be expressed as…
reshape long stem-of-wide-vars, i(wide-id-var) j(var-for-suffix)
where
stem-of-wide-vars is the stem of the wide variables, e.g., faminc wide-id-var is the variable that uniquely identifies wide observations, e.g., famid var-for-suffix is the variable that will contain the suffix of the wide variables, e.g., year