Reshaping data wide to long | Stata Learning Modules

This module illustrates the power (and simplicity) of Stata in its ability to reshape data files. These examples take wide data files and reshape them into long form. These show common examples of reshaping data, but do not exhaustively demonstrate the different kinds of data reshaping that you could encounter.

Example #1: Reshaping data wide to long

Consider the family income data file below.

use https://stats.idre.ucla.edu/stat/stata/modules/faminc, clear

list

         famid   faminc96   faminc97   faminc98 
  1.         3      75000      76000      77000  
  2.         1      40000      40500      41000  
  3.         2      45000      45400      45800

This is called a wide format since the years of data are wide. We may want the data to be long, where each year of data is in a separate observation. The reshape command can accomplish this, as shown below.

reshape long faminc, i(famid) j(year)

(note:  j = 96 97 98)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        3   ->       9
Number of variables                   4   ->       3
j variable (3 values)                     ->   year
xij variables:
             faminc96 faminc97 faminc98   ->   faminc
-----------------------------------------------------------------------------

The list command shows that the data are now in long form, where each year is represented as its own observation.

list

         famid       year     faminc 
  1.         1         96      40000  
  2.         1         97      40500  
  3.         1         98      41000  
  4.         2         96      45000  
  5.         2         97      45400  
  6.         2         98      45800  
  7.         3         96      75000  
  8.         3         97      76000  
  9.         3         98      77000

Let’s look at the wide format and contrast it with the long format.

The reshape wide command puts the data back into wide format. We then list out the wide file.

reshape wide

(note:  j = 96 97 98)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                        9   ->       3
Number of variables                   3   ->       4
j variable (3 values)              year   ->   (dropped)
xij variables:
                                 faminc   ->   faminc96 faminc97 faminc98
-----------------------------------------------------------------------------

list

         famid   faminc96   faminc97   faminc98 
  1.         1      40000      40500      41000  
  2.         2      45000      45400      45800  
  3.         3      75000      76000      77000

The reshape long command puts the data back into long format. We then list out the long file.

reshape long

(note:  j = 96 97 98)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        3   ->       9
Number of variables                   4   ->       3
j variable (3 values)                     ->   year
xij variables:
             faminc96 faminc97 faminc98   ->   faminc
-----------------------------------------------------------------------------

list

         famid       year     faminc 
  1.         1         96      40000  
  2.         1         97      40500  
  3.         1         98      41000  
  4.         2         96      45000  
  5.         2         97      45400  
  6.         2         98      45800  
  7.         3         96      75000  
  8.         3         97      76000  
  9.         3         98      77000

Now let’s look at the pieces of the original reshape command.

reshape long faminc, i(famid) j(year)

long tells reshape that we want to go from wide to long
faminc tells Stata that the stem of the variable to be converted from wide to long is faminc
i(famid) option tells reshape that famid is the unique identifier for records in their wide format
j(year) tells reshape that the suffix of faminc (i.e., 96 97 98) should be placed in a variable called year

Example #2: Reshaping data wide to long

Consider the file containing the kids and their heights at 1 year of age (ht1) and at 2 years of age (ht2).

use https://stats.idre.ucla.edu/stat/stata/modules/kidshtwt, clear

list famid birth ht1 ht2

          famid      birth        ht1        ht2 
  1.         1          1        2.8        3.4  
  2.         1          2        2.9        3.8  
  3.         1          3        2.2        2.9  
  4.         2          1          2        3.2  
  5.         2          2        1.8        2.8  
  6.         2          3        1.9        2.4  
  7.         3          1        2.2        3.3  
  8.         3          2        2.3        3.4  
  9.         3          3        2.1        2.9

Lets reshape this data into a long format. The critical questions are:
Q: What is the stem of the variable going from wide to long.
A: The stem is ht
Q: What variable uniquely identifies an observation when it is in the wide form.
A: famid and birth together uniquely identify the wide observations.
Q: What do we want to call the variable which contains the suffix of ht, i.e., 1 and 2.
A: Lets call the suffix age.

With the answers to these questions, the reshape command will look like this.

reshape long ht, i(famid birth) j(age)

Let’s look at the wide data, and then the data reshaped to be long.

list famid birth ht1 ht2

         famid      birth        ht1        ht2 
  1.         1          1        2.8        3.4  
  2.         1          2        2.9        3.8  
  3.         1          3        2.2        2.9  
  4.         2          1          2        3.2  
  5.         2          2        1.8        2.8  
  6.         2          3        1.9        2.4  
  7.         3          1        2.2        3.3  
  8.         3          2        2.3        3.4  
  9.         3          3        2.1        2.9

reshape long ht, i(famid birth) j(age)

(note:  j = 1 2)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        9   ->      18
Number of variables                   7   ->       7
j variable (2 values)                     ->   age
xij variables:
                                ht1 ht2   ->   ht
-----------------------------------------------------------------------------

list famid birth age ht

         famid      birth        age         ht 
  1.         1          1          1        2.8  
  2.         1          1          2        3.4  
  3.         1          2          1        2.9  
  4.         1          2          2        3.8  
  5.         1          3          1        2.2  
  6.         1          3          2        2.9  
  7.         2          1          1          2  
  8.         2          1          2        3.2  
  9.         2          2          1        1.8  
 10.         2          2          2        2.8  
 11.         2          3          1        1.9  
 12.         2          3          2        2.4  
 13.         3          1          1        2.2  
 14.         3          1          2        3.3  
 15.         3          2          1        2.3  
 16.         3          2          2        3.4  
 17.         3          3          1        2.1  
 18.         3          3          2        2.9

Example #3: Reshaping data wide to long

The file with the kids heights at age 1 and age 2 also contains their weights at age 1 and age 2 (called wt1 and wt2).

use https://stats.idre.ucla.edu/stat/stata/modules/kidshtwt, clear

list famid birth ht1 ht2 wt1 wt2

         famid      birth        ht1        ht2        wt1        wt2 
  1.         1          1        2.8        3.4         19         28  
  2.         1          2        2.9        3.8         21         28  
  3.         1          3        2.2        2.9         20         23  
  4.         2          1          2        3.2         25         30  
  5.         2          2        1.8        2.8         20         33  
  6.         2          3        1.9        2.4         22         33  
  7.         3          1        2.2        3.3         22         28  
  8.         3          2        2.3        3.4         20         30  
  9.         3          3        2.1        2.9         22         31

Let’s reshape this data into a long format. This is basically the same as the previous command except that ht is replaced with ht wt.

reshape long ht wt, i(famid birth) j(age)

Let’s look at the wide data, and then the data reshaped to be long.

list famid birth ht1 ht2 wt1 wt2

         famid      birth        ht1        ht2        wt1        wt2 
  1.         1          1        2.8        3.4         19         28  
  2.         1          2        2.9        3.8         21         28  
  3.         1          3        2.2        2.9         20         23  
  4.         2          1          2        3.2         25         30  
  5.         2          2        1.8        2.8         20         33  
  6.         2          3        1.9        2.4         22         33  
  7.         3          1        2.2        3.3         22         28  
  8.         3          2        2.3        3.4         20         30  
  9.         3          3        2.1        2.9         22         31

reshape long ht wt, i(famid birth) j(age)

(note:  j = 1 2)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        9   ->      18
Number of variables                   7   ->       6
j variable (2 values)                     ->   age
xij variables:
                                ht1 ht2   ->   ht
                                wt1 wt2   ->   wt
-----------------------------------------------------------------------------

list famid birth age ht wt

          famid      birth        age         ht         wt 
  1.         1          1          1        2.8         19  
  2.         1          1          2        3.4         28  
  3.         1          2          1        2.9         21  
  4.         1          2          2        3.8         28  
  5.         1          3          1        2.2         20  
  6.         1          3          2        2.9         23  
  7.         2          1          1          2         25  
  8.         2          1          2        3.2         30  
  9.         2          2          1        1.8         20  
 10.         2          2          2        2.8         33  
 11.         2          3          1        1.9         22  
 12.         2          3          2        2.4         33  
 13.         3          1          1        2.2         22  
 14.         3          1          2        3.3         28  
 15.         3          2          1        2.3         20  
 16.         3          2          2        3.4         30  
 17.         3          3          1        2.1         22  
 18.         3          3          2        2.9         31

Example #4: Reshaping data wide to long with character suffixes

It also is possible to reshape a wide data file to be long when there are character suffixes. Look at the dadmomw file below.

use https://stats.idre.ucla.edu/stat/stata/modules/dadmomw, clear

list

         famid      named       incd      namem       incm 
  1.         1       Bill      30000       Bess      15000  
  2.         2        Art      22000        Amy      18000  
  3.         3       Paul      25000        Pat      50000

We would like to make name and inc into long formats but their suffixes are characters (d & m) instead of numbers. Stata can handle that as long as you use the string option to indicate that the suffix is a character. In the example below, the string option has been added at the end of the command.

reshape long name  inc, i(famid) j(dadmom) string

Let’s look at the data before and after reshaping.

list

         famid      named       incd      namem       incm 
  1.         1       Bill      30000       Bess      15000  
  2.         2        Art      22000        Amy      18000  
  3.         3       Paul      25000        Pat      50000

reshape long name inc, i(famid) j(dadmom) string

(note:  j = d m)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        3   ->       6
Number of variables                   5   ->       4
j variable (2 values)                     ->   dadmom
xij variables:
                            named namem   ->   name
                              incd incm   ->   inc
-----------------------------------------------------------------------------

list

         famid     dadmom       name        inc 
  1.         1          d       Bill      30000  
  2.         1          m       Bess      15000  
  3.         2          d        Art      22000  
  4.         2          m        Amy      18000  
  5.         3          d       Paul      25000  
  6.         3          m        Pat      50000

Summary reshaping data wide to long

 Wide format
          famid   faminc96   faminc97   faminc98 
  1.         1      40000      40500      41000  
  2.         2      45000      45400      45800  
  3.         3      75000      76000      77000  

reshape long faminc, i(famid) j(year)

 Long Format
         famid       year     faminc 
  1.         1         96      40000 
  2.         1         97      40500 
  3.         1         98      41000 
  4.         2         96      45000 
  5.         2         97      45400 
  6.         2         98      45800 
  7.         3         96      75000 
  8.         3         97      76000 
  9.         3         98      77000

The general syntax of reshape long can be expressed as…

reshape long stem-of-wide-vars, i(wide-id-var)  j(var-for-suffix)

where

stem-of-wide-vars  is the stem of the wide variables, e.g., faminc
wide-id-var        is the variable that uniquely identifies wide 
                   observations, e.g., famid
var-for-suffix     is the variable that will contain the suffix of 
                   the wide variables, e.g., year