There are several ways to reshape data from a long to a wide format in SAS. For example, you can reshape your data using proc transpose or reshaping the data in a data step. The following will illustrate how to reshape data from long to wide using the data step.
Example 1: Reshaping one variable
We will begin with a small data set with only one variable to be reshaped. We will use the variables year and faminc (for family income) to create three new variables: faminc96, faminc97 and faminc98. First, let’s look at the data set and use proc print to display it.
DATA long ; INPUT famid year faminc ; CARDS ; 1 96 40000 1 97 40500 1 98 41000 2 96 45000 2 97 45400 2 98 45800 3 96 75000 3 97 76000 3 98 77000 ; RUN ;PROC PRINT DATA=long ; RUN ;Obs famid year faminc 1 1 96 40000 2 1 97 40500 3 1 98 41000 4 2 96 45000 5 2 97 45400 6 2 98 45800 7 3 96 75000 8 3 97 76000 9 3 98 77000
Now let’s look at the program. The first step in the reshaping process is sorting the data (using proc sort) on an identification variable (famid) and saving the sorted data set (longsort). Next we write a data step to do the actual reshaping. We will explain each of the statements in the data step in order.
PROC SORT DATA=long OUT=longsort ;
BY famid ;
RUN ;
DATA wide1 ;
SET longsort ;
BY famid ;
KEEP famid faminc96 -faminc98 ;
RETAIN faminc96 - faminc98 ;
ARRAY afaminc(96:98) faminc96 - faminc98 ;
IF first.famid THEN
DO;
DO i = 96 to 98 ;
afaminc( i ) = . ;
END;
END;
afaminc( year ) = faminc ;
IF last.famid THEN OUTPUT ;
RUN;
The new data set is named wide1 on the data statement. The old set, longsort, upon which the new data set will based, is named on the set statement. Please note that you must use the sorted version of the old set in order for the reshaping to work properly. The identification variable (famid) is used on the by statement. Please note that this must be the same identification variable on which the data were sorted. If the by statement is omitted, then the if-then-do statements (which occur later in the data step) will not work properly, and SAS will issue an error message.
Next, a keep statement is used to keep the desired variables in the new data set. We will keep the identification variable (famid) and the three variables we are creating, faminc96, faminc97 and faminc98. Any variable that is not listed on the keep statement will not be present in the new data set. A retain statement is used to tell SAS to retain the current values of the variables listed. If the retain statement is omitted, only the values for faminc98 are placed in the new data set, while all of the data for faminc96 and faminc97 will be missing.
An array statement is used to define the variables faminc96, faminc97 and faminc98. In our example, we have named the array afaminc. In parenthesis we have given the first and last value of the array separated by a colon (:). The new variable names are then listed. We can list the first and last new variable names separated by a dash because the names are numerically consecutive. This is especially convenient when there are a large number of variables being defined by the array.
We use an if-then-do statement to set the conditions for a do-loop. The by statement that we used above not only caused SAS to process the data in the groups defined by the variable (famid) given on the by statement, it also caused SAS to create two temporary variables: first.famid and last.famid. Temporary variables are variables that you can use during a data step but do not appear in the new data set. The value of first.famid is always zero except when SAS is processing the first row of data for a given value of famid, then first.famid is one. In other words, first.famid is an indicator variable that is one when it is true that SAS is processing the first row of data for a given value of famid, and zero when it is not. The variable last.famid is created with the same logic, except that it equals one when SAS is processing the last row of data for a given value of famid and zero otherwise. You can now see why the data needed to be sorted on famid before we began the data step.
In the do-loop, we set i (you can use any name that you like) equal to the range we used in the array statement. We then set the array with i as the index variable equal to missing, which in SAS is a dot (.). As SAS processes the data through this loop, the missing values will be replaced with the data. If some of the data are missing, the missing value will not be altered. Please note that you can set the array equal to any value and that value will appear in any place that there is missing data.
After ending the do-loop and the if-then-do statement, we set the array with year as the index variable equal to faminc, the variable in the old data set. (Please note that you must have an end statement for each do.) It is at this point that the data in the variable faminc are associated with the new variables.
Finally, we use an if-then statement to have SAS output the data to the new data set each time it encounters the last occurrence of each value of famid. If this statement is omitted, SAS will output the results of its processing after each cycle of the loop, and we will end up with nine records instead of three. At long last, it is time to look at the new data set with a proc print to ensure that the reshaping went as desired, and we can see that it did.
PROC PRINT DATA=wide1; RUN;Obs famid faminc96 faminc97 faminc98 1 1 40000 40500 41000 2 2 45000 45400 45800 3 3 75000 76000 77000
Example 2: Reshaping two variables
This example is very similar to the last example, except that in this example we will be reshaping two variables, faminc and spend.
DATA long2; INPUT famid year faminc spend ; CARDS; 1 96 40000 38000 1 97 40500 39000 1 98 41000 40000 2 96 45000 42000 2 97 45400 43000 2 98 45800 44000 3 96 75000 70000 3 97 76000 71000 3 98 77000 72000 ; RUN ;
We will modify the previous data step by adding the new variable to two statements that were present in the previous example and adding three new statements. In reshaping the data from long to wide, we will create six new variables: faminc96, faminc97, faminc98, spend96, spend97 and spend98. The six variables are listed on both the keep and the retain statements. Following the same logic as in the previous example, we include an array statement to define the variables spend96, spend97 and spend98. We have named the array aspend. We have also included this array in the do-loop. The third added statement sets aspend with year as the index variable equal to spend. As before, we run a proc print on the new data set to ensure that the reshaping was done correctly.
PROC SORT DATA=long2 OUT=longsrt2 ; BY famid ; RUN ; DATA wide2 ; SET longsrt2 ; BY famid ; KEEP famid faminc96-faminc98 spend96-spend98 ; RETAIN faminc96-faminc98 spend96-spend98 ; ARRAY afaminc(96:98) faminc96-faminc98 ; ARRAY aspend(96:98) spend96-spend98 ; IF first.famid THEN DO; DO i = 96 to 98 ; afaminc( i ) = 0 ; aspend( i ) = 0 ; END; END; afaminc( year ) = faminc ; aspend( year ) = spend ; IF last.famid THEN OUTPUT ; RUN; PROC PRINT DATA=wide2; RUN;Obs famid faminc96 faminc97 faminc98 spend96 spend97 spend98 1 1 40000 40500 41000 38000 39000 40000 2 2 45000 45400 45800 42000 43000 44000 3 3 75000 76000 77000 70000 71000 72000
Please note that you can reshape as many variables as you want in a single data step. All you need to do is list the variables on both the keep and retain statements and add an array statement, include the array in the do-loop and add a statement setting the array with the proper index variable equal to the desired variable in the old data set.
Example 3: Two variables that identify the wide record
In this example, we have two variables that, when taken together, identify the wide record. In the data set below, the variables famid and birth together uniquely identify each wide record. (Please note that the data shown below are in long format, so famid and birth do not uniquely identify each record.) We will reshape one variable, ht (for height) from long to wide format.
DATA long3; INPUT famid birth age ht ; CARDS; 1 1 1 2.8 1 1 2 3.4 1 2 1 2.9 1 2 2 3.8 1 3 1 2.2 1 3 2 2.9 2 1 1 2.0 2 1 2 3.2 2 2 1 1.8 2 2 2 2.8 2 3 1 1.9 2 3 2 2.4 3 1 1 2.2 3 1 2 3.3 3 2 1 2.3 3 2 2 3.4 3 3 1 2.1 3 3 2 2.9 ; RUN;
You will notice that the program used to reshape these data is very similar to the program used in example 1. Clearly, the variable names are different. The other important difference is that the data are sorted on both famid and birth in the proc sort, and both famid and birth are given in the by statement.
PROC SORT DATA=long3 OUT=longsrt3 ;
BY famid birth ;
RUN ;
DATA wide3 ;
SET longsrt3 ;
BY famid birth ;
KEEP famid ht1-ht2 ;
RETAIN ht1-ht2 ;
ARRAY aht(1:2) ht1-ht2 ;
IF first.birth THEN
DO;
DO i = 1 to 2 ;
aht( i ) = 0 ;
END;
END;
aht( age ) = ht ;
IF last.birth THEN OUTPUT ;
RUN;
We will run a proc print on the new data set to ensure that the reshaping went as expected.
PROC PRINT DATA=wide3; RUN;
Example 4: A more realistic example
This example is very much like the first example, except that the variable names are different. In this example, we will reshape the variable inc (for income). This is a more realistic example of reshaping data from long to wide in that there are many more records (300, instead of nine) and six time points (instead of three). Despite being a much larger data set, you will notice that the program to reshape it is almost identical to the program used to reshape the tiny data set in example 1.
data long4;
input id year inc ;
cards;
1 90 66483
1 91 69146
1 92 74643
1 93 79783
1 94 81710
1 95 86143
2 90 17510
2 91 17947
2 92 19484
2 93 20979
2 94 21268
2 95 22998
3 90 57947
3 91 62964
3 92 68717
3 93 70957
3 94 75198
3 95 75722
4 90 64831
4 91 71060
4 92 71918
4 93 72514
4 94 73100
4 95 74379
5 90 18904
5 91 19949
5 92 21335
5 93 22237
5 94 23829
5 95 23913
6 90 32057
6 91 34770
6 92 35834
6 93 37387
6 94 40899
6 95 42372
7 90 60551
7 91 64869
7 92 67983
7 93 70498
7 94 71253
7 95 75177
8 90 16553
8 91 18189
8 92 18349
8 93 19815
8 94 21739
8 95 22980
9 90 32611
9 91 33465
9 92 35961
9 93 36416
9 94 37183
9 95 40627
10 90 61379
10 91 66002
10 92 67936
10 93 70513
10 94 74405
10 95 76009
11 90 24065
11 91 24229
11 92 25709
11 93 26121
11 94 26617
11 95 28142
12 90 32975
12 91 36185
12 92 37601
12 93 41336
12 94 43399
12 95 43670
13 90 69548
13 91 71341
13 92 72455
13 93 76552
13 94 80538
13 95 85330
14 90 50274
14 91 53349
14 92 55900
14 93 59375
14 94 61216
14 95 63911
15 90 72011
15 91 73334
15 92 76248
15 93 77724
15 94 78638
15 95 80582
16 90 18911
16 91 20046
16 92 21343
16 93 21630
16 94 22330
16 95 23081
17 90 68841
17 91 75410
17 92 80806
17 93 81327
17 94 81571
17 95 86499
18 90 28099
18 91 30716
18 92 32986
18 93 36097
18 94 39124
18 95 39866
19 90 17302
19 91 18778
19 92 18872
19 93 19884
19 94 20665
19 95 21855
20 90 16291
20 91 16674
20 92 16770
20 93 17182
20 94 17979
20 95 18917
21 90 43244
21 91 46545
21 92 47633
21 93 50744
21 94 54734
21 95 59075
22 90 56393
22 91 59120
22 92 60801
22 93 61404
22 94 63111
22 95 69278
23 90 47347
23 91 49571
23 92 50101
23 93 51345
23 94 56463
23 95 56927
24 90 16076
24 91 17217
24 92 17296
24 93 17900
24 94 18171
24 95 18366
25 90 65906
25 91 69679
25 92 76131
25 93 77676
25 94 81980
25 95 85426
26 90 58586
26 91 61188
26 92 66542
26 93 69267
26 94 71063
26 95 74549
27 90 61674
27 91 66584
27 92 69185
27 93 75193
27 94 78647
27 95 81898
28 90 31673
28 91 31883
28 92 32774
28 93 34485
28 94 36929
28 95 39751
29 90 63412
29 91 67593
29 92 69911
29 93 73092
29 94 80105
29 95 81840
30 90 27684
30 91 28439
30 92 30861
30 93 31406
30 94 32960
30 95 35530
31 90 71873
31 91 76449
31 92 80848
31 93 88691
31 94 94149
31 95 97431
32 90 62177
32 91 63812
32 92 64235
32 93 65703
32 94 69985
32 95 71136
33 90 37684
33 91 38258
33 92 39208
33 93 39489
33 94 39745
33 95 41236
34 90 64013
34 91 66398
34 92 71877
34 93 75610
34 94 76395
34 95 79644
35 90 16011
35 91 16847
35 92 17746
35 93 19123
35 94 19183
35 95 19996
36 90 49215
36 91 52195
36 92 52343
36 93 56365
36 94 58752
36 95 59354
37 90 15774
37 91 16643
37 92 17605
37 93 18781
37 94 18996
37 95 19685
38 90 29106
38 91 31693
38 92 31852
38 93 34505
38 94 35806
38 95 36179
39 90 25147
39 91 26923
39 92 28785
39 93 30987
39 94 34036
39 95 34106
40 90 71978
40 91 79144
40 92 80453
40 93 86580
40 94 95164
40 95 96155
41 90 46166
41 91 47579
41 92 49455
41 93 53849
41 94 56630
41 95 57473
42 90 55810
42 91 59443
42 92 65291
42 93 66065
42 94 69009
42 95 74365
43 90 49642
43 91 50603
43 92 53917
43 93 54858
43 94 58470
43 95 59767
44 90 21348
44 91 22361
44 92 23412
44 93 24038
44 94 24774
44 95 25828
45 90 44361
45 91 48720
45 92 51356
45 93 54927
45 94 56670
45 95 58800
46 90 56509
46 91 60517
46 92 61532
46 93 65077
46 94 69594
46 95 73089
47 90 39097
47 91 40293
47 92 43237
47 93 44809
47 94 48782
47 95 53091
48 90 18685
48 91 19405
48 92 20165
48 93 20316
48 94 22197
48 95 23557
49 90 73103
49 91 76243
49 92 76778
49 93 82734
49 94 86279
49 95 86784
50 90 48129
50 91 49267
50 92 53799
50 93 58768
50 94 63011
50 95 66461
;
run;
PROC SORT DATA=long4 OUT=longsrt4 ;
BY id ;
RUN ;
DATA wide4 ;
SET longsrt4 ;
BY id ;
KEEP id inc90-inc95 ;
RETAIN inc90-inc95 ;
ARRAY ainc(90:95) inc90-inc95 ;
IF first.id THEN
DO;
DO i = 90 to 95 ;
ainc( i ) = 0 ;
END;
END;
ainc( year ) = inc ;
IF last.id THEN OUTPUT ;
RUN;
Because of the size of the data set, we will limit the proc print to the first five observations in the new data set to save space. Please note that the first five observations in the wide data set consist of the first 25 observations in the long data set. We will also run a proc means both data sets as well as our usual proc print to ensure that the reshaping was successful.
PROC PRINT DATA=wide4(obs=5) ; RUN;Obs id inc90 inc91 inc92 inc93 inc94 inc95 1 1 66483 69146 74643 79783 81710 86143 2 2 17510 17947 19484 20979 21268 22998 3 3 57947 62964 68717 70957 75198 75722 4 4 64831 71060 71918 72514 73100 74379 5 5 18904 19949 21335 22237 23829 23913 PROC PRINT DATA=long4(obs=25) ; RUN;Obs id year inc 1 1 90 66483 2 1 91 69146 3 1 92 74643 4 1 93 79783 5 1 94 81710 6 1 95 86143 7 2 90 17510 8 2 91 17947 9 2 92 19484 10 2 93 20979 11 2 94 21268 12 2 95 22998 13 3 90 57947 14 3 91 62964 15 3 92 68717 16 3 93 70957 17 3 94 75198 18 3 95 75722 19 4 90 64831 20 4 91 71060 21 4 92 71918 22 4 93 72514 23 4 94 73100 24 4 95 74379 25 5 90 18904 PROC MEANS DATA=wide4 ; VAR inc90-inc95 ; RUN;The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------------ inc90 50 43899.32 19523.39 15774.00 73103.00 inc91 50 46380.70 20749.43 16643.00 79144.00 inc92 50 48519.58 21720.12 16770.00 80848.00 inc93 50 50842.28 22780.12 17182.00 88691.00 inc94 50 53289.02 23824.01 17979.00 95164.00 inc95 50 55379.00 24592.83 18366.00 97431.00 ------------------------------------------------------------------------------ PROC MEANS DATA=long4 ; CLASS year ; VAR inc ; RUN;The MEANS Procedure Analysis Variable : inc N year Obs N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------------ 90 50 50 43899.32 19523.39 15774.00 73103.00 91 50 50 46380.70 20749.43 16643.00 79144.00 92 50 50 48519.58 21720.12 16770.00 80848.00 93 50 50 50842.28 22780.12 17182.00 88691.00 94 50 50 53289.02 23824.01 17979.00 95164.00 95 50 50 55379.00 24592.83 18366.00 97431.00 ------------------------------------------------------------------------------
Example 5: Reshaping with string (character) variables
This example is similar to example 2 except that we have a third variable to reshape, and this new variable, debt, is a string (i.e., character) variable. There are only two minor differences between adding a numeric variable to our “template” program and adding a string variable. The first is in the array statement for the string variable. Before listing the names of the variables to be created, you need to include a dollar sign ($) to tell SAS to create a string variable and the maximum number of characters allowed. In our example, we are allowing for three characters (since the maximum number of characters in “yes” or “no” is three). The second difference is that when setting the string variable equal to missing in the do-loop, you use open-quote close-quote to indicate a null string instead of setting it equal to a dot. A dot is a missing value only for a numeric variable; a null string is a missing value for a string variable.
data long5;
length debt $ 3;
input famid year faminc spend debt $ ;
cards;
1 96 40000 38000 yes
1 97 40500 39000 yes
1 98 41000 40000 no
2 96 45000 42000 yes
2 97 45400 43000 no
2 98 45800 44000 no
3 96 75000 70000 no
3 97 76000 71000 no
3 98 77000 72000 no
;
run;
PROC SORT DATA=long5 OUT=long5srt ;
BY famid ;
RUN;
DATA wide5 ;
SET long5srt ;
BY famid ;
KEEP famid faminc96-faminc98 spend96-spend98 debt96-debt98 ;
RETAIN faminc96-faminc98 spend96-spend98 debt96-debt98 ;
ARRAY afaminc(96:98) faminc96-faminc98 ;
ARRAY aspend(96:98) spend96-spend98 ;
ARRAY adebt(96:98) $ 3 debt96-debt98 ;
IF first.famid THEN
DO;
DO i = 96 to 98 ;
afaminc( i ) = 0 ;
aspend( i ) = 0 ;
adebt( i ) = " " ;
END;
END;
afaminc( year ) = faminc ;
aspend( year ) = spend ;
adebt( year ) = debt ;
IF last.famid THEN OUTPUT ;
RUN;
We will run a proc print on the new data set to ensure that the reshaping was successful.
PROC PRINT DATA=wide5 ;
RUN;
f f f
a a a s s s
m m m p p p d d d
f i i i e e e e e e
a n n n n n n b b b
O m c c c d d d t t t
b i 9 9 9 9 9 9 9 9 9
s d 6 7 8 6 7 8 6 7 8
1 1 40000 40500 41000 38000 39000 40000 yes yes no
2 2 45000 45400 45800 42000 43000 44000 yes no no
3 3 75000 76000 77000 70000 71000 72000 no no no
