Sometimes you need to reshape your data which is in a long format (like shown below).
long famid year faminc 1 96 40000 1 97 40500 1 98 41000 2 96 45000 2 97 45400 2 98 45800 3 96 75000 3 97 76000 3 98 77000
into a wide format (like shown below).
wide1
famid faminc96 faminc97 faminc98 1 40000 40500 41000 2 45000 45400 45800 3 75000 76000 77000
Below is an example of a SAS data step that reshapes data from a long to a wide format. In order to use this program, you need to be able to provide some key information about your data to be reshaped. Please note that this example (and all of the examples on this page) assume that the values of the time variable are equally spaced (ie, one observation per year or one observation per month). Let’s use the examples of the long and wide1 files above to illustrate.
You need to be able to supply 5 pieces of information:
1. What is the name of the long data file? LONG
2. What is the name we want for the wide data file? We want to call the long data file WIDE1
3. What is the name of the variable which uniquely identifies the wide observations. The wide observations are uniquely identified by the variable FAMID
4. What is the name of the variable in the long data file that will be used for the suffix of the wide variables (the 96 97 and 98 at the end of faminc). The 96, 97 and 98 come from the variable YEAR.
5. What is the name of the variable to be converted from long to wide. The variable to be converted from long to wide is FAMINC
* read in long data ; data long ; INPUT famid year faminc ; CARDS ; 1 96 40000 1 97 40500 1 98 41000 2 96 45000 2 97 45400 2 98 45800 3 96 75000 3 97 76000 3 98 77000 ; RUN ; PROC PRINT DATA=long ; RUN ; * 1. Do basic long to wide ; PROC SORT DATA=long OUT=longsort ; BY famid ; RUN ; DATA wide1 ; SET longsort ; BY famid ; KEEP famid faminc96 -faminc98 ; RETAIN faminc96 - faminc98 ; ARRAY afaminc(96:98) faminc96 - faminc98 ; IF first.famid THEN DO; DO i = 96 to 98 ; afaminc( i ) = 0 ; END; END; afaminc( year ) = faminc ; IF last.famid THEN OUTPUT ; RUN; * check with a basic PROC PRINT ; PROC PRINT DATA=wide1; RUN;
* INCLUDE PART OF THE OUTPUT ;
With only a few modifications, the data step above can reshape two (or more) variables. Please note that you will need to list all of the variables to be reshaped in the KEEP and RETAIN statements, as well as having to do an array and a do-loop for each variable to be reshaped.
* 2. To reshape data from long to wide format with 2 variables ; DATA long2; INPUT famid year faminc spend ; CARDS; 1 96 40000 38000 1 97 40500 39000 1 98 41000 40000 2 96 45000 42000 2 97 45400 43000 2 98 45800 44000 3 96 75000 70000 3 97 76000 71000 3 98 77000 72000 ; RUN ; PROC SORT DATA=long2 OUT=longsrt2 ; BY famid ; RUN ; DATA wide2 ; SET longsrt2 ; BY famid ; KEEP famid faminc96-faminc98 spend96-spend98 ; RETAIN faminc96-faminc98 spend96-spend98 ; ARRAY afaminc(96:98) faminc96-faminc98 ; ARRAY aspend(96:98) spend96-spend98 ; IF first.famid THEN DO; DO i = 96 to 98 ; afaminc( i ) = 0 ; aspend( i ) = 0 ; END; END; afaminc( year ) = faminc ; aspend( year ) = spend ; IF last.famid THEN OUTPUT ; RUN; * check with a PROC PRINT ; PROC PRINT DATA=wide2; RUN;
* INCLUDE PART OF THE OUTPUT ;
Sometimes, there is no variable in the data set that uniquely identifies each observation. Rather, two variables are necessary to uniquely identify each observation. In this situation, you will need to use a data step like the one shown below to reshape the data from long to wide.
* 3. Reshaping data from long to wide format with 2 variables that identify the wide record ; DATA long3; INPUT famid birth age ht ; CARDS; 1 1 1 2.8 1 1 2 3.4 1 2 1 2.9 1 2 2 3.8 1 3 1 2.2 1 3 2 2.9 2 1 1 2.0 2 1 2 3.2 2 2 1 1.8 2 2 2 2.8 2 3 1 1.9 2 3 2 2.4 3 1 1 2.2 3 1 2 3.3 3 2 1 2.3 3 2 2 3.4 3 3 1 2.1 3 3 2 2.9 ; RUN; PROC SORT DATA=long3 OUT=longsrt3 ; BY famid birth ; RUN ; DATA wide3 ; SET longsrt3 ; BY famid birth ; KEEP famid ht1-ht2 ; RETAIN ht1-ht2 ; ARRAY aht(1:2) ht1-ht2 ; IF first.birth THEN DO; DO i = 1 to 2 ; aht( i ) = 0 ; END; END; aht( age ) = ht ; IF last.birth THEN OUTPUT ; RUN; * check with a PROC PRINT ; PROC PRINT DATA=wide3; RUN;
The following example is a more realistic example that uses a data file having 300 records
in long format (50 wide records and 6 time points).
* 4. a more realistic example ; * this is a more realistic data file having ; * 300 long records (50 wide records and 6 time points); data long4; input id year inc ; cards; 1 90 66483 1 91 69146 1 92 74643 1 93 79783 1 94 81710 1 95 86143 2 90 17510 2 91 17947 2 92 19484 2 93 20979 2 94 21268 2 95 22998 3 90 57947 3 91 62964 3 92 68717 3 93 70957 3 94 75198 3 95 75722 4 90 64831 4 91 71060 4 92 71918 4 93 72514 4 94 73100 4 95 74379 5 90 18904 5 91 19949 5 92 21335 5 93 22237 5 94 23829 5 95 23913 6 90 32057 6 91 34770 6 92 35834 6 93 37387 6 94 40899 6 95 42372 7 90 60551 7 91 64869 7 92 67983 7 93 70498 7 94 71253 7 95 75177 8 90 16553 8 91 18189 8 92 18349 8 93 19815 8 94 21739 8 95 22980 9 90 32611 9 91 33465 9 92 35961 9 93 36416 9 94 37183 9 95 40627 10 90 61379 10 91 66002 10 92 67936 10 93 70513 10 94 74405 10 95 76009 11 90 24065 11 91 24229 11 92 25709 11 93 26121 11 94 26617 11 95 28142 12 90 32975 12 91 36185 12 92 37601 12 93 41336 12 94 43399 12 95 43670 13 90 69548 13 91 71341 13 92 72455 13 93 76552 13 94 80538 13 95 85330 14 90 50274 14 91 53349 14 92 55900 14 93 59375 14 94 61216 14 95 63911 15 90 72011 15 91 73334 15 92 76248 15 93 77724 15 94 78638 15 95 80582 16 90 18911 16 91 20046 16 92 21343 16 93 21630 16 94 22330 16 95 23081 17 90 68841 17 91 75410 17 92 80806 17 93 81327 17 94 81571 17 95 86499 18 90 28099 18 91 30716 18 92 32986 18 93 36097 18 94 39124 18 95 39866 19 90 17302 19 91 18778 19 92 18872 19 93 19884 19 94 20665 19 95 21855 20 90 16291 20 91 16674 20 92 16770 20 93 17182 20 94 17979 20 95 18917 21 90 43244 21 91 46545 21 92 47633 21 93 50744 21 94 54734 21 95 59075 22 90 56393 22 91 59120 22 92 60801 22 93 61404 22 94 63111 22 95 69278 23 90 47347 23 91 49571 23 92 50101 23 93 51345 23 94 56463 23 95 56927 24 90 16076 24 91 17217 24 92 17296 24 93 17900 24 94 18171 24 95 18366 25 90 65906 25 91 69679 25 92 76131 25 93 77676 25 94 81980 25 95 85426 26 90 58586 26 91 61188 26 92 66542 26 93 69267 26 94 71063 26 95 74549 27 90 61674 27 91 66584 27 92 69185 27 93 75193 27 94 78647 27 95 81898 28 90 31673 28 91 31883 28 92 32774 28 93 34485 28 94 36929 28 95 39751 29 90 63412 29 91 67593 29 92 69911 29 93 73092 29 94 80105 29 95 81840 30 90 27684 30 91 28439 30 92 30861 30 93 31406 30 94 32960 30 95 35530 31 90 71873 31 91 76449 31 92 80848 31 93 88691 31 94 94149 31 95 97431 32 90 62177 32 91 63812 32 92 64235 32 93 65703 32 94 69985 32 95 71136 33 90 37684 33 91 38258 33 92 39208 33 93 39489 33 94 39745 33 95 41236 34 90 64013 34 91 66398 34 92 71877 34 93 75610 34 94 76395 34 95 79644 35 90 16011 35 91 16847 35 92 17746 35 93 19123 35 94 19183 35 95 19996 36 90 49215 36 91 52195 36 92 52343 36 93 56365 36 94 58752 36 95 59354 37 90 15774 37 91 16643 37 92 17605 37 93 18781 37 94 18996 37 95 19685 38 90 29106 38 91 31693 38 92 31852 38 93 34505 38 94 35806 38 95 36179 39 90 25147 39 91 26923 39 92 28785 39 93 30987 39 94 34036 39 95 34106 40 90 71978 40 91 79144 40 92 80453 40 93 86580 40 94 95164 40 95 96155 41 90 46166 41 91 47579 41 92 49455 41 93 53849 41 94 56630 41 95 57473 42 90 55810 42 91 59443 42 92 65291 42 93 66065 42 94 69009 42 95 74365 43 90 49642 43 91 50603 43 92 53917 43 93 54858 43 94 58470 43 95 59767 44 90 21348 44 91 22361 44 92 23412 44 93 24038 44 94 24774 44 95 25828 45 90 44361 45 91 48720 45 92 51356 45 93 54927 45 94 56670 45 95 58800 46 90 56509 46 91 60517 46 92 61532 46 93 65077 46 94 69594 46 95 73089 47 90 39097 47 91 40293 47 92 43237 47 93 44809 47 94 48782 47 95 53091 48 90 18685 48 91 19405 48 92 20165 48 93 20316 48 94 22197 48 95 23557 49 90 73103 49 91 76243 49 92 76778 49 93 82734 49 94 86279 49 95 86784 50 90 48129 50 91 49267 50 92 53799 50 93 58768 50 94 63011 50 95 66461 ; run; PROC SORT DATA=long4 OUT=longsrt4 ; BY id ; RUN ; DATA wide4 ; SET longsrt4 ; BY id ; KEEP id inc90-inc95 ; RETAIN inc90-inc95 ; ARRAY ainc(90:95) inc90-inc95 ; IF first.id THEN DO; DO i = 90 to 95 ; ainc( i ) = 0 ; END; END; ainc( year ) = inc ; IF last.id THEN OUTPUT ; RUN; * verify with a PROC PRINT ; PROC PRINT DATA=wide4(obs=5) ; RUN; PROC PRINT DATA=long4(obs=25) ; RUN; * verify with PROC MEANS ; PROC MEANS DATA=wide4 ; VAR inc90-inc95 ; RUN; PROC MEANS DATA=long4 ; CLASS year ; VAR inc ; RUN;
The following example shows how to reshape multiple variables, some of which are numeric and other that are character (i.e., string) variables. Note that when you are identifying the character variables in the program, you need to include a dollar sign ($). Also, within the do-loop that associates the elements of the array with the year, you will have opening and closing quotes with nothing but a blank space within the quotes. This is different from how the elements of the array are associated with the year for the numeric variables.
* 5. This example shows how to handle character variables ; data long5; length debt $ 3; input famid year faminc spend debt $ ; cards; 1 96 40000 38000 yes 1 97 40500 39000 yes 1 98 41000 40000 no 2 96 45000 42000 yes 2 97 45400 43000 no 2 98 45800 44000 no 3 96 75000 70000 no 3 97 76000 71000 no 3 98 77000 72000 no ; run; PROC SORT DATA=long5 OUT=long5srt ; BY famid ; RUN; DATA wide5 ; SET long5srt ; BY famid ; KEEP famid faminc96-faminc98 spend96-spend98 debt96-debt98 ; RETAIN faminc96-faminc98 spend96-spend98 debt96-debt98 ; ARRAY afaminc(96:98) faminc96-faminc98 ; ARRAY aspend(96:98) spend96-spend98 ; ARRAY adebt(96:98) $ 3 debt96-debt98 ; IF first.famid THEN DO; DO i = 96 to 98 ; afaminc( i ) = 0 ; aspend( i ) = 0 ; adebt( i ) = " " ; END; END; afaminc( year ) = faminc ; aspend( year ) = spend ; adebt( year ) = debt ; IF last.famid THEN OUTPUT ; RUN; PROC PRINT DATA=wide5 ; RUN;