This seminar is designed to help you improve your SAS data management skills via the use of arrays.
Here is the program on which the seminar is based.
In this seminar we will cover the following ten topics:
Recoding variables Applying math computations to many variables simultaneously Computing new variables Collapsing over variables Identify patterns across variables using arrays Reshaping wide to long Understanding the functions first., last. and the retain statement Reshaping long to wide using arrays Comparisons across observations using arrays
First we run the SAS options so that we can get rid of the date, page number, centering and page break in the output.
options nodate nonumber nocenter formdlim="-";
Recoding variables
Inputting the faminc data set.
data faminc; input famid faminc1-faminc12 ; cards; 1 3281 3413 3114 2500 2700 3500 3114 3319 3514 1282 2434 2818 2 4042 3084 3108 3150 3800 3100 1531 2914 3819 4124 4274 4471 3 6015 6123 6113 6100 6100 6200 6186 6132 3123 4231 6039 6215 ; run;
Recoding manually using if-then.
data recode_manual; set faminc; if faminc1 < 3000 then faminc1=.; if faminc2 < 3000 then faminc2=.; if faminc3 < 3000 then faminc3=.; if faminc4 < 3000 then faminc4=.; if faminc5 < 3000 then faminc5=.; if faminc6 < 3000 then faminc6=.; if faminc7 < 3000 then faminc7=.; if faminc8 < 3000 then faminc8=.; if faminc9 < 3000 then faminc9=.; if faminc10 < 3000 then faminc10=.; if faminc11 < 3000 then faminc11=.; if faminc12 < 3000 then faminc12=.; run; /*heading option specifies horizontal (H) column headings/* proc print data=recode_manual noobs heading=H; var famid faminc1-faminc6; run; famid faminc1 faminc2 faminc3 faminc4 faminc5 faminc6 1 3281 3413 3114 . . 3500 2 4042 3084 3108 3150 3800 3100 3 6015 6123 6113 6100 6100 6200
Recoding with arrays using if-then. Note: In the code we use the square brackets around the subscript variable i. The choice between square brackets, curly brackets or parenthesis is completely arbitrary. We have decided to use the square brackets as a visual reminder that i is a subscript and not a part of a mathematical computation.
data recode_array; set faminc; array Afaminc(12) faminc1-faminc12; do i = 1 to 12; if Afaminc[i] < 3000 then Afaminc[i] = . ; end; drop i; run; proc print data=recode_array noobs heading=H; var famid faminc1-faminc6; run; famid faminc1 faminc2 faminc3 faminc4 faminc5 faminc6 1 3281 3413 3114 . . 3500 2 4042 3084 3108 3150 3800 3100 3 6015 6123 6113 6100 6100 6200
Applying the same math computation to many variables at a time
Reverse items on a -3 to +3 scale using array.
data score; input item1 item2 item3 item4; cards; -2 1 -3 0 -1 2 -2 1 0 -1 -3 -1 ; run; data score_array1; set score; array item(4) item1-item4; do i=1 to 4; item[i] = -1*item[i]; end; run; proc print data=score_array1; run; Obs item1 item2 item3 item4 i 1 2 -1 3 0 5 2 1 -2 2 -1 5 3 0 1 3 1 5
Computing new variables
Computing the tax income variables manually.
data tax_manual; set faminc; taxinc1 = faminc1 * .10 ; taxinc2 = faminc2 * .10 ; taxinc3 = faminc3 * .10 ; taxinc4 = faminc4 * .10 ; taxinc5 = faminc5 * .10 ; taxinc6 = faminc6 * .10 ; taxinc7 = faminc7 * .10 ; taxinc8 = faminc8 * .10 ; taxinc9 = faminc9 * .10 ; taxinc10= faminc10 * .10 ; taxinc11= faminc11 * .10 ; taxinc12= faminc12 * .10 ; run; proc print data=tax_manual noobs; var famid faminc1-faminc3 taxinc1-taxinc3; run; famid faminc1 faminc2 faminc3 taxinc1 taxinc2 taxinc3 1 3281 3413 3114 328.1 341.3 311.4 2 4042 3084 3108 404.2 308.4 310.8 3 6015 6123 6113 601.5 612.3 611.
Computing the same tax income variables using an array. We have to use two arrays because the first array, Afaminc, is the array for the existing variables (faminc1–faminc12); the second array, Ataxinc, is created as a “placeholder” where we will store the new variables (taxinc1–taxinc12).
data tax_array; set faminc; array Afaminc(12) faminc1-faminc12; /* existing variables */ array Ataxinc(12) taxinc1-taxinc12; /* new variables */ do month = 1 to 12; Ataxinc[month] = Afaminc[month]*0.1; end; run; proc print data=tax_array noobs; var famid faminc1-faminc3 taxinc1-taxinc3; run; famid faminc1 faminc2 faminc3 taxinc1 taxinc2 taxinc3 1 3281 3413 3114 328.1 341.3 311.4 2 4042 3084 3108 404.2 308.4 310.8 3 6015 6123 6113 601.5 612.3 611.
Collapsing over variables
Creating the total income per quarter variables manually.
data quarter_manual; set faminc; incq1 = faminc1 + faminc2 + faminc3; incq2 = faminc4 + faminc5 + faminc6; incq3 = faminc7 + faminc8 + faminc9; incq4 = faminc10 + faminc11 + faminc12; run; proc print data=quarter_manual; var incq1 faminc1-faminc3; run; Obs incq1 faminc1 faminc2 faminc3 1 9808 3281 3413 3114 2 10234 4042 3084 3108 3 18251 6015 6123 6113
Creating the total income per quarter variables using arrays.
data quarter_array; set faminc; array Afaminc(12) faminc1-faminc12; /*existing vars*/ array Aquarter(4) incq1-incq4; /* new vars */ do q = 1 to 4; Aquarter[q] = Afaminc[3*q-2] + Afaminc[3*q-1] + Afaminc[3*q]; end; run; /* For q=1: Aquarter[1] = Afaminc[3*1-2] + Afaminc[3*1-1] + Afaminc[3*1] = Afaminc[1] + Afaminc[2] + Afaminc[3] For q=2: Aquarter[2] = Afaminc[3*2-2] + Afaminc[3*2-1] + Afaminc[3*2] = Afaminc[4] + Afaminc[5] + Afaminc[6] */ proc print data=quarter_array nobs; var famid incq1 faminc1-faminc3; run; famid incq1 faminc1 faminc2 faminc3 1 9808 3281 3413 3114 2 10234 4042 3084 3108 3 18251 6015 6123 6113
Identify patterns across variables using arrays
In this section the objective is to identify the months in which income was less than half of previous month and store information in the dummy variables lowinc2–lowinc12 looping over months 2-12. Note that month 1 has no previous month! The variable ever indicates if income has ever been less than half of a previous month for any month. Note: The array “size” specified in the parenthesis is usually one number and it is understood by SAS that it is supposed to create an array where the index ranges from one to the number in the parenthesis. But we can specify any range for the index which suits our program. We are only interested in lowincome variables corresponding to months 2-12 and thus we indicate that the range for the index of array Alowinc should be 2 to 12.
data pattern; set faminc; length ever $ 4; array Afaminc(12) faminc1-faminc12; /* existing vars */ array Alowinc(2:12) lowinc2-lowinc12; /* new vars */ do m = 2 to 12; if Afaminc[m] < (Afaminc[m-1] / 2) then Alowinc[m] = 1; else Alowinc[m] = 0; end; sum_low = sum(of lowinc:); /*sums over all vars with lowinc as part of name*/ if sum_low > 0 then ever='Yes'; if sum_low = 0 then ever='No'; drop m sum_low; run; proc print data=pattern noobs heading=H; var famid faminc1-faminc6 lowinc2-lowinc6 ever; run; famid faminc1 faminc2 faminc3 faminc4 faminc5 faminc6 1 3281 3413 3114 2500 2700 3500 2 4042 3084 3108 3150 3800 3100 3 6015 6123 6113 6100 6100 6200 lowinc2 lowinc3 lowinc4 lowinc5 lowinc6 ever 0 0 0 0 0 Yes 0 0 0 0 0 Yes 0 0 0 0 0 No
Reshaping wide to long
Reshaping wide to long creating only one variable–manually. In the problem data set we show what happens when we forget to include the appropriate output statements in the data step.
data wide; input famid faminc96 faminc97 faminc98 ; cards; 1 40000 40500 41000 2 45000 45400 45800 3 75000 76000 77000 ; run; data long_manual; set wide; year=96; faminc=faminc96; output; year=97; faminc=faminc97; output; year=98; faminc=faminc98; output; run; proc print data=long_manual; var famid year faminc; run; Obs famid year faminc 1 1 96 40000 2 1 97 40500 3 1 98 41000 4 2 96 45000 5 2 97 45400 6 2 98 45800 7 3 96 75000 8 3 97 76000 9 3 98 77000 data problem; set wide; year=96; faminc=faminc96; *output; year=97; faminc=faminc97; *output; year=98; faminc=faminc98; output; run; proc print data=problem; var famid year faminc; run; Obs famid year faminc 1 1 98 41000 2 2 98 45800 3 3 98 77000
Reshaping wide to long creating only one variable using arrays.
data long_array; set wide; array Afaminc(96:98) faminc96 - faminc98; do year = 96 to 98; faminc = Afaminc[year]; output; end; drop faminc96-faminc98; run; proc print data=long_array; run; Obs famid year faminc 1 1 96 40000 2 1 97 40500 3 1 98 41000 4 2 96 45000 5 2 97 45400 6 2 98 45800 7 3 96 75000 8 3 97 76000 9 3 98 77000
Reshaping wide to long creating multiple variables (including string variables) using arrays.
data multi_wide; input famid faminc96 faminc97 faminc98 spend96 spend97 spend98 debt96 $ debt97 $ debt98 $ ; cards; 1 40000 40500 41000 38000 39000 40000 yes yes no 2 45000 45400 45800 42000 43000 44000 yes no no 3 75000 76000 77000 70000 71000 72000 no no no ; run; data multi_long; set multi_wide; length debt $ 3; array Afaminc(96:98) faminc96-faminc98; array Aspend(96:98) spend96-spend98; array Adebt(96:98) debt96-debt98; do year = 96 to 98; faminc = Afaminc[year]; spend = Aspend[year]; debt = Adebt[year]; output; end; drop faminc96-faminc98 spend96-spend98; run; proc print data=multi_long; var famid year faminc spend debt; run; Obs famid year faminc spend debt 1 1 96 40000 38000 yes 2 1 97 40500 39000 yes 3 1 98 41000 40000 no 4 2 96 45000 42000 yes 5 2 97 45400 43000 no 6 2 98 45800 44000 no 7 3 96 75000 70000 no 8 3 97 76000 71000 no 9 3 98 77000 72000 no
Reshaping wide to long in presence of character suffixes. In the above example we had numeric suffixes (96, 97 and 98). We can reshape even if we have character suffixes such as old, now and future.
data character; length name_old $ 24 name_now $ 24 name_future $ 24; input id name_old $ name_now $ name_future $ inc_old inc_now inc_future; cards; 1 Ramon Martin Martin_Sheen 23000 50000 700000 2 John Johnnie J_boy 10000 20000 600000 3 Mary_Cathleen Bo Bo_Derek 15000 40000 250000 ; run; proc print data=character; run; data character_array; set character; length name $ 24; array Aname(3) $ name_old name_now name_future; array Aincome(3) inc_old inc_now inc_future; do time = 1 to 3; name = Aname[time]; income = Aincome[time]; output; end; run; proc format; value t_format 1='old' 2='now' 3='future'; run; proc print data=character_array ; format time t_format.; var id time name income; run; Obs id time name income 1 1 old Ramon 23000 2 1 now Martin 50000 3 1 future Martin_Sheen 700000 4 2 old John 10000 5 2 now Johnnie 20000 6 2 future J_boy 600000 7 3 old Mary_Cathleen 15000 8 3 now Bo 40000 9 3 future Bo_Derek 250000
Understanding the functions first., last. and the retain statement
The previous section demonstrated how to reshape data sets from wide to long. Unfortunately, reshaping data sets from long to wide is more complex. In order to better understand how to use arrays to reshape from long to wide we will need to understand how the first. and last. functions work as well as understand how the retain statement works. The following are examples of the retain statement.
We would like to create a new variable called new_meas which contains the same values as measurement but with the missing values filled in. The new_meas variable should have a starting value of 0 and then change values every time measurement has a non-missing value.
data missings; input id measurement; cards; 1 . 1 2 3 . 2 3 3 4 2 . 3 . 1 . 3 5 3 6 ; run; data ex_retain; set missings; retain new_meas 0; if measurement ne . then new_meas = measurement; run; proc print data=ex_retain; run; Obs id measurement new_meas 1 1 . 0 2 1 2 2 3 3 . 2 4 2 3 3 5 3 4 4 6 2 . 4 7 3 . 4 8 1 . 4 9 3 5 5 10 3 6 6
Omitting the retain statement gives us the wrong new_meas, now it is just a copy of measurement.
data ex_retain; set missings; *retain new_meas 0; if measurement ne . then new_meas = measurement; run; proc print data=ex_retain; run; Obs id measurement new_meas 1 1 . . 2 1 2 2 3 3 . . 4 2 3 3 5 3 4 4 6 2 . . 7 3 . . 8 1 . . 9 3 5 5 10 3 6 6
In the next example we want to create a variable called new1 which contains the cumulative sum of the values in the variable measurement. Note that when measurement is missing the sum should remain unchanged.
data ex2_retain; set missings; retain new1 0; if measurement ne . then new1 = new1 + measurement; run; proc print data=ex2_retain; run; Obs id measurement new1 1 1 . 0 2 1 2 2 3 3 . 2 4 2 3 5 5 3 4 9 6 2 . 9 7 3 . 9 8 1 . 9 9 3 5 14 10 3 6 20
Omitting the retain statement gives us the wrong new1.
data ex2_retain; set missings; *retain new1 0; if measurement ne . then new1 = new1 + measurement; run; proc print data=ex2_retain; run; Obs id measurement new1 1 1 . . 2 1 2 . 3 3 . . 4 2 3 . 5 3 4 . 6 2 . . 7 3 . . 8 1 . . 9 3 5 . 10 3 6 .
Looking at the first. and last. functions. In the first example we create indicator variables, first and last. The variable first indicates the first observation for each person as indicated by id; the variable last indicates the last observation for each person. Note: When using first.var_name or last.var_name we must first sort the data set on the variable var_name. Moreover, in the data step we must always precede first.var_name or last.var_name with a by var_name statement.
proc sort data=missings out=sort_miss; by id; run; data ex1; set sort_miss; by id; if first.id then first=1; else first=0; if last.id then last=1; else last=0; run; proc print data=ex1; run; Obs id measurement first last 1 1 . 1 0 2 1 2 0 0 3 1 . 0 1 4 2 3 1 0 5 2 . 0 1 6 3 . 1 0 7 3 4 0 0 8 3 . 0 0 9 3 5 0 0 10 3 6 0 1
Combining the first. function with a retain statement to get a cumulative sum and count.
data kids; length kidname $ 4; input famid kidname birth_order wt; cards; 1 Beth 1 60 1 Barb 3 20 4 Sam 1 100 4 Stu 2 90 1 Bob 2 40 3 Pete 1 60 3 Phil 3 20 2 Andy 1 80 3 Pam 2 40 2 Al 2 50 2 Ann 3 20 ; run;
We will be using first.famid so we must sort the data set on famid.
proc sort data=kids out=sort_kids; by famid; run; data retain1; set sort_kids; retain sumwt count; /*carry over the value from previous obs to next obs*/ by famid; if first.famid then do; /*at 1st obs of each family set sumwt and count = 0*/ sumwt=0; count=0; end; sumwt = sumwt + wt; count = count + 1; meanwt = sumwt/count; run; proc print data=retain1; var famid kidname wt sumwt count meanwt; run; famid kidname wt sumwt count meanwt 1 Beth 60 60 1 60 1 Barb 20 80 2 40 1 Bob 40 120 3 40 2 Andy 80 80 1 80 2 Al 50 130 2 65 2 Ann 20 150 3 50 3 Pete 60 60 1 60 3 Phil 20 80 2 40 3 Pam 40 120 3 40 4 Sam 100 100 1 100 4 Stu 90 190 2 95
By adding an if last.famid statement to the program we output only the last observation per family which shows the final sumwt, count and meanwt for each family. Note: We do not need to resort the data since it is already sorted on famid.
data retain2; set retain1; by famid; if last.famid then output; /*output only the last obs for each family*/ run; proc print data=retain2; var famid sumwt meanwt; run; famid sumwt count meanwt 1 120 3 40 2 150 3 50 3 120 3 40 4 190 2 95
Reshaping long to wide using arrays
We will use the long_array data set created from the wide data set and we will reshape it back to the original wide format.
proc print data=long_array; run; Obs famid year faminc 1 1 96 40000 2 1 97 40500 3 1 98 41000 4 2 96 45000 5 2 97 45400 6 2 98 45800 7 3 96 75000 8 3 97 76000 9 3 98 77000
We will be using first.famid so we must sort the data set on famid.
proc sort data=long_array out=long_sort; by famid; run; data wide_array; set long_sort; by famid; retain faminc96-faminc98; array Afaminc(96:98) faminc96-faminc98; if first.famid then do; do i = 96 to 98; Afaminc[i] = .; /*initializing to missing*/ end; end; Afaminc(year) = faminc; /*looping across values in the variable year*/ *if last.famid then output; /* outputs only the last obs in a family*/ drop year faminc i; run; proc print data=wide_array noobs; run; famid faminc96 faminc97 faminc98 1 40000 . . 1 40000 40500 . 1 40000 40500 41000 2 45000 . . 2 45000 45400 . 2 45000 45400 45800 3 75000 . . 3 75000 76000 . 3 75000 76000 77000 data wide_array; set long_sort; by famid; retain faminc96-faminc98; array Afaminc(96:98) faminc96-faminc98; if first.famid then do; do i = 96 to 98; Afaminc[i] = .; end; end; Afaminc(year) = faminc; /*looping across values in the variable year*/ if last.famid then output; /* outputs only the last obs in a family*/ drop year faminc i; run; proc print data=wide_array noobs; run; famid faminc96 faminc97 faminc98 1 40000 40500 41000 2 45000 45400 45800 3 75000 76000 77000
Comparisons across observations using arrays
A more subtle usage of arrays. One issue in SAS data management is that we cannot do comparisons across observations. One solution to this problem is to transpose the data from long to wide; then we can use the array to do the comparisons very easily. The goal is to compare each observation with the previous and the next observation. If they are the same then flag the observation.
data real_life; input person topicA; cards; 1 0 1 1 3 -1 1 0 2 0 1 1 2 -1 2 -1 3 0 3 1 4 0 1 1 4 1 4 0 2 -1 4 0 4 0 1 -1 ; run;
We need to number the observations within each person. We will be using first.person in the process of doing this, so we must first sort the data on person. Then we will create the count variable which will enumerates the observations within each person.
proc sort data=real_life out=sort_real; by person; run; data count_real; set sort_real; retain count; by person; if first.person then count = 0; count = count + 1; run; proc print data=count_real noobs; run; topic person A count 1 0 1 1 1 2 1 0 3 1 1 4 1 1 5 1 -1 6 2 0 1 2 -1 2 2 -1 3 2 -1 4 3 -1 1 3 0 2 3 1 3 4 0 1 4 1 2 4 0 3 4 0 4 4 0 5
We now convert the data set from long to wide. Note: We are using first.person and last.person but we do not need to resort the data since it is already sorted on person.
data wide_real; set count_real; array AtopicA(6) topicA_1-topicA_6; retain topicA_1-topicA_6; by person; if first.person then do; do i = 1 to 6; AtopicA[i] = .; end; end; AtopicA(count) = topicA; /*looping across values in the variable count*/ if last.person then output; /* outputs only the last obs per person */ run; proc print data=wide_real noobs; var person topicA_1-topicA_6; run; topic topic topic topic topic topic flag person A_1 A_2 A_3 A_4 A_5 A_6 A 1 0 1 0 1 1 -1 0 2 0 -1 -1 -1 . . 1 3 -1 0 1 . . . 0 4 0 1 0 0 0 . 1
Now, let’s find the people who have the same value for 3 observations in a row.
data three; set wide_real; array topic(6) topicA_1-topicA_6; do i = 2 to 5; if topic[i-1] ne . & topic[i] ne . & topic[i+1] ne . & topic[i]=topic[i-1] & topic[i]=topic[i+1] then flagA=1; end; if flagA=. then flagA=0; run; proc print data=three noobs; var person topicA_1-topicA_6 flagA; run; topic topic topic topic topic topic flag person A_1 A_2 A_3 A_4 A_5 A_6 A 1 0 1 0 1 1 -1 0 2 0 -1 -1 -1 . . 1 3 -1 0 1 . . . 0 4 0 1 0 0 0 . 1
This seminar is based on examples from the following SAS Learning Modules shown below.