This seminar is designed to help you improve your SAS data management skills via the use of arrays. According to SAS, "an array is a temporary grouping of SAS variables that are arranged in a particular order and identified by an array-name." In SAS, an array is created with the array statement within a data step. Arrays can be useful in very explicit ways, like creating or modifying a series of variables, and in more subtle ways, like reshaping a dataset or comparing across observations. The application of arrays will be the focus of this seminar.
In this seminar we will cover the following topics:
Recoding variables
Applying the same computation to many variables simultaneously
Computing new variables
Revisiting recoding variables – two variations of the
example on recoding
Identify patterns across variables using arrays
Reshaping wide to long
Understanding the functions first., last. and the retain statement
Reshaping long to wide using arrays
Comparisons across observations using arrays
Here is the program on which the seminar is based.
After completing this seminar, we hope that you will be able to recognize opportunities to use arrays in your data analysis and execute arrays properly. Before we start, let’s run some SAS options so that we can get rid of the date, page number, centering and page break in the output.
options nodate nonumber nocenter formdlim="-";
Recoding variables
Let’s say we have variables on family monthly income for twelve month and the missing values have been coded as (-999).
data faminc; input famid faminc1-faminc12 ; cards; 1 3281 3413 3114 2500 2700 3500 3114 -999 3514 1282 2434 2818 2 4042 3084 3108 3150 -999 3100 1531 2914 3819 4124 4274 4471 3 6015 6123 6113 -999 6100 6200 6186 6132 -999 4231 6039 6215 ; run;
We need to recode the twelve family income variables so that missing will be represented by a dot. Let’s see how we would do this in a data step one variable at a time.
data recode_missing; set faminc; if faminc1 = -999 then faminc1 = .; if faminc2 = -999 then faminc2 = .; if faminc3 = -999 then faminc3 = .; if faminc4 = -999 then faminc4 = .; if faminc5 = -999 then faminc5 = .; if faminc6 = -999 then faminc6 = .; if faminc7 = -999 then faminc7 = .; if faminc8 = -999 then faminc8 = .; if faminc9 = -999 then faminc9 = .; if faminc10 = -999 then faminc10 = .; if faminc11 = -999 then faminc11 = .; if faminc12 = -999 then faminc12 = .; run; proc print data = recode_missing heading= h noobs; run;famid faminc1 faminc2 faminc3 faminc4 faminc5 faminc61 3281 3413 3114 2500 2700 3500 2 4042 3084 3108 3150 . 3100 3 6015 6123 6113 . 6100 6200faminc7 faminc8 faminc9 faminc10 faminc11 faminc123114 . 3514 1282 2434 2818 1531 2914 3819 4124 4274 4471 6186 6132 . 4231 6039 6215
We are using many repetitive steps here, making this is a perfect situation for using arrays as shown below. Note: In the code we use the square brackets around the subscript variable i. The choice between square brackets, curly brackets or parenthesis is completely arbitrary. We have decided to use the square brackets as a visual reminder that i is a subscript and not a part of a mathematical computation.
data recode_missing; set faminc; array inc[12] faminc1 - faminc12; do i = 1 to 12; if inc[i]=-999 then inc[i]=.; end; drop i; run;
Within the array statement, we first named the array, inc, and indicated in brackets how many variables are in the array, then listed which dataset variables our array variables represent, faminc1-faminc12. The dataset variables can be existing variables or can be new variables added to the dataset with this step. After the array statement, we used a do loop to cycle through our array variables. Upon completing this data step, our dataset has the same set of variables it started with. The only new variable created in the data step was i, and i was dropped explicitly. The array variables, inc1-inc12, simply act as stand-ins for the variables they represent and are then discarded.
Other examples of recoding situations where arrays may be useful:
- If measures for some observations are expressed as proportions and others as percents, recoding all such variables into one consistent measure.
- If all negative values are considered invalid, replacing values less than zero with missing.
Applying the same math computation to many variables at a time
Now let’s say we have survey data with responses coded on a -3 to 3 scale and we wish to reverse code several items. We can do this using an array, cycling through the items to be reversed and multiplying the response values by (-1).
data score; input item1 item2 item3 item4; cards; -2 1 -3 0 -1 2 -2 1 0 -1 -3 -1 ; run; data score_array1; set score; array item(4) item1-item4; do i=1 to 4; item(i) = -1*item(i); end; drop i; run; proc print data=score_array1; run; Obs item1 item2 item3 item41 2 -1 3 0 2 1 -2 2 -1 3 0 1 3 1
In this example, we are only reversing four items. Completing this task without arrays would have been perfectly doable. However, if we wished to reverse 40 items, the non-array code would become much longer, going from 4 to 40 lines, and would be annoying to write. The array coding, on the other hand, would remain the same length and would require only minor changes.
Other examples of applying the same math computation to many variables:
- If variables are measured in absolute time and you are interested in relative time, subtracting the same starting time from all variables.
- If you wish to add random noise to a set of variables, adding in random numbers from standard normal distribution.
Computing new variables
Let’s come back to our family income data set. We are interested in creating a new variable for each month containing the amount of income tax based on that month’s income. We can compute the income tax variables manually.
data tax_manual; set recode_missing; taxinc1 = faminc1 * .10 ; taxinc2 = faminc2 * .10 ; taxinc3 = faminc3 * .10 ; taxinc4 = faminc4 * .10 ; taxinc5 = faminc5 * .10 ; taxinc6 = faminc6 * .10 ; taxinc7 = faminc7 * .10 ; taxinc8 = faminc8 * .10 ; taxinc9 = faminc9 * .10 ; taxinc10= faminc10 * .10 ; taxinc11= faminc11 * .10 ; taxinc12= faminc12 * .10 ; run; proc print data=tax_manual noobs heading=h; var famid faminc6-faminc12 taxinc6-taxinc12; run;famid faminc6 faminc7 faminc8 faminc9 faminc10 faminc11 faminc121 3500 3114 . 3514 1282 2434 2818 2 3100 1531 2914 3819 4124 4274 4471 3 6200 6186 6132 . 4231 6039 6215taxinc6 taxinc7 taxinc8 taxinc9 taxinc10 taxinc11 taxinc12350 311.4 . 351.4 128.2 243.4 281.8 310 153.1 291.4 381.9 412.4 427.4 447.1 620 618.6 613.2 . 423.1 603.9 621.5
Now we will compute the same income tax variables using array technique. We have to use two arrays because the first array, inc, is the array for the existing variables (faminc1–faminc12); the second array, tax, is created as a “placeholder” where we will store the new variables (taxinc1–taxinc12).
data tax_array; set recode_missing; array inc(12) faminc1-faminc12; /* existing variables */ array tax(12) taxinc1-taxinc12; /* new variables */ do month = 1 to 12; tax[month] = inc[month]*0.1; end; run; proc print data=tax_array noobs; var famid faminc1-faminc3 taxinc1-taxinc3; run;famid faminc1 faminc2 faminc3 taxinc1 taxinc2 taxinc3 1 3281 3413 3114 328.1 341.3 311.4 2 4042 3084 3108 404.2 308.4 310.8 3 6015 6123 6113 601.5 612.3 611.3
Other examples of computing new variables:
- If you wish to standardize a set of variables and retain both the standardized and unstandardized values.
- If you wish to add variables containing projected values of variables in the dataset.
Revisiting recoding variables – two variations of the example on recoding
SAS has some nice flexibility in terms of defining arrays. Here is the first variation of our recoding variable example:
data test; set faminc; array inc(*) faminc:; do i =1 to dim(inc); if inc(i) = -999 then inc(i)=.; end; drop i; run;
What are the differences between this segment of code and the code for our first example?
- inc(*) instead of inc(12);
- Using the prefix convention "faminc:" instead of faminc1 – faminc12;
- Using dim(inc) to dynamically determine the number of variables in the array;
Here is another variation.
data test; set faminc; array a(*) _numeric_; do i =1 to dim(a); if a(i) = -999 then a(i)=.; end; drop i; run;
Now the only difference between this data step and the previous one is the use of the SAS key word “_numeric_“. This works nicely if we need to make the same change to all numerical variables.
Identify patterns across variables using arrays
In this section the objective is to identify the number of missing values for each row. We are going to create a new variable named nmiss, which will be the number of missing values across variables faminc1 – faminc12
data mspatterns; set recode_missing; array inc(12) faminc1-faminc12; /* existing vars */ nmiss = 0; do i = 1 to 12; if inc(i) = . then nmiss = nmiss + 1; end; run; proc print data=mspatterns noobs heading=H; var famid faminc1-faminc12 nmiss; run;famid faminc1 faminc2 faminc3 faminc4 faminc5 faminc6 faminc7 1 3281 3413 3114 2500 2700 3500 3114 2 4042 3084 3108 3150 . 3100 1531 3 6015 6123 6113 . 6100 6200 6186 faminc8 faminc9 faminc10 faminc11 faminc12 nmiss . 3514 1282 2434 2818 1 2914 3819 4124 4274 4471 1 6132 . 4231 6039 6215 2
Other examples of computing new variables:
- If you have sales data for each month and you wish to know how many months had sales over a given amount.
- If you have survey data and you are interested in how many positive responses each respondent gave.
Some notes on the above uses of arrays
From the uses of arrays already presented, you have perhaps recognized times when you could have used arrays. Anytime you find yourself
- Copying and pasting lines of code in a data step
- Performing the same operation (like recoding missing values or reverse scaling responses) on multiple variables
- Creating a set of new variables from a set of existing variables using a consistent formulation
- Calculating a single variable based on conditions of a set of existing variables
you could likely be using arrays. Compared to repetitive statements, arrays are efficient in their coding and less subject to typos and syntactical errors. Admittedly, arrays can detract from the readability of SAS code, but careful commenting can fix this.
Reshaping wide to long
Reshaping wide to long creating only one variable–manually.
In the problem data set we show what happens when we forget to include the appropriate
output statements in the data step.
data wide; input famid faminc96 faminc97 faminc98 ; cards; 1 40000 40500 41000 2 45000 45400 45800 3 75000 76000 77000 ; run; data long_manual; set wide; year=96; faminc=faminc96; output; year=97; faminc=faminc97; output; year=98; faminc=faminc98; output; run; proc print data=long_manual; var famid year faminc; run; Obs famid year faminc 1 1 96 40000 2 1 97 40500 3 1 98 41000 4 2 96 45000 5 2 97 45400 6 2 98 45800 7 3 96 75000 8 3 97 76000 9 3 98 77000 data problem; set wide; year=96; faminc=faminc96; *output; year=97; faminc=faminc97; *output; year=98; faminc=faminc98; output; run; proc print data=problem; var famid year faminc; run; Obs famid year faminc 1 1 98 41000 2 2 98 45800 3 3 98 77000
Reshaping wide to long creating only one variable using arrays.
data long_array; set wide; array Afaminc(96:98) faminc96 - faminc98; do year = 96 to 98; faminc = Afaminc[year]; output; end; drop faminc96-faminc98; run; proc print data=long_array; run; Obs famid year faminc 1 1 96 40000 2 1 97 40500 3 1 98 41000 4 2 96 45000 5 2 97 45400 6 2 98 45800 7 3 96 75000 8 3 97 76000 9 3 98 77000
Reshaping wide to long creating multiple variables (including string variables) using arrays.
data multi_wide; input famid faminc96 faminc97 faminc98 spend96 spend97 spend98 debt96 $ debt97 $ debt98 $ ; cards; 1 40000 40500 41000 38000 39000 40000 yes yes no 2 45000 45400 45800 42000 43000 44000 yes no no 3 75000 76000 77000 70000 71000 72000 no no no ; run; data multi_long; set multi_wide; length debt $ 3; array Afaminc(96:98) faminc96-faminc98; array Aspend(96:98) spend96-spend98; array Adebt(96:98) debt96-debt98; do year = 96 to 98; faminc = Afaminc[year]; spend = Aspend[year]; debt = Adebt[year]; output; end; drop faminc96-faminc98 spend96-spend98; run; proc print data=multi_long; var famid year faminc spend debt; run; Obs famid year faminc spend debt 1 1 96 40000 38000 yes 2 1 97 40500 39000 yes 3 1 98 41000 40000 no 4 2 96 45000 42000 yes 5 2 97 45400 43000 no 6 2 98 45800 44000 no 7 3 96 75000 70000 no 8 3 97 76000 71000 no 9 3 98 77000 72000 no
Reshaping wide to long in presence of character suffixes. In the above example we had numeric suffixes (96, 97 and 98). We can reshape even if we have character suffixes such as old, now and future.
data character; length name_old name_now name_future $ 24; input id name_old $ name_now $ name_future $ inc_old inc_now inc_future; cards; 1 Ramon Martin Martin_Sheen 23000 50000 700000 2 John Johnnie J_boy 10000 20000 600000 3 Mary_Cathleen Bo Bo_Derek 15000 40000 250000 ; run; proc print data=character; run;Obs name_old name_now name_future id inc_old inc_now future 1 Ramon Martin Martin_Sheen 1 23000 50000 700000 2 John Johnnie J_boy 2 10000 20000 600000 3 Mary_Cathleen Bo Bo_Derek 3 15000 40000 250000 data character_array; set character; length name $ 24; array Aname(3) $ name_old name_now name_future; array Aincome(3) inc_old inc_now inc_future; do time = 1 to 3; name = Aname[time]; income = Aincome[time]; output; end; run; proc format; value t_format 1='old' 2='now' 3='future'; run; proc print data=character_array ; format time t_format.; var id time name income; run; Obs id time name income 1 1 old Ramon 23000 2 1 now Martin 50000 3 1 future Martin_Sheen 700000 4 2 old John 10000 5 2 now Johnnie 20000 6 2 future J_boy 600000 7 3 old Mary_Cathleen 15000 8 3 now Bo 40000 9 3 future Bo_Derek 250000
Understanding the functions first., last. and the retain statement
The previous section demonstrated how to reshape data sets from wide to long. Unfortunately, reshaping data sets from long to wide is more complex. In order to better understand how to use arrays to reshape from long to wide we will need to understand how the first. and last. functions work as well as understand how the retain statement works. The following are examples of the retain statement.
We would like to create a new variable called new_meas which contains the same values as measurement but with the missing values filled in. The new_meas variable should have a starting value of 0 and then change values every time measurement has a non-missing value.
data missings; input id measurement; cards; 1 . 1 2 3 . 2 3 3 4 2 . 3 . 1 . 3 5 3 6 ; run; data ex_retain; set missings; retain new_meas 0; if measurement ne . then new_meas = measurement; run; proc print data=ex_retain; run; Obs id measurement new_meas 1 1 . 0 2 1 2 2 3 3 . 2 4 2 3 3 5 3 4 4 6 2 . 4 7 3 . 4 8 1 . 4 9 3 5 5 10 3 6 6
Omitting the retain statement gives us the wrong new_meas, now it is just a copy of measurement.
data ex_retain; set missings; *retain new_meas 0; if measurement ne . then new_meas = measurement; run; proc print data=ex_retain; run; Obs id measurement new_meas 1 1 . . 2 1 2 2 3 3 . . 4 2 3 3 5 3 4 4 6 2 . . 7 3 . . 8 1 . . 9 3 5 5 10 3 6 6
In the next example we want to create a variable called new1 which contains the cumulative sum of the values in the variable measurement. Note that when measurement is missing the sum should remain unchanged.
data ex2_retain; set missings; retain new1 0; if measurement ne . then new1 = new1 + measurement; run; proc print data=ex2_retain; run; Obs id measurement new1 1 1 . 0 2 1 2 2 3 3 . 2 4 2 3 5 5 3 4 9 6 2 . 9 7 3 . 9 8 1 . 9 9 3 5 14 10 3 6 20
Omitting the retain statement gives us the wrong new1.
data ex2_retain; set missings; *retain new1 0; if measurement ne . then new1 = new1 + measurement; run; proc print data=ex2_retain; run; Obs id measurement new1 1 1 . . 2 1 2 . 3 3 . . 4 2 3 . 5 3 4 . 6 2 . . 7 3 . . 8 1 . . 9 3 5 . 10 3 6 .
Let’s look at the first. and last. variables. In the first example we create indicator variables, first and last. The variable first indicates the first observation for each person as indicated by id; the variable last indicates the last observation for each person. Note: When using first.var_name or last.var_name we must first sort the data set on the variable var_name. Moreover, in the data step we must always precede first.var_name or last.var_name with a by var_name statement.
proc sort data=missings out=sort_miss; by id; run; data ex1; set sort_miss; by id; if first.id then first=1; else first=0; if last.id then last=1; else last=0; run; proc print data=ex1; run; Obs id measurement first last 1 1 . 1 0 2 1 2 0 0 3 1 . 0 1 4 2 3 1 0 5 2 . 0 1 6 3 . 1 0 7 3 4 0 0 8 3 . 0 0 9 3 5 0 0 10 3 6 0 1
We can combine the first. function with a retain statement to get a cumulative sum and count.
data kids; length kidname $ 4; input famid kidname birth_order wt; cards; 1 Beth 1 60 1 Barb 3 20 4 Sam 1 100 4 Stu 2 90 1 Bob 2 40 3 Pete 1 60 3 Phil 3 20 2 Andy 1 80 3 Pam 2 40 2 Al 2 50 2 Ann 3 20 ; run;
We will be using first.famid so we must sort the data set on famid.
proc sort data=kids out=sort_kids; by famid; run; data retain1; set sort_kids; retain sumwt count; /*carry over the value from previous obs to next obs*/ by famid; if first.famid then do; /*at 1st obs of each family set sumwt and count = 0*/ sumwt=0; count=0; end; sumwt = sumwt + wt; count = count + 1; meanwt = sumwt/count; run; proc print data=retain1; var famid kidname wt sumwt count meanwt; run; famid kidname wt sumwt count meanwt 1 Beth 60 60 1 60 1 Barb 20 80 2 40 1 Bob 40 120 3 40 2 Andy 80 80 1 80 2 Al 50 130 2 65 2 Ann 20 150 3 50 3 Pete 60 60 1 60 3 Phil 20 80 2 40 3 Pam 40 120 3 40 4 Sam 100 100 1 100 4 Stu 90 190 2 95
By adding an if last.famid statement to the program we output only
the last observation per family which shows the final sumwt, count and
meanwt for
each family.
Note: We do not need to resort the data since it is already sorted on
famid.
data retain2; set retain1; by famid; if last.famid then output; /*output only the last obs for each family*/ run; proc print data=retain2; var famid sumwt meanwt; run; famid sumwt count meanwt 1 120 3 40 2 150 3 50 3 120 3 40 4 190 2 95
Reshaping long to wide using arrays
We will use the long_array data set created from the wide data set and we will reshape it back to the original wide format.
proc print data=long_array; run; Obs famid year faminc 1 1 96 40000 2 1 97 40500 3 1 98 41000 4 2 96 45000 5 2 97 45400 6 2 98 45800 7 3 96 75000 8 3 97 76000 9 3 98 77000
We will be using first.famid so we must sort the data set on famid.
proc sort data=long_array out=long_sort; by famid; run; data wide_array; set long_sort; by famid; retain faminc96-faminc98; array Afaminc(96:98) faminc96-faminc98; if first.famid then do; do i = 96 to 98; Afaminc[i] = .; /*initializing to missing*/ end; end; Afaminc(year) = faminc; /*looping across values in the variable year*/ *if last.famid then output; /* outputs only the last obs in a family*/ drop year faminc i; run; proc print data=wide_array noobs; run; famid faminc96 faminc97 faminc98 1 40000 . . 1 40000 40500 . 1 40000 40500 41000 2 45000 . . 2 45000 45400 . 2 45000 45400 45800 3 75000 . . 3 75000 76000 . 3 75000 76000 77000 data wide_array; set long_sort; by famid; retain faminc96-faminc98; array Afaminc(96:98) faminc96-faminc98; if first.famid then do; do i = 96 to 98; Afaminc[i] = .; end; end; Afaminc(year) = faminc; /*looping across values in the variable year*/ if last.famid then output; /* outputs only the last obs in a family*/ drop year faminc i; run; proc print data=wide_array noobs; run; famid faminc96 faminc97 faminc98 1 40000 40500 41000 2 45000 45400 45800 3 75000 76000 77000
Comparisons across observations using arrays
One difficulty of data management in SAS is that we cannot easily compare across observations. However, with our knowledge of arrays, we can transpose the data from long to wide; then we can use the array to do the comparisons very easily. This is a rather subtle use of arrays that can be extremely helpful.
We have up to six observations for four people. Our goal is to compare, for each person, a given observation with the previous and the next observations. If a person has three consecutive observations that are the same, we wish to flag the person.
data real_life; input person topicA; cards; 1 0 1 1 3 -1 1 0 2 0 1 1 2 -1 2 -1 3 0 3 1 4 0 1 1 4 1 4 0 2 -1 4 0 4 0 1 -1 ; run;
We need to number the observations within each person. We will be using first.person in the process of doing this, so we must first sort the data on person. Then we will create the count variable which will enumerates the observations within each person.
proc sort data=real_life out=sort_real; by person; run; data count_real; set sort_real; retain count; by person; if first.person then count = 0; count = count + 1; run; proc print data=count_real noobs; run; topic person A count 1 0 1 1 1 2 1 0 3 1 1 4 1 1 5 1 -1 6 2 0 1 2 -1 2 2 -1 3 2 -1 4 3 -1 1 3 0 2 3 1 3 4 0 1 4 1 2 4 0 3 4 0 4 4 0 5
We now convert the data set from long to wide.
Note: We are using first.person and
last.person but we do not need to resort the data since it is already
sorted on person.
data wide_real; set count_real; array AtopicA(6) topicA_1-topicA_6; retain topicA_1-topicA_6; by person; if first.person then do; do i = 1 to 6; AtopicA[i] = .; end; end; AtopicA(count) = topicA; /*looping across values in the variable count*/ if last.person then output; /* outputs only the last obs per person */ run; proc print data=wide_real noobs; var person topicA_1-topicA_6; run; topic topic topic topic topic topic flag person A_1 A_2 A_3 A_4 A_5 A_6 A 1 0 1 0 1 1 -1 0 2 0 -1 -1 -1 . . 1 3 -1 0 1 . . . 0 4 0 1 0 0 0 . 1
Now, let’s find the people who have the same value for 3 observations in a row.
data three; set wide_real; array topic(6) topicA_1-topicA_6; do i = 2 to 5; if topic[i-1] ne . & topic[i] ne . & topic[i+1] ne . & topic[i]=topic[i-1] & topic[i]=topic[i+1] then flagA=1; end; if flagA=. then flagA=0; run; proc print data=three noobs; var person topicA_1-topicA_6 flagA; run; topic topic topic topic topic topic flag person A_1 A_2 A_3 A_4 A_5 A_6 A 1 0 1 0 1 1 -1 0 2 0 -1 -1 -1 . . 1 3 -1 0 1 . . . 0 4 0 1 0 0 0 . 1
This seminar is based on examples from the following SAS Learning Modules shown below.
- SAS Documentation on Array Processing
- Using Arrays to Work Across Variables
- Reshaping Data from Wide to Long using the SAS Data Step
- Collapsing Across Observations in SAS using the Data Step I
- Collapsing Across Observations in SAS using the Data Step II
- Reshaping Data from Long to Wide using the SAS Data Step