When we need to create many variables that are transformations of existing variables, or to perforam a transforamation not easily done or available in a data step, we can use proc expand to create the transformed variables. This procedure is one of SAS’s time series procedures, but you don’t need time series data to use this procedure. In the examples below, we illustrate the use of proc expand to create new variables. With proc expand, it is easy to operate across observations — to create variables that are within-group transformations; this is much more difficult to do in a data step. We will use the hsb2 dataset in all of the examples on this page.
In our first example, we will take the negative of the variable socst. If you were creating this variable in a data step, you would multiply socst by -1. We will also create a variable called number_write, which will be the variable write with 10 added to each value. In a data step, you would simply add 10 to the variable write.
proc expand data = "d:datahsb2" out = test1; convert socst = neg_socst / transformout = (neg); convert write = number_write / transformout = (+ 10); run;proc print data = test1 (obs = 15); var socst neg_socst write number_write; run;neg_ number_ Obs socst socst write write 1 41 -41 44 54 2 56 -56 65 75 3 51 -51 50 60 4 31 -31 40 50 5 41 -41 41 51 6 48 -48 44 54 7 36 -36 31 41 8 51 -51 46 56 9 71 -71 65 75 10 41 -41 44 54 11 56 -56 65 75 12 51 -51 54 64 13 51 -51 49 59 14 51 -51 47 57 15 56 -56 57 67
In the examples above, a function wasn’t really necessary to create the new variables; they could have easily been created in a data step. Let’s create some other variables that might not be so easy to create without a function. First, we will create med_read, which is the centered moving median of the variable read. Next, the variable sum_science is the cumulative sum of the variable science. For both of these, our window will be 5. The variable ratio_sci2 is ratio of the current value to the lagged value. In our example, we will use a lag of 2. Finally, pct_sci is the percent difference between the current value and the lagged value. Because we did not specify a value in the parentheses with the pctdif option, the lag will be 1.
proc expand data = hsb2 out = test2; convert read = med_read / transformout = (cmovmed 5); convert science = sum_science / transformout = (cusum 5); convert science = ratio_sci2 / transformout = (ratio 2); convert science = pct_sci / transformout = (pctdif); run;proc print data = test2 (obs = 15); var read med_read science sum_science ratio_sci2 pct_sci; run;sum_ ratio_ Obs read med_read science science sci2 pct_sci 1 34 44.0 39 39 . . 2 63 45.5 63 63 . 61.5385 3 44 47.0 39 39 1.00000 -38.0952 4 47 47.0 45 45 0.71429 15.3846 5 47 47.0 40 40 1.02564 -11.1111 6 39 47.0 44 83 0.97778 10.0000 7 47 47.0 36 99 0.90000 -18.1818 8 28 47.0 44 83 1.00000 22.2222 9 65 52.0 58 103 1.61111 31.8182 10 52 57.0 55 95 1.25000 -5.1724 11 57 57.0 54 137 0.93103 -1.8182 12 60 52.0 50 149 0.90909 -7.4074 13 44 45.0 35 118 0.64815 -30.0000 14 41 44.0 39 142 0.78000 11.4286 15 45 44.0 31 126 0.88571 -20.5128
Now, let’s use a few functions within the levels of a categorical variable. We have a variable called race in our dataset that has four levels. First, we will create the variable socst_cusum, which will be the cumulative sum of the variable socst over five observations. As you can see in the proc print output, the first five observations for socst and socst_cusum have the same values. The new variable socst_sum is simply the sum of the values of the variable socst. The new variable rev_math is the values of the variable math in reverse order within each level of the variable race.
proc sort data = hsb2 out = hsb2_sorted; by race; run; proc expand data = hsb2_sorted out = test3; by race; convert socst = socst_cusum / transformout = (cusum 5); convert socst = socst_sum / transformout = (sum); convert math = rev_math /transformout = (reverse); run; proc print data = test3 (obs = 40); var race socst socst_cusum socst_sum math rev_math; run;socst_ socst_ Obs race socst cusum sum math rev_math 1 1 56 56 56 48 44 2 1 61 61 117 57 44 3 1 46 46 163 39 45 4 1 66 66 229 57 45 5 1 61 61 290 49 41 6 1 51 107 341 50 40 7 1 46 107 387 61 46 8 1 51 97 438 59 54 9 1 61 127 499 39 33 10 1 31 92 530 43 49 11 1 48 155 578 52 48 12 1 51 158 629 43 52 13 1 51 148 680 52 43 14 1 41 168 721 48 52 15 1 36 128 757 49 43 16 1 41 196 798 33 39 17 1 56 214 854 54 59 18 1 41 189 895 46 61 19 1 41 209 936 40 50 20 1 51 179 987 41 49 21 1 36 232 1023 45 57 22 1 46 260 1069 45 39 23 1 36 225 1105 44 57 24 1 42 251 1147 44 48 25 2 56 56 56 66 42 26 2 51 51 107 62 52 27 2 71 71 178 64 42 28 2 56 56 234 61 66 29 2 41 41 275 49 54 30 2 56 112 331 72 72 31 2 41 92 372 54 49 32 2 46 117 418 66 61 33 2 36 92 454 42 64 34 2 56 97 510 52 62 35 2 51 163 561 42 66 36 3 66 66 66 67 41 37 3 56 56 122 45 42 38 3 66 66 188 53 50 39 3 61 61 249 42 49 40 3 41 41 290 43 44
Let’s look at a few more useful functions, including lag, lead and reciprocal.
proc expand data = hsb2 out = test4; convert math = lag_math / transformout = (lag); convert math = lead_math /transformout = (lead); convert science = rec_science / transformout = (reciprocal); run;proc print data = test4 (obs = 15); var math lag_math lead_math science rec_science; run;lead_ rec_ Obs math lag_math math science science 1 40 . 48 39 0.025641 2 48 40 41 63 0.015873 3 41 48 43 39 0.025641 4 43 41 46 45 0.022222 5 46 43 52 40 0.025000 6 52 46 44 44 0.022727 7 44 52 43 36 0.027778 8 43 44 64 44 0.022727 9 64 43 49 58 0.017241 10 49 64 72 55 0.018182 11 72 49 50 54 0.018519 12 50 72 44 50 0.020000 13 44 50 40 35 0.028571 14 40 44 50 39 0.025641 15 50 40 43 31 0.032258
Another useful function in proc expand is the scale function. This function allows the user to create a new variable that is rescaled to the user-specified minimum and maximum. To create the new variable read_write, we set the scale of the variable write to the minimum and maximum of the variable read. To create read_25_80 and write_25_80, we set the minimum and maximum of both read and write to be 25 and 80, respectively.
proc expand data = hsb2_sorted out = test5; convert write = read_write / transformout = (scale (28 76)); convert read = read_25_80 / transformout = (scale (25 80)); convert write = write_25_80 / transformout = (scale (25 80)); run; proc print data = test5 (obs = 15); var read write read_write read_25_80 write_25_80; run;read_ read_ write_ Obs read write write 25_80 25_80 1 63 65 73.3333 65.1042 76.9444 2 60 52 56.0000 61.6667 57.0833 3 42 39 38.6667 41.0417 37.2222 4 73 61 68.0000 76.5625 70.8333 5 47 54 58.6667 46.7708 60.1389 6 60 54 58.6667 61.6667 60.1389 7 44 44 45.3333 43.3333 44.8611 8 57 54 58.6667 58.2292 60.1389 9 47 46 48.0000 46.7708 47.9167 10 47 40 40.0000 46.7708 38.7500 11 39 44 45.3333 37.6042 44.8611 12 28 46 48.0000 25.0000 47.9167 13 48 49 52.0000 47.9167 52.5000 14 47 57 62.6667 46.7708 64.7222 15 50 33 30.6667 50.2083 28.0556proc corr data = test5; var read write read_write read_25_80 write_25_80; run;Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum Label read 200 52.23000 10.25294 10446 28.00000 76.00000 reading score write 200 52.77500 9.47859 10555 31.00000 67.00000 writing score read_write 200 57.03333 12.63811 11407 28.00000 76.00000 writing score read_25_80 200 52.76354 11.74816 10553 25.00000 80.00000 reading score write_25_80 200 58.26736 14.48117 11653 25.00000 80.00000 writing score Pearson Correlation Coefficients, N = 200 Prob > |r| under H0: Rho=0 read_ read_ write_ read write write 25_80 25_80 read 1.00000 0.59678 0.59678 1.00000 0.59678 reading score <.0001 <.0001 <.0001 <.0001 write 0.59678 1.00000 1.00000 0.59678 1.00000 writing score <.0001 <.0001 <.0001 <.0001 read_write 0.59678 1.00000 1.00000 0.59678 1.00000 writing score <.0001 <.0001 <.0001 <.0001 read_25_80 1.00000 0.59678 0.59678 1.00000 0.59678 reading score <.0001 <.0001 <.0001 <.0001 write_25_80 0.59678 1.00000 1.00000 0.59678 1.00000 writing score <.0001 <.0001 <.0001 <.0001
For a complete list of the functions available for creating variables on the convert statement of proc expand, please see the SAS documentation .