1. Creating and replacing variables in SAS
We will illustrate creating and replacing variables in SAS using a data file about 26 automobiles with their make, price, mpg, repair record in 1978 (rep78), and whether the car was foreign or domestic (foreign). The program below reads the data and creates a temporary data file called “auto“. Please note that there are two missing values for mpg in the data file (coded as a single period).
We will create two new variables to go along with the existing ones. First, we will create cost so that it gives us the price in thousands of dollars. Then we will create mpgpd which will stand for miles per gallon per thousand dollars. In each case, we just type the variable name, followed by an equal sign, followed by an expression for the value.
DATA auto; INPUT make $ price mpg rep78 foreign; cost = ROUND( price / 1000 ); mpgptd = mpg / cost; DATALINES; AMC 4099 22 3 0 AMC 4749 17 3 0 AMC 3799 22 3 0 Audi 9690 . 5 1 Audi 6295 23 3 1 BMW 9735 25 4 1 Buick 4816 20 3 0 Buick 7827 15 4 0 Buick 5788 18 3 0 Buick 4453 26 3 0 Buick 5189 20 3 0 Buick 10372 16 3 0 Buick 4082 19 3 0 Cad. 11385 14 3 0 Cad. 14500 14 2 0 Cad. 15906 21 3 0 Chev. 3299 29 3 0 Chev. 5705 16 4 0 Chev. 4504 . 3 0 Chev. 5104 22 2 0 Chev. 3667 24 2 0 Chev. 3955 19 3 0 Datsun 6229 23 4 1 Datsun 4589 35 5 1 Datsun 5079 24 4 1 Datsun 8129 21 4 1 ; RUN;PROC PRINT DATA=auto; RUN;
Here is the output of the proc print. You can compare the output to the original data.
1 AMC 4099 22 3 0 4 .005367163 2 AMC 4749 17 3 0 5 .003579701 3 AMC 3799 22 3 0 4 .005790998 4 Audi 9690 . 5 1 10 . 5 Audi 6295 23 3 1 6 .003653693 6 BMW 9735 25 4 1 10 .002568053 7 Buick 4816 20 3 0 5 .004152824 8 Buick 7827 15 4 0 8 .001916443 9 Buick 5788 18 3 0 6 .003109883 10 Buick 4453 26 3 0 4 .005838760 11 Buick 5189 20 3 0 5 .003854307 12 Buick 10372 16 3 0 10 .001542615 13 Buick 4082 19 3 0 4 .004654581 14 Cad. 11385 14 3 0 11 .001229688 15 Cad. 14500 14 2 0 15 .000965517 16 Cad. 15906 21 3 0 16 .001320257 17 Chev. 3299 29 3 0 3 .008790543 18 Chev. 5705 16 4 0 6 .002804557 19 Chev. 4504 . 3 0 5 . 20 Chev. 5104 22 2 0 5 .004310345 21 Chev. 3667 24 2 0 4 .006544860 22 Chev. 3955 19 3 0 4 .004804046 23 Datsun 6229 23 4 1 6 .003692406 24 Datsun 4589 35 5 1 5 .007626934 25 Datsun 5079 24 4 1 5 .004725340 26 Datsun 8129 21 4 1 8 .002583344
Note that cost is just a one or two-digit value. The vehicle that achieves the best mpgptd is the Chev. for observation 17 which gets 9+ miles per gallon for every thousand dollars in price. The Cad. in observation 14 has the worst mpgptd.
Also note that there are two missing values for mpgptd because of the missing values in mpg.
2. Recoding variables in SAS
The variable rep78 is coded 1 through 5 standing for poor, fair, average, good and excellent. We would like to change rep78 so that it has only three values, 1 through 3, standing for below average, average, and above average. We will do this by creating a new variable called repair and recoding the values of rep78 into it.
We will also create a new variable called himpg that is a dummy coding of mpg. All vehicles with better than 20 mpg will be coded 1 and those with 20 or less will be coded 0.
SAS does not have a recode command, so we will use a series of if-then/else commands in a data step to do the job. This data step creates a temporary data file called auto2.
DATA auto2; SET auto; repair = .; IF (rep78=1) or (rep78=2) THEN repair = 1; IF (rep78=3) THEN repair = 2; IF (rep78=4) or (rep78=5) THEN repair = 3; himpg = .; IF (mpg <= 20) THEN himpg = 0; IF (mpg > 20) THEN himpg = 1; RUN;
Note that we begin by setting repair and himpg to missing, just in case we make a mistake in the recoding. Proc freq will show us how the recoding worked.
PROC FREQ DATA=auto2; TABLES repair*rep78 repair*himpg / MISSING; RUN;TABLE OF REPAIR BY REP78 REPAIR REP78 Frequency| Percent | Row Pct | Col Pct | 2| 3| 4| 5| Total ---------+--------+--------+--------+--------+ 1 | 3 | 0 | 0 | 0 | 3 | 11.54 | 0.00 | 0.00 | 0.00 | 11.54 | 100.00 | 0.00 | 0.00 | 0.00 | | 100.00 | 0.00 | 0.00 | 0.00 | ---------+--------+--------+--------+--------+ 2 | 0 | 15 | 0 | 0 | 15 | 0.00 | 57.69 | 0.00 | 0.00 | 57.69 | 0.00 | 100.00 | 0.00 | 0.00 | | 0.00 | 100.00 | 0.00 | 0.00 | ---------+--------+--------+--------+--------+ 3 | 0 | 0 | 6 | 2 | 8 | 0.00 | 0.00 | 23.08 | 7.69 | 30.77 | 0.00 | 0.00 | 75.00 | 25.00 | | 0.00 | 0.00 | 100.00 | 100.00 | ---------+--------+--------+--------+--------+ Total 3 15 6 2 26 11.54 57.69 23.08 7.69 100.00 TABLE OF REPAIR BY HIMPG REPAIR HIMPG Frequency| Percent | Row Pct | Col Pct | 0| 1| Total ---------+--------+--------+ 1 | 1 | 2 | 3 | 3.85 | 7.69 | 11.54 | 33.33 | 66.67 | | 7.69 | 15.38 | ---------+--------+--------+ 2 | 9 | 6 | 15 | 34.62 | 23.08 | 57.69 | 60.00 | 40.00 | | 69.23 | 46.15 | ---------+--------+--------+ 3 | 3 | 5 | 8 | 11.54 | 19.23 | 30.77 | 37.50 | 62.50 | | 23.08 | 38.46 | ---------+--------+--------+ Total 13 13 26 50.00 50.00 100.00
Uh oh, there’s a problem with himpg. There are no missing values for himpg even though there were two missing values of mpg. SAS treats missing values (values coded with a . ) as the smallest number possible (i.e., negative infinity). When we recoded mpg we wrote
IF (mpg <= 20) THEN himpg = 0;
which converted all values of mpg that were 20 or less into a value of 0 for himpg. Since a missing value is also less than 20, the missing values got recoded to 0 as well. (It is unforeseen mistakes like this that make it so important to check every variable that you recode.) Let’s try recoding himpg again, being careful to properly treat missing values like this:
IF (. < mpg <= 20) THEN himpg = 0;
The complete program, with the fixed if statement, is shown below.
DATA auto2; SET auto; repair = .; IF (rep78=1) or (rep78=2) THEN repair = 1; IF (rep78=3) THEN repair = 2; IF (rep78=4) or (rep78=5) THEN repair = 3; himpg = .; IF (. < mpg <= 20) THEN himpg = 0; IF (mpg > 20) THEN himpg = 1; RUN;
Now let’s use proc freq again to check the recoding.
PROC FREQ DATA=auto2; TABLES repair*himpg / MISSING; RUN;TABLE OF REPAIR BY HIMPG REPAIR HIMPG Frequency| Percent | Row Pct | Col Pct | .| 0| 1| Total ---------+--------+--------+--------+ 1 | 0 | 1 | 2 | 3 | 0.00 | 3.85 | 7.69 | 11.54 | 0.00 | 33.33 | 66.67 | | 0.00 | 9.09 | 15.38 | ---------+--------+--------+--------+ 2 | 1 | 8 | 6 | 15 | 3.85 | 30.77 | 23.08 | 57.69 | 6.67 | 53.33 | 40.00 | | 50.00 | 72.73 | 46.15 | ---------+--------+--------+--------+ 3 | 1 | 2 | 5 | 8 | 3.85 | 7.69 | 19.23 | 30.77 | 12.50 | 25.00 | 62.50 | | 50.00 | 18.18 | 38.46 | ---------+--------+--------+--------+ Total 2 11 13 26 7.69 42.31 50.00 100.00
There, that’s better, this time there are two missing values for himpg.
3. Problems to look out for
Watch out for math errors, such as, division by zero and square root of a negative number.
4. Helpful hints and suggestions
- Set values to missing and then recode them.
- Use new variable names when you create or recode variables. Avoid constructions like this, total = total + sub1 + sub2; that reuse the variable name total.
- Use the missing option with proc freq to make sure all missing values are accounted for.
5. For more information
- For more information about missing data in SAS, see SAS Learning Module: Missing data in SAS .