1. Introduction
This module will explore missing data in SAS, focusing on numeric missing data. It will describe how to indicate missing data in your raw data files, how missing data are handled in SAS procedures, and how to handle missing data in a SAS data step. Suppose we did a reaction time study with six subjects, and the subjects reaction time was measured three times. The data file is shown below.
DATA times ; INPUT id trial1 trial2 trial3 ; CARDS ; 1 1.5 1.4 1.6 2 1.5 . 1.9 3 . 2.0 1.6 4 . . 2.2 5 2.1 2.3 2.2 6 1.8 2.0 1.9 ; RUN ; PROC PRINT DATA=times ; RUN ;
You might notice that some of the reaction times are coded using a single dot. For example, for subject 2, the second trial is coded just as a dot. Well, the person measuring response time for that trial did not measure the response time properly so the data for that trial was missing.
OBS ID TRIAL1 TRIAL2 TRIAL3 1 1 1.5 1.4 1.6 2 2 1.5 . 1.9 3 3 . 2.0 1.6 4 4 . . 2.2 5 5 2.1 2.3 2.2 6 6 1.8 2.0 1.9
In your raw data, missing data are generally coded using a single . to indicate a missing value. SAS recognizes a single . as a missing value and knows to interpret it as missing and handles it in special ways. Let’s examine how SAS handles missing data in procedures.
2. How SAS handles missing data in SAS procedures
As a general rule, SAS procedures that perform computations handle missing data by omitting the missing values. (We say procedures that perform computations to indicate that we are not addressing procedures like proc contents). The way that missing values are eliminated is not always the same among SAS procedures, so let’s us look at some examples. First, let’s do a proc means on our data file and see how SAS proc means handles the missing values.
PROC MEANS DATA=times ; VAR trial1 trial2 trial3 ; RUN ;
As you see in the output below, proc means computed the means using 4 observations for trial1 and trial2 and 6 observations for trial3. In short, proc means used all of the valid data and performed the computations on all of the available data.
Variable N Mean Std Dev Minimum Maximum ------------------------------------------------------------------- TRIAL1 4 1.7250000 0.2872281 1.5000000 2.1000000 TRIAL2 4 1.9250000 0.3774917 1.4000000 2.3000000 TRIAL3 6 1.9000000 0.2683282 1.6000000 2.2000000 -------------------------------------------------------------------
As you see below, proc freq likewise performed its computations using just the available data. Note that the percentages are computed based on just the total number of non-missing cases.
PROC FREQ DATA=times ; TABLES trial1 trial2 trial3 ; RUN ; Cumulative Cumulative TRIAL1 Frequency Percent Frequency Percent ---------------------------------------------------- 1.5 2 50.0 2 50.0 1.8 1 25.0 3 75.0 2.1 1 25.0 4 100.0 Frequency Missing = 2 Cumulative Cumulative TRIAL2 Frequency Percent Frequency Percent ---------------------------------------------------- 1.4 1 25.0 1 25.0 2 2 50.0 3 75.0 2.3 1 25.0 4 100.0 Frequency Missing = 2 Cumulative Cumulative TRIAL3 Frequency Percent Frequency Percent ---------------------------------------------------- 1.6 2 33.3 2 33.3 1.9 2 33.3 4 66.7 2.2 2 33.3 6 100.0
It is possible that you might want the percentages to be computed out of the total number of values, and even report the percentage missing right in the table itself. You can request this using the missing option on the tables statement of proc freq as shown below (just for trial1).
PROC FREQ DATA=times ; TABLES trial1 / MISSING ; RUN ;
As you see, now the percentages are computed out of the total number of observations, and the percentage missing are shown right in the table as well.
Cumulative Cumulative TRIAL1 Frequency Percent Frequency Percent ---------------------------------------------------- . 2 33.3 2 33.3 1.5 2 33.3 4 66.7 1.8 1 16.7 5 83.3 2.1 1 16.7 6 100.0
Let’s look at how proc corr handles missing data. We would expect that it would do the computations based on the available data, and omit the missing values. Here is an example program.
PROC CORR DATA=times ; VAR trial1 trial2 trial3 ; RUN ;
The output of this program is shown below. Note how the missing values were excluded. For each pair of variables, proc corr used the number of pairs that had valid data. For the pair formed by trial1 and trial2, there were 3 pairs with valid data. For the pairing of trial1 and trial3 there were 4 valid pairs, and likewise there were 4 valid pairs for trial2 and trial3. Since this used all of the valid pairs of data, this is often called pairwise deletion of missing data.
Correlation Analysis 3 'VAR' Variables: TRIAL1 TRIAL2 TRIAL3 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum TRIAL1 4 1.725000 0.287228 6.900000 1.500000 2.100000 TRIAL2 4 1.925000 0.377492 7.700000 1.400000 2.300000 TRIAL3 6 1.900000 0.268328 11.400000 1.600000 2.200000 Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / Number of Observations TRIAL1 TRIAL2 TRIAL3 TRIAL1 1.00000 0.98198 0.85280 0.0 0.1210 0.1472 4 3 4 TRIAL2 0.98198 1.00000 0.76089 0.1210 0.0 0.2391 3 4 4 TRIAL3 0.85280 0.76089 1.00000 0.1472 0.2391 0.0 4 4 6
It is possible to ask SAS to only perform the correlations on the observations that had complete data for all of the variables on the var statement. For example, you might want the correlations of the reaction times just for the observations that had non-missing data on all of the trials. This is called listwise deletion of missing data meaning that when any of the variables are missing, the entire observation is omitted from the analysis. You can request listwise deletion within proc corr with the nomiss option as illustrated below.
PROC CORR DATA=times NOMISS ; VAR trial1 trial2 trial3 ; RUN ;
As you see in the results below, the N for all the simple statistics is the same, 3, which corresponds to the number of cases with complete non-missing data for trial1 trial2 and trial3. Since the N is the same for all of the correlations (i.e., 3), the N is not displayed along with the correlations.
Correlation Analysis 3 'VAR' Variables: TRIAL1 TRIAL2 TRIAL3 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum TRIAL1 3 1.800000 0.300000 5.400000 1.500000 2.100000 TRIAL2 3 1.900000 0.458258 5.700000 1.400000 2.300000 TRIAL3 3 1.900000 0.300000 5.700000 1.600000 2.200000 Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 3 TRIAL1 TRIAL2 TRIAL3 TRIAL1 1.00000 0.98198 1.00000 0.0 0.1210 0.0001 TRIAL2 0.98198 1.00000 0.98198 0.1210 0.0 0.1210 TRIAL3 1.00000 0.98198 1.00000 0.0001 0.1210 0.0
3. Summary of how missing values are handled in SAS procedures
It is important to understand how SAS procedures handle missing data if you have missing data. To know how a procedure handles missing data, you should consult the SAS manual. Here is a brief overview of how some common SAS procedures handle missing data.
- – proc means
For each variable, the number of non-missing values are used- proc freq
By default, missing values are excluded and percentages are based on the number of non-missing values. If you use the missing option on the tables statement, the percentages are based on the total number of observations (non-missing and missing) and the percentage of missing values are reported in the table.- proc corr
By default, correlations are computed based on the number of pairs with non-missing data (pairwise deletion of missing data). The nomiss option can be used on the proc corr statement to request that correlations be computed only for observations that have non-missing data for all variables on the var statement (listwise deletion of missing data).- proc reg
If any of the variables on the model or var statement are missing, they are excluded from the analysis (i.e., listwise deletion of missing data)- proc factor
Missing values are deleted listwise, i.e., observations with missing values on any of the variables in the analysis are omitted from the analysis.- proc glm
The handling of missing values in proc glm can be complex to explain. If you have an analysis with just one variable on the left side of the model statement (just one outcome or dependent variable), observations are eliminated if any of the variables on the model statement are missing. Likewise, if you are performing a repeated measures ANOVA or a MANOVA, then observations are eliminated if any of the variables in the model statement are missing. For other situations, see the SAS/STAT manual about proc glm.- For other procedures, see the SAS manual for information on how missing data are handled.
4. Missing values in assignment statements
It is important to understand how missing values are handled in assignment statements. Consider the example shown below.
DATA times2 ; SET times ; avg = (trial1 + trial2 + trial3) / 3 ; RUN ; PROC PRINT DATA=times2 ; RUN ;
The proc print below illustrates how missing values are handled in assignment statements. The variable avg is based on the variables trial1 trial2 and trial3. If any of those variables were missing, the value for avg was set to missing. This meant that avg was missing for observations 2, 3 and 4.
OBS ID TRIAL1 TRIAL2 TRIAL3 AVG 1 1 1.5 1.4 1.6 1.5 2 2 1.5 . 1.9 . 3 3 . 2.0 1.6 . 4 4 . . 2.2 . 5 5 2.1 2.3 2.2 2.2 6 6 1.8 2.0 1.9 1.9
In fact, SAS included a NOTE: in the Log to let you know about the missing values that were created. The Log entry from this example is shown below.
222 DATA times2 ; 223 SET times ; 224 avg = (trial1 + trial2 + trial3) / 3 ; 225 RUN ; NOTE: Missing values were generated as a result of performing an operation on missing values. Each place is given by: (Number of times) at (Line):(Column). 3 at 224:17 3 at 224:26 3 at 224:36 NOTE: The data set WORK.TIMES2 has 6 observations and 5 variables.
This note tells us that three missing values were created in the program at line 224. This makes sense, we know that 3 missing values were created for avg and that avg is created on line 224.
As a general rule, computations involving missing values yield missing values. For example,
2 + 2 yields 4
2 + . yields .
2 / 2 yields 1
. / 2 yields .
2 * 3 yields 6
2 * . yields .
whenever you add, subtract, multiply, divide, etc., values that involve missing data, the result is missing.
In our reaction time experiment, the average reaction time avg is missing for three out of six cases. We could try just averaging the data for the non-missing trials by using the mean function as shown in the example below.
DATA times3 ; SET times ; avg = MEAN(trial1, trial2, trial3) ; RUN ; PROC PRINT DATA=times3 ; RUN ;
The results below show that avg now contains the average of the non-missing trials.
OBS ID TRIAL1 TRIAL2 TRIAL3 AVG 1 1 1.5 1.4 1.6 1.5 2 2 1.5 . 1.9 1.7 3 3 . 2.0 1.6 1.8 4 4 . . 2.2 2.2 5 5 2.1 2.3 2.2 2.2 6 6 1.8 2.0 1.9 1.9
Had there been a large number of trials, say 50 trials,
then it would be annoying to have to type
avg = mean(trial1, trial2, trial3 …. trial50)
Here is a shortcut you could use in this kind of situation
avg = mean(of trial1-trial50)
Also, if we wanted to get the sum of the times instead of the average, then we could just use the sum function instead of the mean function. The syntax of the sum function is just like the mean function, but it returns the sum of the non-missing values.
Finally, you can use the N function to determine the number of non-missing values in a list of variables, as illustrated below.
DATA times4 ; SET times ; n = N(trial1, trial2, trial3) ; RUN ; PROC PRINT DATA=times4 ; RUN ;
As you see below, observations 1, 5 and 6 had three valid values, observations 2 and 3 had two valid values, and observation 4 had only one valid value.
OBS ID TRIAL1 TRIAL2 TRIAL3 N 1 1 1.5 1.4 1.6 3 2 2 1.5 . 1.9 2 3 3 . 2.0 1.6 2 4 4 . . 2.2 1 5 5 2.1 2.3 2.2 3 6 6 1.8 2.0 1.9 3
You might feel uncomfortable with the variable avg for observation 4 since it is not really an average at all. We can use the variable n to create avg only when there are two or more valid values, but if the number of non-missing values is 1 or less, then make avg to be missing. This is illustrated below.
DATA times5 ; SET times ; n = N(trial1, trial2, trial3) ; IF n >= 2 THEN avg = MEAN(trial1, trial2, trial3) ; IF n <= 1 THEN avg=. ; RUN ; PROC PRINT DATA=times5 ; RUN ;
In the output below, you see that avg now contains the average reaction time for the non-missing values, except for observation 4 where the value is assigned to missing because it had only 1 valid observation.
OBS ID TRIAL1 TRIAL2 TRIAL3 N AVG 1 1 1.5 1.4 1.6 3 1.5 2 2 1.5 . 1.9 2 1.7 3 3 . 2.0 1.6 2 1.8 4 4 . . 2.2 1 . 5 5 2.1 2.3 2.2 3 2.2 6 6 1.8 2.0 1.9 3 1.9
5. Missing values in logical statements
It is important to understand how missing values are handled in logical statements. For example, say that you want to create a 0/1 value for trial1 that is 0 if it is 1.5 or less, and 1 if it is over 1.5. We show this below (incorrectly, as you will see).
DATA times2 ; SET times ; if (trial1 <= 1.5) then trial1a = 0; else trial1a = 1 ; RUN ; proc print data=times2; var id trial1 trial1a; run;
And as you can see in the output, the values for trial1a are wrong when id is 3 or 4, when trial1 is missing. This is because SAS treats a missing value as the smallest possible value (e.g., negative infinity) and that value is less than 1.5, so then the value for trial1a becomes 0.
Obs id trial1 trial1a 1 1 1.5 0 2 2 1.5 0 3 3 . 0 4 4 . 0 5 5 2.1 1 6 6 1.8 1
Instead, we will explicitly exclude missing values to make sure they are treated properly, as shown below.
DATA times2 ; SET times ; trial1a = .; if (trial1 <= 1.5) and (trial1 > .) then trial1a = 0; if (trial1 > 1.5) then trial1a = 1 ; RUN ; proc print data=times2; var id trial1 trial1a; run;
And now we get the results that we wish. The value for trial1a is only 0 when it is less than or equal to 1.5 and it is not missing. The value for trial1a is only 1 when it is over 1.5, as shown below.
Obs id trial1 trial1a 1 1 1.5 0 2 2 1.5 0 3 3 . . 4 4 . . 5 5 2.1 1 6 6 1.8 1
6. Problems to look out for
- When creating or recoding variables that involve missing values, always pay attention to the SAS log to detect when you are creating missing values.
7. For more information
- See Subsetting data in SAS for information about subsetting data with variables that are missing.
- See How do I specify types of missing values? for more information about using different missing data values.