We would like to thank Mike Zdeb of University at Albany School of Public Health who emailed us with his suggestions/examples to improve this page.
Sometimes, a data set may have "holes" in them, i.e., missing values and we may want to know the number of missing values of all the variables and the distribution of the missing values. We will use the following data set as our example data set.
data test; input landval improval totval salepric saltoapr city $6. season $8.; datalines; 30000 64831 94831 118500 1.25 A spring 30000 50765 80765 93900 . winter 46651 18573 65224 . 1.16 B 45990 91402 . 184000 1.34 C winter 42394 . 40575 168000 1.43 . 3351 51102 169000 1.12 D winter 63596 2182 65778 . 1.26 E spring 56658 53806 10464 255000 1.21 51428 72451 . . 1.18 F spring 93200 . 4321 422000 1.04 76125 78172 54297 290000 1.14 G winter . 61934 16294 237000 1.10 H spring 65376 34458 . 286500 1.43 winter 42400 . 57446 . . K 40800 92606 33406 168000 1.26 S ; run;
1. Number of missing values vs. number of non missing values in each variable
The first thing we are going to look at the variables that have a lot of missing values. For numerical variables, we use proc means with the options n and nmiss.
proc means data = test n nmiss; var _numeric_; run;
Variable N N Miss landval 13 2 improval 12 3 totval 12 3 salepric 11 4 saltoapr 13 2
For character variables, we can use proc freq to display the number of missing values in each variable.
proc freq data = test; tables city season ; run;
city Frequency Percent Cumulative Cumulative Frequency Percent A 1 10 1 10 B 1 10 2 20 C 1 10 3 30 D 1 10 4 40 E 1 10 5 50 F 1 10 6 60 G 1 10 7 70 H 1 10 8 80 K 1 10 9 90 S 1 10 10 100 Frequency Missing = 5 season Frequency Percent Cumulative Cumulative Frequency Percent spring 4 44.44 4 44.44 winter 5 55.56 9 100 Frequency Missing = 6
2. Number of missing values in each observation
We can also look at the number of missing values in each observation. For example, we can use SAS function cmiss to store the number of missing values from both numeric and character variables in each observation.
data test1; set test; miss_n = cmiss(of landval -- season); run; proc print data = test1; run;
Obs landval improval totval salepric saltoapr city season miss_n 1 30000 64831 94831 118500 1.25 A spring 0 2 30000 50765 80765 93900 . winter 2 3 46651 18573 65224 . 1.16 B 2 4 45990 91402 . 184000 1.34 C winter 1 5 42394 . 40575 168000 1.43 3 6 . 3351 51102 169000 1.12 D winter 1 7 63596 2182 65778 . 1.26 E spring 1 8 56658 53806 10464 255000 1.21 2 9 51428 72451 . . 1.18 F spring 2 10 93200 . 4321 422000 1.04 3 11 76125 78172 54297 290000 1.14 G winter 0 12 . 61934 16294 237000 1.1 H spring 1 13 65376 34458 . 286500 1.43 winter 2 14 42400 . 57446 . . K 4 15 40800 92606 33406 168000 1.26 S 1
3. Distribution of missing values
We can also look at the patterns of missing values. By default the MI procedure will output missing data patterns for the variables in the specified datasets. If no var statement is specified Proc MI will output a table for the all the variables in a dataset. The ods select statement tells SAS to only output the "Missing Data Patterns" table.
proc mi data=test; ods select misspattern; run;
Missing Data Patterns Group landval improval totval salepric saltoapr Freq Percent Group Means landval improval totval salepric saltoapr 1 X X X X X 4 26.67 50896 72354 48250 207875 1.215 2 X X X X . 1 6.67 30000 50765 80765 93900 . 3 X X X . X 2 13.33 55124 10378 65501 . 1.21 4 X X . X X 2 13.33 55683 62930 . 235250 1.385 5 X X . . X 1 6.67 51428 72451 . . 1.18 6 X . X X X 2 13.33 67797 . 22448 295000 1.235 7 X . X . . 1 6.67 42400 . 57446 . . 8 . X X X X 2 13.33 . 32643 33698 203000 1.11
You will notice that this report only contains information for numeric variables not character. Another approach to achieve the same output could be via making use of formats, which then allows for character variables to be included. The order of the patterns is different but the information is still the same.
proc format; value nm . = '.' other = 'X'; value $ch ' ' = '.'other = 'X'; run;proc freq data=test; table landval*improval*totval*salepric*saltoapr*city*season / list missing nocum; format _numeric_ nm. _character_ $ch.; run;
landval improval totval salepric saltoapr city season Frequency Percent . X X X X X X 2 13.33 X . X . . X . 1 6.67 X . X X X . . 2 13.33 X X . . X X X 1 6.67 X X . X X . X 1 6.67 X X . X X X X 1 6.67 X X X . X X . 1 6.67 X X X . X X X 1 6.67 X X X X . . X 1 6.67 X X X X X . . 1 6.67 X X X X X X . 1 6.67 X X X X X X X 2 13.33