Sometimes, a data set may have "holes" in them, i.e., missing values. Some statistical procedures such as regression analysis will not work as well, or at all on data set with missing values. The observations with missing values have to be either deleted or the missing values have to be substituted in order for a statistical procedure to produce meaningful results. Thus we may want to know the number of missing values and the distribution of those missing values so we have a better idea on what to do with the observations with missing values. Let’s look at the following data set.
LANDVAL IMPROVAL TOTVAL SALEPRIC SALTOAPR 30000 64831 94831 118500 1.25 30000 50765 80765 93900 . 46651 18573 65224 . 1.16 45990 91402 . 184000 1.34 42394 . 40575 168000 1.43 . 3351 51102 169000 1.12 63596 2182 65778 . 1.26 56658 53806 10464 255000 1.21 51428 72451 . . 1.18 93200 . 4321 422000 1.04 76125 78172 54297 290000 1.14 . 61934 16294 237000 1.10 65376 34458 . 286500 1.43 42400 . 57446 . . 40800 92606 33406 168000 1.26
1. Number of missing values vs. number of non missing values
The first thing we are going to look at is what the variables are that have a lot of missing values. We just use the command frequencies with option /format=notable.
FREQUENCIES VARIABLES=landval improval totval salepric saltoapr /FORMAT=NOTABLE /ORDER= ANALYSIS .
So we know the number of missing values in each variable. For instance, variable salepric has four and saltoapr has two missing values. This will help us to identify variables that may have a large number of missing values and perhaps we may want exclude those from analysis.
2. Number of missing values in each observation and its distribution
We can also look at the distribution of missing values across observations. For example we use command count to create a new variable cmiss counting the number of missing values across each observation. Looking at its frequency table we know that there are four observations with no missing values, nine observations with one missing values, one observation with two missing values and one observation with three missing values. If we are willing to substitute one missing value per observation, we will be able to reclaim nine observations back to get a valid data set that is 13/15=87% of the size of the original one.
COUNT cmiss = landval improval totval salepric saltoapr (MISSING). FREQUENCIES VARIABLES=cmiss /ORDER= ANALYSIS .
3. Distribution of missing values
We can also look at the patterns of missing values. We can recode each variable into a dummy variable such that 1 is missing and 0 is nonmissing. Then we use the aggregate command to compute the frequency for each pattern of missing data.
RECODE landval improval totval salepric saltoapr (MISSING=1) (ELSE=0) INTO land1 impr1 totv1 sale1 salt1 . EXECUTE . AGGREGATE /OUTFILE='AGGR.SAV' /BREAK=land1 impr1 totv1 sale1 salt1 /N_BREAK=N.File AGGR.SAV has the following variables and observations.LAND1 IMPR1 TOTV1 SALE1 SALT1 N_BREAK .00 .00 .00 .00 .00 4 .00 .00 .00 .00 1.00 1 .00 .00 .00 1.00 .00 2 .00 .00 1.00 .00 .00 2 .00 .00 1.00 1.00 .00 1 .00 1.00 .00 .00 .00 2 .00 1.00 .00 1.00 1.00 1 1.00 .00 .00 .00 .00 2
Now we see that there are four observations with no missing values, one observation with one missing value in variable saltoapr, two observations with missing value in variable salepric and one observation with missing value in both variable totval and salepric, etc. If we want to delete some observations from the original data set, we have a better idea now on which observation to delete, e.g. the observation corresponding to the 7th row above.