How can I see the number of missing values and patterns of missing values in my data file?

Sometimes, a data set may have "holes" in them, i.e., missing values. Some statistical procedures such as regression analysis will not work as well, or at all on data set with missing values. The observations with missing values have to be either deleted or the missing values have to be substituted in order for a statistical procedure to produce meaningful results. Thus we may want to know the number of missing values and the distribution of those missing values so we have a better idea on what to do with the observations with missing values. Let’s look at the following data set.

 LANDVAL  IMPROVAL    TOTVAL  SALEPRIC SALTOAPR

   30000     64831     94831    118500   1.25
 30000     50765     80765     93900    .
   46651     18573     65224         .   1.16
   45990     91402         .    184000   1.34
   42394         .     40575    168000   1.43
       .      3351     51102    169000   1.12
   63596      2182     65778         .   1.26
   56658     53806     10464    255000   1.21
   51428     72451         .         .   1.18
   93200         .      4321    422000   1.04
   76125     78172     54297    290000   1.14
       .     61934     16294    237000   1.10
   65376     34458         .    286500   1.43
  42400         .     57446         .    .
   40800     92606     33406    168000   1.26

1. Number of missing values vs. number of non missing values

The first thing we are going to look at is what the variables are that have a lot of missing values. We just use the command frequencies with option /format=notable.

FREQUENCIES VARIABLES=landval improval totval salepric saltoapr 
  /FORMAT=NOTABLE
  /ORDER= ANALYSIS .

So we know the number of missing values in each variable. For instance, variable salepric has four and saltoapr has two missing values. This will help us to identify variables that may have a large number of missing values and perhaps we may want exclude those from analysis.

2. Number of missing values in each observation and its distribution

We can also look at the distribution of missing values across observations. For example we use command count to create a new variable cmiss counting the number of missing values across each observation. Looking at its frequency table we know that there are four observations with no missing values, nine observations with one missing values, one observation with two missing values and one observation with three missing values. If we are willing to substitute one missing value per observation, we will be able to reclaim nine observations back to get a valid data set that is 13/15=87% of the size of the original one.

COUNT
  cmiss = landval improval totval salepric saltoapr  (MISSING).

FREQUENCIES VARIABLES=cmiss    
  /ORDER=  ANALYSIS .

3. Distribution of missing values

We can also look at the patterns of missing values. We can recode each variable into a dummy variable such that 1 is missing and 0 is nonmissing. Then we use the aggregate command to compute the frequency for each pattern of missing data.

RECODE
  landval improval totval salepric saltoapr
  (MISSING=1)  (ELSE=0)  INTO  land1  impr1  totv1  sale1  salt1 .
EXECUTE .
AGGREGATE
  /OUTFILE='AGGR.SAV'
  /BREAK=land1 impr1 totv1 sale1 salt1
  /N_BREAK=N.

   File AGGR.SAV has the following variables and observations.

   LAND1    IMPR1    TOTV1    SALE1    SALT1  N_BREAK

     .00      .00      .00      .00      .00       4
     .00      .00      .00      .00     1.00       1
     .00      .00      .00     1.00      .00       2
     .00      .00     1.00      .00      .00       2
     .00      .00     1.00     1.00      .00       1
     .00     1.00      .00      .00      .00       2
     .00     1.00      .00     1.00     1.00       1
    1.00      .00      .00      .00      .00       2

Now we see that there are four observations with no missing values, one observation with one missing value in variable saltoapr, two observations with missing value in variable salepric and one observation with missing value in both variable totval and salepric, etc. If we want to delete some observations from the original data set, we have a better idea now on which observation to delete, e.g. the observation corresponding to the 7th row above.