Many data sets have missing values. However, having lots of missing values can be problematic, as most statistical procedures (e.g., regression) will do a casewise deletion of cases with missing values. This means that the procedure works runs on only the cases with complete data, and that may be a fraction of the cases in the data set. Hence, finding out the number of missing values each variable has can be important. Let’s look at the following data set.
LANDVAL IMPROVAL TOTVAL SALEPRIC SALTOAPR 30000 64831 94831 118500 1.25 30000 50765 80765 93900 . 46651 18573 65224 . 1.16 45990 91402 . 184000 1.34 42394 . 40575 168000 1.43 . 3351 51102 169000 1.12 63596 2182 65778 . 1.26 56658 53806 10464 255000 1.21 51428 72451 . . 1.18 93200 . 4321 422000 1.04 76125 78172 54297 290000 1.14 . 61934 16294 237000 1.10 65376 34458 . 286500 1.43 42400 . 57446 . . 40800 92606 33406 168000 1.26
1. Number of missing values versus number of non-missing values
The first thing to do is find out how many missing values each variable has. We can use the frequencies command with the format=notable subcommand.
FREQUENCIES VARIABLES=landval improval totval salepric saltoapr /FORMAT=NOTABLE /ORDER= ANALYSIS .
![]()
Now we know the number of missing values in each variable: the variable salepric has four and saltoapr has two missing values. This will help us to identify variables that may have a large number of missing values, and perhaps we may want exclude those from analysis.
2. Number of missing values in each observation
We can also look at the distribution of missing values across observations. For example we use the count command to create a new variable called cmiss, which counts the number of missing values across each observation. Looking at its frequency table, we know that there are four observations with no missing values, nine observations with one missing value, one observation with two missing values and one observation with three missing values.
COUNT cmiss = landval improval totval salepric saltoapr (MISSING). FREQUENCIES VARIABLES=cmiss /ORDER= ANALYSIS .
![]()
3. Patterns of missing values
We can also look at the patterns of missing values. We can recode each variable into a dummy variable such that 1 is missing and 0 is non-missing. Then we use the aggregate command to compute the frequency for each pattern of missing data.
RECODE landval improval totval salepric saltoapr (MISSING=1) (ELSE=0) INTO land1 impr1 totv1 sale1 salt1 . AGGREGATE /OUTFILE='AGGR.SAV' /BREAK=land1 impr1 totv1 sale1 salt1 /N_BREAK=N.File AGGR.SAV has the following variables and observations.LAND1 IMPR1 TOTV1 SALE1 SALT1 N_BREAK .00 .00 .00 .00 .00 4 .00 .00 .00 .00 1.00 1 .00 .00 .00 1.00 .00 2 .00 .00 1.00 .00 .00 2 .00 .00 1.00 1.00 .00 1 .00 1.00 .00 .00 .00 2 .00 1.00 .00 1.00 1.00 1 1.00 .00 .00 .00 .00 2
Now we see that there are four observations with no missing values, one observation with one missing value in variable saltoapr, two observations with missing value in variable salepric and one observation with missing values in both variable totval and salepric, etc.