How can I see the number of missing values and patterns of missing values in my data file?

Sometimes, a data set may have “holes” in it, that is, missing values. Some statistical procedures such as regression analysis will not work as well, or at all, on a data set with missing values. The observations with missing values have to be deleted or the missing values have to be substituted in order for a statistical procedure to produce meaningful results. Most statistical programs (including SAS, SPSS, and Stata) will automatically remove such cases from any analysis you run (without deleting the cases from the dataset). This is why the “n” often varies from analysis to analysis, even if the dataset is the same. Different variables have different amounts of missing data and hence, changing the variables in a model changes the number of cases with complete data on all the variables in the model. Because the software drops cases with missing values for us, it is very easy to “forget” about missing data entirely. However, the presence of missing data can influence our results, especially when a dataset or even a single variable, has a high percentage of values missing. Thus it is always a good idea to check a dataset for missing data, and to think about how the missing data may influence our analyses. This page shows a few methods of looking at missing values in a dataset, this information can be used to make better informed decisions about how to handle the missing values.

Before we begin, we need some data with missing values, the code below inputs a small dataset into Stata, and then displays that data. In a small dataset, like the one below, it is very easy to look at the raw data and see where values are missing. However, when datasets are large, we need a more systematic way to examine our dataset for missing values. Below we show you some ways to do that, using the data below as an example.

clear
input landval  improval    totval  salepric saltoapr
30000     64831     94831    118500   1.25
30000     50765     80765     93900    .
46651     18573     65224         .   1.16
45990     91402         .    184000   1.34
42394         .     40575    168000   1.43
.      3351     51102    169000   1.12
63596      2182     65778         .   1.26
56658     53806     10464    255000   1.21
51428     72451         .         .   1.18
93200         .      4321    422000   1.04
76125     78172     54297    290000   1.14
.     61934     16294    237000   1.10
65376     34458         .    286500   1.43
42400         .     57446         .    .
40800     92606     33406    168000   1.26
end

list

     +---------------------------------------------------+
     | landval   improval   totval   salepric   saltoapr |
     |---------------------------------------------------|
  1. |   30000      64831    94831     118500       1.25 |
  2. |   30000      50765    80765      93900          . |
  3. |   46651      18573    65224          .       1.16 |
  4. |   45990      91402        .     184000       1.34 |
  5. |   42394          .    40575     168000       1.43 |
     |---------------------------------------------------|
  6. |       .       3351    51102     169000       1.12 |
  7. |   63596       2182    65778          .       1.26 |
  8. |   56658      53806    10464     255000       1.21 |
  9. |   51428      72451        .          .       1.18 |
 10. |   93200          .     4321     422000       1.04 |
     |---------------------------------------------------|
 11. |   76125      78172    54297     290000       1.14 |
 12. |       .      61934    16294     237000        1.1 |
 13. |   65376      34458        .     286500       1.43 |
 14. |   42400          .    57446          .          . |
 15. |   40800      92606    33406     168000       1.26 |
     +---------------------------------------------------+

1. Number of missing values vs. number of non missing values

The first thing we are going to do is determine which variables have a lot of missing values. We have created a small Stata program called mdesc that counts the number of missing values in both numeric and character variables. You can download mdesc from within Stata by typing search mdesc (see How can I use the search command to search for programs and get additional help? for more information about using search).

Then you can run mdesc for one or more variables as illustrated below.

mdesc

     Variable      Missing      Total     Missing/Total
    ------------------------------------------------------------
     landval           2          15        .133333
    improval           3          15             .2
      totval           3          15             .2
    salepric           4          15        .266667
    saltoapr           2          15        .133333

Now we know the number of missing values in each variable. For instance, variable salepric has four missing values and saltoapr has two missing values.

2. Obtaining the number of missing values per observation

We can also look at the distribution of missing values across observations. The code below creates a variable called nmis that gives the number of missing values for each observation. The function rmiss2() used here is an extension to the egen function rmiss(). It counts the number of missing values in the varlist. rmiss2() accepts both string and numeric variables. (Stata’s rmiss() only accepts numeric variables.) You can download rmiss2() over the internet from within Stata by typing search rmiss2 (see How can I use the search command to search for programs and get additional help? for more information about using search).

egen nmis=rmiss2(landval improval totval salepric saltoapr)

Below we tabulate the variable we just created. Looking at the frequency table we know that there are four observations with no missing values, nine observations with one missing values, one observation with two missing values and one observation with three missing values.

tab nmis

       nmis |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |          4       26.67       26.67
          1 |          9       60.00       86.67
          2 |          1        6.67       93.33
          3 |          1        6.67      100.00
------------+-----------------------------------
      Total |         15      100.00

3. Patterns of missing values

We can also look at the patterns of missing values. You can download mvpatterns over the internet from within Stata by typing search mvpatterns (see How can I use the search command to search for programs and get additional help? for more information about using search). The command mvpatterns produces output for all variables in the dataset, for missing data patterns across a subset of variables, a variable list can be included, for example, mvpatterns landval improval totval.

The output produced by mvpatterns is shown below. The first table lists the variables, their storage type (type), the number of observations (obs), the number of missing values (mv), and the variable label if the variables has one. The second table contains the information on the pattern of missing values. The first block of columns in the output shows the patterns of missing data. Within the block, each variable is represented by a column, a “+” indicates that values of that variable are present in a given missing data pattern, a “.” indicates that they are missing. The columns follow the same order as the variable list in the first table, so that the first column in the output below represents landval, the second improval, and so on. The missing data patterns are listed in descending frequency, here the most common missing data pattern is complete data (“+++++”). The table also shows the number of missing values in that pattern (_mv), and the number of cases with that missing data pattern (_freq). Based on the information in the second table we know that there are four observations with no missing values, two cases missing on just the variable salepric, and one observation with missing values on improval, salepric and saltoapr.

mvpatterns
  
Variable     | type     obs   mv   variable label
-------------+-----------------------------------
landval      | float     13    2   
improval     | float     12    3   
totval       | float     12    3   
salepric     | float     11    4   
saltoapr     | float     13    2   
-------------------------------------------------

Patterns of missing values

  +------------------------+
  | _pattern   _mv   _freq |
  |------------------------|
  |    +++++     0       4 |
  |    +++.+     1       2 |
  |    ++.++     1       2 |
  |    +.+++     1       2 |
  |    .++++     1       2 |
  |------------------------|
  |    ++++.     1       1 |
  |    ++..+     2       1 |
  |    +.+..     3       1 |
  +------------------------+

4. When all the variables of interest are numeric

The examples above all work regardless of whether the variables of interest (i.e. the variables whose missing data patterns you want to examine) are numeric or string. When all of the variables you wish to check for missing values are numeric we can use a program called misschk to simplify the steps of examining the missing data in our dataset. (Note: numeric variables include those with value labels that are strings, as long as the actual values of the variables are stored as numbers.) You can download misschk from within Stata by typing search misschk (see How can I use the search command to search for programs and get additional help? for more information about using search).

Below is the command for misschk. We have listed all five of the variables in our dataset in the variable list after the misschk command. However, we could have just left the list of variables blank (i.e. used only misschk , gen(miss) instead), if we had, misschk would have run using all the variables in our dataset. The variable list is only necessary if we want to run misschk on only some of the variables in our dataset. The gen(miss) option tells misschk that we want it to create two new variables, both of which start with “miss”. These two variables will be named misspattern and missnumber. The variable misspattern indicates which of the missing data patterns each case follows. The variable missnumber indicates the number of missing values for each case.

misschk landval improval totval salepric saltoapr, gen(miss)

The output for misschk consists of three tables. The first table lists the number of missing values, as well as percent missing for each variable, this is similar to the table produced by mdesc in part 1 above. This table also contains a column labeled “#” which assigns each variable a number that is used to identify the variable later on in the output. The second table shows the distribution of missing values. The pattern of missingness is described using the variable numbers from the first table, and underscores (“_”). The numbers indicate which variables are missing in that pattern, the underscores represent non-missing observations. For example, from the second table we see that two cases have missing values on variable 1 (landval), but complete data on all other variables, and that one case is missing data on variables 2, 4, and 5. The bottom row shows that four cases are not missing any values at all (all underscores). This table shows the same information generated in part three above, but in a slightly different format. The missing data pattern for each case is described in the variable misspattern. Finally, the third table shows the distribution of the number of missing values per case. This is the same information discussed above in part 2. The number of variables each case is missing is also contained in the variable missnumber.

Variables examined for missing values

   #  Variable        # Missing   % Missing
--------------------------------------------
   1  landval               2        13.3
   2  improval              3        20.0
   3  totval                3        20.0
   4  salepric              4        26.7
   5  saltoapr              2        13.3

Missing for |
      which |
 variables? |      Freq.     Percent        Cum.
------------+-----------------------------------
      1____ |          2       13.33       13.33
      _2_45 |          1        6.67       20.00
      _2___ |          2       13.33       33.33
      __34_ |          1        6.67       40.00
      __3__ |          2       13.33       53.33
      ___4_ |          2       13.33       66.67
      ____5 |          1        6.67       73.33
      _____ |          4       26.67      100.00
------------+-----------------------------------
      Total |         15      100.00

Missing for |
   how many |
 variables? |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |          4       26.67       26.67
          1 |          9       60.00       86.67
          2 |          1        6.67       93.33
          3 |          1        6.67      100.00
------------+-----------------------------------
      Total |         15      100.00