Sometimes, a data set may have “holes” in it, that is, missing values. Some statistical procedures such as regression analysis will not work as well, or at all, on a data set with missing values. The observations with missing values have to be deleted or the missing values have to be substituted in order for a statistical procedure to produce meaningful results. Most statistical programs (including SAS, SPSS, and Stata) will automatically remove such cases from any analysis you run (without deleting the cases from the dataset). This is why the “n” often varies from analysis to analysis, even if the dataset is the same. Different variables have different amounts of missing data and hence, changing the variables in a model changes the number of cases with complete data on all the variables in the model. Because the software drops cases with missing values for us, it is very easy to “forget” about missing data entirely. However, the presence of missing data can influence our results, especially when a dataset or even a single variable, has a high percentage of values missing. Thus it is always a good idea to check a dataset for missing data, and to think about how the missing data may influence our analyses. This page shows a few methods of looking at missing values in a dataset, this information can be used to make better informed decisions about how to handle the missing values.
Before we begin, we need some data with missing values, the code below inputs a small dataset into Stata, and then displays that data. In a small dataset, like the one below, it is very easy to look at the raw data and see where values are missing. However, when datasets are large, we need a more systematic way to examine our dataset for missing values. Below we show you some ways to do that, using the data below as an example.
clear input landval improval totval salepric saltoapr 30000 64831 94831 118500 1.25 30000 50765 80765 93900 . 46651 18573 65224 . 1.16 45990 91402 . 184000 1.34 42394 . 40575 168000 1.43 . 3351 51102 169000 1.12 63596 2182 65778 . 1.26 56658 53806 10464 255000 1.21 51428 72451 . . 1.18 93200 . 4321 422000 1.04 76125 78172 54297 290000 1.14 . 61934 16294 237000 1.10 65376 34458 . 286500 1.43 42400 . 57446 . . 40800 92606 33406 168000 1.26 end list +---------------------------------------------------+ | landval improval totval salepric saltoapr | |---------------------------------------------------| 1. | 30000 64831 94831 118500 1.25 | 2. | 30000 50765 80765 93900 . | 3. | 46651 18573 65224 . 1.16 | 4. | 45990 91402 . 184000 1.34 | 5. | 42394 . 40575 168000 1.43 | |---------------------------------------------------| 6. | . 3351 51102 169000 1.12 | 7. | 63596 2182 65778 . 1.26 | 8. | 56658 53806 10464 255000 1.21 | 9. | 51428 72451 . . 1.18 | 10. | 93200 . 4321 422000 1.04 | |---------------------------------------------------| 11. | 76125 78172 54297 290000 1.14 | 12. | . 61934 16294 237000 1.1 | 13. | 65376 34458 . 286500 1.43 | 14. | 42400 . 57446 . . | 15. | 40800 92606 33406 168000 1.26 | +---------------------------------------------------+
1. Number of missing values vs. number of non missing values
The first thing we are going to do is determine which variables have a lot of missing values. We have created a small Stata program called mdesc that counts the number of missing values in both numeric and character variables. You can download mdesc from within Stata by typing search mdesc (see How can I use the search command to search for programs and get additional help? for more information about using search).
Then you can run mdesc for one or more variables as illustrated below.
mdesc Variable Missing Total Missing/Total ------------------------------------------------------------ landval 2 15 .133333 improval 3 15 .2 totval 3 15 .2 salepric 4 15 .266667 saltoapr 2 15 .133333
Now we know the number of missing values in each variable. For instance, variable salepric has four missing values and saltoapr has two missing values.
2. Obtaining the number of missing values per observation
We can also look at the distribution of missing values across observations. The code below creates a variable called nmis that gives the number of missing values for each observation. The function rmiss2() used here is an extension to the egen function rmiss(). It counts the number of missing values in the varlist. rmiss2() accepts both string and numeric variables. (Stata’s rmiss() only accepts numeric variables.) You can download rmiss2() over the internet from within Stata by typing search rmiss2 (see How can I use the search command to search for programs and get additional help? for more information about using search).
egen nmis=rmiss2(landval improval totval salepric saltoapr)
Below we tabulate the variable we just created. Looking at the frequency table we know that there are four observations with no missing values, nine observations with one missing values, one observation with two missing values and one observation with three missing values.
tab nmis nmis | Freq. Percent Cum. ------------+----------------------------------- 0 | 4 26.67 26.67 1 | 9 60.00 86.67 2 | 1 6.67 93.33 3 | 1 6.67 100.00 ------------+----------------------------------- Total | 15 100.00
3. Patterns of missing values
We can also look at the patterns of missing values. You can download mvpatterns over the internet from within Stata by typing search mvpatterns (see How can I use the search command to search for programs and get additional help? for more information about using search). The command mvpatterns produces output for all variables in the dataset, for missing data patterns across a subset of variables, a variable list can be included, for example, mvpatterns landval improval totval.
The output produced by mvpatterns is shown below. The first table lists the variables, their storage type (type), the number of observations (obs), the number of missing values (mv), and the variable label if the variables has one. The second table contains the information on the pattern of missing values. The first block of columns in the output shows the patterns of missing data. Within the block, each variable is represented by a column, a “+” indicates that values of that variable are present in a given missing data pattern, a “.” indicates that they are missing. The columns follow the same order as the variable list in the first table, so that the first column in the output below represents landval, the second improval, and so on. The missing data patterns are listed in descending frequency, here the most common missing data pattern is complete data (“+++++”). The table also shows the number of missing values in that pattern (_mv), and the number of cases with that missing data pattern (_freq). Based on the information in the second table we know that there are four observations with no missing values, two cases missing on just the variable salepric, and one observation with missing values on improval, salepric and saltoapr.
mvpatterns Variable | type obs mv variable label -------------+----------------------------------- landval | float 13 2 improval | float 12 3 totval | float 12 3 salepric | float 11 4 saltoapr | float 13 2 ------------------------------------------------- Patterns of missing values +------------------------+ | _pattern _mv _freq | |------------------------| | +++++ 0 4 | | +++.+ 1 2 | | ++.++ 1 2 | | +.+++ 1 2 | | .++++ 1 2 | |------------------------| | ++++. 1 1 | | ++..+ 2 1 | | +.+.. 3 1 | +------------------------+
4. When all the variables of interest are numeric
The examples above all work regardless of whether the variables of interest (i.e. the variables whose missing data patterns you want to examine) are numeric or string. When all of the variables you wish to check for missing values are numeric we can use a program called misschk to simplify the steps of examining the missing data in our dataset. (Note: numeric variables include those with value labels that are strings, as long as the actual values of the variables are stored as numbers.) You can download misschk from within Stata by typing search misschk (see How can I use the search command to search for programs and get additional help? for more information about using search).
Below is the command for misschk. We have listed all five of the variables in our dataset in the variable list after the misschk command. However, we could have just left the list of variables blank (i.e. used only misschk , gen(miss) instead), if we had, misschk would have run using all the variables in our dataset. The variable list is only necessary if we want to run misschk on only some of the variables in our dataset. The gen(miss) option tells misschk that we want it to create two new variables, both of which start with “miss”. These two variables will be named misspattern and missnumber. The variable misspattern indicates which of the missing data patterns each case follows. The variable missnumber indicates the number of missing values for each case.
misschk landval improval totval salepric saltoapr, gen(miss)
The output for misschk consists of three tables. The first table lists the number of missing values, as well as percent missing for each variable, this is similar to the table produced by mdesc in part 1 above. This table also contains a column labeled “#” which assigns each variable a number that is used to identify the variable later on in the output. The second table shows the distribution of missing values. The pattern of missingness is described using the variable numbers from the first table, and underscores (“_”). The numbers indicate which variables are missing in that pattern, the underscores represent non-missing observations. For example, from the second table we see that two cases have missing values on variable 1 (landval), but complete data on all other variables, and that one case is missing data on variables 2, 4, and 5. The bottom row shows that four cases are not missing any values at all (all underscores). This table shows the same information generated in part three above, but in a slightly different format. The missing data pattern for each case is described in the variable misspattern. Finally, the third table shows the distribution of the number of missing values per case. This is the same information discussed above in part 2. The number of variables each case is missing is also contained in the variable missnumber.
Variables examined for missing values # Variable # Missing % Missing -------------------------------------------- 1 landval 2 13.3 2 improval 3 20.0 3 totval 3 20.0 4 salepric 4 26.7 5 saltoapr 2 13.3 Missing for | which | variables? | Freq. Percent Cum. ------------+----------------------------------- 1____ | 2 13.33 13.33 _2_45 | 1 6.67 20.00 _2___ | 2 13.33 33.33 __34_ | 1 6.67 40.00 __3__ | 2 13.33 53.33 ___4_ | 2 13.33 66.67 ____5 | 1 6.67 73.33 _____ | 4 26.67 100.00 ------------+----------------------------------- Total | 15 100.00 Missing for | how many | variables? | Freq. Percent Cum. ------------+----------------------------------- 0 | 4 26.67 26.67 1 | 9 60.00 86.67 2 | 1 6.67 93.33 3 | 1 6.67 100.00 ------------+----------------------------------- Total | 15 100.00