1. Introduction
This module will explore missing data in Stata, focusing on numeric missing data. It will describe how to indicate missing data in your raw data files, as well as how missing data are handled in Stata logical commands and assignment statements.
We will illustrate some of the missing data properties in Stata using data from a reaction time study with eight subjects indicated by the variable id , and the subjects reaction times were measured at three time points (trial1, trial2 and trial3). The input data file is shown below.
input id trial1 trial2 trial3 1 1.5 1.4 1.6 2 1.5 . 1.9 3 . 2.0 1.6 4 . . 2.2 5 1.9 2.1 2 6 1.8 2.0 1.9 7 . . . endlist
You might notice that some of the reaction times are coded using a single . as is the case for subject 2. The person measuring time for that trial did not measure the response time properly; therefore, the data point for the second trial is missing.
+-------------------------------+ | id trial1 trial2 trial3 | |-------------------------------| 1. | 1 1.5 1.4 1.6 | 2. | 2 1.5 . 1.9 | 3. | 3 . 2 1.6 | 4. | 4 . . 2.2 | 5. | 5 1.9 2.1 2 | |-------------------------------| 6. | 6 1.8 2 1.9 | 7. | 7 . . . | +-------------------------------+
2. How Stata handles missing data in Stata procedures
As a general rule, Stata commands that perform computations of any type handle missing data by omitting the row with the missing values. However, the way that missing values are omitted is not always consistent across commands, so let’s take a look at some examples.
First, let’s summarize our reaction time variables and see how Stata handles the missing values.
summarize trial1 trial2 trial3
As you see in the output below, summarize computed means using 4 observations for trial1 and trial2 and 6 observations for trial3. In short, the summarize command performed the computations on all the available data.
Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- trial1 | 4 1.675 .2061553 1.5 1.9 trial2 | 4 1.875 .3201562 1.4 2.1 trial3 | 6 1.866667 .233809 1.6 2.2
A second example shows how the tabulation or tab1 command handles missing data. Like summarize, tab1 uses just available data. Note that the percentages are computed based on the total number of non-missing cases.
tab1 trial1 trial2 trial3-> tabulation of trial1 trial1 | Freq. Percent Cum. ------------+----------------------------------- 1.5 | 2 50.00 50.00 1.8 | 1 25.00 75.00 1.9 | 1 25.00 100.00 ------------+----------------------------------- Total | 4 100.00 -> tabulation of trial2 trial2 | Freq. Percent Cum. ------------+----------------------------------- 1.4 | 1 25.00 25.00 2 | 2 50.00 75.00 2.1 | 1 25.00 100.00 ------------+----------------------------------- Total | 4 100.00 -> tabulation of trial3 trial3 | Freq. Percent Cum. ------------+----------------------------------- 1.6 | 2 33.33 33.33 1.9 | 2 33.33 66.67 2 | 1 16.67 83.33 2.2 | 1 16.67 100.00 ------------+----------------------------------- Total | 6 100.00
It is possible that you might want the percentages to be computed out of the total number of observations, and the percentage missing for each variable shown in the table. This can be achieved by including the missing option (which can be shortened to m) after the tabulation command.
tab1 trial1 trial2 trial3, m-> tabulation of trial1 trial1 | Freq. Percent Cum. ------------+----------------------------------- 1.5 | 2 28.57 28.57 1.8 | 1 14.29 42.86 1.9 | 1 14.29 57.14 . | 3 42.86 100.00 ------------+----------------------------------- Total | 7 100.00 -> tabulation of trial2 trial2 | Freq. Percent Cum. ------------+----------------------------------- 1.4 | 1 14.29 14.29 2 | 2 28.57 42.86 2.1 | 1 14.29 57.14 . | 3 42.86 100.00 ------------+----------------------------------- Total | 7 100.00 -> tabulation of trial3 trial3 | Freq. Percent Cum. ------------+----------------------------------- 1.6 | 2 28.57 28.57 1.9 | 2 28.57 57.14 2 | 1 14.29 71.43 2.2 | 1 14.29 85.71 . | 1 14.29 100.00 ------------+----------------------------------- Total | 7 100.00
Let’s look at how the correlate command handles missing data. We would expect that it would perform the computations based on the available data and omit the missing values. Here is an example command.
corr trial1 trial2 trial3
The output is show below. Note how the missing values were excluded. Stata will perform listwise deletion and only display correlation for observations that have non-missing values on all variables listed.
. corr trial1 trial2 trial3 (obs=3) | trial1 trial2 trial3 -------------+--------------------------- trial1 | 1.0000 trial2 | 0.9939 1.0000 trial3 | 1.0000 0.9939 1.0000
Stata also allows for pairwise deletion. Correlations are displayed for the observations that have non-missing values for each pair of variables. This can done using the pwcorr command. We use the obs option to display the number of observation used for each pair. As you can see, they differ depending on the amount of missing.
pwcorr trial1 trial2 trial3, obs| trial1 trial2 trial3 -------------+--------------------------- trial1 | 1.0000 | 3 | trial2 | 0.9939 1.0000 | 3 4 | trial3 | 0.7001 0.6439 1.0000 | 4 4 6
3. Summary of how missing values are handled in Stata procedures
- summarize For each variable, the number of non-missing values are used.
- tabulation By default, missing values are excluded and percentages are based on the number of non-missing values. If you use the missing option on the tab command, the percentages are based on the total number of observations (non-missing and missing) and the percentage of missing values are reported in the table.
- corr By default, correlations are computed based on the number of rows with non-missing data for the variables listed after the corr command (listwise deletion of missing data). The pwcorr command can be used to request that correlations be computed in a pairwise fashion, meaning that all of the available data for each pair of variables will be used to compute the correlation. This means that a different number of observations may be used in the calculation of the correlation coefficients for each pair of variables.
- reg If any of the variables listed after the reg command are missing, the observations missing that value(s) are excluded from the analysis (i.e., listwise deletion of missing data).
- For other procedures, see the Stata manual for information on how missing data are handled.
4. Missing values in assignment statements
It is important to understand how missing values are handled in assignment statements. Consider the example shown below.
gen sum1 = trial1 + trial2 + trial3
The list command below illustrates how missing values are handled in assignment statements. The variable sum1 is based on the variables trial1, trial2 and trial3. If the value of any of those variables were missing, the value for sum1 was set to missing. Therefore sum1 is missing for observations 2, 3, 4 and 7.
list+--------------------------------------+ | id trial1 trial2 trial3 sum1 | |--------------------------------------| 1. | 1 1.5 1.4 1.6 4.5 | 2. | 2 1.5 . 1.9 . | 3. | 3 . 2 1.6 . | 4. | 4 . . 2.2 . | 5. | 5 1.9 2.1 2 6 | |--------------------------------------| 6. | 6 1.8 2 1.9 5.7 | 7. | 7 . . . . | +--------------------------------------+
As a general rule, computations involving missing values yield missing values. For example,
2 + 2 yields 4 2 + . yields . 2 / 2 yields 1 . / 2 yields . 2 * 3 yields 6 2 * . yields .
Whenever you add, subtract, multiply, divide, etc., values that involve missing a missing value, the result is missing.
In our reaction time experiment, the total reaction time sum1 is missing for four out of seven cases. We could try totaling the data for the non-missing trials by using the rowtotal function as shown in the example below.
egen sum2 = rowtotal(trial1 trial2 trial3) list
The results below show that sum2 now contains the sum of the non-missing trials.
+---------------------------------------------+ | id trial1 trial2 trial3 sum1 sum2 | |---------------------------------------------| 1. | 1 1.5 1.4 1.6 4.5 4.5 | 2. | 2 1.5 . 1.9 . 3.4 | 3. | 3 . 2 1.6 . 3.6 | 4. | 4 . . 2.2 . 2.2 | 5. | 5 1.9 2.1 2 6 6 | |---------------------------------------------| 6. | 6 1.8 2 1.9 5.7 5.7 | 7. | 7 . . . . 0 | +---------------------------------------------+
Note that the rowtotal function treats missing as a zero value. When summing several variables it may not be reasonable to treat missing as zero if an observations is missing on all variables to be summed. The rowtotal function with the missing option will return a missing value if an observation is missing on all variables.
egen sum3 = rowtotal(trial1 trial2 trial3) , missing +----------------------------------------------------+ | id trial1 trial2 trial3 sum1 sum2 sum3 | |----------------------------------------------------| 1. | 1 1.5 1.4 1.6 4.5 4.5 4.5 | 2. | 2 1.5 . 1.9 . 3.4 3.4 | 3. | 3 . 2 1.6 . 3.6 3.6 | 4. | 4 . . 2.2 . 2.2 2.2 | 5. | 5 1.9 2.1 2 6 6 6 | |----------------------------------------------------| 6. | 6 1.8 2 1.9 5.7 5.7 5.7 | 7. | 7 . . . . 0 . | +----------------------------------------------------+
Other statements work similarly. For example, observe what happened when we try to create an average variable without using a function (as in the example below). If any of the variables trial1, trial2 or trial3 are missing, the value for avg1 is set to missing.
gen avg1 = (trial1 + trial2 + trial3)/3
Alternatively, the rowmean function averages the data for the non-missing trials in the same way as the rowtotal function.
egen avg2 = rowmean(trial1 trial2 trial3)
Note: Had there been large number of trials, say 50 trials, then it would be annoying to have to type avg=rowmean(trial1 trial2 trial3 trial4 …). Here is a shortcut you could use in this kind of situation:
egen avg3 = rowmean(trial1 - trial3) list+----------------------------------------------------+ | id trial1 trial2 trial3 avg1 avg2 avg3 | |----------------------------------------------------| 1. | 1 1.5 1.4 1.6 1.5 1.5 1.5 | 2. | 2 1.5 . 1.9 . 1.7 1.7 | 3. | 3 . 2 1.6 . 1.8 1.8 | 4. | 4 . . 2.2 . 2.2 2.2 | 5. | 5 1.9 2.1 2 2 2 2 | |----------------------------------------------------| 6. | 6 1.8 2 1.9 1.9 1.9 1.9 | 7. | 7 . . . . . . | +----------------------------------------------------+
Finally, you can use the rowmiss and rownomiss functions to determine the number of missing and the number of non-missing values, respectively, in a list of variables. This is illustrated below.
egen miss = rowmiss(trial1 - trial3) egen nomiss = rownonmiss(trial1 - trial3) list
For the variable nomiss, observations 1, 5 and 6 had three valid values, observations 2 and 3 had two valid values, observation 4 had only one valid value and observation 7 had no valid values. The variable miss shows the opposite; it provides a count of the number of missing values.
+-----------------------------------------------+ | id trial1 trial2 trial3 miss nomiss | |-----------------------------------------------| 1. | 1 1.5 1.4 1.6 0 3 | 2. | 2 1.5 . 1.9 1 2 | 3. | 3 . 2 1.6 1 2 | 4. | 4 . . 2.2 2 1 | 5. | 5 1.9 2.1 2 0 3 | |-----------------------------------------------| 6. | 6 1.8 2 1.9 0 3 | 7. | 7 . . . 3 0 | +-----------------------------------------------+
5. Missing values in logical statements
It is important to understand how missing values are handled in logical statements. For example, say that you want to create a 0/1 variable for trial2 that is 1 if it is 1.5 or less, and 0 if it is over 1.5. We show this below (incorrectly, as you will see).
gen newvar1 =(trial2 <1.5) list trial2 newvar1
It appears that something went wrong with our newly created variable newvar1! The observations with missing values for trial2 were assigned a zero for newvar1.
+------------------+ | trial2 newvar1 | |------------------| 1. | 1.4 1 | 2. | . 0 | 3. | 2 0 | 4. | . 0 | 5. | 2.1 0 | |------------------| 6. | 2 0 | 7. | . 0 | +------------------+
Let’s explore why this happened by looking at the frequency table of trial2.
As you can see in the output, missing values are at the listed after the highest value 2.1. This is because Stata treats a missing value as the largest possible value (e.g., positive infinity) and that value is greater than 2.1, so then the values for newvar1 become 0.
tab trial2, missingtrial2 | Freq. Percent Cum. ------------+----------------------------------- 1.4 | 1 14.29 14.29 2 | 2 28.57 42.86 2.1 | 1 14.29 57.14 . | 3 42.86 100.00 ------------+----------------------------------- Total | 7 100.00
Now that we understand how Stata treats missing values, we will explicitly exclude missing values to make sure they are treated properly, as shown below.
gen newvar2 =(trial2 <1.5) if trial2 !=. list trial2 newvar1 newvar2
As you can see in the Stata output below, the new variable newvar2 has missing values for observations that are also missing for trial2.
+----------------------------+ | trial2 newvar1 newvar2 | |----------------------------| 1. | 1.4 1 1 | 2. | . 0 . | 3. | 2 0 0 | 4. | . 0 . | 5. | 2.1 0 0 | |----------------------------| 6. | 2 0 0 | 7. | . 0 . | +----------------------------+
6. Missing values in logical statements
- When creating or recoding variables that involve missing values, always pay attention to whether the variable includes missing values.
7. For more information
- See the Stata FAQ: How can I recode missing values into different categories?
- See the Stata FAQ: Can I quickly see how many missing values a variable has? for more information on examining the number of missing and non-missing values for a particular variable or set of variables.