When performing data analysis, it is very common for a given model (e.g. a regression model), to not use all cases in the dataset. This can occur for a number of reasons, for example because if was used to tell Stata to perform the analysis on a subset of cases, or because some cases had missing values on some or all of the variables in the analysis. To allow you to identify the cases used in an analysis, most Stata estimation commands return a function that takes on a value of one if the case was included in the analysis, and zero otherwise (for more information see our Stata FAQ: How can I access information stored after I run a command in Stata (returned results)?). Below we show how this can be useful in two common situations. Many more situations exist, once you’re aware of this function and how it works, you’ll recognize them. The examples below use two different versions of the hsb2 dataset. Both versions contain information on 200 high school students, including their scores on a series of standardized tests, and some demographic information.
When analyzing a subset of data
In this example we run a regression model predicting student’s reading scores based on their scores for math, and science. However, we use if to indicate that we want to run our model on only those cases where the variable write is greater than or equal to 50. Below we see the output for this regression. Note that 128 observations were used in the analysis, rather than the full 200, because we restricted the sample using if.
use https://stats.idre.ucla.edu/stat/stata/notes/hsb2, clear
regress read math science if write>=50
Source | SS df MS Number of obs = 128
-------------+------------------------------ F( 2, 125) = 43.61
Model | 4595.51237 2 2297.75618 Prob > F = 0.0000
Residual | 6585.72982 125 52.6858386 R-squared = 0.4110
-------------+------------------------------ Adj R-squared = 0.4016
Total | 11181.2422 127 88.0412771 Root MSE = 7.2585
------------------------------------------------------------------------------
read | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
math | .4952647 .0898744 5.51 0.000 .3173921 .6731373
science | .2960287 .0942091 3.14 0.002 .1095772 .4824803
_cons | 11.7755 4.86182 2.42 0.017 2.153353 21.39764
------------------------------------------------------------------------------
Once we have run our model, we can generate predicted values using the predict command. Below generate a new variable, p1, which contains the predicted values for each case. When we use summarize to examine the predicted values, we see that predict that the variable p1 has 200 observations, but the model from which these predictions was made used only 128 observations. Predicted values were generated for both the 128 cases where write>=50 and the 72 cases where write<50 (who were not used to estimate the model). Generally, we don’t want to use a model estimated on one sample (in our case, observations where write>=50) on a different sample (observations where write<50). This is particularly true in cases like this one, where we know there is a systematic difference between the samples.
predict p1
(option xb assumed; fitted values)
summarize p1
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
p1 | 200 53.1978 6.875587 40.55244 70.23441
We can use e(sample) to generate predicted values only for those cases used to estimate the model. Below we use predict to generate a new variable, p2, that contains predicted values, but this time we add if e(sample)==1, which indicates that predicted values should only be created for cases used in the last model we ran. This time Stata tells us that we have generated 72 missing values. There are 72 cases where write<=50 in the dataset, rather than predicted values, these cases were given missing values for p2. Summarizing the data again
predict p2 if e(sample)==1
(option xb assumed; fitted values)
(72 missing values generated)
sum p2
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
p2 | 128 56.11719 6.015408 42.64159 70.23441
For model comparison
When we want to compare nested models, the models must be estimated on the same sample in order for the comparison to be valid. When a dataset contains missing values, adding additional predictor variables to a model often reduces the number of cases available for a given model. In this example we fit a model where write predicts read, and compare the fit of this model to a model that contains math and science as well as write as predictors. We will compare the two models using a likelihood ratio test (i.e. the command lrtest). Below we first run a regression model where the variable read is predicted by the variable write and store the estimates from that model as m1 using the command estimates store m1.
use https://stats.idre.ucla.edu/stat/stata/faq/hsb2_mar, clear
regress read write
Source | SS df MS Number of obs = 170
-------------+------------------------------ F( 1, 168) = 94.38
Model | 6188.25135 1 6188.25135 Prob > F = 0.0000
Residual | 11014.7428 168 65.563945 R-squared = 0.3597
-------------+------------------------------ Adj R-squared = 0.3559
Total | 17202.9941 169 101.792865 Root MSE = 8.0972
------------------------------------------------------------------------------
read | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
write | .6496086 .0668652 9.72 0.000 .5176042 .7816129
_cons | 17.65687 3.589724 4.92 0.000 10.5701 24.74365
------------------------------------------------------------------------------
estimates store m1
Below we run a second model where read is predicted by write, math, and science. We store the estimates from this model as m2.
reg read write math science
Source | SS df MS Number of obs = 141
-------------+------------------------------ F( 3, 137) = 47.27
Model | 7560.153 3 2520.051 Prob > F = 0.0000
Residual | 7304.15906 137 53.3150296 R-squared = 0.5086
-------------+------------------------------ Adj R-squared = 0.4979
Total | 14864.3121 140 106.173658 Root MSE = 7.3017
------------------------------------------------------------------------------
read | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
write | .2143165 .0915771 2.34 0.021 .0332291 .3954039
math | .3973615 .1020276 3.89 0.000 .1956088 .5991141
science | .3108218 .0905435 3.43 0.001 .1317781 .4898654
_cons | 3.851736 4.091921 0.94 0.348 -4.239757 11.94323
------------------------------------------------------------------------------
estimates store m2
Now that we have estimated the two models and stored the results, we want to test whether the model that contains write, math, and science fits significantly better than the model that contains only write as a predictor. One way to do this is using a likelihood ratio test, which is what is done below with the command lrtest m1 m2. However this command generates an error message. It turns out, the models were not estimated on the same number of cases. In order for the test to be valid, the two models must be run on the same cases, clearly this is not the case. Looking at the error message and the output from our regressions we see that the model using only write as a predictor was run on 170 cases, while the model that contained write, math, and science as predictors was run on 141 cases. The only difference between these two models is the addition of the variables math and science, indicating that the difference in sample size for the two models is due to missing data on math, and science.
lrtest m1 m2 observations differ: 141 vs. 170 r(498);
So how do we make sure that the two models contain the same number of cases? First, we run the model with write, math, and science as predictors, and store the estimates as m2. Then we use the generate command (gen) to create a new variable called sample that is equal to the function e(sample). In other words the variable sample is equal to one if the case was included in the last analysis (i.e. the model we just ran) and zero otherwise.
regress read write math science
Source | SS df MS Number of obs = 141
-------------+------------------------------ F( 3, 137) = 47.27
Model | 7560.153 3 2520.051 Prob > F = 0.0000
Residual | 7304.15906 137 53.3150296 R-squared = 0.5086
-------------+------------------------------ Adj R-squared = 0.4979
Total | 14864.3121 140 106.173658 Root MSE = 7.3017
------------------------------------------------------------------------------
read | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
write | .2143165 .0915771 2.34 0.021 .0332291 .3954039
math | .3973615 .1020276 3.89 0.000 .1956088 .5991141
science | .3108218 .0905435 3.43 0.001 .1317781 .4898654
_cons | 3.851736 4.091921 0.94 0.348 -4.239757 11.94323
------------------------------------------------------------------------------
estimates store m2
generate sample = e(sample)
Now we can use the variable sample to run the model with only write as a predictor
regress read write if sample==1
Source | SS df MS Number of obs = 141
-------------+------------------------------ F( 1, 139) = 70.12
Model | 4984.37291 1 4984.37291 Prob > F = 0.0000
Residual | 9879.93915 139 71.0786989 R-squared = 0.3353
-------------+------------------------------ Adj R-squared = 0.3305
Total | 14864.3121 140 106.173658 Root MSE = 8.4308
------------------------------------------------------------------------------
read | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
write | .6479233 .0773728 8.37 0.000 .4949436 .800903
_cons | 17.43003 4.1396 4.21 0.000 9.245303 25.61475
------------------------------------------------------------------------------
estimates store m1
Now we can use the lrtest command again, to test whether the model with write, math and science as predictors fits significantly better than a model with just write as a predictor.
lrtest m1 m2 Likelihood-ratio test LR chi2(2) = 42.59 (Assumption: m1 nested in m2) Prob > chi2 = 0.0000
