How can I identify cases used by an estimation command using e(sample)?

When performing data analysis, it is very common for a given model (e.g. a regression model), to not use all cases in the dataset. This can occur for a number of reasons, for example because if was used to tell Stata to perform the analysis on a subset of cases, or because some cases had missing values on some or all of the variables in the analysis. To allow you to identify the cases used in an analysis, most Stata estimation commands return a function that takes on a value of one if the case was included in the analysis, and zero otherwise (for more information see our Stata FAQ: How can I access information stored after I run a command in Stata (returned results)?). Below we show how this can be useful in two common situations. Many more situations exist, once you’re aware of this function and how it works, you’ll recognize them. The examples below use two different versions of the hsb2 dataset. Both versions contain information on 200 high school students, including their scores on a series of standardized tests, and some demographic information.

When analyzing a subset of data

In this example we run a regression model predicting student’s reading scores based on their scores for math, and science. However, we use if to indicate that we want to run our model on only those cases where the variable write is greater than or equal to 50. Below we see the output for this regression. Note that 128 observations were used in the analysis, rather than the full 200, because we restricted the sample using if.

use https://stats.idre.ucla.edu/stat/stata/notes/hsb2, clear

regress read math science if write>=50


      Source |       SS       df       MS              Number of obs =     128
-------------+------------------------------           F(  2,   125) =   43.61
       Model |  4595.51237     2  2297.75618           Prob > F      =  0.0000
    Residual |  6585.72982   125  52.6858386           R-squared     =  0.4110
-------------+------------------------------           Adj R-squared =  0.4016
       Total |  11181.2422   127  88.0412771           Root MSE      =  7.2585

------------------------------------------------------------------------------
        read |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        math |   .4952647   .0898744     5.51   0.000     .3173921    .6731373
     science |   .2960287   .0942091     3.14   0.002     .1095772    .4824803
       _cons |    11.7755    4.86182     2.42   0.017     2.153353    21.39764
------------------------------------------------------------------------------

Once we have run our model, we can generate predicted values using the predict command. Below generate a new variable, p1, which contains the predicted values for each case. When we use summarize to examine the predicted values, we see that predict that the variable p1 has 200 observations, but the model from which these predictions was made used only 128 observations. Predicted values were generated for both the 128 cases where write>=50 and the 72 cases where write<50 (who were not used to estimate the model). Generally, we don’t want to use a model estimated on one sample (in our case, observations where write>=50) on a different sample (observations where write<50). This is particularly true in cases like this one, where we know there is a systematic difference between the samples.

predict p1
(option xb assumed; fitted values)

summarize p1

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
          p1 |       200     53.1978    6.875587   40.55244   70.23441

We can use e(sample) to generate predicted values only for those cases used to estimate the model. Below we use predict to generate a new variable, p2, that contains predicted values, but this time we add if e(sample)==1, which indicates that predicted values should only be created for cases used in the last model we ran. This time Stata tells us that we have generated 72 missing values. There are 72 cases where write<=50 in the dataset, rather than predicted values, these cases were given missing values for p2. Summarizing the data again

predict p2 if e(sample)==1
(option xb assumed; fitted values)
(72 missing values generated)

sum p2

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
          p2 |       128    56.11719    6.015408   42.64159   70.23441

For model comparison

When we want to compare nested models, the models must be estimated on the same sample in order for the comparison to be valid. When a dataset contains missing values, adding additional predictor variables to a model often reduces the number of cases available for a given model. In this example we fit a model where write predicts read, and compare the fit of this model to a model that contains math and science as well as write as predictors. We will compare the two models using a likelihood ratio test (i.e. the command lrtest). Below we first run a regression model where the variable read is predicted by the variable write and store the estimates from that model as m1 using the command estimates store m1.

use https://stats.idre.ucla.edu/stat/stata/faq/hsb2_mar, clear
regress read write

      Source |       SS       df       MS              Number of obs =     170
-------------+------------------------------           F(  1,   168) =   94.38
       Model |  6188.25135     1  6188.25135           Prob > F      =  0.0000
    Residual |  11014.7428   168   65.563945           R-squared     =  0.3597
-------------+------------------------------           Adj R-squared =  0.3559
       Total |  17202.9941   169  101.792865           Root MSE      =  8.0972

------------------------------------------------------------------------------
        read |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       write |   .6496086   .0668652     9.72   0.000     .5176042    .7816129
       _cons |   17.65687   3.589724     4.92   0.000      10.5701    24.74365
------------------------------------------------------------------------------

estimates store m1

Below we run a second model where read is predicted by write, math, and science. We store the estimates from this model as m2.

reg read write math science

      Source |       SS       df       MS              Number of obs =     141
-------------+------------------------------           F(  3,   137) =   47.27
       Model |    7560.153     3    2520.051           Prob > F      =  0.0000
    Residual |  7304.15906   137  53.3150296           R-squared     =  0.5086
-------------+------------------------------           Adj R-squared =  0.4979
       Total |  14864.3121   140  106.173658           Root MSE      =  7.3017

------------------------------------------------------------------------------
        read |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       write |   .2143165   .0915771     2.34   0.021     .0332291    .3954039
        math |   .3973615   .1020276     3.89   0.000     .1956088    .5991141
     science |   .3108218   .0905435     3.43   0.001     .1317781    .4898654
       _cons |   3.851736   4.091921     0.94   0.348    -4.239757    11.94323
------------------------------------------------------------------------------

estimates store m2

Now that we have estimated the two models and stored the results, we want to test whether the model that contains write, math, and science fits significantly better than the model that contains only write as a predictor. One way to do this is using a likelihood ratio test, which is what is done below with the command lrtest m1 m2. However this command generates an error message. It turns out, the models were not estimated on the same number of cases. In order for the test to be valid, the two models must be run on the same cases, clearly this is not the case. Looking at the error message and the output from our regressions we see that the model using only write as a predictor was run on 170 cases, while the model that contained write, math, and science as predictors was run on 141 cases. The only difference between these two models is the addition of the variables math and science, indicating that the difference in sample size for the two models is due to missing data on math, and science.

lrtest m1 m2
observations differ: 141 vs. 170
r(498);

So how do we make sure that the two models contain the same number of cases? First, we run the model with write, math, and science as predictors, and store the estimates as m2. Then we use the generate command (gen) to create a new variable called sample that is equal to the function e(sample). In other words the variable sample is equal to one if the case was included in the last analysis (i.e. the model we just ran) and zero otherwise.

regress read write math science

      Source |       SS       df       MS              Number of obs =     141
-------------+------------------------------           F(  3,   137) =   47.27
       Model |    7560.153     3    2520.051           Prob > F      =  0.0000
    Residual |  7304.15906   137  53.3150296           R-squared     =  0.5086
-------------+------------------------------           Adj R-squared =  0.4979
       Total |  14864.3121   140  106.173658           Root MSE      =  7.3017

------------------------------------------------------------------------------
        read |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       write |   .2143165   .0915771     2.34   0.021     .0332291    .3954039
        math |   .3973615   .1020276     3.89   0.000     .1956088    .5991141
     science |   .3108218   .0905435     3.43   0.001     .1317781    .4898654
       _cons |   3.851736   4.091921     0.94   0.348    -4.239757    11.94323
------------------------------------------------------------------------------

estimates store m2
generate sample = e(sample)

Now we can use the variable sample to run the model with only write as a predictor

regress read write if sample==1

      Source |       SS       df       MS              Number of obs =     141
-------------+------------------------------           F(  1,   139) =   70.12
       Model |  4984.37291     1  4984.37291           Prob > F      =  0.0000
    Residual |  9879.93915   139  71.0786989           R-squared     =  0.3353
-------------+------------------------------           Adj R-squared =  0.3305
       Total |  14864.3121   140  106.173658           Root MSE      =  8.4308

------------------------------------------------------------------------------
        read |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       write |   .6479233   .0773728     8.37   0.000     .4949436     .800903
       _cons |   17.43003     4.1396     4.21   0.000     9.245303    25.61475
------------------------------------------------------------------------------

estimates store m1

Now we can use the lrtest command again, to test whether the model with write, math and science as predictors fits significantly better than a model with just write as a predictor.

lrtest m1 m2
Likelihood-ratio test                                  LR chi2(2)  =     42.59
(Assumption: m1 nested in m2)                          Prob > chi2 =    0.0000