How do I write my own bootstrap program?

Stata has the convenient feature of having a bootstrap prefix command which can be seamlessly incorporated with estimation commands (e.g., logistic regression or OLS regression) and non-estimation commands (e.g., summarize). The bootstrap command automates the bootstrap process for the statistic of interest and computes relevant summary measures (i.e., bias and confidence intervals). As convenient as this command is, however, there are instances when the statistic you want to bootstrap does not work within the command. For such instances, you need to write your own bootstrap program.

This Stata FAQ shows how to write your own bootstrap program. For the first example, we match results from the bootstrap command with results from writing a bootstrap program. Ideally, this should reveal how simple it is to write your own bootstrap program. This is followed by an example in which the statistic you want to bootstrap does not work within the bootstrap command, and therefore, requires you to write your own bootstrap program.

Example 1

This example we use the bootstrap command and replicate the results by writing our own bootstrap program. We use the High School and Beyond dataset from which we are going to regress gender (female), math score (math), writing score (write) and socio-economic status (ses) on reading score (read) and bootstrap the root mean squared error (rmse). For the bootstrap we do 100 replications and specify the seed so that we can replicate the results.

use http://statistics.ats.ucla.edu/stat/stata/notes/hsb2, clear
(highschool and beyond (200 cases))

regress read female math write ses

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  4,   195) =   52.58
       Model |  10854.9318     4  2713.73294           Prob > F      =  0.0000
    Residual |  10064.4882   195  51.6127602           R-squared     =  0.5189
-------------+------------------------------           Adj R-squared =  0.5090
       Total |    20919.42   199  105.122714           Root MSE      =  7.1842

------------------------------------------------------------------------------
        read |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |  -2.450171   1.101524    -2.22   0.027    -4.622602   -.2777409
        math |   .4565641   .0721114     6.33   0.000     .3143457    .5987825
       write |   .3793564   .0732728     5.18   0.000     .2348475    .5238653
         ses |   1.301982   .7400719     1.76   0.080    -.1575905    2.761555
       _cons |   6.833418   3.279371     2.08   0.038     .3658287    13.30101
------------------------------------------------------------------------------

bootstrap rmse=e(rmse), reps(100) seed(12345): regress read female math write ses
(running regress on estimation sample)

Bootstrap replications (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..................................................   100

Linear regression                               Number of obs      =       200
                                                Replications       =       100

      command:  regress read female math write ses
         rmse:  e(rmse)

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        rmse |   7.184202   .2594069    27.69   0.000     6.675774     7.69263
------------------------------------------------------------------------------

estat bootstrap, all

Linear regression                               Number of obs      =       200
                                                Replications       =       100

      command:  regress read female math write ses
         rmse:  e(rmse)

------------------------------------------------------------------------------
             |    Observed               Bootstrap
             |       Coef.       Bias    Std. Err.  [95% Conf. Interval]
-------------+----------------------------------------------------------------
        rmse |   7.1842021  -.1006956   .25940687    6.675774    7.69263   (N)
             |                                       6.559784   7.636096   (P)
             |                                       6.778425   7.741319  (BC)
------------------------------------------------------------------------------
(N)    normal confidence interval
(P)    percentile confidence interval
(BC)   bias-corrected confidence interval

Writing our own bootstrap program requires four steps.

In the first step we obtain initial estimates and store the results in a matrix, say observe. In addition, we must also note the number of observations used in the analysis. This information will be used when we summarize the bootstrap results.
Second, we write a program which we will call myboot that samples the data with replacement and returns the statistic of interest. In this step, we start by preserving the data with the preserve command, then take a bootstrap sample with bsample. bsample samples the data in memory with replacement, which is the essential element of the bootstrap. From the bootstrap sample we run our regression model and output the statistic of interest with the return scalar command. Note that when we define the program, program define myboot, we specify the rclass option; without that option, we would not be able to output the bootstrapped statistic. myboot concludes with the restore command, which returns the data to the original state (prior to the bootstrapped sample).
In the third step, we use the simulate prefix command along with myboot, which collects the statistic from the bootstrapped sample. We specify the seed and number of replications at this step, which coincide with those from the example above.
Finally, we use the bstat command to summarize the results. We include the initial estimates, stored in the matrix observe, and the sample size with the stat( ) and n() options, respectively.

use https://stats.idre.ucla.edu/stat/stata/notes3/hsb2, clear
(highschool and beyond (200 cases))

*Step 1
quietly regress read female math write ses
matrix observe = e(rmse)

*Step 2
capture program drop myboot
program define myboot, rclass
 preserve 
  bsample
  regress read female math write ses
  return scalar rmse = e(rmse)
 restore
end

*Step 3
simulate rmse=r(rmse), reps(100) seed(12345): myboot

      command:  myboot
         rmse:  r(rmse)

Simulations (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..................................................   100

*Step 4
bstat, stat(observe) n(200)

Bootstrap results                               Number of obs      =       200
                                                Replications       =       100

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        rmse |   7.184202   .2594069    27.69   0.000     6.675774     7.69263
------------------------------------------------------------------------------

estat bootstrap, all

Bootstrap results                               Number of obs      =       200
                                                Replications       =       100

------------------------------------------------------------------------------
             |    Observed               Bootstrap
             |       Coef.       Bias    Std. Err.  [95% Conf. Interval]
-------------+----------------------------------------------------------------
        rmse |   7.1842021  -.1006956   .25940687    6.675774    7.69263   (N)
             |                                       6.559784   7.636096   (P)
             |                                       6.778425   7.741319  (BC)
------------------------------------------------------------------------------
(N)    normal confidence interval
(P)    percentile confidence interval
(BC)   bias-corrected confidence interval

The results from Step 4 match the results from the bootstrap command in the example above.

Example 2

In this example we write a bootstrap program where the usual bootstrap command does not accommodate the statistic we want to bootstrap. The reason why the bootstrap command does not accommodate all situations is because the bootstrap command requires a statistic that falls directly out of the “analysis” command. To see what statistics are accommodated, use either the ereturn list or return list command following the “analysis” command. The distinction between ereturn list or return list depends whether the “analysis” command is an estimation command or not.

Suppose we want to bootstrap the variance inflation factor (vif), which requires us to run regress and then estat vif. In such a situation, the statistic to bootstrap falls out from a post estimation command, which is not obtainable from regress and therefore not accommodated by the bootstrap command. Hence, we must write our own bootstrap program to get a bootstrap estimate of the vif.

use https://stats.idre.ucla.edu/stat/stata/notes3/hsb2, clear
(highschool and beyond (200 cases))

*Step 1
quietly regress read female math write ses
estat vif

    Variable |       VIF       1/VIF  
-------------+----------------------
       write |      1.86    0.537690
        math |      1.76    0.568278
      female |      1.17    0.857692
         ses |      1.11    0.902671
-------------+----------------------
    Mean VIF |      1.47

return list

scalars:
             r(vif_4) =  1.107823014259338
             r(vif_3) =  1.165920257568359
             r(vif_2) =  1.759701371192932
             r(vif_1) =  1.859809398651123

macros:
            r(name_4) : "ses"
            r(name_3) : "female"
            r(name_2) : "math"
            r(name_1) : "write"

matrix vif = ( r(vif_4), r(vif_3), r(vif_2), r(vif_1))
matrix list vif

vif[1,4]
           c1         c2         c3         c4
r1   1.107823  1.1659203  1.7597014  1.8598094

 
*Step 2
capture program drop myboot2
program define myboot2, rclass
 preserve 
  bsample
    regress read female math write ses
    estat vif
    return scalar vif_4 = r(vif_4)
    return scalar vif_3 = r(vif_3)
    return scalar vif_2 = r(vif_2)
    return scalar vif_1 = r(vif_1)
 restore
end

*Step 3
simulate vif_4=r(vif_4) vif_3=r(vif_3) vif_2=r(vif_2) vif_1=r(vif_1), ///
  reps(100) seed(12345): myboot2

      command:  myboot2
        vif_4:  r(vif_4)
        vif_3:  r(vif_3)
        vif_2:  r(vif_2)
        vif_1:  r(vif_1)

Simulations (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..................................................   100

bstat, stat(vif) n(200)

Bootstrap results                               Number of obs      =       200
                                                Replications       =       100

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       vif_4 |   1.107823   .0344814    32.13   0.000     1.040241    1.175405
       vif_3 |    1.16592   .0524449    22.23   0.000      1.06313     1.26871
       vif_2 |   1.759701   .1349314    13.04   0.000     1.495241    2.024162
       vif_1 |   1.859809   .1467453    12.67   0.000     1.572194    2.147425
------------------------------------------------------------------------------

estat bootstrap, all

Bootstrap results                               Number of obs      =       200
                                                Replications       =       100

------------------------------------------------------------------------------
             |    Observed               Bootstrap
             |       Coef.       Bias    Std. Err.  [95% Conf. Interval]
-------------+----------------------------------------------------------------
       vif_4 |    1.107823   .0127056   .03448139    1.040241   1.175405   (N)
             |                                       1.061917   1.197667   (P)
             |                                       1.058617   1.172653  (BC)
       vif_3 |   1.1659203   .0285308    .0524449     1.06313    1.26871   (N)
             |                                        1.10246    1.30328   (P)
             |                                        1.08448   1.255424  (BC)
       vif_2 |   1.7597014   .0305828   .13493139    1.495241   2.024162   (N)
             |                                       1.552449   2.081403   (P)
             |                                       1.510279   2.026165  (BC)
       vif_1 |   1.8598094   .0389828   .14674526    1.572194   2.147425   (N)
             |                                       1.665374   2.196174   (P)
             |                                       1.633619   2.160758  (BC)
------------------------------------------------------------------------------
(N)    normal confidence interval
(P)    percentile confidence interval
(BC)   bias-corrected confidence interval