How can I bootstrap estimates in SAS?

Bootstrapping allows for estimation of statistics through the repeated resampling of data. In this page, we will demonstrate several methods of bootstrapping a confidence interval about an R-squared statistic in SAS. We will be using the hsb2 dataset that can be found hsb2. We will begin by running an OLS regression, predicting read with female, math, write, and ses, and saving the R-squared value in a dataset called t0. The R-squared value in this regression is 0.5189.


ods output FitStatistics = t0;
proc reg data = hsb2;
  model read = female math write ses;
run;
quit;

The REG Procedure
Model: MODEL1
Dependent Variable: read reading score

Number of Observations Read         200
Number of Observations Used         200

                             Analysis of Variance
                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F
Model                     4          10855     2713.73294      52.58    <.0001
Error                   195          10064       51.61276
Corrected Total         199          20919

Root MSE              7.18420    R-Square     0.5189
Dependent Mean       52.23000    Adj R-Sq     0.5090
Coeff Var            13.75493

                                 Parameter Estimates
                                      Parameter       Standard
Variable     Label            DF       Estimate          Error    t Value    Pr > |t|
Intercept    Intercept         1        6.83342        3.27937       2.08      0.0385
female                         1       -2.45017        1.10152      -2.22      0.0273
math         math score        1        0.45656        0.07211       6.33      <.0001
write        writing score     1        0.37936        0.07327       5.18      <.0001
ses                            1        1.30198        0.74007       1.76      0.0801

*store the estimated r-square;
data _null_;
 set t0;
 if label2 =  "R-Square" then 
 call symput('r2bar', cvalue2);
run;

To bootstrap a confidence interval about this R-squared value, we will first need to resample. This step involves sampling with replacement from our original dataset to generate a new dataset the same size as our original dataset. For each of these samples, we will be running the same regression as above and saving the R-squared value. proc surveyselect allows us to do this resampling in one step.

Before carrying out this step, let's outline the assumptions we are making about our data when we use this method. We are assuming that the observations in our dataset are independent. We are also assuming that the statistic we are estimating is asymptotically normally distributed.

We indicate an output dataset, a seed, a sampling method, and the number of replicates. The sampling method indicated, urs, is unrestricted random sampling, or sampling with replacement. The samprate indicates how large each sample should be relative to the input dataset. A samprate of 1 means that the sampled datasets should be of the same size as the input dataset. So in this example, we will generate 500 datasets of 200, so our output dataset bootsample will have 100,000 observations.


%let rep = 500;
proc surveyselect data= hsb2 out=bootsample
     seed = 1347 method = urs
	 samprate = 1 outhits rep = &rep;
run;
ods listing close;

The SURVEYSELECT Procedure

Selection Method    Unrestricted Random Sampling

Input Data Set                   HSB2
Random Number Seed               1347
Sampling Rate                       1
Sample Size                       200
Expected Number of Hits             1
Sampling Weight                     1
Number of Replicates              500
Total Sample Size              100000
Output Data Set            BOOTSAMPLE

With this dataset, we will now run our regression model, specifying by replicate so that the model will be run separately for each of the 500 sample datasets. After that, we use a data step to convert the R-squared values to numeric.


ods output  FitStatistics = t (where = (label2 =  "R-Square"));
proc reg data = bootsample;
  by replicate;
  model read = female math write ses;
run;
quit;

* converting character type to numeric type;
data t1;
  set t;
  r2 = cvalue2 + 0;
run;

Method 1: Normal Distribution Confidence Interval

We will first create a confidence interval using the normal distribution theory. This assumes that the R-squared values follow a t distribution, so we can generate a 95% confidence interval by about the mean of the R-squared values based on quantiles from a t-distribution with 499 degrees of freedom. We find the critical t values for our confidence interval and multiply these by the standard deviation of the R-squared values that arose in our 500 replications. Our confidence interval using this method is symmetric about the R-squared value we saw in our original regression. We can see that the 95% confidence interval using this method is (0.432787, 0.605013). We have also calculated the bias in our original value of R-squared as the difference between that value and the mean of the 500 R-squareds in our bootstrap sample.


* creating confidence interval, normal distribution theory method;
* using the t-distribution;
%let alphalev = .05;
ods listing;
proc sql;
  select  &r2bar as r2,
          mean(r2) - &r2bar as bias, 
		  std(r2) as std_err,
          &r2bar - tinv(1-&alphalev/2, &rep-1)*std(r2) as lb,
          &r2bar + tinv(1-&alphalev/2, &rep-1)*std(r2) as hb
  from t1;
quit; 

      r2      bias   std_err        lb        hb
  0.5189  0.006616  0.043829  0.432787  0.605013

Method 2: Percentile Confidence Interval

Another way to generate a bootstrap 95% confidence interval from the sample of 500 R-squared values is to look at the 2.5th and 97.5th percentiles in this distribution. This approach to the confidence interval has some advantages over the normal approximation used above. This interval is not symmetric about the original estimate of the R-squared and this method is unaffected by monotonic transformations on the estimated statistic. The first advantage is relevant because our original estimate is subject to bias. The second advantage is less relevant in this example than in an instance where the estimate might be subject to a transformation. The bootstrap estimates that form the bounds of the interval can be transformed in the same way to create the bootstrap interval of the transformed estimate.

We can easily generate a percentile confidence interval in SAS using proc univariate after creating some macro variables for the percentiles of interest and using them in the output statement. We can see that the confidence interval from this method is (0.436, 0.6017). Since we have put the information of interest into a new dataset, pmethod, we have omitted the standard output from the proc univariate.

%let alphalev = .05;
%let a1 = %sysevalf(&alphalev/2*100);
%let a2 = %sysevalf((1 - &alphalev/2)*100);
* creating confidence interval, percentile method;
proc univariate data = t1 alpha = .05;
  var r2;
  output out=pmethod mean = r2hat pctlpts=&a1 &a2 pctlpre = p pctlname = _lb _ub ;
run;

<... output omitted ... >

data t2;
  set pmethod;
  bias = r2hat - &r2bar;
  r2 = &r2bar;
run;
ods listing;
proc print data  = t2;
  var r2 bias p_lb p_ub;
run;

Obs      r2        bias       p_lb     p_ub
 1     0.5189    .0066164    0.436    0.6017

Method 3: Bias-Corrected Confidence Interval

We can also correct for bias in calculating our confidence interval. We have calculated bias in the previous method as the difference between the R-squared we observed in our initial regression and the mean of the 500 R-squared values from the bootstrap samples. The R-squared estimate from the initial regression is assumed to be an unbiased estimate of the true R-squared. If we wish to correct for the bias in calculating our confidence interval, we can go through the steps below. These are described by Cameron and Trivedi in Microeconomics Using Stata.

We first calculate the proportion of the bootstrap R-squareds that are less than our original value. We will adjust the percentiles used to define our confidence interval based on how this proportion differs from 0.5. We then find the probit of this proportion (z0) and the proportion associated with our alpha level (zalpha). Next, we calculate the two percentiles that will be used to find our confidence interval, p1 and p2, from these values. We then calculate our interval with proc univariate. From this method, our interval is (0.40575, 0.5936).


%let alphalev = .05;
%let alpha1 = %sysevalf(1 - &alphalev/2);
%put &alpha1;
proc sql;
  select sum(r2<=&r2bar)/count(r2) into :z0bar
  from t1;
quit;

    0.44

data _null_;
  z0 = probit(&z0bar);
  zalpha = probit(&alpha1);
  p1 = put(probnorm(2*z0 - zalpha)*100, 3.0);
  p2 = put(probnorm(2*z0 + zalpha)*100, 3.0);
  output;
  call symput('a1', p1);
  call symput('a2', p2);
run;

* creating confidence interval, bias-corrected method;
proc univariate data = t1 alpha = .05;
  var r2;
  output out=pmethod mean = r2hat pctlpts=&a1 &a2 pctlpre = p pctlname = _lb _ub ;
run;

<... output omitted ...>

data t2;
  set pmethod;
  bias = r2hat - &r2bar;
  r2 = &r2bar;
run;

ods listing;

proc print data  = t2;
  var r2 bias p_lb p_ub;
run;

Obs      r2        bias        p_lb      p_ub
 1     0.5189    .0066164    0.40575    0.5936

References

Cameron, A.C., Trivedi, P.K. Microeconomics Using Stata. Stata Press: College Station, 2009.

Efron, B., Tibshirani, R. An Introduction to the Bootstrap. Chapman and Hall: Boca Raton, 1998.

Cassell D.L. Don't Be Loopy: Re-Sampling and Simulation the SAS Way.