Bootstrapping allows for estimation of statistics through the repeated
resampling of data. In this page, we will demonstrate several methods of
bootstrapping a confidence interval about an R-squared statistic in SAS. We will be using the **hsb2** dataset
that can be found hsb2. We will begin by running an OLS regression, predicting **read**
with **female**, **math**, **write**, and **ses**, and saving the
R-squared value in a dataset called **t0**. The R-squared value in this
regression is 0.5189.

ods output FitStatistics = t0; proc reg data = hsb2; model read = female math write ses; run; quit;The REG Procedure Model: MODEL1 Dependent Variable: read reading score Number of Observations Read 200 Number of Observations Used 200 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 4 10855 2713.73294 52.58 <.0001 Error 195 10064 51.61276 Corrected Total 199 20919 Root MSE 7.18420 R-Square 0.5189 Dependent Mean 52.23000 Adj R-Sq 0.5090 Coeff Var 13.75493 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 6.83342 3.27937 2.08 0.0385 female 1 -2.45017 1.10152 -2.22 0.0273 math math score 1 0.45656 0.07211 6.33 <.0001 write writing score 1 0.37936 0.07327 5.18 <.0001 ses 1 1.30198 0.74007 1.76 0.0801*store the estimated r-square; data _null_; set t0; if label2 = "R-Square" then call symput('r2bar', cvalue2); run;

To bootstrap a confidence interval about this R-squared value, we will first
need to resample. This step involves sampling with replacement from our
original dataset to generate a new dataset the same size as our original
dataset. For each of these samples, we will be running the same regression
as above and saving the R-squared value. **proc** **surveyselect**
allows us to do this resampling in one step.

Before carrying out this step, let's outline the assumptions we are making about our data when we use this method. We are assuming that the observations in our dataset are independent. We are also assuming that the statistic we are estimating is asymptotically normally distributed.

We indicate an output
dataset, a seed, a sampling method, and the number of replicates. The
sampling method indicated, **urs**, is unrestricted random sampling, or
sampling with replacement. The **samprate** indicates how large each
sample should be relative to the input dataset. A **samprate** of 1
means that the sampled datasets should be of the same size as the input dataset.
So in this example, we will generate 500 datasets of 200, so our output dataset
**bootsample **will have 100,000 observations.

%let rep = 500; proc surveyselect data= hsb2 out=bootsample seed = 1347 method = urs samprate = 1 outhits rep = &rep; run; ods listing close;The SURVEYSELECT Procedure Selection Method Unrestricted Random Sampling Input Data Set HSB2 Random Number Seed 1347 Sampling Rate 1 Sample Size 200 Expected Number of Hits 1 Sampling Weight 1 Number of Replicates 500 Total Sample Size 100000 Output Data Set BOOTSAMPLE

With this dataset, we will now run our regression model, specifying **by
replicate** so that the model will be run separately for each of the 500
sample datasets. After that, we use a data step to convert the R-squared values
to numeric.

ods output FitStatistics = t (where = (label2 = "R-Square")); proc reg data = bootsample; by replicate; model read = female math write ses; run; quit;

* converting character type to numeric type; data t1; set t; r2 = cvalue2 + 0; run;

## Method 1: Normal Distribution Confidence Interval

We will first create a confidence interval using the normal distribution theory. This assumes that the R-squared values follow a t distribution, so we can generate a 95% confidence interval by about the mean of the R-squared values based on quantiles from a t-distribution with 499 degrees of freedom. We find the critical t values for our confidence interval and multiply these by the standard deviation of the R-squared values that arose in our 500 replications. Our confidence interval using this method is symmetric about the R-squared value we saw in our original regression. We can see that the 95% confidence interval using this method is (0.432787, 0.605013). We have also calculated the bias in our original value of R-squared as the difference between that value and the mean of the 500 R-squareds in our bootstrap sample.

* creating confidence interval, normal distribution theory method; * using the t-distribution; %let alphalev = .05; ods listing; proc sql; select &r2bar as r2, mean(r2) - &r2bar as bias, std(r2) as std_err, &r2bar - tinv(1-&alphalev/2, &rep-1)*std(r2) as lb, &r2bar + tinv(1-&alphalev/2, &rep-1)*std(r2) as hb from t1; quit;r2 bias std_err lb hb 0.5189 0.006616 0.043829 0.432787 0.605013

## Method 2: Percentile Confidence Interval

Another way to generate a bootstrap 95% confidence interval from the sample of 500 R-squared values is to look at the 2.5th and 97.5th percentiles in this distribution. This approach to the confidence interval has some advantages over the normal approximation used above. This interval is not symmetric about the original estimate of the R-squared and this method is unaffected by monotonic transformations on the estimated statistic. The first advantage is relevant because our original estimate is subject to bias. The second advantage is less relevant in this example than in an instance where the estimate might be subject to a transformation. The bootstrap estimates that form the bounds of the interval can be transformed in the same way to create the bootstrap interval of the transformed estimate.

We can easily generate a percentile confidence interval in SAS using **proc** **univariate**
after creating some macro variables for the percentiles of interest and using
them in the **output** statement. We can see that the confidence interval from this method is (0.436, 0.6017).
Since we have put the information of interest into a new dataset, **pmethod**,
we have omitted the standard output from the **proc univariate**.

%let alphalev = .05; %let a1 = %sysevalf(&alphalev/2*100); %let a2 = %sysevalf((1 - &alphalev/2)*100); * creating confidence interval, percentile method; proc univariate data = t1 alpha = .05; var r2; output out=pmethod mean = r2hat pctlpts=&a1 &a2 pctlpre = p pctlname = _lb _ub ; run;<...output omitted... >data t2; set pmethod; bias = r2hat - &r2bar; r2 = &r2bar; run; ods listing; proc print data = t2; var r2 bias p_lb p_ub; run;Obs r2 bias p_lb p_ub 1 0.5189 .0066164 0.436 0.6017

## Method 3: Bias-Corrected Confidence Interval

We can also correct for bias in calculating our confidence interval. We have
calculated bias in the previous method as the difference between the R-squared
we observed in our initial regression and the mean of the 500 R-squared values
from the bootstrap samples. The R-squared estimate from the initial
regression is assumed to be an unbiased
estimate of the true R-squared. If we wish to correct for the bias in calculating our confidence interval, we can go through the
steps below. These are described by Cameron and Trivedi in *
Microeconomics Using Stata.*

We first calculate the proportion of the bootstrap R-squareds that are less
than our original value. We will adjust the percentiles used to define our
confidence interval based on how this proportion differs from 0.5. We then find the probit of this proportion (**z0**)
and the proportion associated with our alpha level (**zalpha**). Next, we
calculate the two percentiles that will be used to find our confidence interval,
**p1** and **p2**, from these values. We then calculate our interval
with **proc** **univariate**. From this method, our interval is (0.40575, 0.5936).

%let alphalev = .05; %let alpha1 = %sysevalf(1 - &alphalev/2); %put &alpha1; proc sql; select sum(r2<=&r2bar)/count(r2) into :z0bar from t1; quit;0.44data _null_; z0 = probit(&z0bar); zalpha = probit(&alpha1); p1 = put(probnorm(2*z0 - zalpha)*100, 3.0); p2 = put(probnorm(2*z0 + zalpha)*100, 3.0); output; call symput('a1', p1); call symput('a2', p2); run; * creating confidence interval, bias-corrected method; proc univariate data = t1 alpha = .05; var r2; output out=pmethod mean = r2hat pctlpts=&a1 &a2 pctlpre = p pctlname = _lb _ub ; run;<...output omitted...>data t2; set pmethod; bias = r2hat - &r2bar; r2 = &r2bar; run; ods listing; proc print data = t2; var r2 bias p_lb p_ub; run;Obs r2 bias p_lb p_ub 1 0.5189 .0066164 0.40575 0.5936

## References

*Microeconomics Using Stata.*Stata Press: College Station, 2009.

*An Introduction to the Bootstrap*. Chapman and Hall: Boca Raton, 1998.

*Don't Be Loopy: Re-Sampling and Simulation the SAS Way*.