Bootstrapping allows for estimation of statistics through the repeated resampling of data. In this page, we will demonstrate several methods of bootstrapping a confidence interval about an R-squared statistic in SAS. We will be using the hsb2 dataset that can be found hsb2. We will begin by running an OLS regression, predicting read with female, math, write, and ses, and saving the R-squared value in a dataset called t0. The R-squared value in this regression is 0.5189.
ods output FitStatistics = t0; proc reg data = hsb2; model read = female math write ses; run; quit; The REG Procedure Model: MODEL1 Dependent Variable: read reading score Number of Observations Read 200 Number of Observations Used 200 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 4 10855 2713.73294 52.58 <.0001 Error 195 10064 51.61276 Corrected Total 199 20919 Root MSE 7.18420 R-Square 0.5189 Dependent Mean 52.23000 Adj R-Sq 0.5090 Coeff Var 13.75493 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 6.83342 3.27937 2.08 0.0385 female 1 -2.45017 1.10152 -2.22 0.0273 math math score 1 0.45656 0.07211 6.33 <.0001 write writing score 1 0.37936 0.07327 5.18 <.0001 ses 1 1.30198 0.74007 1.76 0.0801*store the estimated r-square; data _null_; set t0; if label2 = "R-Square" then call symput('r2bar', cvalue2); run;
To bootstrap a confidence interval about this R-squared value, we will first need to resample. This step involves sampling with replacement from our original dataset to generate a new dataset the same size as our original dataset. For each of these samples, we will be running the same regression as above and saving the R-squared value. proc surveyselect allows us to do this resampling in one step.
Before carrying out this step, let's outline the assumptions we are making about our data when we use this method. We are assuming that the observations in our dataset are independent. We are also assuming that the statistic we are estimating is asymptotically normally distributed.
We indicate an output dataset, a seed, a sampling method, and the number of replicates. The sampling method indicated, urs, is unrestricted random sampling, or sampling with replacement. The samprate indicates how large each sample should be relative to the input dataset. A samprate of 1 means that the sampled datasets should be of the same size as the input dataset. So in this example, we will generate 500 datasets of 200, so our output dataset bootsample will have 100,000 observations.
%let rep = 500; proc surveyselect data= hsb2 out=bootsample seed = 1347 method = urs samprate = 1 outhits rep = &rep; run; ods listing close; The SURVEYSELECT Procedure Selection Method Unrestricted Random Sampling Input Data Set HSB2 Random Number Seed 1347 Sampling Rate 1 Sample Size 200 Expected Number of Hits 1 Sampling Weight 1 Number of Replicates 500 Total Sample Size 100000 Output Data Set BOOTSAMPLE
With this dataset, we will now run our regression model, specifying by replicate so that the model will be run separately for each of the 500 sample datasets. After that, we use a data step to convert the R-squared values to numeric.
ods output FitStatistics = t (where = (label2 = "R-Square")); proc reg data = bootsample; by replicate; model read = female math write ses; run; quit;
* converting character type to numeric type; data t1; set t; r2 = cvalue2 + 0; run;
Method 1: Normal Distribution Confidence Interval
We will first create a confidence interval using the normal distribution theory. This assumes that the R-squared values follow a t distribution, so we can generate a 95% confidence interval by about the mean of the R-squared values based on quantiles from a t-distribution with 499 degrees of freedom. We find the critical t values for our confidence interval and multiply these by the standard deviation of the R-squared values that arose in our 500 replications. Our confidence interval using this method is symmetric about the R-squared value we saw in our original regression. We can see that the 95% confidence interval using this method is (0.432787, 0.605013). We have also calculated the bias in our original value of R-squared as the difference between that value and the mean of the 500 R-squareds in our bootstrap sample.
* creating confidence interval, normal distribution theory method; * using the t-distribution; %let alphalev = .05; ods listing; proc sql; select &r2bar as r2, mean(r2) - &r2bar as bias, std(r2) as std_err, &r2bar - tinv(1-&alphalev/2, &rep-1)*std(r2) as lb, &r2bar + tinv(1-&alphalev/2, &rep-1)*std(r2) as hb from t1; quit; r2 bias std_err lb hb 0.5189 0.006616 0.043829 0.432787 0.605013
Method 2: Percentile Confidence Interval
Another way to generate a bootstrap 95% confidence interval from the sample of 500 R-squared values is to look at the 2.5th and 97.5th percentiles in this distribution. This approach to the confidence interval has some advantages over the normal approximation used above. This interval is not symmetric about the original estimate of the R-squared and this method is unaffected by monotonic transformations on the estimated statistic. The first advantage is relevant because our original estimate is subject to bias. The second advantage is less relevant in this example than in an instance where the estimate might be subject to a transformation. The bootstrap estimates that form the bounds of the interval can be transformed in the same way to create the bootstrap interval of the transformed estimate.
We can easily generate a percentile confidence interval in SAS using proc univariate after creating some macro variables for the percentiles of interest and using them in the output statement. We can see that the confidence interval from this method is (0.436, 0.6017). Since we have put the information of interest into a new dataset, pmethod, we have omitted the standard output from the proc univariate.
%let alphalev = .05; %let a1 = %sysevalf(&alphalev/2*100); %let a2 = %sysevalf((1 - &alphalev/2)*100); * creating confidence interval, percentile method; proc univariate data = t1 alpha = .05; var r2; output out=pmethod mean = r2hat pctlpts=&a1 &a2 pctlpre = p pctlname = _lb _ub ; run; <... output omitted ... > data t2; set pmethod; bias = r2hat - &r2bar; r2 = &r2bar; run; ods listing; proc print data = t2; var r2 bias p_lb p_ub; run; Obs r2 bias p_lb p_ub 1 0.5189 .0066164 0.436 0.6017
Method 3: Bias-Corrected Confidence Interval
We can also correct for bias in calculating our confidence interval. We have calculated bias in the previous method as the difference between the R-squared we observed in our initial regression and the mean of the 500 R-squared values from the bootstrap samples. The R-squared estimate from the initial regression is assumed to be an unbiased estimate of the true R-squared. If we wish to correct for the bias in calculating our confidence interval, we can go through the steps below. These are described by Cameron and Trivedi in Microeconomics Using Stata.
We first calculate the proportion of the bootstrap R-squareds that are less than our original value. We will adjust the percentiles used to define our confidence interval based on how this proportion differs from 0.5. We then find the probit of this proportion (z0) and the proportion associated with our alpha level (zalpha). Next, we calculate the two percentiles that will be used to find our confidence interval, p1 and p2, from these values. We then calculate our interval with proc univariate. From this method, our interval is (0.40575, 0.5936).
%let alphalev = .05; %let alpha1 = %sysevalf(1 - &alphalev/2); %put &alpha1; proc sql; select sum(r2<=&r2bar)/count(r2) into :z0bar from t1; quit; 0.44 data _null_; z0 = probit(&z0bar); zalpha = probit(&alpha1); p1 = put(probnorm(2*z0 - zalpha)*100, 3.0); p2 = put(probnorm(2*z0 + zalpha)*100, 3.0); output; call symput('a1', p1); call symput('a2', p2); run; * creating confidence interval, bias-corrected method; proc univariate data = t1 alpha = .05; var r2; output out=pmethod mean = r2hat pctlpts=&a1 &a2 pctlpre = p pctlname = _lb _ub ; run; <... output omitted ...> data t2; set pmethod; bias = r2hat - &r2bar; r2 = &r2bar; run; ods listing; proc print data = t2; var r2 bias p_lb p_ub; run; Obs r2 bias p_lb p_ub 1 0.5189 .0066164 0.40575 0.5936