Chapter Outline
4.1 Robust Regression Methods
4.1.1 Regression with Robust Standard Errors
4.1.2 Using the Proc Genmod for Clustered Data
4.1.3 Robust Regression
4.1.4 Quantile Regression
4.2 Constrained Linear Regression
4.3 Regression with Censored or Truncated Data
4.3.1 Regression with Censored Data
4.3.2 Regression with Truncated Data
4.4 Regression with Measurement Error
4.5 Multiple Equation Regression Models
4.5.1 Seemingly Unrelated Regression
4.5.2 Multivariate Regression
4.6 Summary
In this chapter we will go into various commands that go beyond OLS. This chapter is a bit different from the others in that it covers a number of different concepts, some of which may be new to you. These extensions, beyond OLS, have much of the look and feel of OLS but will provide you with additional tools to work with linear models.
The topics will include robust regression methods, constrained linear regression, regression with censored and truncated data, regression with measurement error, and multiple equation models.
4.1 Robust Regression Methods
It seems to be a rare dataset that meets all of the assumptions underlying multiple regression. We know that failure to meet assumptions can lead to biased estimates of coefficients and especially biased estimates of the standard errors. This fact explains a lot of the activity in the development of robust regression methods.
The idea behind robust regression methods is to make adjustments in the estimates that take into account some of the flaws in the data itself. We are going to look at three robust methods: regression with robust standard errors, regression with clustered data, robust regression, and quantile regression.
Before we look at these approaches, let’s look at a standard OLS regression using the elementary school academic performance index (elemapi2.dta) dataset. We will look at a model that predicts the api 2000 scores using the average class size in K through 3 (acs_k3), average class size 4 through 6 (acs_46), the percent of fully credentialed teachers (full), and the size of the school (enroll). First let’s look at the descriptive statistics for these variables. Note the missing values for acs_k3 and acs_k6.
proc means data = "c:sasregelemapi2" mean std max min; var api00 acs_k3 acs_46 full enroll; run;
The MEANS Procedure Variable Mean Std Dev Minimum Maximum ------------------------------------------------------------------------ api00 647.6225000 142.2489610 369.0000000 940.0000000 acs_k3 19.1608040 1.3686933 14.0000000 25.0000000 acs_46 29.6851385 3.8407840 20.0000000 50.0000000 full 84.5500000 14.9497907 37.0000000 100.0000000 enroll 483.4650000 226.4483847 130.0000000 1570.00 ------------------------------------------------------------------------
Below we see the regression predicting api00 from acs_k3 acs_46 full and enroll. We see that all of the variables are significant except for acs_k3.
proc reg data = "c:sasregelemapi2"; model api00 = acs_k3 acs_46 full enroll ; run;
The REG Procedure Model: MODEL1 Dependent Variable: api00 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 4 3071909 767977 61.01 <.0001 Error 390 4909501 12588 Corrected Total 394 7981410 Root MSE 112.19832 R-Square 0.3849 Dependent Mean 648.65063 Adj R-Sq 0.3786 Coeff Var 17.29719 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -5.20041 84.95492 -0.06 0.9512 acs_k3 1 6.95438 4.37110 1.59 0.1124 acs_46 1 5.96601 1.53105 3.90 0.0001 full 1 4.66822 0.41425 11.27 <.0001 enroll 1 -0.10599 0.02695 -3.93 <.0001
Since the regression procedure is interactive and we haven’t issued the quit command, we can test both of the class size variables, and we find the overall test of these two variables is significant.
test acs_k3 = acs_46 = 0; run;
Test 1 Results for Dependent Variable api00 Mean Source DF Square F Value Pr > F Numerator 2 139437 11.08 <.0001 Denominator 390 12588
Here is the residual versus fitted plot for this regression. Notice that the pattern of the residuals is not exactly as we would hope. The spread of the residuals is somewhat wider toward the middle right of the graph than at the left, where the variability of the residuals is somewhat smaller, suggesting some heteroscedasticity.
plot r.*p.; run;
Here is the index plot of Cook’s D for this regression. We see 4 points that are somewhat high in both their leverage and their residuals.
plot cookd.*obs.; run;
None of these results are dramatic problems, but the plot of residual vs. predicted value suggests that there might be some outliers and some possible heteroscedasticity and the index plot of Cook’s D shows some points in the upper right quadrant that could be influential. We might wish to use something other than OLS regression to estimate this model. In the next several sections we will look at some robust regression methods.
4.1.1 Regression with Robust Standard Errors
The SAS proc reg includes an option called acov in the model statement for estimating the asymptotic covariance matrix of the estimates under the hypothesis of heteroscedasticity. The standard error obtained from the asymptotic covariance matrix is considered to be more robust and can deal with a collection of minor concerns about failure to meet assumptions, such as minor problems about normality, heteroscedasticity, or some observations that exhibit large residuals, leverage or influence. For such minor problems, the standard error based on acov may effectively deal with these concerns.
With the acov option, the point estimates of the coefficients are exactly the same as in ordinary OLS, but we will calculate the standard errors based on the asymptotic covariance matrix. Here is the same regression as above using the acov option. We also use SAS ODS (Output Delivery System) to output the parameter estimates along with the asymptotic covariance matrix. We calculated the robust standard error in a data step and merged them with the parameter estimate using proc sql and created the t-values and corresponding probabilities. Note the changes in the standard errors and t-tests (but no change in the coefficients). In this particular example, using robust standard errors did not change any of the conclusions from the original OLS regression. We should also mention that the robust standard error has been adjusted for the sample size correction.
proc reg data = "c:sasregelemapi2"; model api00 = acs_k3 acs_46 full enroll /acov; ods output ACovEst = estcov; ods output ParameterEstimates=pest; run; quit; data temp_dm; set estcov; drop model dependent; array a(5) intercept acs_k3 acs_46 full enroll; array b(5) std1-std5; b(_n_) = sqrt((395/390)*a(_n_)); std = max(of std1-std5); keep variable std; run; proc sql; select pest.variable, estimate, stderr, tvalue, probt, std as robust_stderr, estimate/robust_stderr as tvalue_rb, (1 - probt(abs(estimate/robust_stderr), 394))*2 as probt_rb from pest, temp_dm where pest.variable=temp_dm.variable; quit;
robust_ Variable Estimate StdErr tValue Probt stderr tvalue_rb probt_rb ------------------------------------------------------------------------------ Intercept -5.20041 84.95492 -0.06 0.9512 86.66308 -0.06001 0.95218 acs_k3 6.95438 4.37110 1.59 0.1124 4.620599 1.505082 0.133104 acs_46 5.96601 1.53105 3.90 0.0001 1.573214 3.792246 0.000173 full 4.66822 0.41425 11.27 <.0001 0.414681 11.25737 0 enroll -0.10599 0.02695 -3.93 <.0001 0.028015 -3.78331 0.000179
4.1.2 Using the Proc Genmod for Clustered Data
As described in Chapter 2, OLS regression assumes that the residuals are independent. The elemapi2 dataset contains data on 400 schools that come from 37 school districts. It is very possible that the scores within each school district may not be independent, and this could lead to residuals that are not independent within districts. SAS proc genmod is used to model correlated data. We can use the class statement and the repeated statement to indicate that the observations are clustered into districts (based on dnum) and that the observations may be correlated within districts, but would be independent between districts.
proc genmod data="c:sasregelemapi2"; class dnum; model api00 = acs_k3 acs_46 full enroll ; repeated subject=dnum / type=ind ; run; quit;
The GENMOD Procedure
Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates
Standard 95% Confidence Parameter Estimate Error Limits Z Pr > |Z|
Intercept -5.2004 119.5172 -239.450 229.0490 -0.04 0.9653 acs_k3 6.9544 6.7726 -6.3196 20.2284 1.03 0.3045 acs_46 5.9660 2.4839 1.0976 10.8344 2.40 0.0163 full 4.6682 0.6904 3.3151 6.0213 6.76 <.0001 enroll -0.1060 0.0421 -0.1886 -0.0234 -2.51 0.0119
As with the regression with robust error, the estimate of the coefficients are the same as the OLS estimates, but the standard errors take into account that the observations within districts are non-independent. Even though the standard errors are larger in this analysis, the three variables that were significant in the OLS analysis are significant in this analysis as well.
We notice that the standard error estimates given here are different from
what Stata’s result using regress with the cluster option. This is because that
Stata further does a finite-sample adjustment. We can do some SAS programming
here for the adjustment. The adjusted variance is a constant times the variance
obtained from the empirical standard error estimates. This particular constant
is
(N-1)/(N-k)*M/(M-1).
data em; set 'c:sasregelemapi2'; run;
proc genmod data=em; class dnum; model api00 = acs_k3 acs_46 full enroll ; repeated subject=dnum / type = ind covb ; ods output geercov = gcov; ods output GEEEmpPEst = parms; run; quit;
proc sql; select count(dnum),count(distinct dnum) into :n, :m from em; quit; proc sql; select count(prm1) into :k from gcov; quit; data gcov_ad; set gcov; array all(*) _numeric_; do i = 1 to dim(all); all(i) = all(i)*((&n-1)/(&n-&k))*(&m/(&m-1)); if i = _n_ then std_ad = sqrt(all(i)); end; drop i; keep std_ad; run; data all; merge parms gcov_ad; run; proc print data = all noobs; run;
Parm Estimate Stderr LowerCL UpperCL Z ProbZ std_ad
Intercept -5.2004 119.5172 -239.450 229.0490 -0.04 0.9653 121.778 acs_k3 6.9544 6.7726 -6.3196 20.2284 1.03 0.3045 6.901 acs_46 5.9660 2.4839 1.0976 10.8344 2.40 0.0163 2.531 full 4.6682 0.6904 3.3151 6.0213 6.76 <.0001 0.703 enroll -0.1060 0.0421 -0.1886 -0.0234 -2.51 0.0119 0.043
4.1.3 Robust Regression
In SAS, we can not simply execute some proc to perform a robust regression using iteratively reweighted least squares. In order to perform a robust regression, we have to write our own macro. To this end, ATS has written a macro called /sas/webbooks/reg/chapter4/robust_hb.sas. This macro first uses Hubert weight and later switches to biweight. This is why the macro is called robust_hb where h and b stands for Hubert and biweight respectively. Robust regression assigns a weight to each observation with higher weights given to better behaved observations. In fact, extremely deviant cases, those with Cook’s D greater than 1, can have their weights set to missing so that they are not included in the analysis at all.
The macro robust_hb.sas uses another macro called /sas/webbooks/reg/chapter4/mad.sas to generate MAD (median absolute deviation) during the iteration process. We will include both macros to perform the robust regression analysis as shown below. Note that in this analysis both the coefficients and the standard errors differ from the original OLS regression.
%include 'c:sasreg/sas/webbooks/reg/chapter4/mad.sas'; %include 'c:sasregrobust_hb.sas'; %robust_hb("c:sasregelemapi2", api00, acs_k3 acs_46 full enroll, .01, 0.00005, 10);
The REG Procedure Model: MODEL1 Dependent Variable: api00 Weight: _w2_ Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 4 3053248 763312 73.19 <.0001 Error 390 4067269 10429 Corrected Total 394 7120516 Root MSE 102.12196 R-Square 0.4288 Dependent Mean 647.38090 Adj R-Sq 0.4229 Coeff Var 15.77463 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -6.79858 80.45406 -0.08 0.9327 acs_k3 1 6.11031 4.15015 1.47 0.1417 acs_46 1 6.25588 1.45280 4.31 <.0001 full 1 4.79575 0.39212 12.23 <.0001 enroll 1 -0.10924 0.02558 -4.27 <.0001
If you compare the robust regression results (directly above) with the OLS results previously presented, you can see that the coefficients and standard errors are quite similar, and the t values and p values are also quite similar. Despite the minor problems that we found in the data when we performed the OLS analysis, the robust regression analysis yielded quite similar results suggesting that indeed these were minor problems. Had the results been substantially different, we would have wanted to further investigate the reasons why the OLS and robust regression results were different, and among the two results the robust regression results would probably be the more trustworthy.
Let’s look at the predicted (fitted) values (p), the residuals (r), and the leverage (hat) values (h). The macro robust_hb generates a final data set with predicted values, raw residuals and leverage values together with the original data called _tempout_.
Now, let’s check on the various predicted values and the weighting. First, we will sort by _w2_, the weight generated at last iteration. Then we will look at the first 15 observations. Notice that the smallest weights are near one-half but quickly get into the .6 range. The first five values are missing due to the missing values of predictors.
proc sort data = _tempout_; by descending _w2_; run; proc print data = _tempout_ (obs=10); var snum api00 p r h _w2_; run;
Obs snum api00 p r h _w2_ 1 116 513 . . . . 2 3072 763 . . . . 3 3055 590 . . . . 4 4488 521 . . . . 5 4534 445 . . . . 6 637 447 733.426 -286.144 0.003657 0.55684 7 5387 892 611.709 280.464 0.002278 0.57183 8 2267 897 622.062 275.512 0.009887 0.58481 9 65 903 631.930 271.731 0.010198 0.59466 10 3759 585 842.664 -257.473 0.040009 0.63128
Now, let’s look at the last 10 observations. The weights for observations with snum 1678, 4486 and 1885 are all very close to one, since the residuals are fairly small. Note that the observations above that have the lowest weights are also those with the largest residuals (residuals over 200) and the observations below with the highest weights have very low residuals (all less than 3).
proc sort data = _tempout_; by _w2_; run; proc print data = _tempout_ (obs=10); var snum api00 p r h _w2_; run;
Obs snum api00 p r h _w2_ 1 1678 497 497.065 0.18424 0.023137 1.00000 2 4486 706 706.221 0.20151 0.013740 1.00000 3 1885 605 606.066 -0.41421 0.013745 1.00000 4 3535 705 703.929 1.16143 0.004634 0.99999 5 3024 727 729.278 -2.01559 0.010113 0.99997 6 3700 717 719.803 -2.43802 0.007317 0.99996 7 1596 536 533.918 2.77985 0.012143 0.99995 8 181 724 728.068 -3.53209 0.013665 0.99992 9 1949 696 691.898 3.96736 0.020426 0.99990 10 5917 735 731.141 4.03336 0.005831 0.99990
After using macro robust_hb.sas, we can use the dataset _tempout_ to create some graphs for regression diagnostic purposes. For example, we can create a graph of residuals versus fitted (predicted) with a line at zero. This plot looks much like the OLS plot, except that in the OLS all of the observations would be weighted equally, but as we saw above the observations with the greatest residuals are weighted less and hence have less influence on the results.
axis1 order = (-300 to 300 by 100) label=(a=90) minor=none; axis2 order = (300 to 900 by 300) minor=none; symbol v=star h=0.8 c=blue; proc gplot data = _tempout_; plot r*p = 1 /haxis=axis2 vaxis=axis1 vref=0; label r ="Residual"; label p ="Fitted Value"; run; quit;
4.1.4 Quantile Regression
Quantile regression, in general, and median regression, in particular, might be considered as an alternative to robust regression. SAS does quantile regression using a little bit of proc iml. Inside proc iml, a procedure called LAV is called and it does a median regression in which the coefficients will be estimated by minimizing the absolute deviations from the median. Of course, as an estimate of central tendency, the median is a resistant measure that is not as greatly affected by outliers as is the mean. It is not clear that median regression is a resistant estimation procedure, in fact, there is some evidence that it can be affected by high leverage values.
Here is what the quantile regression looks like using SAS proc iml. The first data step is to make sure that the data set that proc iml takes as input does not have any missing values. Inside proc iml we first generate necessary matrices for regression computation and then call the procedure LAV. After calling LAV we can calculate the predicted values and residuals. At last, we create a data set called _temp_ containing the dependent variables and all the predictors plus the predicted values and residuals.
data elemapi2; set "c:sasregelemapi2"; cons = 1; if api00 ~=. & acs_k3 ~= . & acs_46 ~=. & full ~=. & enroll ~=.; run; proc iml ; /*Least absolute values*/ use elemapi2; read all; a = cons || acs_k3 || acs_46 || full || enroll; b=api00; opt= { . 3 0 1 }; call lav(rc,xr,a,b,,opt); /*print out the estimates*/ opt= { . 0 . 1 }; call lav(rc,xr,a,b,,opt); /*no print out but to create the xr*/ pred = a*t(xr); resid = b - pred; create _temp_ var { api00 cons acs_k3 acs_46 full enroll pred resid}; append; quit;
LAV (L1) Estimation Start with LS Solution Start Iter: gamma=284.75134813 ActEqn=395 Iter N Huber Act Eqn Rank Gamma L1(x) F(Gamma) 31 36 5 5 0.0000 36268.1049 36240.6335 Algorithm converged Objective Function L1(x)= 36268.104941 Number Search Directions= 68 Number Refactorizations = 2 Total Execution Time= 0.0200 Necessary and sufficient optimality condition satisfied. L1 Solution with ASE Est 17.1505029 1.2690656888 7.2240793844 5.3238408715 -0.124573396 ASE 123.531545 6.3559394265 2.2262729207 0.602359504 0.0391932684
The coefficient and standard error for acs_k3 are considerably different as compared to OLS (the coefficients are 1.2 vs 6.9 and the standard errors are 6.4 vs 4.3). The coefficients and standard errors for the other variables are also different, but not as dramatically different. Nevertheless, the quantile regression results indicate that, like the OLS results, all of the variables except acs_k3 are significant. Using the data set _temp_ we created above we obtain a plot of residuals vs. predicted values shown below.
symbol v=star h=0.8 c=blue; axis1 order = (-300 to 300 by 100) label=(a=90) minor=none; axis2 order = (300 to 900 by 300) minor=none; proc gplot data = _temp_; plot resid*pred = 1 / vaxis=axis1 haxis=axis2 vref=0; label resid="Residual"; label pred = "Fitted Value"; run; quit;
4.2 Constrained Linear Regression
Let’s begin this section by looking at a regression model using the hsb2 dataset. The hsb2 file is a sample of 200 cases from the Highschool and Beyond Study (Rock, Hilton, Pollack, Ekstrom & Goertz, 1985). It includes the following variables: id female race ses schtyp program read write math science socst. The variables read write math science socst are the results of standardized tests on reading, writing, math, science and social studies (respectively), and the variable female is coded 1 if female, 0 if male.
Let’s start by doing an OLS regression where we predict socst score from read, write, math, science and female (gender)
proc reg data="c:sasreghsb2"; model socst = read write math science female ; run;
The REG Procedure Model: MODEL1 Dependent Variable: socst Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 5 10949 2189.85150 35.44 <.0001 Error 194 11987 61.78834 Corrected Total 199 22936 Root MSE 7.86056 R-Square 0.4774 Dependent Mean 52.40500 Adj R-Sq 0.4639 Coeff Var 14.99963 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 7.33934 3.65024 2.01 0.0457 read 1 0.37840 0.08063 4.69 <.0001 write 1 0.38587 0.08893 4.34 <.0001 math 1 0.13033 0.08938 1.46 0.1464 science 1 -0.03339 0.08187 -0.41 0.6838 female 1 -0.35326 1.24537 -0.28 0.7770
Notice that the coefficients for read and write are very similar, which makes sense since they are both measures of language ability. Also, the coefficients for math and science are similar (in that they are both not significantly different from 0). Suppose that we have a theory that suggests that read and write and math should have equal coefficients. We can test the equality of the coefficients using the test command.
test read = write; run;
Test 1 Results for Dependent Variable socst Mean Source DF Square F Value Pr > F Numerator 1 0.19057 0.00 0.9558 Denominator 194 61.78834
The test result indicates that there is no significant difference in the coefficients for the reading and writing scores. Since it appears that the coefficients for math and science are also equal, let’s test the equality of those as well.
test math = science; run;
Test 2 Results for Dependent Variable socst Mean Source DF Square F Value Pr > F Numerator 1 89.63950 1.45 0.2299 Denominator 194 61.78834
Let’s now perform both of these tests together, simultaneously testing that the coefficient for read equals write and math equals science using mtest statement.
mtest math - science, read - write; run;
Multivariate Test 1 Multivariate Statistics and Exact F Statistics S=1 M=0 N=96 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.99257173 0.73 2 194 0.4852 Pillai's Trace 0.00742827 0.73 2 194 0.4852 Hotelling-Lawley Trace 0.00748386 0.73 2 194 0.4852 Roy's Greatest Root 0.00748386 0.73 2 194 0.4852
Note this second test has 2 df, since it is testing both of the hypotheses listed, and this test is not significant, suggesting these pairs of coefficients are not significantly different from each other. We can estimate regression models where we constrain coefficients to be equal to each other. For example, let’s begin on a limited scale and constrain read to equal write. Proc reg uses restrict statement to accomplish this.
proc reg data="c:sasreghsb2"; model socst = read write math science female ; restrict read=write; run;
The REG Procedure Model: MODEL1 Dependent Variable: socst NOTE: Restrictions have been applied to parameter estimates. Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 4 10949 2737.26674 44.53 <.0001 Error 195 11987 61.47245 Corrected Total 199 22936 Root MSE 7.84044 R-Square 0.4774 Dependent Mean 52.40500 Adj R-Sq 0.4667 Coeff Var 14.96124 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 7.35415 3.63118 2.03 0.0442 read 1 0.38185 0.05139 7.43 <.0001 write 1 0.38185 0.05139 7.43 <.0001 math 1 0.13030 0.08915 1.46 0.1454 science 1 -0.03328 0.08164 -0.41 0.6840 female 1 -0.32962 1.16736 -0.28 0.7780 RESTRICT -1 -25.51302 458.21611 -0.06 0.9558* * Probability computed using beta distribution.
Notice that the coefficients for read and write are identical, along with their standard errors, t-test, etc. Also note that the degrees of freedom for the F test is four, not five, as in the OLS model. This is because only one coefficient is estimated for read and write, estimated like a single variable equal to the sum of their values. Notice also that the Root MSE is slightly higher for the constrained model, but only slightly higher. This is because we have forced the model to estimate the coefficients for read and write that are not as good at minimizing the Sum of Squares Error (the coefficients that would minimize the SSE would be the coefficients from the unconstrained model).
Next, we will define a second constraint, setting math equal to science together with the first constraint we set before.
proc reg data="c:sasreghsb2"; model socst = read write math science female ; restrict read = write, math = science; run;
The REG Procedure Model: MODEL1 Dependent Variable: socst NOTE: Restrictions have been applied to parameter estimates. Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 10860 3619.84965 58.75 <.0001 Error 196 12077 61.61554 Corrected Total 199 22936 Root MSE 7.84956 R-Square 0.4735 Dependent Mean 52.40500 Adj R-Sq 0.4654 Coeff Var 14.97864 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 7.50566 3.63323 2.07 0.0402 read 1 0.38604 0.05133 7.52 <.0001 write 1 0.38604 0.05133 7.52 <.0001 math 1 0.04281 0.05192 0.82 0.4107 science 1 0.04281 0.05192 0.82 0.4107 female 1 -0.20087 1.16383 -0.17 0.8631 RESTRICT -1 -15.36285 458.82638 -0.03 0.9734* RESTRICT -1 547.24384 454.01563 1.21 0.2290* * Probability computed using beta distribution.
Now the coefficients for read = write and math = science and the degrees of freedom for the model has dropped to three. Again, the Root MSE is slightly larger than in the prior model, but we should emphasize only very slightly larger. If indeed the population coefficients for read = write and math = science, then these combined (constrained) estimates may be more stable and generalize better to other samples. So although these estimates may lead to slightly higher standard error of prediction in this sample, they may generalize better to the population from which they came.
4.3 Regression with Censored or Truncated Data
Analyzing data that contain censored values or are truncated is common in many research disciplines. According to Hosmer and Lemeshow (1999), a censored value is one whose value is incomplete due to random factors for each subject. A truncated observation, on the other hand, is one which is incomplete due to a selection process in the design of the study.
We will begin by looking at analyzing data with censored values.
4.3.1 Regression with Censored Data
In this example we have a variable called acadindx which is a weighted combination of standardized test scores and academic grades. The maximum possible score on acadindx is 200 but it is clear that the 16 students who scored 200 are not exactly equal in their academic abilities. In other words, there is variability in academic ability that is not being accounted for when students score 200 on acadindx. The variable acadindx is said to be censored, in particular, it is right censored.
Let’s look at the example. We will begin by looking at a description of the data, some descriptive statistics, and correlations among the variables.
proc means data = "c:sasregacadindx"; run;
The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------------- id 200 100.5000000 57.8791845 1.0000000 200.0000000 female 200 0.5450000 0.4992205 0 1.0000000 reading 200 52.2300000 10.2529368 28.0000000 76.0000000 writing 200 52.7750000 9.4785860 31.0000000 67.0000000 acadindx 200 172.1850000 16.8173987 138.0000000 200.0000000 -------------------------------------------------------------------------------
proc means data = "c:sasregacadindx" N; where acadindx = 200; run;
The MEANS Procedure Variable N --------------- id 16 female 16 reading 16 writing 16 acadindx 16 ---------------
proc corr data = "c:sasregacadindx" nosimple noprob; var acadindx female reading writing; run;
The CORR Procedure 4 Variables: acadindx female reading writing Pearson Correlation Coefficients, N = 200 acadindx female reading writing acadindx 1.00000 -0.08210 0.71309 0.66256 female -0.08210 1.00000 -0.05308 0.25649 reading 0.71309 -0.05308 1.00000 0.59678 writing 0.66256 0.25649 0.59678 1.00000
Now, let’s run a standard OLS regression on the data and generate predicted scores in p1.
proc reg data = "c:sasregacadindx"; model acadindx =female reading writing; output out = reg1 p = p1; run; quit;
The REG Procedure Model: MODEL1 Dependent Variable: acadindx Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 34994 11665 107.40 <.0001 Error 196 21288 108.61160 Corrected Total 199 56282 Root MSE 10.42169 R-Square 0.6218 Dependent Mean 172.18500 Adj R-Sq 0.6160 Coeff Var 6.05261 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 96.11841 4.48956 21.41 <.0001 female 1 -5.83250 1.58821 -3.67 0.0003 reading 1 0.71842 0.09315 7.71 <.0001 writing 1 0.79057 0.10410 7.59 <.0001
The proc lifereg is one of the procedures in SAS that can be used for regression with censored data. The syntax of the command is similar to proc reg with the addition of the variable indicating if an observation is censored. Therefore, we have to create a data set with the information on censoring.
data tobit_model; set "c:sasregacadindx"; censor = ( acadindx >= 200 ); run; proc lifereg data = tobit_model; model acadindx*censor(1) = female reading writing /d=normal; output out = reg2 p = p2; run;
The LIFEREG Procedure Model Information Data Set WORK.TOBIT_MODEL Dependent Variable acadindx Censoring Variable censor Censoring Value(s) 1 Number of Observations 200 Noncensored Values 184 Right Censored Values 16 Left Censored Values 0 Interval Censored Values 0 Name of Distribution Normal Log Likelihood -718.0636168 Algorithm converged. Type III Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq female 1 14.0654 0.0002 reading 1 60.8529 <.0001 writing 1 54.1655 <.0001 Analysis of Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 92.7378 4.8034 83.3233 102.1524 372.74 <.0001 female 1 -6.3473 1.6924 -9.6644 -3.0302 14.07 0.0002 reading 1 0.7777 0.0997 0.5823 0.9731 60.85 <.0001 writing 1 0.8111 0.1102 0.5951 1.0271 54.17 <.0001 Scale 1 10.9897 0.5817 9.9067 12.1912
Let’s merge the two data sets we created together to compare the predicted score p1 and p2.
data compare; merge reg1 reg2; by id; run; proc means data = compare; var acadindx p1 p2; run;
The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------------- acadindx 200 172.1850000 16.8173987 138.0000000 200.0000000 p1 200 172.1850000 13.2608696 142.3820725 201.5311070 p2 200 172.7040275 14.0029221 141.2210940 203.8540576 -------------------------------------------------------------------------------
It shows that the censored regression model predicted values have a larger standard deviation and a greater range of values. When we look at a listing of p1 and p2 for all students who scored the maximum of 200 on acadindx, we see that in every case the censored regression model predicted value is greater than the OLS predicted value. These predictions represent an estimate of what the variability would be if the values of acadindx could exceed 200.
proc print data = compare; var acadindx p1 p2; where acadindx = 200; run;
Obs acadindx p1 p2 32 200 179.175 179.620 57 200 192.681 194.329 68 200 201.531 203.854 80 200 191.831 193.577 82 200 188.154 189.563 88 200 186.573 187.940 95 200 195.997 198.176 100 200 186.933 188.108 132 200 197.578 199.798 136 200 189.459 191.144 143 200 191.185 192.833 157 200 191.614 193.477 161 200 180.251 181.008 169 200 182.275 183.367 174 200 191.614 193.477 200 200 187.662 189.421
Proc lifereg handles right censored, left censored and interval censored data. There are two other commands in SAS that perform censored regression analysis such as proc qlim.
4.3.2 Regression with Truncated Data
Truncated data occurs when some observations are not included in the analysis because of the value of the variable. We will illustrate analysis with truncation using the dataset, acadindx, that was used in the previous section.
Let’s imagine that in order to get into a special honors program, students need to score at least 160 on acadindx. So we will drop all observations in which the value of acadindx is less than or equal 160. Now, let’s estimate the same model that we used in the section on censored data, only this time we will pretend that a 200 for acadindx is not censored.
proc reg data = "c:sasregacadindx"; model acadindx = female reading writing; where acadindx >160; run; quit;
The REG Procedure Model: MODEL1 Dependent Variable: acadindx Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 8074.79638 2691.59879 33.01 <.0001 Error 140 11416 81.54545 Corrected Total 143 19491 Root MSE 9.03025 R-Square 0.4143 Dependent Mean 180.42361 Adj R-Sq 0.4017 Coeff Var 5.00503 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 125.63547 5.89156 21.32 <.0001 female 1 -5.23849 1.61563 -3.24 0.0015 reading 1 0.44111 0.09635 4.58 <.0001 writing 1 0.58733 0.11508 5.10 <.0001
It is clear that the estimates of the coefficients are distorted due to the fact that 53 observations are no longer in the dataset. This amounts to restriction of range on both the response variable and the predictor variables. For example, the coefficient for writing dropped from .79 to .58. What this means is that if our goal is to find the relation between adadindx and the predictor variables in the populations, then the truncation of acadindx in our sample is going to lead to biased estimates. A better approach to analyzing these data is to use truncated regression. In SAS this can be accomplished using proc qlim. Proc qlim is an experimental procedure first available in SAS version 8.1. Proc qlim (Qualitative and LImited dependent variable model) analyzes univariate (and multivariate) limited dependent variable models where dependent variables takes discrete values or dependent variables are observed only in a limited range of values.
data trunc_model; set "c:sasregacadindx"; y = .; if acadindx > 160 & acadindx ~=. then y = acadindx; run; proc qlim data=trunc_model ; model y = female reading writing; endogenous y~ truncated (lb=160); run;
The QLIM Procedure Summary Statistics of Continuous Responses N Obs N Obs Standard Lower Upper Lower Upper Variable Mean Error Type Bound Bound Bound Bound y 180.4236 11.674837 Truncated 160 Model Fit Summary Number of Endogenous Variables 1 Endogenous Variable y Number of Observations 144 Missing Values 56 Log Likelihood -510.00768 Maximum Absolute Gradient 3.87471E-6 Number of Iterations 14 AIC 1030 Schwarz Criterion 1045 Algorithm converged. Parameter Estimates Standard Approx Parameter Estimate Error t Value Pr > |t| Intercept 110.289206 8.673847 12.72 <.0001 female -6.099602 1.925245 -3.17 0.0015 reading 0.518179 0.116829 4.44 <.0001 writing 0.766164 0.152620 5.02 <.0001 _Sigma 9.803571 0.721646 13.59 <.0001
The coefficients from the proc qlim are closer to the OLS results, for example the coefficient for writing is .77 which is closer to the OLS results of .79. However, the results are still somewhat different on the other variables, for example the coefficient for reading is .52 in the proc qlim as compared to .72 in the original OLS with the unrestricted data, and better than the OLS estimate of .47 with the restricted data. While proc qlim may improve the estimates on a restricted data file as compared to OLS, it is certainly no substitute for analyzing the complete unrestricted data file.
4.4 Regression with Measurement Error
As you will most likely recall, one of the assumptions of regression is that the predictor variables are measured without error. The problem is that measurement error in predictor variables leads to under estimation of the regression coefficients.
This section is under development.
4.5 Multiple Equation Regression Models
If a dataset has enough variables we may want to estimate more than one regression model. For example, we may want to predict y1 from x1 and also predict y2 from x2. Even though there are no variables in common these two models are not independent of one another because the data come from the same subjects. This is an example of one type multiple equation regression known as seemly unrelated regression.. We can estimate the coefficients and obtain standard errors taking into account the correlated errors in the two models. An important feature of multiple equation modes is that we can test predictors across equations.
Another example of multiple equation regression is if we wished to predict y1, y2 and y3 from x1 and x2. This is a three equation system, known as multivariate regression, with the same predictor variables for each model. Again, we have the capability of testing coefficients across the different equations.
Multiple equation models are a powerful extension to our data analysis tool kit.
4.5.1 Seemingly Unrelated Regression
Let’s continue using the hsb2 data file to illustrate the use of seemingly unrelated regression. This time let’s look at two regression models.
science = math female write = read female
It is the case that the errors (residuals) from these two models would be correlated. This would be true even if the predictor female were not found in both models. The errors would be correlated because all of the values of the variables are collected on the same set of observations. This is a situation tailor made for seemingly unrelated regression using the proc syslin with option sur. With the proc syslin we can estimate both models simultaneously while accounting for the correlated errors at the same time, leading to efficient estimates of the coefficients and standard errors. The syntax is as follows.
proc syslin data="c:sasreghsb2" sur ; science: model science = math female ; write: model write = read female ; run;
The first part of the output consists of the OLS estimate for each model. Here is the OLS estimate for the first model.
The SYSLIN Procedure Ordinary Least Squares Estimation Model SCIENCE Dependent Variable science Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 7993.550 3996.775 68.38 <.0001 Error 197 11513.95 58.44645 Corrected Total 199 19507.50 Root MSE 7.64503 R-Square 0.40977 Dependent Mean 51.85000 Adj R-Sq 0.40378 Coeff Var 14.74451 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 18.11813 3.167133 5.72 <.0001 math 1 0.663190 0.057872 11.46 <.0001 female 1 -2.16840 1.086043 -2.00 0.0472
And here is OLS estimate for the second model.
The SYSLIN Procedure Ordinary Least Squares Estimation Model WRITE Dependent Variable write Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 7856.321 3928.161 77.21 <.0001 Error 197 10022.55 50.87591 Corrected Total 199 17878.88 Root MSE 7.13273 R-Square 0.43942 Dependent Mean 52.77500 Adj R-Sq 0.43373 Coeff Var 13.51537 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 20.22837 2.713756 7.45 <.0001 read 1 0.565887 0.049385 11.46 <.0001 female 1 5.486894 1.014261 5.41 <.0001
Proc syslin with option sur also gives an estimate of the correlation between the errors of the two models. Here is the corresponding output.
The SYSLIN Procedure Seemingly Unrelated Regression Estimation Cross Model Covariance SCIENCE WRITE SCIENCE 58.4464 7.8908 WRITE 7.8908 50.8759 Cross Model Correlation SCIENCE WRITE SCIENCE 1.00000 0.14471 WRITE 0.14471 1.00000 Cross Model Inverse Correlation SCIENCE WRITE SCIENCE 1.02139 -0.14780 WRITE -0.14780 1.02139 Cross Model Inverse Covariance SCIENCE WRITE SCIENCE 0.017476 -.002710 WRITE -.002710 0.020076 System Weighted MSE 0.9981 Degrees of freedom 394 System Weighted R-Square 0.3875
Finally, we have the seemingly unrelated regression estimation for our models. Note that both the estimates of the coefficients and their standard errors are different from the OLS model estimates shown above.
The SYSLIN Procedure Seemingly Unrelated Regression Estimation Model SCIENCE Dependent Variable science Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 20.13265 3.149485 6.39 <.0001 math 1 0.625141 0.057528 10.87 <.0001 female 1 -2.18934 1.086038 -2.02 0.0452 Model WRITE Dependent Variable write Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 21.83439 2.698827 8.09 <.0001 read 1 0.535484 0.049091 10.91 <.0001 female 1 5.453748 1.014244 5.38 <.0001
Now that we have estimated our models let’s test the predictor variables. The test for female combines information from both models. The tests for math and read are actually equivalent to the t-tests above except that the results are displayed as F-tests.
proc syslin data = "c:sasreghsb2" sur ; science: model science = math female ; write: model write = read female ; female: stest science.female = write.female =0; math: stest science.math = 0; read: stest write.read = 0; run;
Test Results for Variable FEAMLE Num DF Den DF F Value Pr > F 2 394 18.48 0.0001 Test Results for Variable MATH Num DF Den DF F Value Pr > F 1 394 118.31 0.0001 Test Results for Variable READ Num DF Den DF F Value Pr > F 1 394 119.21 0.0001
Now, let’s estimate 3 models where we use the same predictors in each model as shown below.
read = female prog1 prog3 write = female prog1 prog3 math = female prog1 prog3
Here variable prog1 and prog3 are dummy variables for the variable prog. Let’s generate these variables before estimating our three models using proc syslin.
data hsb2; set "c:sasreghsb2"; prog1 = (prog = 1); prog3 = (prog = 3); run; proc syslin data = hsb2 sur; model1: model read = female prog1 prog3; model2: model write = female prog1 prog3; model3: model math = female prog1 prog3; run;
The OLS regression estimate of our three models are as follows.
<some output omitted>
The SYSLIN Procedure Ordinary Least Squares Estimation Model MODEL1 Dependent Variable read Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 56.82950 1.170562 48.55 <.0001 female 1 -1.20858 1.327672 -0.91 0.3638 prog1 1 -6.42937 1.665893 -3.86 0.0002 prog3 1 -9.97687 1.606428 -6.21 <.0001 Model MODEL2 Dependent Variable write Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 53.62162 1.042019 51.46 <.0001 female 1 4.771211 1.181876 4.04 <.0001 prog1 1 -4.83293 1.482956 -3.26 0.0013 prog3 1 -9.43807 1.430021 -6.60 <.0001 Model MODEL3 Dependent Variable math Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 57.10551 1.036890 55.07 <.0001 female 1 -0.67377 1.176059 -0.57 0.5674 prog1 1 -6.72394 1.475657 -4.56 <.0001 prog3 1 -10.3217 1.422983 -7.25 <.0001
These regressions provide fine estimates of the coefficients and standard errors but these results assume the residuals of each analysis are completely independent of the other. Also, if we wish to test female, we would have to do it three times and would not be able to combine the information from all three tests into a single overall test.
Now let’s see the output of the estimate using seemingly unrelated regression.
The SYSLIN Procedure Seemingly Unrelated Regression Estimation
Model MODEL1 Dependent Variable read
Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 56.82950 1.170562 48.55 <.0001 female 1 -1.20858 1.327672 -0.91 0.3638 prog1 1 -6.42937 1.665893 -3.86 0.0002 prog3 1 -9.97687 1.606428 -6.21 <.0001 Model MODEL2 Dependent Variable write Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 53.62162 1.042019 51.46 <.0001 female 1 4.771211 1.181876 4.04 <.0001 prog1 1 -4.83293 1.482956 -3.26 0.0013 prog3 1 -9.43807 1.430021 -6.60 <.0001 Model MODEL3 Dependent Variable math Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 57.10551 1.036890 55.07 <.0001 female 1 -0.67377 1.176059 -0.57 0.5674 prog1 1 -6.72394 1.475657 -4.56 <.0001 prog3 1 -10.3217 1.422983 -7.25 <.0001
Note that the coefficients are identical in the OLS results and in the seemingly unrelated regression estimate, however the standard errors are different, only slightly, due to the correlation among the residuals in the multiple equations.
In addition to getting more appropriate standard errors, we can test the effects of the predictors across the equations. We can test the hypothesis that the coefficient for female is 0 for all three outcome variables, as shown below.
proc syslin data = hsb2 sur; model1: model read = female prog1 prog3; model2: model write = female prog1 prog3; model3: model math = female prog1 prog3; feamle: stest model1.female = model2.female = model3.female = 0; run;
Test Results for Variable FEAMLE Num DF Den DF F Value Pr > F 3 588 11.63 0.0001
We can also test the hypothesis that the coefficient for female is 0 for just read and math.
proc syslin data = hsb2 sur; model1: model read = female prog1 prog3; model2: model write = female prog1 prog3; model3: model math = female prog1 prog3; f1: stest model1.female = model3.female = 0; run;
Test Results for Variable F1 Num DF Den DF F Value Pr > F 2 588 0.42 0.6599
We can also test the hypothesis that the coefficients for prog1 and prog3 are 0 for all three outcome variables, as shown below.
proc syslin data = hsb2 sur; model1: model read = female prog1 prog3; model2: model write = female prog1 prog3; model3: model math = female prog1 prog3; progs: stest model1.prog1 = model2.prog1 = model3.prog1 =0, model1.prog3 = model2.prog3 = model3.prog3 =0 ; run;
Test Results for Variable PROGS Num DF Den DF F Value Pr > F 6 588 11.83 0.0001
4.5.2 Multivariate Regression
Let’s now use multivariate regression using proc reg to look at the same analysis say that we saw in the proc syslin example above, estimating the following 3 models.
read = female prog1 prog3 write = female prog1 prog3 math = female prog1 prog3
Below we use proc reg to predict read write and math from female prog1 and prog3. Note that the top part of the output is similar to the sureg output in that it gives an overall summary of the model for each outcome variable, however the results are somewhat different and the sureg uses a Chi-Square test for the overall fit of the model, and mvreg uses an F-test. The lower part of the output appears similar to the sureg output, however when you compare the standard errors you see that the results are not the same. These standard errors correspond to the OLS standard errors, so these results below do not take into account the correlations among the residuals (as do the sureg results).
proc reg data =hsb2; model read write math = female prog1 prog3 ; run;
The REG Procedure [Some output omitted] Dependent Variable: read Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 56.82950 1.17056 48.55 <.0001 female 1 -1.20858 1.32767 -0.91 0.3638 prog1 1 -6.42937 1.66589 -3.86 0.0002 prog3 1 -9.97687 1.60643 -6.21 <.0001 Dependent Variable: write Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 53.62162 1.04202 51.46 <.0001 female 1 4.77121 1.18188 4.04 <.0001 prog1 1 -4.83293 1.48296 -3.26 0.0013 prog3 1 -9.43807 1.43002 -6.60 <.0001 Dependent Variable: math Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 57.10551 1.03689 55.07 <.0001 female 1 -0.67377 1.17606 -0.57 0.5674 prog1 1 -6.72394 1.47566 -4.56 <.0001 prog3 1 -10.32168 1.42298 -7.25 <.0001
Now, let’s test female. Note, that female was statistically significant in only one of the three equations. Using the mtest statement after proc reg allows us to test female across all three equations simultaneously. And, guess what? It is significant. This is consistent with what we found using seemingly unrelated regression estimation.
female: mtest female=0; run;
Multivariate Test: female Multivariate Statistics and Exact F Statistics S=1 M=0.5 N=96 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.84892448 11.51 3 194 <.0001 Pillai's Trace 0.15107552 11.51 3 194 <.0001 Hotelling-Lawley Trace 0.17796108 11.51 3 194 <.0001 Roy's Greatest Root 0.17796108 11.51 3 194 <.0001
We can also test prog1 and prog3, both separately and combined. Remember these are multivariate tests.
prog1: mtest prog1 = 0; run;
Multivariate Test: prog1 Multivariate Statistics and Exact F Statistics S=1 M=0.5 N=96 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.89429287 7.64 3 194 <.0001 Pillai's Trace 0.10570713 7.64 3 194 <.0001 Hotelling-Lawley Trace 0.11820192 7.64 3 194 <.0001 Roy's Greatest Root 0.11820192 7.64 3 194 <.0001
prog3: mtest prog3 = 0; run;
Multivariate Test: prog3 Multivariate Statistics and Exact F Statistics S=1 M=0.5 N=96 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.75267026 21.25 3 194 <.0001 Pillai's Trace 0.24732974 21.25 3 194 <.0001 Hotelling-Lawley Trace 0.32860304 21.25 3 194 <.0001 Roy's Greatest Root 0.32860304 21.25 3 194 <.0001
prog: mtest prog1 = prog3 =0; run; quit;
Multivariate Test: prog Multivariate Statistics and F Approximations S=2 M=0 N=96 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.73294667 10.87 6 388 <.0001 Pillai's Trace 0.26859190 10.08 6 390 <.0001 Hotelling-Lawley Trace 0.36225660 11.68 6 256.9 <.0001 Roy's Greatest Root 0.35636617 23.16 3 195 <.0001 NOTE: F Statistic for Roy's Greatest Root is an upper bound. NOTE: F Statistic for Wilks' Lambda is exact.
Proc syslin with sur option and proc reg both allow you to test multi-equation models while taking into account the fact that the equations are not independent. The proc syslin with sur option allows you to get estimates for each equation which adjust for the non-independence of the equations, and it allows you to estimate equations which don’t necessarily have the same predictors. By contrast, proc reg is restricted to equations that have the same set of predictors, and the estimates it provides for the individual equations are the same as the OLS estimates. However, proc reg allows you to perform more traditional multivariate tests of predictors.
4.6 Summary
This chapter has covered a variety of topics that go beyond ordinary least squares regression, but there still remain a variety of topics we wish we could have covered, including the analysis of survey data, dealing with missing data, panel data analysis, and more. And, for the topics we did cover, we wish we could have gone into even more detail. One of our main goals for this chapter was to help you be aware of some of the techniques that are available in SAS for analyzing data that do not fit the assumptions of OLS regression and some of the remedies that are possible. If you are a member of the UCLA research community, and you have further questions, we invite you to use our consulting services to discuss issues specific to your data analysis.