**Chapter Outline**

**4.1 Robust Regression Methods
4.1.1 Regression with Robust Standard Errors
4.1.2 Using the Cluster Option
4.1.3 Robust Regression
4.1.4 Quantile Regression
4.2 Constrained Linear Regression
4.3 Regression with Censored or Truncated Data
4.3.1 Regression with Censored Data
4.3.2 Regression with Truncated Data
4.4 Regression with Measurement Error
4.5 Multiple Equation Regression Models
4.5.1 Seemingly Unrelated Regression
4.5.2 Multivariate Regression
4.6 Summary
4.7 Self assessment
4.8 For more information
**

In this chapter we will go into various commands that go beyond OLS. This chapter is a bit different from the others in that it covers a number of different concepts, some of which may be new to you. These extensions, beyond OLS, have much of the look and feel of OLS but will provide you with additional tools to work with linear models.

The topics will include robust regression methods, constrained linear regression, regression with censored and truncated data, regression with measurement error, and multiple equation models.

**4.1 Robust Regression Methods**

It seems to be a rare dataset that meets all of the assumptions underlying multiple regression. We know that failure to meet assumptions can lead to biased estimates of coefficients and especially biased estimates of the standard errors. This fact explains a lot of the activity in the development of robust regression methods.

The idea behind robust regression methods is to make adjustments in the estimates that
take into account some of the flaws in the data itself. We are going to look at three
approaches to robust regression: 1) regression with robust standard errors including the **cluster**
option, 2) robust regression using iteratively reweighted least squares, and 3) quantile
regression, more specifically, median regression.

Before we look at these approaches, let’s look at a standard OLS regression using the elementary school academic performance index (elemapi2.dta) dataset.

use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/elemapi2

We will look at a model that predicts the api 2000 scores using the average class size
in K through 3 (**acs_k3**), average class size 4 through 6 (**acs_46**), the
percent of fully credentialed teachers (**full**), and the size of the school (**enroll**).
First let’s look at the descriptive statistics for these variables. Note the missing
values for **acs_k3** and **acs_k6**.

summarize api00 acs_k3 acs_46 full enrollVariable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- api00 | 400 647.6225 142.249 369 940 acs_k3 | 398 19.1608 1.368693 14 25 acs_46 | 397 29.68514 3.840784 20 50 full | 400 84.55 14.94979 37 100 enroll | 400 483.465 226.4484 130 1570

Below we see the regression predicting **api00** from **acs_k3**, ** acs_46
full **and** enroll**. We see that all of the variables are significant except for **acs_k3**.

regress api00 acs_k3 acs_46 full enrollSource | SS df MS Number of obs = 395 ---------+------------------------------ F( 4, 390) = 61.01 Model | 3071909.06 4 767977.265 Prob > F = 0.0000 Residual | 4909500.73 390 12588.4634 R-squared = 0.3849 ---------+------------------------------ Adj R-squared = 0.3786 Total | 7981409.79 394 20257.3852 Root MSE = 112.20 ------------------------------------------------------------------------------ api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- acs_k3 | 6.954381 4.371097 1.591 0.112 -1.63948 15.54824 acs_46 | 5.966015 1.531049 3.897 0.000 2.955873 8.976157 full | 4.668221 .4142537 11.269 0.000 3.853771 5.482671 enroll | -.1059909 .0269539 -3.932 0.000 -.1589841 -.0529977 _cons | -5.200407 84.95492 -0.061 0.951 -172.2273 161.8265 ------------------------------------------------------------------------------

We can use the **test** command to test both of the class size variables,
and we find the overall test of these two variables is significant.

test acs_k3 acs_46( 1) acs_k3 = 0.0 ( 2) acs_46 = 0.0 F( 2, 390) = 11.08 Prob > F = 0.0000

Here is the residual versus fitted plot for this regression. Notice that the pattern of the residuals is not exactly as we would hope. The spread of the residuals is somewhat wider toward the middle right of the graph than at the left, where the variability of the residuals is somewhat smaller, suggesting some heteroscedasticity.

rvfplot

Below we show the **avplots**. Although the plots are small, you can see some
points that are of concern. There is not a single extreme point (like we saw in chapter
2)
but a handful of points that stick out. For example, in the top right graph you can
see a handful of points that stick out from the rest. If this were just one or two
points, we might look for mistakes or for outliers, but we would be more reluctant to
consider such a large number of points as outliers.

avplots

Here is the** lvr2plot** for this regression. We see 4 points that are
somewhat high in both their leverage and their residuals.

lvr2plot

None of these results are dramatic problems, but the **rvfplot** suggests that there
might be some outliers and some possible heteroscedasticity; the **avplots** have some
observations that look to have high leverage, and the **lvr2plot** shows some
points in the upper right quadrant that could be influential. We might wish to use
something other than OLS regression to estimate this model. In the next several sections
we will look at some robust regression methods.

**4.1.1 Regression with Robust Standard Errors**

The Stata **regress** command includes a **robust** option for
estimating the standard errors using the Huber-White sandwich estimators. Such robust
standard errors can deal with a collection of minor concerns about failure to meet
assumptions, such as minor problems about normality, heteroscedasticity, or some
observations that exhibit large residuals, leverage or influence. For such minor problems,
the robust option may effectively deal with these concerns.

With the **robust** option, the point estimates of the coefficients are exactly the
same as in ordinary OLS, but the standard errors take into account issues concerning
heterogeneity and lack of normality. Here is the same regression as above using the **robust**
option. Note the changes in the standard errors and t-tests (but no change in the
coefficients). In this particular example, using robust standard errors did not change any
of the conclusions from the original OLS regression.

regress api00 acs_k3 acs_46 full enroll, robustRegression with robust standard errors Number of obs = 395 F( 4, 390) = 84.67 Prob > F = 0.0000 R-squared = 0.3849 Root MSE = 112.20 ------------------------------------------------------------------------------ | Robust api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- acs_k3 | 6.954381 4.620599 1.505 0.133 -2.130019 16.03878 acs_46 | 5.966015 1.573214 3.792 0.000 2.872973 9.059057 full | 4.668221 .4146813 11.257 0.000 3.852931 5.483512 enroll | -.1059909 .0280154 -3.783 0.000 -.1610711 -.0509108 _cons | -5.200407 86.66308 -0.060 0.952 -175.5857 165.1849 ------------------------------------------------------------------------------

4.1.2 Using the Cluster Option

As described in Chapter 2, OLS regression assumes that the residuals are independent.
The **elemapi2** dataset contains data on 400 schools that come from 37 school
districts. It is very possible that the scores within each school district may not be
independent, and this could lead to residuals that are not independent within districts.
We can use the **cluster** option to indicate that the observations
are clustered into districts (based on **dnum**) and that the observations
may be correlated within districts, but would be independent between districts.

By the way, if we did not know the number of districts, we could quickly find out how
many districts there are as shown below, by **quietly** tabulating **dnum**
and then displaying the macro **r(r)** which gives the numbers of rows in the
table, which is the number of school districts in our data.

quietly tabulate dnumdisplay r(r)37

Now, we can run regress with the **cluster** option. We do not need to include the
robust option since robust is implied with cluster. Note that the standard errors have
changed substantially, much more so, than the change caused by the **robust** option by
itself.

regress api00 acs_k3 acs_46 full enroll, cluster(dnum)Regression with robust standard errors Number of obs = 395 F( 4, 36) = 31.18 Prob > F = 0.0000 R-squared = 0.3849 Number of clusters (dnum) = 37 Root MSE = 112.20 ------------------------------------------------------------------------------ | Robust api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- acs_k3 | 6.954381 6.901117 1.008 0.320 -7.041734 20.9505 acs_46 | 5.966015 2.531075 2.357 0.024 .8327565 11.09927 full | 4.668221 .7034641 6.636 0.000 3.24153 6.094913 enroll | -.1059909 .0429478 -2.468 0.018 -.1930931 -.0188888 _cons | -5.200407 121.7856 -0.043 0.966 -252.193 241.7922 ------------------------------------------------------------------------------

As with the **robust** option, the estimate of the coefficients are the
same as the OLS estimates, but the standard errors take into account that the observations
within districts are non-independent. Even though the standard errors are larger in
this analysis, the three variables that were significant in the OLS analysis are
significant in this analysis as well. These standard errors are computed based on
aggregate scores for the 37 districts, since these district level scores should be
independent. If you have a very small number of clusters compared to your overall sample
size it is possible that the standard errors could be quite larger than the OLS results.
For example, if there were only 3 districts, the standard errors would be computed on the
aggregate scores for just 3 districts.

4.1.3 Robust Regression

The Stata **rreg** command performs a robust regression using iteratively reweighted
least squares, i.e., **rreg** assigns a weight to each observation with higher weights given to
better behaved observations. In fact, extremely deviant cases, those with Cook’s D greater than 1,
can have their weights set to missing so that they are not included in the analysis at all.

We will use **rreg** with the generate option so that we can
inspect the weights used to weight the observations. Note that in this analysis both the
coefficients and the standard errors differ from the original OLS regression. Below we
show the same analysis using robust regression using the **rreg** command.

rreg api00 acs_k3 acs_46 full enroll, gen(wt)Robust regression estimates Number of obs = 395 F( 4, 390) = 56.51 Prob > F = 0.0000 ------------------------------------------------------------------------------ api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- acs_k3 | 6.110881 4.658131 1.312 0.190 -3.047308 15.26907 acs_46 | 6.254708 1.631587 3.834 0.000 3.046901 9.462516 full | 4.796072 .4414563 10.864 0.000 3.92814 5.664004 enroll | -.1092586 .0287239 -3.804 0.000 -.1657316 -.0527855 _cons | -6.788183 90.5336 -0.075 0.940 -184.7832 171.2068 ------------------------------------------------------------------------------

If you compare the robust regression results (directly above) with the OLS results previously presented, you can see that the coefficients and standard errors are quite similar, and the t values and p values are also quite similar. Despite the minor problems that we found in the data when we performed the OLS analysis, the robust regression analysis yielded quite similar results suggesting that indeed these were minor problems. Had the results been substantially different, we would have wanted to further investigate the reasons why the OLS and robust regression results were different, and among the two results the robust regression results would probably be the more trustworthy.

Let’s calculate and look at the predicted (fitted) values (**p**), the
residuals (**r**), and the leverage (hat) values (**h**). Note
that we are including **if e(sample)** in the commands because **rreg** can generate
weights of missing and you wouldn’t want to have predicted values and residuals for those
observations.

predict p if e(sample)(option xb assumed; fitted values) (5 missing values generated)predict r if e(sample), resid(5 missing values generated)predict h if e(sample), hat(5 missing values generated)

Now, let’s check on the various predicted values and the weighting. First, we will sort
by **wt** then we will look at the first 15 observations. Notice that the smallest
weights are near one-half but quickly get into the .7 range.

sort wt list snum api00 p r h wt in 1/15snum api00 p r h wt 1. 637 447 733.1567 -286.1568 .0037645 .55612093 2. 5387 892 611.5344 280.4655 .0023925 .57126927 3. 2267 897 621.4881 275.5119 .010207 .58433963 4. 65 903 631.2718 271.7282 .0105486 .59425026 5. 3759 585 842.4838 -257.4838 .0414728 .63063771 6. 5926 469 715.2266 -246.2266 .0058346 .65892631 7. 1978 894 650.7816 243.2184 .0058116 .6665881 8. 3696 483 721.3105 -238.3105 .0052619 .67834344 9. 5222 940 707.648 232.352 .0041016 .69303069 10. 690 424 654.5795 -230.5795 .0094319 .69701005 11. 3785 459 687.3311 -228.3311 .0081474 .70245717 12. 2910 831 604.4401 226.56 .0536809 .70650365 13. 699 437 660.2588 -223.2588 .0059152 .71449402 14. 3070 479 698.1256 -219.1256 .0043322 .72399766 15. 1812 917 698.9828 218.0172 .0099871 .72670695

Now, let’s look at the last 10 observations. The weights for observations 391 to 395 are all very close to one. The values for observations 396 to the end are missing due to the missing predictors. Note that the observations above that have the lowest weights are also those with the largest residuals (residuals over 200) and the observations below with the highest weights have very low residuals (all less than 3).

list snum api00 p r h wt in -10/lsnum api00 p r h wt 391. 3024 727 729.0243 -2.024302 .0104834 .99997367 392. 3535 705 703.846 1.154008 .0048329 .99999207 393. 1885 605 605.427 -.4269809 .0144377 .99999843 394. 1678 497 496.8011 .1989256 .0243301 .99999956 395. 4486 706 705.8076 .192455 .0142448 .99999986 396. 4488 521 . . . . 397. 3072 763 . . . . 398. 3055 590 . . . . 399. 116 513 . . . . 400. 4534 445 . . . .

After using **rreg**, it is possible to generate predicted values, residuals and
leverage (hat), but most of the regression diagnostic commands are not available after **rreg**.
We will have to create some of them for ourselves. Here, of course, is the graph of
residuals versus fitted (predicted) with a line at zero. This plot looks much like the OLS
plot, except that in the OLS all of the observations would be weighted equally, but as we
saw above the observations with the greatest residuals are weighted less and hence have
less influence on the results.

graph r p, yline(0)

To get an **lvr2plot** we are going to have to go through several steps in order to
get the normalized squared residuals and the means of both the residuals and the leverage
(hat) values.

First, we generate the residual squared (**r2**) and then divide it by the
sum of the squared residuals. We then compute the mean of this value and save it as a
local macro called **rm** (which we will use for creating the
leverage vs. residual plot).

generate r2=r^2(5 missing values generated)sum r2Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- r2 | 395 12436.05 14677.98 .0370389 81885.7replace r2 = r2/r(sum)(395 real changes made)summarize r2Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- r2 | 395 .0025316 .002988 7.54e-09 .0166697local rm = r(mean)

Next we compute the mean of the leverage and save it as a local macro called **hm**.

summarize hVariable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- h | 395 .0126422 .0108228 .0023925 .0664077local hm = r(mean)

Now, we can plot the leverage against the residual squared as shown below. Comparing the plot below with the plot from the OLS regression, this plot is much better behaved. There are no longer points in the upper right quadrant of the graph.

graph h r2, yline(`hm') xline(`rm')

Let’s close out this analysis by deleting our temporary variables.

drop wt p r h r2

4.1.4 Quantile Regression

Quantile regression, in general, and median regression, in particular, might be
considered as an alternative to **rreg**. The Stata command **qreg** does quantile
regression. **qreg** without any options will actually do a median regression in which
the coefficients will be estimated by minimizing the absolute deviations from the median.
Of course, as an estimate of central tendency, the median is a resistant measure that is
not as greatly affected by outliers as is the mean. It is not clear that median regression
is a resistant estimation procedure, in fact, there is some evidence that it can be
affected by high leverage values.

Here is what the quantile regression looks like using Stata’s **qreg** command. The
coefficient and standard error for **acs_k3** are considerably different when
using **qreg** as compared to OLS using the **regress** command
(the coefficients are 1.2 vs 6.9 and the standard errors are 6.4 vs 4.3). The coefficients
and standard errors for the other variables are also different, but not as dramatically
different. Nevertheless, the **qreg** results indicate that, like the OLS
results, all of the variables except **acs_k3** are significant.

qreg api00 acs_k3 acs_46 full enrollMedian regression Number of obs = 395 Raw sum of deviations 48534 (about 643) Min sum of deviations 36268.11 Pseudo R2 = 0.2527 ------------------------------------------------------------------------------ api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- acs_k3 | 1.269065 6.470588 0.196 0.845 -11.45253 13.99066 acs_46 | 7.22408 2.228949 3.241 0.001 2.841821 11.60634 full | 5.323841 .6157333 8.646 0.000 4.113269 6.534413 enroll | -.1245734 .0397576 -3.133 0.002 -.2027395 -.0464073 _cons | 17.15049 125.4396 0.137 0.891 -229.4719 263.7729 ------------------------------------------------------------------------------

The **qreg** command has even fewer diagnostic options than **rreg** does. About
the only values we can obtain are the predicted values and the residuals.

predict p if e(sample)(option xb assumed; fitted values) (5 missing values generated)predict r if e(sample), r(5 missing values generated)graph r p, yline(0)

Stata has three additional commands that can do quantile regression.

**iqreg** estimates interquantile regressions, regressions of the difference in
quantiles. The estimated variance-covariance matrix of the estimators is obtained via
bootstrapping.

**sqreg** estimates simultaneous-quantile regression. It produces the same
coefficients as qreg for each quantile. **sqreg** obtains a bootstrapped
variance-covariance matrix of the estimators that includes between-quantiles blocks. Thus,
one can test and construct confidence intervals comparing coefficients describing
different quantiles.

**bsqreg** is the same as sqreg with one quantile. **sqreg** is, therefore,
faster than bsqreg.

**4.2 Constrained Linear Regression**

Let’s begin this section by looking at a regression model using the ** hsb2** dataset.
The **hsb2** file is a sample of 200 cases from the Highschool and Beyond
Study (Rock, Hilton, Pollack, Ekstrom & Goertz, 1985). It includes the
following variables: **id**, **female**, **race**, **ses**, **schtyp**,
**program**, **read**, **write**, **math**, **science** and **socst**.
The variables **read**, **write**, **math**, **science** and **socst**
are the results of standardized tests on reading, writing, math, science and
social studies (respectively), and the variable **female** is coded 1 if
female, 0 if male.

use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/hsb2

Let’s start by doing an OLS regression where we predict **socst** score
from **read**, **write**, **math**, **science**
and **female** (gender)

regress socst read write math science femaleSource | SS df MS Number of obs = 200 ---------+------------------------------ F( 5, 194) = 35.44 Model | 10949.2575 5 2189.8515 Prob > F = 0.0000 Residual | 11986.9375 194 61.7883375 R-squared = 0.4774 ---------+------------------------------ Adj R-squared = 0.4639 Total | 22936.195 199 115.257261 Root MSE = 7.8606 ------------------------------------------------------------------------------ socst | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- read | .3784046 .0806267 4.693 0.000 .2193872 .537422 write | .3858743 .0889283 4.339 0.000 .2104839 .5612646 math | .1303258 .0893767 1.458 0.146 -.045949 .3066006 science | -.0333925 .0818741 -0.408 0.684 -.1948702 .1280852 female | -.3532648 1.245372 -0.284 0.777 -2.809471 2.102941 _cons | 7.339342 3.650243 2.011 0.046 .1400864 14.5386 ------------------------------------------------------------------------------

Notice that the coefficients for **read** and **write** are very similar, which
makes sense since they are both measures of language ability. Also, the coefficients
for **math** and **science** are similar (in that they are both
not significantly different from 0). Suppose that we have a theory that suggests that **read**
and **write** should have equal coefficients, and that **math** and **science**
should have equal coefficients as well. We can test the equality
of the coefficients using the **test** command.

test read=write( 1) read - write = 0.0 F( 1, 194) = 0.00 Prob > F = 0.9558

We can also do this with the **testparm** command, which is especially
useful if you were testing whether 3 or more coefficients were equal.

testparm read write, equal( 1) - read + write = 0.0 F( 1, 194) = 0.00 Prob > F = 0.9558

Both of these results indicate that there is no significant difference in the
coefficients for the reading and writing scores. Since it appears that the coefficients
for **math** and **science** are also equal, let’s test the
equality of those as well (using the **testparm** command).

testparm math science, equal( 1) - math + science = 0.0 F( 1, 194) = 1.45 Prob > F = 0.2299

Let’s now perform both of these tests together, simultaneously testing that the
coefficient for **read** equals **write** and **math**
equals **science**. We do this using two **test**
commands, the second using the **accum** option to accumulate the first test
with the second test to test both of these hypotheses together.

test read=write( 1) read - write = 0.0 F( 1, 194) = 0.00 Prob > F = 0.9558test math=science, accum( 1) read - write = 0.0 ( 2) math - science = 0.0 F( 2, 194) = 0.73 Prob > F = 0.4852

Note this second test has 2 df, since it is testing both of the hypotheses listed, and
this test is not significant, suggesting these pairs of coefficients are not significantly
different from each other. We can estimate regression models where we constrain
coefficients to be equal to each other. For example, let’s begin on a limited scale
and constrain **read** to equal **write**. First, we will define a constraint and
then we will run the **cnsreg** command.

constraint define 1 read = write. cnsreg socst read write math science female, constraint(1)Constrained linear regression Number of obs = 200 F( 4, 195) = 44.53 Prob > F = 0.0000 Root MSE = 7.8404 ( 1) read - write = 0.0 ------------------------------------------------------------------------------ socst | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- read | .3818488 .0513899 7.430 0.000 .2804975 .4832002 write | .3818488 .0513899 7.430 0.000 .2804975 .4832002 math | .1303036 .0891471 1.462 0.145 -.0455126 .3061197 science | -.0332762 .0816379 -0.408 0.684 -.1942827 .1277303 female | -.3296237 1.167364 -0.282 0.778 -2.631904 1.972657 _cons | 7.354148 3.631175 2.025 0.044 .1927294 14.51557 ------------------------------------------------------------------------------

Notice that the coefficients for **read** and **write** are identical, along with
their standard errors, t-test, etc. Also note that the degrees of freedom for the F test
is four, not five, as in the OLS model. This is because only one coefficient is estimated
for **read** and **write**, estimated like a single variable equal to the sum of
their values. Notice also that the Root MSE is slightly higher for the constrained
model, but only slightly higher. This is because we have forced the model to
estimate the coefficients for **read** and **write** that are
not as good at minimizing the Sum of Squares Error (the coefficients that would minimize
the SSE would be the coefficients from the unconstrained model).

Next, we will define a second constraint, setting **math** equal to **science**.
We will also abbreviate the constraints option to **c**.

constraint define 2 math = science. cnsreg socst read write math science female, c(1 2)Constrained linear regression Number of obs = 200 F( 3, 196) = 58.75 Prob > F = 0.0000 Root MSE = 7.8496 ( 1) read - write = 0.0 ( 2) math - science = 0.0 ------------------------------------------------------------------------------ socst | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- read | .3860376 .0513322 7.520 0.000 .2848033 .4872719 write | .3860376 .0513322 7.520 0.000 .2848033 .4872719 math | .0428053 .0519238 0.824 0.411 -.0595958 .1452064 science | .0428053 .0519238 0.824 0.411 -.0595958 .1452064 female | -.200875 1.163831 -0.173 0.863 -2.496114 2.094364 _cons | 7.505658 3.633225 2.066 0.040 .3404249 14.67089 ------------------------------------------------------------------------------

Now the coefficients for **read** = **write** and **math** = **science**
and the degrees of freedom for the model has dropped to three. Again, the Root MSE
is slightly larger than in the prior model, but we should emphasize only very slightly
larger. If indeed the population coefficients for **read** = **write**
and **math** = **science**, then these combined (constrained) estimates
may be more stable and generalize better to other samples. So although these
estimates may lead to slightly higher standard error of prediction in this sample, they
may generalize better to the population from which they came.

4.3 Regression with Censored or Truncated Data

Analyzing data that contain censored values or are truncated is common in many research disciplines. According to Hosmer and Lemeshow (1999), a censored value is one whose value is incomplete due to random factors for each subject. A truncated observation, on the other hand, is one which is incomplete due to a selection process in the design of the study.

We will begin by looking at analyzing data with censored values.

**4.3.1 Regression with Censored Data**

In this example we have a variable called **acadindx** which is a weighted
combination of standardized test scores and academic grades. The maximum possible score on
**acadindx** is 200 but it is clear that the 16 students who scored 200 are not exactly
equal in their academic abilities. In other words, there is variability in academic
ability that is not being accounted for when students score 200 on acadindx. The variable **acadindx**
is said to be censored, in particular, it is right censored.

Let’s look at the example. We will begin by looking at a description of the data, some descriptive statistics, and correlations among the variables.

use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/acadindx(max possible on acadindx is 200)describeContains data from acadindx.dta obs: 200 max possible on acadindx is 200 vars: 5 19 Jan 2001 20:14 size: 4,800 (99.7% of memory free) ------------------------------------------------------------------------------- 1. id float %9.0g 2. female float %9.0g fl 3. reading float %9.0g 4. writing float %9.0g 5. acadindx float %9.0g academic index -------------------------------------------------------------------------------summarizeVariable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- id | 200 100.5 57.87918 1 200 female | 200 .545 .4992205 0 1 reading | 200 52.23 10.25294 28 76 writing | 200 52.775 9.478586 31 67 acadindx | 200 172.185 16.8174 138 200count if acadindx==20016corr acadindx female reading writing(obs=200) | acadindx female reading writing ---------+------------------------------------ acadindx | 1.0000 female | -0.0821 1.0000 reading | 0.7131 -0.0531 1.0000 writing | 0.6626 0.2565 0.5968 1.0000

Now, let’s run a standard OLS regression on the data and generate predicted scores in **p1**.

regress acadindx female reading writingSource | SS df MS Number of obs = 200 ---------+------------------------------ F( 3, 196) = 107.40 Model | 34994.282 3 11664.7607 Prob > F = 0.0000 Residual | 21287.873 196 108.611597 R-squared = 0.6218 ---------+------------------------------ Adj R-squared = 0.6160 Total | 56282.155 199 282.824899 Root MSE = 10.422 ------------------------------------------------------------------------------ acadindx | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- female | -5.832498 1.58821 -3.672 0.000 -8.964671 -2.700324 reading | .7184174 .0931493 7.713 0.000 .5347138 .902121 writing | .7905706 .1040996 7.594 0.000 .5852715 .9958696 _cons | 96.11841 4.489562 21.409 0.000 87.26436 104.9725 ------------------------------------------------------------------------------predict p1(option xb assumed; fitted values)

The **tobit** command is one of the commands that can be used for regression with
censored data. The syntax of the command is similar to regress with the addition of the **ul**
option to indicate that the right censored value is 200. We will follow the **tobit**
command by predicting **p2** containing the **tobit** predicted values.

tobit acadindx female reading writing, ul(200)Tobit estimates Number of obs = 200 LR chi2(3) = 190.39 Prob > chi2 = 0.0000 Log likelihood = -718.06362 Pseudo R2 = 0.1171 ------------------------------------------------------------------------------ acadindx | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- female | -6.347316 1.692441 -3.750 0.000 -9.684943 -3.009688 reading | .7776857 .0996928 7.801 0.000 .5810837 .9742877 writing | .8111221 .110211 7.360 0.000 .5937773 1.028467 _cons | 92.73782 4.803441 19.307 0.000 83.26506 102.2106 ---------+-------------------------------------------------------------------- _se | 10.98973 .5817477 (Ancillary parameter) ------------------------------------------------------------------------------ Obs. summary: 184 uncensored observations 16 right-censored observations at acadindx>=200predict p2(option xb assumed; fitted values)

Summarizing the **p1** and **p2** scores shows that the **tobit** predicted
values have a larger standard deviation and a greater range of values.

summarize acadindx p1 p2Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- acadindx | 200 172.185 16.8174 138 200 p1 | 200 172.185 13.26087 142.3821 201.5311 p2 | 200 172.704 14.00292 141.2211 203.8541

When we look at a listing of **p1** and **p2** for all students who scored the
maximum of 200 on acadindx, we see that in every case the **tobit** predicted value is
greater than the OLS predicted value. These predictions represent an estimate of what the
variability would be if the values of **acadindx** could exceed 200.

list p1 p2 if acadindx==200p1 p2 32. 179.175 179.62 57. 192.6806 194.3291 68. 201.5311 203.8541 80. 191.8309 193.577 82. 188.1537 189.5627 88. 186.5725 187.9405 95. 195.9971 198.1762 100. 186.9333 188.1076 132. 197.5782 199.7984 136. 189.4592 191.1436 143. 191.1846 192.8327 157. 191.6145 193.4767 161. 180.2511 181.0082 169. 182.275 183.3667 174. 191.6145 193.4767 200. 187.6616 189.4211

Here is the syntax diagram for tobit:

tobit depvar [indepvars] [weight] [if exp] [in range], ll[(#)] ul[(#)] [ level(#) offset(varname) maximize_options ]

You can declare both lower and upper censored values. The censored values are fixed in that the same lower and upper values apply to all observations.

There are two other commands in Stata that allow you more flexibility in doing regression with censored data.

**cnreg** estimates a model in which the censored values may vary from observation
to observation.

**intreg** estimates a model where the response variable for each observation is
either point data, interval data, left-censored data, or right-censored data.

**4.3.2 Regression with Truncated Data**

Truncated data occurs when some observations are not included in the analysis because
of the value of the variable. We will illustrate analysis with truncation using the
dataset, **acadindx**, that was used in the previous section. If **acadindx** is no
longer loaded in memory you can get it with the following use command.

use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/acadindx(max possible on acadindx is 200)

Let’s imagine that in order to get into a special honors program, students need to
score at least 160 on **acadindx**. So we will drop all observations in which the value
of **acadindx** is less than 160.

drop if acadindx <= 160(56 observations deleted)

Now, let’s estimate the same model that we used in the section on censored data, only
this time we will pretend that a 200 for **acadindx** is not censored.

regress acadindx female reading writingSource | SS df MS Number of obs = 144 -------------+------------------------------ F( 3, 140) = 33.01 Model | 8074.79638 3 2691.59879 Prob > F = 0.0000 Residual | 11416.3633 140 81.5454524 R-squared = 0.4143 -------------+------------------------------ Adj R-squared = 0.4017 Total | 19491.1597 143 136.301816 Root MSE = 9.0303 ------------------------------------------------------------------------------ acadindx | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | -5.238495 1.615632 -3.24 0.001 -8.432687 -2.044303 reading | .4411066 .0963504 4.58 0.000 .2506166 .6315965 writing | .5873287 .1150828 5.10 0.000 .3598037 .8148537 _cons | 125.6355 5.891559 21.32 0.000 113.9875 137.2834 ------------------------------------------------------------------------------

It is clear that the estimates of the coefficients are distorted due to the fact that
56 observations are no longer in the dataset. This amounts to restriction of range on both
the response variable and the predictor variables. For example, the coefficient for
writing dropped from .79 to .59. What this means is that if our goal is to find the
relation between **adadindx** and the predictor variables in the population, then the
truncation of **acadindx** in our sample is going to lead to biased estimates. A better
approach to analyzing these data is to use truncated regression. In Stata this can be
accomplished using the **truncreg** command where the **ll** option is used to
indicate the lower limit of **acadindx** scores used in the truncation.

truncreg acadindx female reading writing, ll(160)(note: 0 obs. truncated) Truncated regression Limit: lower = 160 Number of obs = 144 upper = +inf Wald chi2(3) = 77.87 Log likelihood = -510.00768 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ acadindx | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- eq1 | female | -6.099602 1.925245 -3.17 0.002 -9.873012 -2.326191 reading | .5181789 .1168288 4.44 0.000 .2891986 .7471592 writing | .7661636 .15262 5.02 0.000 .4670339 1.065293 _cons | 110.2892 8.673849 12.72 0.000 93.28877 127.2896 -------------+---------------------------------------------------------------- sigma | _cons | 9.803572 .721646 13.59 0.000 8.389172 11.21797 ------------------------------------------------------------------------------

The coefficients from the **truncreg** command are closer to the OLS results, for
example the coefficient for **writing** is .77 which is closer to the OLS
results of .79. However, the results are still somewhat different on the other
variables, for example the coefficient for reading is .52 in the **truncreg**
as compared to .72 in the original OLS with the unrestricted data, and better than the OLS
estimate of .47 with the restricted data. While **truncreg **may
improve the estimates on a restricted data file as compared to OLS, it is certainly no
substitute for analyzing the complete unrestricted data file.

**4.4 Regression with Measurement Error**

As you will most likely recall, one of the assumptions of regression is that the
predictor variables are measured without error. The problem is that measurement error in
predictor variables leads to under estimation of the regression coefficients. Stata’s **eivreg**
command takes measurement error into account when estimating the coefficients for the model.

Let’s look at a regression using the hsb2 dataset.

use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/hsb2 regress write read femaleSource | SS df MS Number of obs = 200 ---------+------------------------------ F( 2, 197) = 77.21 Model | 7856.32118 2 3928.16059 Prob > F = 0.0000 Residual | 10022.5538 197 50.8759077 R-squared = 0.4394 ---------+------------------------------ Adj R-squared = 0.4337 Total | 17878.875 199 89.843593 Root MSE = 7.1327 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- read | .5658869 .0493849 11.459 0.000 .468496 .6632778 female | 5.486894 1.014261 5.410 0.000 3.48669 7.487098 _cons | 20.22837 2.713756 7.454 0.000 14.87663 25.58011 ------------------------------------------------------------------------------

The predictor read is a standardized test score. Every test has measurement error. We
don’t know the exact reliability of **read**, but using .9 for the reliability would
probably not be far off. We will now estimate the same regression model with the Stata **eivreg**
command, which stands for errors-in-variables regression.

eivreg write read female, r(read .9)assumed errors-in-variables regression variable reliability ------------------------ Number of obs = 200 read 0.9000 F( 2, 197) = 83.41 * 1.0000 Prob > F = 0.0000 R-squared = 0.4811 Root MSE = 6.86268 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- read | .6289607 .0528111 11.910 0.000 .524813 .7331085 female | 5.555659 .9761838 5.691 0.000 3.630548 7.48077 _cons | 16.89655 2.880972 5.865 0.000 11.21504 22.57805

Note that the F-ratio and the R^{2} increased along with the regression
coefficient for **read**. Additionally, there is an increase in the standard error for
read.

Now, let’s try a model with ** read**, ** math** and ** socst** as predictors. First, we will run a
standard OLS regression.

regress write read math socst femaleSource | SS df MS Number of obs = 200 ---------+------------------------------ F( 4, 195) = 64.37 Model | 10173.7036 4 2543.42591 Prob > F = 0.0000 Residual | 7705.17137 195 39.5136993 R-squared = 0.5690 ---------+------------------------------ Adj R-squared = 0.5602 Total | 17878.875 199 89.843593 Root MSE = 6.286 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- read | .2065341 .0640006 3.227 0.001 .0803118 .3327563 math | .3322639 .0651838 5.097 0.000 .2037082 .4608195 socst | .2413236 .0547259 4.410 0.000 .133393 .3492542 female | 5.006263 .8993625 5.566 0.000 3.232537 6.77999 _cons | 9.120717 2.808367 3.248 0.001 3.582045 14.65939 ------------------------------------------------------------------------------

Now, let’s try to account for the measurement error by using the following
reliabilities: **read** – .9, **math** – .9, **socst** – .8.

eivreg write read math socst female, r(read .9 math .9 socst .8)assumed errors-in-variables regression variable reliability ------------------------ Number of obs = 200 read 0.9000 F( 4, 195) = 70.17 math 0.9000 Prob > F = 0.0000 socst 0.8000 R-squared = 0.6047 * 1.0000 Root MSE = 6.02062 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- read | .1506668 .0936571 1.609 0.109 -.0340441 .3353776 math | .350551 .0850704 4.121 0.000 .1827747 .5183273 socst | .3327103 .0876869 3.794 0.000 .159774 .5056467 female | 4.852501 .8730646 5.558 0.000 3.13064 6.574363 _cons | 6.37062 2.868021 2.221 0.027 .7142973 12.02694 ------------------------------------------------------------------------------

Note that the overall F and R^{2} went up, but that the coefficient for read is
no longer statistically significant.

**4.5 Multiple Equation Regression Models**

If a dataset has enough variables we may want to estimate more than one regression model. For example, we may want to predict y1 from x1 and also predict y2 from x2. Even though there are no variables in common these two models are not independent of one another because the data come from the same subjects. This is an example of one type of multiple equation regression known as seemingly unrelated regression. We can estimate the coefficients and obtain standard errors taking into account the correlated errors in the two models. An important feature of multiple equation models is that we can test predictors across equations.

Another example of multiple equation regression is if we wished to predict y1, y2 and y3 from x1 and x2. This is a three equation system, known as multivariate regression, with the same predictor variables for each model. Again, we have the capability of testing coefficients across the different equations.

Multiple equation models are a powerful extension to our data analysis tool kit.

4.5.1 Seemingly Unrelated Regression

Let’s continue using the **hsb2** data file to illustrate the use of
seemingly unrelated regression. You can load it into memory again if it has been
cleared out.

use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/hsb2(highschool and beyond (200 cases))

This time let’s look at two regression models.

science = math female write = read female

It is the case that the errors (residuals) from these two models would be correlated. This
would be true even if the predictor female were not found in both models. The errors would
be correlated because all of the values of the variables are collected on the same set of
observations. This is a situation tailor made for seemingly unrelated regression using the
**sureg** command. Here is our first model using OLS.

regress science math female<some output omitted> ------------------------------------------------------------------------------ science | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- math | .6631901 .0578724 11.460 0.000 .549061 .7773191 female | -2.168396 1.086043 -1.997 0.047 -4.310159 -.026633 _cons | 18.11813 3.167133 5.721 0.000 11.8723 24.36397 ------------------------------------------------------------------------------

And here is our second model using OLS.

regress write read female<some output omitted> ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- read | .5658869 .0493849 11.459 0.000 .468496 .6632778 female | 5.486894 1.014261 5.410 0.000 3.48669 7.487098 _cons | 20.22837 2.713756 7.454 0.000 14.87663 25.58011 ------------------------------------------------------------------------------

With the **sureg** command we can estimate both models simultaneously while
accounting for the correlated errors at the same time, leading to efficient estimates of
the coefficients and standard errors. By including the **corr** option with **sureg**
we can also obtain an estimate of the correlation between the errors of the two models.
Note that both the estimates of the coefficients and their standard errors are different
from the OLS model estimates shown above. The bottom of the output provides a
Breusch-Pagan test of
whether the residuals from the two equations are independent (in this case, we
would say the residuals were not independent, p=0.0407).

sureg (science math female) (write read female), corrSeemingly unrelated regression ------------------------------------------------------------------ Equation Obs Parms RMSE "R-sq" Chi2 P ------------------------------------------------------------------ science 200 2 7.595793 0.4085 125.4142 0.0000 write 200 2 7.085844 0.4383 144.2683 0.0000 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- science | math | .6251409 .0570948 10.949 0.000 .5132373 .7370446 female | -2.189344 1.077862 -2.031 0.042 -4.301914 -.0767744 _cons | 20.13265 3.125775 6.441 0.000 14.00624 26.25905 ---------+-------------------------------------------------------------------- write | read | .5354838 .0487212 10.991 0.000 .4399919 .6309757 female | 5.453748 1.006609 5.418 0.000 3.48083 7.426665 _cons | 21.83439 2.67851 8.152 0.000 16.5846 27.08417 ------------------------------------------------------------------------------ Correlation matrix of residuals: science write science 1.0000 write 0.1447 1.0000 Breusch-Pagan test of independence: chi2(1) = 4.188, Pr = 0.0407

Now that we have estimated our models let’s test the predictor variables. The test for **female**
combines information from both models. The tests for **math** and **read** are
actually equivalent to the z-tests above except that the results are displayed as
chi-square tests.

test female( 1) [science]female = 0.0 ( 2) [write]female = 0.0 chi2( 2) = 37.45 Prob > chi2 = 0.0000test math( 1) [science]math = 0.0 chi2( 1) = 119.88 Prob > chi2 = 0.0000test read( 1) [write]read = 0.0 chi2( 1) = 120.80 Prob > chi2 = 0.0000

Now, let’s estimate 3 models where we use the same predictors in each model as shown below.

read = female prog1 prog3 write = female prog1 prog3 math = female prog1 prog3

If you no longer have the dummy variables for **prog**, you can recreate them using
the tabulate command.

tabulate prog, gen(prog)

Let’s first estimate these three models using 3 OLS regressions.

regress read female prog1 prog3<some output omitted> ------------------------------------------------------------------------------ read | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- female | -1.208582 1.327672 -0.910 0.364 -3.826939 1.409774 prog1 | -6.42937 1.665893 -3.859 0.000 -9.714746 -3.143993 prog3 | -9.976868 1.606428 -6.211 0.000 -13.14497 -6.808765 _cons | 56.8295 1.170562 48.549 0.000 54.52099 59.13802 ------------------------------------------------------------------------------regress write female prog1 prog3<some output omitted> ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- female | 4.771211 1.181876 4.037 0.000 2.440385 7.102037 prog1 | -4.832929 1.482956 -3.259 0.001 -7.757528 -1.908331 prog3 | -9.438071 1.430021 -6.600 0.000 -12.25827 -6.617868 _cons | 53.62162 1.042019 51.459 0.000 51.56661 55.67662 ------------------------------------------------------------------------------regress math female prog1 prog3<some output omitted> ------------------------------------------------------------------------------ math | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- female | -.6737673 1.176059 -0.573 0.567 -2.993122 1.645587 prog1 | -6.723945 1.475657 -4.557 0.000 -9.634149 -3.81374 prog3 | -10.32168 1.422983 -7.254 0.000 -13.128 -7.515352 _cons | 57.10551 1.03689 55.074 0.000 55.06062 59.1504 ------------------------------------------------------------------------------

These regressions provide fine estimates of the coefficients and standard errors but
these results assume the residuals of each analysis are completely independent of the
others. Also, if we wish to test **female**, we would have to do it three times and
would not be able to combine the information from all three tests into a single overall
test.

Now let’s use **sureg** to estimate the same models. Since all 3 models have
the same predictors, we can use the syntax as shown below which says that **read**,
**write** and **math** will each be predicted by **female**,
**prog1** and **prog3**. Note that the coefficients are identical
in the OLS results above and the **sureg** results below, however the
standard errors are different, only slightly, due to the correlation among the residuals
in the multiple equations.

sureg (read write math = female prog1 prog3), corrSeemingly unrelated regression ------------------------------------------------------------------ Equation Obs Parms RMSE "R-sq" Chi2 P ------------------------------------------------------------------ read 200 3 9.254765 0.1811 44.24114 0.0000 write 200 3 8.238468 0.2408 63.41908 0.0000 math 200 3 8.197921 0.2304 59.88479 0.0000 ------------------------------------------------------------------------------ | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- read | female | -1.208582 1.314328 -0.920 0.358 -3.784618 1.367454 prog1 | -6.42937 1.64915 -3.899 0.000 -9.661645 -3.197095 prog3 | -9.976868 1.590283 -6.274 0.000 -13.09377 -6.859971 _cons | 56.8295 1.158797 49.042 0.000 54.5583 59.1007 ---------+-------------------------------------------------------------------- write | female | 4.771211 1.169997 4.078 0.000 2.478058 7.064363 prog1 | -4.832929 1.468051 -3.292 0.001 -7.710257 -1.955602 prog3 | -9.438071 1.415648 -6.667 0.000 -12.21269 -6.663451 _cons | 53.62162 1.031546 51.982 0.000 51.59982 55.64341 ---------+-------------------------------------------------------------------- math | female | -.6737673 1.164239 -0.579 0.563 -2.955634 1.608099 prog1 | -6.723945 1.460826 -4.603 0.000 -9.587111 -3.860778 prog3 | -10.32168 1.408681 -7.327 0.000 -13.08264 -7.560711 _cons | 57.10551 1.026469 55.633 0.000 55.09367 59.11735 ------------------------------------------------------------------------------ Correlation matrix of residuals: read write math read 1.0000 write 0.5519 1.0000 math 0.5774 0.5577 1.0000 Breusch-Pagan test of independence: chi2(3) = 189.811, Pr = 0.0000

In addition to getting more appropriate standard errors, **sureg** allows
us to test the effects of the predictors across the equations. We can test the
hypothesis that the coefficient for **female** is 0 for all three outcome
variables, as shown below.

test female( 1) [read]female = 0.0 ( 2) [write]female = 0.0 ( 3) [math]female = 0.0 chi2( 3) = 35.59 Prob > chi2 = 0.0000

We can also test the hypothesis that the coefficient for **female** is 0
for just **read** and **math**. Note that **[read]female**
means the coefficient for **female** for the outcome variable **read**.

test [read]female [math]female( 1) [read]female = 0.0 ( 2) [math]female = 0.0 chi2( 2) = 0.85 Prob > chi2 = 0.6541

We can also test the hypothesis that the coefficients for **prog1 **and **prog3**
are 0 for all three outcome variables, as shown below.

test prog1 prog3( 1) [read]prog1 = 0.0 ( 2) [write]prog1 = 0.0 ( 3) [math]prog1 = 0.0 ( 4) [read]prog3 = 0.0 ( 5) [write]prog3 = 0.0 ( 6) [math]prog3 = 0.0 chi2( 6) = 72.45 Prob > chi2 = 0.0000

**4.5.2 Multivariate Regression**

Let’s now use multivariate regression using the **mvreg** command to look
at the same analysis that we saw in the **sureg** example above,
estimating the following 3 models.

read = female prog1 prog3 write = female prog1 prog3 math = female prog1 prog3

If you don’t have the **hsb2** data file in memory, you can use it below
and then create the dummy variables for **prog1** – **prog3**.

use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/hsb2tabulate prog, gen(prog)<output omitted>

Below we use **mvreg** to predict **read**, ** write** and **math**
from **female**, ** prog1** and **prog3**. Note that the top part of
the output is similar to the **sureg** output in that it gives an overall
summary of the model for each outcome variable, however the results are somewhat different
and the **sureg** uses a **Chi-Square** test for the overall fit
of the model, and **mvreg** uses an **F-test**. The lower part
of the output appears similar to the **sureg** output; however, when you
compare the standard errors you see that the results are not the same. These standard errors
correspond to the OLS standard errors, so these results below do not take into account the
correlations among the residuals (as do the **sureg** results).

mvreg read write math = female prog1 prog3Equation Obs Parms RMSE "R-sq" F P ------------------------------------------------------------------ read 200 4 9.348725 0.1811 14.45211 0.0000 write 200 4 8.32211 0.2408 20.7169 0.0000 math 200 4 8.281151 0.2304 19.56237 0.0000 ------------------------------------------------------------------------------ | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- read | female | -1.208582 1.327672 -0.910 0.364 -3.826939 1.409774 prog1 | -6.42937 1.665893 -3.859 0.000 -9.714746 -3.143993 prog3 | -9.976868 1.606428 -6.211 0.000 -13.14497 -6.808765 _cons | 56.8295 1.170562 48.549 0.000 54.52099 59.13802 ---------+-------------------------------------------------------------------- write | female | 4.771211 1.181876 4.037 0.000 2.440385 7.102037 prog1 | -4.832929 1.482956 -3.259 0.001 -7.757528 -1.908331 prog3 | -9.438071 1.430021 -6.600 0.000 -12.25827 -6.617868 _cons | 53.62162 1.042019 51.459 0.000 51.56661 55.67662 ---------+-------------------------------------------------------------------- math | female | -.6737673 1.176059 -0.573 0.567 -2.993122 1.645587 prog1 | -6.723945 1.475657 -4.557 0.000 -9.634149 -3.81374 prog3 | -10.32168 1.422983 -7.254 0.000 -13.128 -7.515352 _cons | 57.10551 1.03689 55.074 0.000 55.06062 59.1504 ------------------------------------------------------------------------------

Now, let’s **test female**. Note, that **female** was statistically significant
in only one of the three equations. Using the test command after **mvreg** allows us to
test **female** across all three equations simultaneously. And, guess what? It is
significant. This is consistent with what we found using **sureg** (except
that **sureg** did this test using a **Chi-Square** test).

test female( 1) [read]female = 0.0 ( 2) [write]female = 0.0 ( 3) [math]female = 0.0 F( 3, 196) = 11.63 Prob > F = 0.0000

We can also test **prog1** and **prog3**, both separately and combined. Remember
these are multivariate tests.

test prog1( 1) [read]prog1 = 0.0 ( 2) [write]prog1 = 0.0 ( 3) [math]prog1 = 0.0 F( 3, 196) = 7.72 Prob > F = 0.0001test prog3( 1) [read]prog3 = 0.0 ( 2) [write]prog3 = 0.0 ( 3) [math]prog3 = 0.0 F( 3, 196) = 21.47 Prob > F = 0.0000test prog1 prog3( 1) [read]prog1 = 0.0 ( 2) [write]prog1 = 0.0 ( 3) [math]prog1 = 0.0 ( 4) [read]prog3 = 0.0 ( 5) [write]prog3 = 0.0 ( 6) [math]prog3 = 0.0 F( 6, 196) = 11.83 Prob > F = 0.0000

Many researchers familiar with traditional multivariate analysis may not recognize the
tests above. They don’t see Wilks’ Lambda, Pillai’s Trace or the Hotelling-Lawley Trace
statistics, statistics that they are familiar with. It is possible to obtain these
statistics using the **mvtest** command written by David E. Moore of the University of
Cincinnati. **mvtest** , which UCLA updated to work with Stata 6 and above,
can be downloaded over the internet like this.

net from https://stats.idre.ucla.edu/stat/stata/ado/analysis net install mvtest

Now that we have downloaded it, we can use it like this.

mvtest femaleMULTIVARIATE TESTS OF SIGNIFICANCE Multivariate Test Criteria and Exact F Statistics for the Hypothesis of no Overall "female" Effect(s) S=1 M=.5 N=96 Test Value F Num DF Den DF Pr > F Wilks' Lambda 0.84892448 11.5081 3 194.0000 0.0000 Pillai's Trace 0.15107552 11.5081 3 194.0000 0.0000 Hotelling-Lawley Trace 0.17796108 11.5081 3 194.0000 0.0000mvtest prog1 prog3MULTIVARIATE TESTS OF SIGNIFICANCE Multivariate Test Criteria and Exact F Statistics for the Hypothesis of no Overall "prog1 prog3" Effect(s) S=2 M=0 N=96 Test Value F Num DF Den DF Pr > F Wilks' Lambda 0.73294667 10.8676 6 388.0000 0.0000 Pillai's Trace 0.26859190 10.0834 6 390.0000 0.0000 Hotelling-Lawley Trace 0.36225660 11.6526 6 386.0000 0.0000

We will end with an **mvtest** including all of the predictor variables. This is an
overall multivariate test of the model.

mvtest female prog1 prog3MULTIVARIATE TESTS OF SIGNIFICANCE Multivariate Test Criteria and Exact F Statistics for the Hypothesis of no Overall "female prog1 prog3" Effect(s) S=3 M=-.5 N=96 Test Value F Num DF Den DF Pr > F Wilks' Lambda 0.62308940 11.2593 9 472.2956 0.0000 Pillai's Trace 0.41696769 10.5465 9 588.0000 0.0000 Hotelling-Lawley Trace 0.54062431 11.5734 9 578.0000 0.0000

The **sureg** and **mvreg** commands both allow you to test
multi-equation models while taking into account the fact that the equations are not
independent. The **sureg** command allows you to get estimates for each
equation which adjust for the non-independence of the equations, and it allows you to
estimate equations which don’t necessarily have the same predictors. By contrast, **mvreg**
is restricted to equations that have the same set of predictors, and the estimates it
provides for the individual equations are the same as the OLS estimates. However, **mvreg**
(especially when combined with **mvtest**) allows you to perform more
traditional multivariate tests of predictors.

4.6 Summary

This chapter has covered a variety of topics that go beyond ordinary least squares regression, but there still remain a variety of topics we wish we could have covered, including the analysis of survey data, dealing with missing data, panel data analysis, and more. And, for the topics we did cover, we wish we could have gone into even more detail. One of our main goals for this chapter was to help you be aware of some of the techniques that are available in Stata for analyzing data that do not fit the assumptions of OLS regression and some of the remedies that are possible. If you are a member of the UCLA research community, and you have further questions, we invite you to use our consulting services to discuss issues specific to your data analysis.

4.7 Self Assessment

1. Use the **crime** data file that was used in chapter 2 (use
https://stats.idre.ucla.edu/stat/stata/webbooks/reg/crime ) and look at a regression model
predicting **murder** from **pctmetro**, ** poverty**, ** pcths**
and **single** using OLS and make a **avplots** and a **lvr2plot**
following the regression. Are there any states that look worrisome? Repeat this analysis
using regression with robust standard errors and show **avplots**
for the analysis. Repeat the analysis using robust regression and make a
manually created **lvr2plot**. Also run the results using **qreg**.
Compare the results of the different analyses. Look at the weights from the
robust regression and comment on the weights.

2. Using the elemapi2 data file (use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/elemapi2
) pretend that 550 is the lowest score that a school could achieve on **api00**,
i.e., create a new variable with the **api00** score and recode it
such that any score of 550 or below becomes 550. Use **meals**, **ell**
and **emer** to predict api scores using 1) OLS to predict the
original api score (before recoding) 2) OLS to predict the recoded score where
550 was the lowest value, and 3) using **tobit** to predict the
recoded api score indicating the lowest value is 550. Compare the results of
these analyses.

3. Using the elemapi2 data file (use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/elemapi2
) pretend that only schools with api scores of 550 or higher were included in
the sample. Use **meals**, **ell** and **emer**
to predict api scores using 1) OLS to predict api from the full set of
observations, 2) OLS to predict api using just the observations with api scores
of 550 or higher, and 3) using **truncreg** to predict api using
just the observations where api is 550 or higher. Compare the results of these
analyses.

4. Using the **hsb2** data file (use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/hsb2
) predict **read** from **science**, ** socst**, ** math** and **write**.
Use the **testparm** and **test** commands to test
the equality of the coefficients for **science**, **socst**
and **math**. Use **cnsreg** to estimate a model where
these three parameters are equal.

5. Using the elemapi2 data file (use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/elemapi2 ) consider the following 2 regression equations.

api00 = meals ell emer api99 = meals ell emer

Estimate the coefficients for these predictors in predicting **api00**
and **api99** taking into account the non-independence of the
schools. Test the overall contribution of each of the predictors in jointly
predicting api scores in these two years. Test whether the contribution of **emer**
is the same for **api00** and **api99**.

Click here for our answers to these self assessment questions.

**4.8 For more information**

- Stata Manuals
**[R] rreg****[R] qreg****[R] cnsreg****[R] tobit****[R] truncreg****[R] eivreg****[R] sureg****[R] mvreg****[U] 23 Estimation and post-estimation commands****[U] 29 Overview of model estimation in Stata**

- Web Links