In many ways, multivariate regression is similar to MANOVA. The hypotheses, the methods used to obtain the estimates, and the assumptions are all similar. The multivariate test statistics are the same. The hypothesis being tested by a multivariate regression is that there is a joint linear effect of the set of predictors on the set of responses. Hence, the null hypothesis is that slope of all coefficients is simultaneously zero. Note that the “set” of predictors may include no predictor or only one predictor, but usually it contains more.
The basic assumptions of multivariate regression are 1) multivariate normality of the residuals, 2) homogenous variances of residuals conditional on predictors, 3) common covariance structure across observations, and 4) independent observations. Unfortunately, testing the first three assumptions is very difficult. Currently, many of the common statistical packages, such as SAS and SPSS, do not offer a test of multivariate normality. However, you can see if your data are close to being multivariate normal by creating some graphs. First, you want to see if your residuals for each dependent variable are normal by themselves. This is necessary, but not sufficient, for multivariate normality. Next, you can create scatterplots of the residuals. You want to see the points on the graph form an ellipse (as opposed to a V-shape, a wedge-shape, or some other kind of shape). Remember that an ellipse can be any form of a circle. You would like the points to line up in a “flattened” ellipse because the dependent variables are supposed to be correlated for MANOVA or multiple regression to be the analysis of choice, but this is not necessary for multivariate normality. Regarding the second assumption, homogeneity of variances, there are several tests available for this. However, most of them are very sensitive to nonnormality. Fortunately, the F statistic is fairly robust against violations of this assumption. As for the third assumption, the covariance matrices are rarely equal. Monte Carlo studies have shown that keeping the number of observations (subjects) per group approximately equal is an effective method of ensuring that violations of this assumption will not be too problematic. Regarding the independence of observations, clearly there is no statistical test for that. Rather, that is an issue of methodology. Care should be taken to ensure that the observations are independent, because even small intraclass correlations can cause serious problems. For example, suppose an experimenter had three groups with 30 subjects per group and a small dependence between the observations, say an intraclass correlation of .10. The actual alpha value would be .4917, rather than the standard .05.
If all of these assumptions are met, then the coefficients will be unbiased, the least-squares estimates will have minimum variance, and the relationships among the coefficients will reflect the relationships among the predictors. Furthermore, a multivariate hypothesis test will account for the relationship among the coefficients, whereas a univariate F test would not.
With all of this in mind, let’s try a multivariate multiple regression. We will use the hsb2 data set for our example, and we will use read and socst as our dependent variables and write, math and science as our independent variables. The proc reg statement is the same as it would be in a univariate regression, but the model statement is a little different: we now have two (we could have more) dependent variables listed before the equals sign. Also, we have included the mtest statement, which is used to test hypotheses in multivariate regression. If no equations are listed on the mtest statement, SAS tests the hypothesis that all coefficients except the intercept are zero. You can specify some options on the mtest statement, including canprint, which will print the canonical correlations for the hypothesis combinations and the dependent variable combinations. The details option will display the M matrix, and the print option will display the H and E matrices.
proc reg data = "g:SAShsb2"; model read socst = write math science; mtest / details print; run; quit;The REG Procedure Model: MODEL1 Dependent Variable: read Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 11313 3771.09916 76.94 <.0001 Error 196 9606.12253 49.01083 Corrected Total 199 20919 Root MSE 7.00077 R-Square 0.5408 Dependent Mean 52.23000 Adj R-Sq 0.5338 Coeff Var 13.40374 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 4.36993 3.20878 1.36 0.1748 write 1 0.23767 0.06969 3.41 0.0008 math 1 0.37840 0.07463 5.07 <.0001 science 1 0.29693 0.06763 4.39 <.0001The REG Procedure Model: MODEL1 Dependent Variable: socst Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 9551.66620 3183.88873 46.62 <.0001 Error 196 13385 68.28841 Corrected Total 199 22936 Root MSE 8.26368 R-Square 0.4164 Dependent Mean 52.40500 Adj R-Sq 0.4075 Coeff Var 15.76888 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 8.86989 3.78763 2.34 0.0202 write 1 0.46567 0.08227 5.66 <.0001 math 1 0.27630 0.08810 3.14 0.0020 science 1 0.08512 0.07984 1.07 0.2877The REG Procedure Model: MODEL1 Multivariate Test 1 L Ginv(X'X) L' LB-cj 0.0000991078 -0.000042904 -0.000028518 0.2376705687 0.4656741023 -0.000042904 0.0001136529 -0.000044399 0.3784014963 0.2763008055 -0.000028518 -0.000044399 0.0000933347 0.2969346843 0.0851168364 Inv(L Ginv(X'X) L') Inv()(LB-cj) 17878.875 10911.025 10653.25 11541.35 12247.225 10911.025 17465.795 11642.35 12659.33 10897.755 10653.25 11642.35 19507.5 12729.9 9838.15 Error Matrix (E) 9606.1225306 3657.5503071 3657.5503071 13384.528803 Hypothesis Matrix (H) 11313.297469 9955.8196929 9955.8196929 9551.6661967 Hypothesis + Error Matrix (T) 20919.42 13613.37 13613.37 22936.195 Eigenvectors 0.004986 0.002488 -0.007281 0.008053 Eigenvalues 0.587507 0.051687 The REG Procedure Model: MODEL1 Multivariate Test 1 Multivariate Statistics and F Approximations S=2 M=0 N=96.5 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.39117291 38.93 6 390 <.0001 Pillai's Trace 0.63919333 30.69 6 392 <.0001 Hotelling-Lawley Trace 1.47878554 47.94 6 258.23 <.0001 Roy's Greatest Root 1.42428180 93.05 3 196 <.0001 NOTE: F Statistic for Roy's Greatest Root is an upper bound. NOTE: F Statistic for Wilks' Lambda is exact.
Looking at the very bottom of the output we can see that the overall model is statistically significant. We can look at the first half of the output to see the univariate results. Here we see that with only the dependent variable read, the overall model is statistically significant, as well as each of the predictors. When we look at the univariate results for socst, we see that the overall model is statistically significant, as are the predictors write and math, but not science. In other words, multivariate tests tell us that the set of predictors accounts for a statistically significant portion of the variance in the dependent variables, and the univariate tests break this down for us so that we can see where the significant differences are.
Let’s run the same model again, but this time, we will specify some hypotheses to be tested on the mtest statement. In the first mtest statement, we will test the hypothesis that the parameter for write is the same for read and socst. In the second mtest statement, we will test the hypothesis that the parameter for science is the same for read and socst. You will notice that, as with test statements in other procs, we can use a label before the statement so that it is labeled in the output.
proc reg data = "g:SAShsb2"; model read socst = write math science; write: mtest read- socst, write / details print; science: mtest read - socst, science / details print; run; quit;The REG Procedure Model: MODEL1 Dependent Variable: read Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 11313 3771.09916 76.94 <.0001 Error 196 9606.12253 49.01083 Corrected Total 199 20919 Root MSE 7.00077 R-Square 0.5408 Dependent Mean 52.23000 Adj R-Sq 0.5338 Coeff Var 13.40374 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 4.36993 3.20878 1.36 0.1748 write 1 0.23767 0.06969 3.41 0.0008 math 1 0.37840 0.07463 5.07 <.0001 science 1 0.29693 0.06763 4.39 <.0001 The REG Procedure Model: MODEL1 Dependent Variable: socst Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 9551.66620 3183.88873 46.62 <.0001 Error 196 13385 68.28841 Corrected Total 199 22936 Root MSE 8.26368 R-Square 0.4164 Dependent Mean 52.40500 Adj R-Sq 0.4075 Coeff Var 15.76888 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 8.86989 3.78763 2.34 0.0202 write 1 0.46567 0.08227 5.66 <.0001 math 1 0.27630 0.08810 3.14 0.0020 science 1 0.08512 0.07984 1.07 0.2877 The REG Procedure Model: MODEL1 Multivariate Test: write Multivariate Statistics and Exact F Statistics S=1 M=-0.5 N=97 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.96762141 6.56 1 196 0.0112 Pillai's Trace 0.03237859 6.56 1 196 0.0112 Hotelling-Lawley Trace 0.03346205 6.56 1 196 0.0112 Roy's Greatest Root 0.03346205 6.56 1 196 0.0112 The REG Procedure Model: MODEL1 Multivariate Test: science Multivariate Statistics and Exact F Statistics S=1 M=-0.5 N=97 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.97024627 6.01 1 196 0.0151 Pillai's Trace 0.02975373 6.01 1 196 0.0151 Hotelling-Lawley Trace 0.03066616 6.01 1 196 0.0151 Roy's Greatest Root 0.03066616 6.01 1 196 0.0151
For the dependent variable read, the predictors write, math and science are significant. For the dependent variable socst, the predictors write and math are significant. The last two pages of the output indicate that both of the hypotheses regarding the parameters were statistically significant (F = 6.56, p = 0.0112 and F = 6.01, p = 0.0151, respectively). Hence, we would conclude that, based on the results of the first test (which we called write), that parameters for read and socst are not the same for the variable write. The second test (which we called science) suggests that the parameters for read and socst are not the same for the variable science.