Below is the diagram of a simple structural equation model. The dependent variable is a latent variable Acad with three observed indicators, math, science and socst. There are two additional observed variables, the independent variable female and a mediator variable read. (Note, variables in squares are observed (manifest variables), those in circles are latent. The small circles with ε are error terms, i.e., residual variances).
We will analyze this model using the sem command with the hsbdemo dataset.
use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear sem (Acad->math science socst)(Acad<-read)(read<-female), /// mean(female) var(female) Endogenous variables Observed: read Measurement: math science socst Latent: Acad Exogenous variables Observed: female Fitting target model: Iteration 0: log likelihood = -6737.783 (not concave) [output omitted] Iteration 13: log likelihood = -2949.3343 Structural equation model Number of obs = 200 Estimation method = ml Log likelihood = -2949.3343 ( 1) [math]Acad = 1 ------------------------------------------------------------------------------ | OIM | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- Structural | read <- | female | -1.090231 1.450201 -0.75 0.452 -3.932572 1.75211 _cons | 52.82418 1.070598 49.34 0.000 50.72584 54.92251 -----------+---------------------------------------------------------------- Acad <- | read | .620298 .0461021 13.45 0.000 .5299395 .7106565 -------------+---------------------------------------------------------------- Measurement | math <- | Acad | 1 (constrained) _cons | 20.24684 2.456312 8.24 0.000 15.43255 25.06112 -----------+---------------------------------------------------------------- science <- | Acad | .9874351 .0908878 10.86 0.000 .8092984 1.165572 _cons | 19.85891 2.737663 7.25 0.000 14.49319 25.22463 -----------+---------------------------------------------------------------- socst <- | Acad | .9940776 .1030617 9.65 0.000 .7920804 1.196075 _cons | 20.19871 3.181537 6.35 0.000 13.96301 26.4344 -------------+---------------------------------------------------------------- Mean | female | .545 .0352119 15.48 0.000 .475986 .614014 -------------+---------------------------------------------------------------- Variance | e.math | 31.77659 4.630872 23.8814 42.28192 e.science | 43.37236 5.487556 33.84678 55.57874 e.socst | 59.7846 7.011562 47.50727 75.23477 e.read | 104.3024 10.43024 85.73812 126.8862 e.Acad | 15.30659 3.693936 9.538023 24.56398 female | .247975 .0247975 .2038392 .3016672 ------------------------------------------------------------------------------ LR test of model vs. saturated: chi2(5) = 12.25, Prob > chi2 = 0.0315 estat gof ---------------------------------------------------------------------------- Fit statistic | Value Description ---------------------+------------------------------------------------------ Likelihood ratio | chi2_ms(5) | 12.251 model vs. saturated p > chi2 | 0.032 chi2_bs(10) | 361.012 baseline vs. saturated p > chi2 | 0.000 ----------------------------------------------------------------------------
The estat gof makes reference to three different models; 1) the model (the one we just ran), 2) the saturated model, and 3) the baseline model. Before we discuss the saturated and baseline models, let’s look a little closer at the above model.
In the above model we estimated 15 parameters; 2 structural coefficients, 1 structural intercept, 2 measurement coefficients (loadings), 3 measurement intercepts, 6 variances and 1 mean. The log likelihood for our model was -2949.3343.
The saturated model
Now let’s move on to the saturated model. A saturated model perfectly reproduces all of the variances, covariance and means of the observed variables. Here is a simple way to produce a saturated model.
sem (<-read math science socst female) Exogenous variables Observed: read math science socst female Fitting target model: Iteration 0: log likelihood = -2943.2087 Iteration 1: log likelihood = -2943.2087 Structural equation model Number of obs = 200 Estimation method = ml Log likelihood = -2943.2087 ------------------------------------------------------------------------------ | OIM | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- Mean | read | 52.23 .7231774 72.22 0.000 50.8126 53.6474 math | 52.645 .6607911 79.67 0.000 51.34987 53.94013 science | 51.85 .6983463 74.25 0.000 50.48127 53.21873 socst | 52.405 .757235 69.21 0.000 50.92085 53.88915 female | .545 .0352119 15.48 0.000 .475986 .614014 -------------+---------------------------------------------------------------- Variance | read | 104.5971 10.45971 85.98041 127.2447 math | 87.32898 8.732897 71.78574 106.2377 science | 97.5375 9.75375 80.17731 118.6566 socst | 114.681 11.4681 94.2695 139.512 female | .247975 .0247975 .2038392 .3016672 -------------+---------------------------------------------------------------- Covariance | read | math | 63.29665 8.105808 7.81 0.000 47.40956 79.18374 science | 63.6495 8.441978 7.54 0.000 47.10353 80.19547 socst | 68.06685 9.118222 7.46 0.000 50.19546 85.93824 female | -.27035 .3606283 -0.75 0.453 -.9771685 .4364685 -----------+---------------------------------------------------------------- math | science | 58.21175 7.715717 7.54 0.000 43.08922 73.33428 socst | 54.48877 8.057294 6.76 0.000 38.69677 70.28078 female | -.136525 .3291963 -0.41 0.678 -.7817379 .5086879 -----------+---------------------------------------------------------------- science | socst | 49.19075 8.247856 5.96 0.000 33.02525 65.35625 female | -.62825 .3505821 -1.79 0.073 -1.315378 .0588783 -----------+---------------------------------------------------------------- socst | female | .279275 .3775977 0.74 0.460 -.460803 1.019353 ------------------------------------------------------------------------------ LR test of model vs. saturated: chi2(0) = 0.00, Prob > chi2 = .
A saturated model has the best fit possible since it perfectly reproduces all of the variances, covariances and means. That’s why the saturated model above has a chi-square of zero with zero degrees of freedom. Since you can’t do any better than a saturated model, it becomes the standard for comparison with the models that you estimate.
For the saturated model we estimated 20 parameters; 5 variances, 10 covariances and 5 means. You can compute the number of parameters in a saturated model of k observed variables by the formula k*(k+1)/2 + k. In our example, it is 5*(5+1)/2 + 5 = 20. The log likelihood for this model is -2943.2087.
To test how well our model compares to a saturated model, we compute chi-square as follows, minus two times the differences in the log likelihoods; -2*(-2949.3343 – -2943.2087) = 12.2512. The degrees of freedom for this chi-square is the difference in the number of parameters estimated in the two model (20 – 15 = 5). Thus, our model fits significantly poorer than a saturated model (p = .0315). But, that’s not surprising since our model was only for demonstration purposes.
The baseline model
So, that brings us to the baseline model. This is defined in the Stata [SEM] Structural Equation Modeling Reference Manual as a model which includes the means and variances of all observed variables plus the covariances of all observed exogenous variables. Since there is only one observed exogenous variable, female, in our model, there will be no covariances in our baseline model.
sem (<-read math science socst female), /// covstr(read math science socst female, diagonal) Exogenous variables Observed: read math science socst female Fitting target model: Iteration 0: log likelihood = -3123.7147 Iteration 1: log likelihood = -3123.7147 Structural equation model Number of obs = 200 Estimation method = ml Log likelihood = -3123.7147 ( 1) [cov(read,math)]_cons = 0 ( 2) [cov(read,science)]_cons = 0 ( 3) [cov(read,socst)]_cons = 0 ( 4) [cov(read,female)]_cons = 0 ( 5) [cov(math,science)]_cons = 0 ( 6) [cov(math,socst)]_cons = 0 ( 7) [cov(math,female)]_cons = 0 ( 8) [cov(science,socst)]_cons = 0 ( 9) [cov(science,female)]_cons = 0 (10) [cov(socst,female)]_cons = 0 ------------------------------------------------------------------------------ | OIM | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- Mean | read | 52.23 .7231774 72.22 0.000 50.8126 53.6474 math | 52.645 .6607911 79.67 0.000 51.34987 53.94013 science | 51.85 .6983463 74.25 0.000 50.48127 53.21873 socst | 52.405 .757235 69.21 0.000 50.92085 53.88915 female | .545 .0352119 15.48 0.000 .475986 .614014 -------------+---------------------------------------------------------------- Variance | read | 104.5971 10.45971 85.98041 127.2447 math | 87.32898 8.732898 71.78574 106.2377 science | 97.5375 9.75375 80.17731 118.6566 socst | 114.681 11.4681 94.2695 139.512 female | .247975 .0247975 .2038392 .3016672 -------------+---------------------------------------------------------------- Covariance | read | math | 0 (constrained) science | 0 (constrained) socst | 0 (constrained) female | 0 (constrained) -----------+---------------------------------------------------------------- math | science | 0 (constrained) socst | 0 (constrained) female | 0 (constrained) -----------+---------------------------------------------------------------- science | socst | 0 (constrained) female | 0 (constrained) -----------+---------------------------------------------------------------- socst | female | 0 (constrained) ------------------------------------------------------------------------------ LR test of model vs. saturated: chi2(10) = 361.01, Prob > chi2 = 0.0000
For the baseline model we estimated 10 parameters; 5 variances and 5 means. In comparing this model with the saturated model there was a difference of 10 degrees of freedom, 20 – 10 = 10. Again, we compute chi-square as minus two times the difference in the log likelihoods, -2*(-3123.7147 – -2943.2087) = 361.012.
Although our model did not fit all that well compared to the saturated model, the fit of the baseline model compared to the saturated model is much worse, with chi2(10) = 361.012, p = 0.0000.
The two chi-square values from the estat gof for our model versus a saturated model and baseline versus saturated model help us to understand how well our model fits the data.
The saturated model revisited
When we looked at the saturated model above we used a very simple model with only observed variables. Now we are going to try to come up with a saturated model that is more closely related to our original model. We will begin by looking at just the measurement part of our model. Here is the diagram.
Followed by the sem code.
sem (Acad->math science socst) Endogenous variables Measurement: math science socst Exogenous variables Latent: Acad Fitting target model: Iteration 0: log likelihood = -2141.1294 Iteration 1: log likelihood = -2141.1294 Structural equation model Number of obs = 200 Estimation method = ml Log likelihood = -2141.1294 ( 1) [math]Acad = 1 ------------------------------------------------------------------------------ | OIM | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- Measurement | math <- | Acad | 1 (constrained) _cons | 52.645 .6607911 79.67 0.000 51.34987 53.94013 -----------+---------------------------------------------------------------- science <- | Acad | .9027685 .1108342 8.15 0.000 .6855374 1.12 _cons | 51.85 .6983463 74.25 0.000 50.48127 53.21873 -----------+---------------------------------------------------------------- socst <- | Acad | .8450313 .110572 7.64 0.000 .6283142 1.061748 _cons | 52.405 .757235 69.21 0.000 50.92085 53.88915 -------------+---------------------------------------------------------------- Variance | e.math | 22.84761 7.002427 12.53029 41.66011 e.science | 44.98577 7.024159 33.12585 61.09185 e.socst | 68.63626 8.333688 54.10062 87.07729 Acad | 64.48137 10.71715 46.55433 89.31172 ------------------------------------------------------------------------------ LR test of model vs. saturated: chi2(0) = 0.00, Prob > chi2 = .
As you can see, the measure model with three indicators is itself a saturated model. To be saturated it should have 3*4/2 + 3 = 9 parameters being estimated, which is the case.
Now, let’s add read to our model like this.
sem (Acad->math science socst)(Acad<-read)(science<-read)(socst<-read), /// mean(read) difficult Endogenous variables Observed: science socst Measurement: math Latent: Acad Exogenous variables Observed: read Fitting target model: Iteration 0: log likelihood = -3698.205 (not concave) [output omitted] Iteration 28: log likelihood = -2802.3352 Structural equation model Number of obs = 200 Estimation method = ml Log likelihood = -2802.3352 ( 1) [math]Acad = 1 ------------------------------------------------------------------------------- | OIM | Coef. Std. Err. z P>|z| [95% Conf. Interval] --------------+---------------------------------------------------------------- Structural | science <- | Acad | .5843354 .3233183 1.81 0.071 -.0493567 1.218028 read | .2549117 .2019701 1.26 0.207 -.1409425 .6507659 _cons | 20.06696 2.821788 7.11 0.000 14.53636 25.59757 ------------+---------------------------------------------------------------- socst <- | Acad | .3945619 .2262457 1.74 0.081 -.0488715 .8379953 read | .4119847 .148232 2.78 0.005 .1214554 .702514 _cons | 18.41618 3.087163 5.97 0.000 12.36546 24.46691 ------------+---------------------------------------------------------------- Acad <- | read | .6051473 .04841 12.50 0.000 .5102655 .700029 --------------+---------------------------------------------------------------- Measurement | math <- | Acad | 1 (constrained) _cons | 21.03816 2.57647 8.17 0.000 15.98837 26.08795 --------------+---------------------------------------------------------------- mean(read)| 52.23 .7231774 72.22 0.000 50.8126 53.6474 --------------+---------------------------------------------------------------- var(e.math)| 15.32121 18.29777 1.474762 159.171 var(e.science)| 47.29731 7.818624 34.20787 65.39535 var(e.socst)| 65.13928 7.105546 52.60074 80.66665 var(e.Acad)| 33.70397 18.81883 11.28255 100.6827 var(read)| 104.5971 10.45971 85.98041 127.2447 ------------------------------------------------------------------------------- LR test of model vs. saturated: chi2(0) = 0.00, Prob > chi2 = .
This model has four observed variables. Thus, we should estimate 4*5/2 + 4 = 14 parameters. We achieved this by adding direct paths from read to science and to socst. We could have also achieved the same result by adding two covariances, say e.math*e.science and e.math*e.socst, to our model instead of the direct effects.
Finally, let’s add female to our model. We now have as many observed variables as our original model.
The above diagram translates to the following code.
sem (Acad -> math science socst)(Acad<-read)(Acad<-female)(read<-female) /// (math<- female read)(socst<-female read), /// mean(female) var(female) Endogenous variables Observed: math socst read Measurement: science Latent: Acad Exogenous variables Observed: female Fitting target model: Iteration 0: log likelihood = -3111.6647 (not concave) [output omitted] Iteration 58: log likelihood = -2943.2087 Structural equation model Number of obs = 200 Estimation method = ml Log likelihood = -2943.2087 ( 1) [math]Acad = 1 ------------------------------------------------------------------------------ | OIM | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- Structural | math <- | read | -.3219013 .4660796 -0.69 0.490 -1.235401 .5915979 Acad | 1 (constrained) female | 2.990358 2.137855 1.40 0.162 -1.199762 7.180477 _cons | 20.9637 2.663851 7.87 0.000 15.74265 26.18475 -----------+---------------------------------------------------------------- socst <- | read | .250464 .1525691 1.64 0.101 -.0485661 .549494 Acad | .436786 .2230541 1.96 0.050 -.0003919 .8739639 female | 3.099202 1.37288 2.26 0.024 .4084072 5.789998 _cons | 17.16439 3.172945 5.41 0.000 10.94553 23.38325 -----------+---------------------------------------------------------------- read <- | female | -1.090232 1.450201 -0.75 0.452 -3.932574 1.75211 _cons | 52.82418 1.070598 49.34 0.000 50.72584 54.92251 -----------+---------------------------------------------------------------- Acad <- | read | .9273316 .4666623 1.99 0.047 .0126903 1.841973 female | -2.880859 2.192737 -1.31 0.189 -7.178543 1.416826 -------------+---------------------------------------------------------------- Measurement | science <- | Acad | .6509763 .3226734 2.02 0.044 .018548 1.283405 _cons | 21.34222 2.895969 7.37 0.000 15.66623 27.01822 -------------+---------------------------------------------------------------- Mean | female | .545 .0352119 15.48 0.000 .475986 .614014 -------------+---------------------------------------------------------------- Variance | e.math | 18.69062 14.85898 3.934787 88.78225 e.science | 45.08212 7.703921 32.25118 63.01778 e.socst | 63.76156 6.970891 51.46349 78.99847 e.read | 104.3024 10.43024 85.73812 126.8862 e.Acad | 30.33159 15.41889 11.19933 82.1483 female | .247975 .0247975 .2038392 .3016672 ------------------------------------------------------------------------------ LR test of model vs. saturated: chi2(0) = 0.00, Prob > chi2 = .
This time there are five observed variables which means that we need to estimate 5*6/2 + 5 = 20 parameters for a saturated model. We did this by adding direct paths from female to Acad, math, and socst and direct paths from read to math and socst. This is the same result that was obtained with the simpler approach used earlier for the saturated model.
The baseline model revisited
We know that the baseline model estimates five means and five variances and no covariances, because there is only one observed exogenous variables, for a total of 10 total parameters. We can get this from our original model by constraining all of the measurement coefficients (loadings) to be one and all of the path coefficients to be zero. Here is a diagram of the model.
And, here is one way to accomplish this.
sem (Acad ->math@1 science@1 socst@1) (read<-female@0), /// var(Acad@0 female) mean(female) Endogenous variables Observed: read Measurement: math science socst Exogenous variables Observed: female Latent: Acad Fitting target model: Iteration 0: log likelihood = -3257.7854 [omitted output] Iteration 4: log likelihood = -3123.7147 Structural equation model Number of obs = 200 Estimation method = ml Log likelihood = -3123.7147 ( 1) [math]Acad = 1 ( 2) [science]Acad = 1 ( 3) [socst]Acad = 1 ( 4) [var(Acad)]_cons = 0 ------------------------------------------------------------------------------ | OIM | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- Structural | read <- | _cons | 52.23 .7231774 72.22 0.000 50.8126 53.6474 -------------+---------------------------------------------------------------- Measurement | math <- | Acad | 1 (constrained) _cons | 52.645 .6607909 79.67 0.000 51.34987 53.94013 -----------+---------------------------------------------------------------- science <- | Acad | 1 (constrained) _cons | 51.85 .6983462 74.25 0.000 50.48127 53.21873 -----------+---------------------------------------------------------------- socst <- | Acad | 1 (constrained) _cons | 52.405 .757235 69.21 0.000 50.92085 53.88915 -------------+---------------------------------------------------------------- Mean | female | .545 .0352119 15.48 0.000 .475986 .614014 -------------+---------------------------------------------------------------- Variance | e.math | 87.32893 8.732889 71.78572 106.2376 e.science | 97.53748 9.753746 80.17729 118.6565 e.socst | 114.681 11.4681 94.2695 139.512 e.read | 104.5971 10.45971 85.98041 127.2447 female | .247975 .0247975 .2038392 .3016672 Acad | 0 (constrained) ------------------------------------------------------------------------------ LR test of model vs. saturated: chi2(10) = 361.01, Prob > chi2 = 0.0000
In this model the term (read<-female@0) estimates an intercept (mean) but no structural coefficient. There is no term that predicting Acad from read which is equivalent to setting that structural coefficient to zero. We added terms for the mean and variance of female. Finally, by convention, the variance of the latent variables is constrained to zero, which we did.