Let’s say that you have a dataset with a bunch of binary variables. Further, you believe that these binary variables reflect underlying and unobserved continuous variables. You don’t want to compute your confirmatory factor analysis (CFA) directly on the binary variables. You will want to compute the CFA on tetrachoric correlations that reflect the associations among these underlying continuous variables. We will demonstrate this by using data with five continuous variables and creating binary variables from them by dichotomizing them at a point a little above their mean values.
Let’s begin by loading the hsbdemo.dta dataset and creating binary variables for read, write, math, science and socst.
use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear gen r=read<=55 gen w=write<=55 gen m=math<=55 gen s=science<=55 gen o=socst<=55
Now that we have the binary variables, let’s checkout the correlations among the continuous version of the variables and the binary version.
corr read write math science socst (obs=200) | read write math science socst -------------+--------------------------------------------- read | 1.0000 write | 0.5968 1.0000 math | 0.6623 0.6174 1.0000 science | 0.6302 0.5704 0.6307 1.0000 socst | 0.6215 0.6048 0.5445 0.4651 1.0000 corr r w m s o (obs=200) | r w m s o -------------+--------------------------------------------- r | 1.0000 w | 0.4109 1.0000 m | 0.4750 0.5029 1.0000 s | 0.3846 0.4320 0.4750 1.0000 o | 0.4057 0.3910 0.3676 0.3009 1.0000
As you can see, the correlations among the binary version of the variables are much lower than among the continuous version. The Pearson correlations tend to underestimate the relationship between the underlying continuous variables that give rise to the binary variables. What we need are the tetrachoric correlations which we can obtain using the tetrachoric command.
tetrachoric r w m s o (obs=200) | r w m s o -------------+--------------------------------------------- r | 1.0000 w | 0.6145 1.0000 m | 0.6874 0.7176 1.0000 s | 0.5790 0.6411 0.6874 1.0000 o | 0.6148 0.5780 0.5556 0.4690 1.0000
The tetrachoric correlations are much closer to the original correlations among the continuous variables than the correlations among the binary values.
For comparison purposes we will compute a CFA on the original continuous data.
sem (FC->read write math science socst) Endogenous variables Measurement: read write math science socst Exogenous variables Latent: FC Fitting target model: Iteration 0: log likelihood = -3469.2622 [output omitted] Iteration 3: log likelihood = -3468.8093 Structural equation model Number of obs = 200 Estimation method = ml Log likelihood = -3468.8093 ( 1) [read]FC = 1 ------------------------------------------------------------------------------ | OIM | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- Measurement | read <- | FC | 1 (constrained) _cons | 52.23 .7231774 72.22 0.000 50.8126 53.6474 -----------+---------------------------------------------------------------- write <- | FC | .8560521 .074653 11.47 0.000 .7097349 1.002369 _cons | 52.775 .6685596 78.94 0.000 51.46465 54.08535 -----------+---------------------------------------------------------------- math <- | FC | .8915544 .0716702 12.44 0.000 .7510834 1.032025 _cons | 52.645 .6607911 79.67 0.000 51.34987 53.94013 -----------+---------------------------------------------------------------- science <- | FC | .8738626 .0766422 11.40 0.000 .7236467 1.024079 _cons | 51.85 .6983463 74.25 0.000 50.48127 53.21873 -----------+---------------------------------------------------------------- socst <- | FC | .9068989 .0836274 10.84 0.000 .7429922 1.070806 _cons | 52.405 .757235 69.21 0.000 50.92085 53.88915 -------------+---------------------------------------------------------------- Variance | e.read | 33.32625 4.724035 25.24223 43.99925 e.write | 37.1653 4.627598 29.1173 47.43775 e.math | 30.67796 4.098463 23.61072 39.86061 e.science | 43.11252 5.199046 34.03728 54.60745 e.socst | 56.06316 6.566122 44.56406 70.52942 FC | 71.27085 10.46467 53.44787 95.03717 ------------------------------------------------------------------------------ LR test of model vs. saturated: chi2(5) = 13.90, Prob > chi2 = 0.0163sem (FB->r w m s o)
Next, we will create the SSD dataset and compute the CFA on the tetrachoric correlations.
clear ssd init r w m s o Summary statistics data initialized. Next use, in any order, ssd set observations (required) It is best to do this first. ssd set means (optional) Default setting is 0. ssd set variances or ssd set sd (optional) Use this only if you have set or will set correlations and, even then, this is optional but highly recommended. Default setting is 1. ssd set covariances or ssd set correlations (required) ssd set obs 200 (value set) Status: observations: set means: unset variances or sd: unset covariances or correlations: unset (required to be set) ssd set cor 1.0000 /// 0.6145 1.0000 /// 0.6874 0.7176 1.0000 /// 0.5790 0.6411 0.6874 1.0000 /// 0.6148 0.5780 0.5556 0.4690 1.0000 (values set) Status: observations: set means: unset variances or sd: unset covariances or correlations: set sem (FT->r w m s o) Endogenous variables Measurement: r w m s o Exogenous variables Latent: FT Fitting target model: Iteration 0: log likelihood = -1148.1182 Iteration 1: log likelihood = -1147.4763 Iteration 2: log likelihood = -1147.4673 Iteration 3: log likelihood = -1147.4673 Structural equation model Number of obs = 200 Estimation method = ml Log likelihood = -1147.4673 ( 1) [r]FT = 1 ------------------------------------------------------------------------------ | OIM | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- Measurement | r <- | FT | 1 (constrained) -----------+---------------------------------------------------------------- w <- | FT | 1.044911 .0857913 12.18 0.000 .8767629 1.213058 -----------+---------------------------------------------------------------- m <- | FT | 1.112223 .0842935 13.19 0.000 .9470107 1.277435 -----------+---------------------------------------------------------------- s <- | FT | .9748353 .0866568 11.25 0.000 .804991 1.14468 -----------+---------------------------------------------------------------- o <- | FT | .8638298 .0865136 9.98 0.000 .6942662 1.033393 -------------+---------------------------------------------------------------- Variance | e.r | .3821465 .0471597 .3000443 .4867145 e.w | .3258631 .0427959 .2519104 .4215259 e.m | .2368758 .037462 .1737412 .3229525 e.s | .412603 .0490659 .3268205 .5209013 e.o | .5376875 .0600257 .4320208 .6691991 FT | .6128535 .0959352 .4509317 .8329186 ------------------------------------------------------------------------------ LR test of model vs. saturated: chi2(5) = 14.57, Prob > chi2 = 0.0124
You will note that the model fit versus a saturated model is very close to the value that was obtained when ran the CFA on the continuous variables.