Let’s say that you have a dataset with a bunch of binary variables. Further, you believe that these binary variables reflect underlying and unobserved continuous variables. You don’t want to compute your confirmatory factor analysis (CFA) directly on the binary variables. You will want to compute the CFA on tetrachoric correlations that reflect the associations among these underlying continuous variables. We will demonstrate this by using data with five continuous variables and creating binary variables from them by dichotomizing them at a point a little above their mean values.
Let’s begin by loading the hsbdemo.dta dataset and creating binary variables for read, write, math, science and socst.
use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear gen r=read<=55 gen w=write<=55 gen m=math<=55 gen s=science<=55 gen o=socst<=55
Now that we have the binary variables, let’s checkout the correlations among the continuous version of the variables and the binary version.
corr read write math science socst
(obs=200)
| read write math science socst
-------------+---------------------------------------------
read | 1.0000
write | 0.5968 1.0000
math | 0.6623 0.6174 1.0000
science | 0.6302 0.5704 0.6307 1.0000
socst | 0.6215 0.6048 0.5445 0.4651 1.0000
corr r w m s o
(obs=200)
| r w m s o
-------------+---------------------------------------------
r | 1.0000
w | 0.4109 1.0000
m | 0.4750 0.5029 1.0000
s | 0.3846 0.4320 0.4750 1.0000
o | 0.4057 0.3910 0.3676 0.3009 1.0000
As you can see, the correlations among the binary version of the variables are much lower than among the continuous version. The Pearson correlations tend to underestimate the relationship between the underlying continuous variables that give rise to the binary variables. What we need are the tetrachoric correlations which we can obtain using the tetrachoric command.
tetrachoric r w m s o
(obs=200)
| r w m s o
-------------+---------------------------------------------
r | 1.0000
w | 0.6145 1.0000
m | 0.6874 0.7176 1.0000
s | 0.5790 0.6411 0.6874 1.0000
o | 0.6148 0.5780 0.5556 0.4690 1.0000
The tetrachoric correlations are much closer to the original correlations among the continuous variables than the correlations among the binary values.
For comparison purposes we will compute a CFA on the original continuous data.
sem (FC->read write math science socst)
Endogenous variables
Measurement: read write math science socst
Exogenous variables
Latent: FC
Fitting target model:
Iteration 0: log likelihood = -3469.2622
[output omitted]
Iteration 3: log likelihood = -3468.8093
Structural equation model Number of obs = 200
Estimation method = ml
Log likelihood = -3468.8093
( 1) [read]FC = 1
------------------------------------------------------------------------------
| OIM
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
Measurement |
read <- |
FC | 1 (constrained)
_cons | 52.23 .7231774 72.22 0.000 50.8126 53.6474
-----------+----------------------------------------------------------------
write <- |
FC | .8560521 .074653 11.47 0.000 .7097349 1.002369
_cons | 52.775 .6685596 78.94 0.000 51.46465 54.08535
-----------+----------------------------------------------------------------
math <- |
FC | .8915544 .0716702 12.44 0.000 .7510834 1.032025
_cons | 52.645 .6607911 79.67 0.000 51.34987 53.94013
-----------+----------------------------------------------------------------
science <- |
FC | .8738626 .0766422 11.40 0.000 .7236467 1.024079
_cons | 51.85 .6983463 74.25 0.000 50.48127 53.21873
-----------+----------------------------------------------------------------
socst <- |
FC | .9068989 .0836274 10.84 0.000 .7429922 1.070806
_cons | 52.405 .757235 69.21 0.000 50.92085 53.88915
-------------+----------------------------------------------------------------
Variance |
e.read | 33.32625 4.724035 25.24223 43.99925
e.write | 37.1653 4.627598 29.1173 47.43775
e.math | 30.67796 4.098463 23.61072 39.86061
e.science | 43.11252 5.199046 34.03728 54.60745
e.socst | 56.06316 6.566122 44.56406 70.52942
FC | 71.27085 10.46467 53.44787 95.03717
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(5) = 13.90, Prob > chi2 = 0.0163sem (FB->r w m s o)
Next, we will create the SSD dataset and compute the CFA on the tetrachoric correlations.
clear
ssd init r w m s o
Summary statistics data initialized. Next use, in any order,
ssd set observations (required)
It is best to do this first.
ssd set means (optional)
Default setting is 0.
ssd set variances or ssd set sd (optional)
Use this only if you have set or will set correlations and, even then,
this is optional but highly recommended. Default setting is 1.
ssd set covariances or ssd set correlations (required)
ssd set obs 200
(value set)
Status:
observations: set
means: unset
variances or sd: unset
covariances or correlations: unset (required to be set)
ssd set cor 1.0000 ///
0.6145 1.0000 ///
0.6874 0.7176 1.0000 ///
0.5790 0.6411 0.6874 1.0000 ///
0.6148 0.5780 0.5556 0.4690 1.0000
(values set)
Status:
observations: set
means: unset
variances or sd: unset
covariances or correlations: set
sem (FT->r w m s o)
Endogenous variables
Measurement: r w m s o
Exogenous variables
Latent: FT
Fitting target model:
Iteration 0: log likelihood = -1148.1182
Iteration 1: log likelihood = -1147.4763
Iteration 2: log likelihood = -1147.4673
Iteration 3: log likelihood = -1147.4673
Structural equation model Number of obs = 200
Estimation method = ml
Log likelihood = -1147.4673
( 1) [r]FT = 1
------------------------------------------------------------------------------
| OIM
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
Measurement |
r <- |
FT | 1 (constrained)
-----------+----------------------------------------------------------------
w <- |
FT | 1.044911 .0857913 12.18 0.000 .8767629 1.213058
-----------+----------------------------------------------------------------
m <- |
FT | 1.112223 .0842935 13.19 0.000 .9470107 1.277435
-----------+----------------------------------------------------------------
s <- |
FT | .9748353 .0866568 11.25 0.000 .804991 1.14468
-----------+----------------------------------------------------------------
o <- |
FT | .8638298 .0865136 9.98 0.000 .6942662 1.033393
-------------+----------------------------------------------------------------
Variance |
e.r | .3821465 .0471597 .3000443 .4867145
e.w | .3258631 .0427959 .2519104 .4215259
e.m | .2368758 .037462 .1737412 .3229525
e.s | .412603 .0490659 .3268205 .5209013
e.o | .5376875 .0600257 .4320208 .6691991
FT | .6128535 .0959352 .4509317 .8329186
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(5) = 14.57, Prob > chi2 = 0.0124
You will note that the model fit versus a saturated model is very close to the value that was obtained when ran the CFA on the continuous variables.
