How can I do CFA with binary variables?

Let’s say that you have a dataset with a bunch of binary variables. Further, you believe that these binary variables reflect underlying and unobserved continuous variables. You don’t want to compute your confirmatory factor analysis (CFA) directly on the binary variables. You will want to compute the CFA on tetrachoric correlations that reflect the associations among these underlying continuous variables. We will demonstrate this by using data with five continuous variables and creating binary variables from them by dichotomizing them at a point a little above their mean values.

Let’s begin by loading the hsbdemo.dta dataset and creating binary variables for read, write, math, science and socst.

use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear

gen r=read<=55
gen w=write<=55
gen m=math<=55
gen s=science<=55
gen o=socst<=55

Now that we have the binary variables, let’s checkout the correlations among the continuous version of the variables and the binary version.

corr read write math science socst

(obs=200)

             |     read    write     math  science    socst
-------------+---------------------------------------------
        read |   1.0000
       write |   0.5968   1.0000
        math |   0.6623   0.6174   1.0000
     science |   0.6302   0.5704   0.6307   1.0000
       socst |   0.6215   0.6048   0.5445   0.4651   1.0000

corr r w m s o
 
(obs=200)

             |        r        w        m        s        o
-------------+---------------------------------------------
           r |   1.0000
           w |   0.4109   1.0000
           m |   0.4750   0.5029   1.0000
           s |   0.3846   0.4320   0.4750   1.0000
           o |   0.4057   0.3910   0.3676   0.3009   1.0000

As you can see, the correlations among the binary version of the variables are much lower than among the continuous version. The Pearson correlations tend to underestimate the relationship between the underlying continuous variables that give rise to the binary variables. What we need are the tetrachoric correlations which we can obtain using the tetrachoric command.

tetrachoric r w m s o

(obs=200)

             |        r        w        m        s        o
-------------+---------------------------------------------
           r |   1.0000 
           w |   0.6145   1.0000 
           m |   0.6874   0.7176   1.0000 
           s |   0.5790   0.6411   0.6874   1.0000 
           o |   0.6148   0.5780   0.5556   0.4690   1.0000

The tetrachoric correlations are much closer to the original correlations among the continuous variables than the correlations among the binary values.

For comparison purposes we will compute a CFA on the original continuous data.

sem (FC->read write math science socst)

Endogenous variables

Measurement:  read write math science socst

Exogenous variables

Latent:       FC

Fitting target model:

Iteration 0:   log likelihood = -3469.2622  
[output omitted] 
Iteration 3:   log likelihood = -3468.8093  

Structural equation model                       Number of obs      =       200
Estimation method  = ml
Log likelihood     = -3468.8093

 ( 1)  [read]FC = 1
------------------------------------------------------------------------------
             |                 OIM
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
Measurement  |
  read <-    |
          FC |          1  (constrained)
       _cons |      52.23   .7231774    72.22   0.000      50.8126     53.6474
  -----------+----------------------------------------------------------------
  write <-   |
          FC |   .8560521    .074653    11.47   0.000     .7097349    1.002369
       _cons |     52.775   .6685596    78.94   0.000     51.46465    54.08535
  -----------+----------------------------------------------------------------
  math <-    |
          FC |   .8915544   .0716702    12.44   0.000     .7510834    1.032025
       _cons |     52.645   .6607911    79.67   0.000     51.34987    53.94013
  -----------+----------------------------------------------------------------
  science <- |
          FC |   .8738626   .0766422    11.40   0.000     .7236467    1.024079
       _cons |      51.85   .6983463    74.25   0.000     50.48127    53.21873
  -----------+----------------------------------------------------------------
  socst <-   |
          FC |   .9068989   .0836274    10.84   0.000     .7429922    1.070806
       _cons |     52.405    .757235    69.21   0.000     50.92085    53.88915
-------------+----------------------------------------------------------------
Variance     |
      e.read |   33.32625   4.724035                      25.24223    43.99925
     e.write |    37.1653   4.627598                       29.1173    47.43775
      e.math |   30.67796   4.098463                      23.61072    39.86061
   e.science |   43.11252   5.199046                      34.03728    54.60745
     e.socst |   56.06316   6.566122                      44.56406    70.52942
          FC |   71.27085   10.46467                      53.44787    95.03717
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(5)   =     13.90, Prob > chi2 = 0.0163sem (FB->r w m s o)

Next, we will create the SSD dataset and compute the CFA on the tetrachoric correlations.

clear

ssd init r w m s o

Summary statistics data initialized.  Next use, in any order,

    ssd set observations (required)
        It is best to do this first.

    ssd set means (optional)
        Default setting is 0.

    ssd set variances or ssd set sd (optional)
        Use this only if you have set or will set correlations and, even then,
        this is optional but highly recommended.  Default setting is 1.

    ssd set covariances or ssd set correlations (required)

ssd set obs 200
  (value set)

    Status:
                       observations:    set
                              means:  unset
                    variances or sd:  unset
        covariances or correlations:  unset (required to be set)

ssd set cor  1.0000  ///
             0.6145    1.0000  ///
             0.6874   0.7176   1.0000  ///
             0.5790   0.6411   0.6874    1.0000   ///
             0.6148   0.5780   0.5556    0.4690    1.0000 
  (values set)

    Status:
                       observations:    set
                              means:  unset
                    variances or sd:  unset
        covariances or correlations:    set

sem (FT->r w m s o)

Endogenous variables

Measurement:  r w m s o

Exogenous variables

Latent:       FT

Fitting target model:

Iteration 0:   log likelihood = -1148.1182  
Iteration 1:   log likelihood = -1147.4763  
Iteration 2:   log likelihood = -1147.4673  
Iteration 3:   log likelihood = -1147.4673  

Structural equation model                       Number of obs      =       200
Estimation method  = ml
Log likelihood     = -1147.4673

 ( 1)  [r]FT = 1
------------------------------------------------------------------------------
             |                 OIM
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
Measurement  |
  r <-       |
          FT |          1  (constrained)
  -----------+----------------------------------------------------------------
  w <-       |
          FT |   1.044911   .0857913    12.18   0.000     .8767629    1.213058
  -----------+----------------------------------------------------------------
  m <-       |
          FT |   1.112223   .0842935    13.19   0.000     .9470107    1.277435
  -----------+----------------------------------------------------------------
  s <-       |
          FT |   .9748353   .0866568    11.25   0.000      .804991     1.14468
  -----------+----------------------------------------------------------------
  o <-       |
          FT |   .8638298   .0865136     9.98   0.000     .6942662    1.033393
-------------+----------------------------------------------------------------
Variance     |
         e.r |   .3821465   .0471597                      .3000443    .4867145
         e.w |   .3258631   .0427959                      .2519104    .4215259
         e.m |   .2368758    .037462                      .1737412    .3229525
         e.s |    .412603   .0490659                      .3268205    .5209013
         e.o |   .5376875   .0600257                      .4320208    .6691991
          FT |   .6128535   .0959352                      .4509317    .8329186
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(5)   =     14.57, Prob > chi2 = 0.0124

You will note that the model fit versus a saturated model is very close to the value that was obtained when ran the CFA on the continuous variables.