Latent Class Analysis

Hypothetical Scenarios

Example 1

You are interested in studying drinking behavior among adults. Rather than conceptualizing drinking behavior as a continuous variable, you conceptualize it as forming distinct categories or typologies. For example, you think that people fall into one of three different types: abstainers, social drinkers and those who may have a problem with alcohol. Since you cannot directly measure what category someone falls into, this is a latent variable (a variable that cannot be directly measured). However, you do have a number of indicators that you believe are useful for categorizing people into these different categories. Using these indicators, you would like to:

Create a model that permits you to categorize these people into three different types of drinkers, hopefully fitting your conceptualization that there are abstainers, social drinkers and those who may have a problem with alcohol. Be able to categorize people as to what kind of drinker they are. Determine whether three latent classes is the right number of classes (i.e., are there only two types of drinkers or perhaps are there as many as four types of drinkers).

Example 2

High school students vary in their success in school. This might be indicated by the grades one gets, the number of absences one has, the number of truancies one has, and so forth. A traditional way to conceptualize this might be to view “degree of success in high school” as a latent variable (one that you cannot directly measure) that is normally distributed. However, you might conceptualize some students who are struggling and having trouble as forming a different category, perhaps a group you would call “at risk” (or in older days they would be called “juvenile delinquents”). Using indicators like grades, absences, truancies, tardies, suspensions, etc., you might try to identify latent class memberships based on high school success.

Data Description

Let’s pursue Example 1 from above. We have a hypothetical data file that we created that contains 9 fictional measures of drinking behavior. For each measure, the person would be asked whether the description applies to him/herself (yes or no). The 9 measures are

I like to drink I drink hard liquor I have drank in the morning I have drank at work I drink to get drunk I like the taste of alcohol I drink help me sleep Drinking interferes with my relationships I frequently visit bars

We have made up data for 1000 respondents and stored the data in a file called lca1, which is a Stata data file with the subject ID followed by the responses to the 9 questions, coded 1 for yes and 0 for no. The data for the first 10 observations look like this:

list id item1-item9 in 1/10

     +-----------------------------------------------------------------------------+
     |  id   item1   item2   item3   item4   item5   item6   item7   item8   item9 |
     |-----------------------------------------------------------------------------|
  1. | 300       0       0       0       0       0       0       0       0       0 |
  2. | 804       0       0       0       0       0       0       0       0       0 |
  3. | 949       0       0       0       0       0       0       0       0       0 |
  4. |  11       0       0       0       0       0       0       0       0       0 |
  5. | 166       0       0       0       0       0       0       0       0       0 |
     |-----------------------------------------------------------------------------|
  6. | 269       0       0       0       0       0       0       0       0       0 |
  7. | 437       0       0       0       0       0       0       0       0       0 |
  8. | 678       0       0       0       0       0       0       0       0       0 |
  9. | 379       0       0       0       0       0       0       0       0       0 |
 10. | 525       0       0       0       0       0       0       0       0       0 |
     +-----------------------------------------------------------------------------+

Some Strategies You Might Try

Before we show how you can analyze this with Latent Class Analysis, let’s consider some other methods that you might use:

Cluster Analysis – You could use cluster analysis for data like these. However, cluster analysis is not based on a statistical model. It can tell you how the cases are clustered into groups, but it does not provide information such as the probability that a given person is an alcoholic or abstainer. Also, cluster analysis would not provide information such as: given that someone said “yes” to drinking at work, what is the probability that they are an alcoholic. Factor Analysis – Because the term “latent variable” is used, you might be tempted to use factor analysis, since that is a technique used with latent variables. However, factor analysis is used for continuous and usually normally distributed latent variables, where this latent variable, e.g., alcoholism, is categorical.

Stata’s gsem is used to run a latent class analysis. After the command, the categorical predictor variables are listed. Because the variables in this example are numbered consecutively from 1 to 9, we can simply list the first variable name, item1, followed by a dash, and then the last variable name, item9. This is followed by an arrow pointing toward the predictors. The _cons is optional; the analysis will run if it is there are not. After closing the parentheses, a comma is given, indicating that options will follow. We list the family as bernoulli, the link as logit (because our predictors are binary), and then use the lclass option. In the parentheses, we give the name of the class, usually an upper-case C, and the number of classes we want.

gsem (item1-item9 <- _cons), family(bernoulli) link(logit) lclass(C 3)
Fitting class model:

Iteration 0:  (class) log likelihood = -1098.6113  
Iteration 1:  (class) log likelihood = -1098.6113  

Fitting outcome model:

Iteration 0:  (outcome) log likelihood = -3758.2924  
Iteration 1:  (outcome) log likelihood = -3646.4855  
Iteration 2:  (outcome) log likelihood = -3630.2169  
Iteration 3:  (outcome) log likelihood = -3626.9688  
Iteration 4:  (outcome) log likelihood = -3626.3104  
Iteration 5:  (outcome) log likelihood = -3626.1551  
Iteration 6:  (outcome) log likelihood = -3626.1263  
Iteration 7:  (outcome) log likelihood = -3626.1234  
Iteration 8:  (outcome) log likelihood = -3626.1228  
Iteration 9:  (outcome) log likelihood = -3626.1226  
Iteration 10: (outcome) log likelihood = -3626.1226  

Refining starting values:

Iteration 0:  (EM) log likelihood = -4931.4182
Iteration 1:  (EM) log likelihood = -4973.2876
Iteration 2:  (EM) log likelihood = -4980.0232
Iteration 3:  (EM) log likelihood = -4975.9769
Iteration 4:  (EM) log likelihood = -4968.4002
Iteration 5:  (EM) log likelihood = -4959.8284
Iteration 6:  (EM) log likelihood = -4951.1641
Iteration 7:  (EM) log likelihood = -4942.6958
Iteration 8:  (EM) log likelihood = -4934.4788
Iteration 9:  (EM) log likelihood = -4926.4841
Iteration 10: (EM) log likelihood = -4918.6582
Iteration 11: (EM) log likelihood = -4910.9457
Iteration 12: (EM) log likelihood = -4903.2977
Iteration 13: (EM) log likelihood = -4895.6746
Iteration 14: (EM) log likelihood = -4888.0454
Iteration 15: (EM) log likelihood = -4880.3882
Iteration 16: (EM) log likelihood =  -4872.689
Iteration 17: (EM) log likelihood = -4864.9418
Iteration 18: (EM) log likelihood = -4857.1472
Iteration 19: (EM) log likelihood = -4849.3128
Iteration 20: (EM) log likelihood = -4841.4511
note: EM algorithm reached maximum iterations.

Fitting full model:

Iteration 0:  Log likelihood = -4243.4772  (not concave)
Iteration 1:  Log likelihood = -4242.3749  (not concave)
Iteration 2:  Log likelihood = -4241.0733  (not concave)
Iteration 3:  Log likelihood = -4240.0798  (not concave)
Iteration 4:  Log likelihood = -4236.6898  (not concave)
Iteration 5:  Log likelihood =  -4236.656  (not concave)
Iteration 6:  Log likelihood = -4234.9869  
Iteration 7:  Log likelihood =  -4232.749  (not concave)
Iteration 8:  Log likelihood = -4232.2285  
Iteration 9:  Log likelihood = -4231.9255  
Iteration 10: Log likelihood = -4231.7821  
Iteration 11: Log likelihood = -4231.7411  (not concave)
Iteration 12: Log likelihood =  -4231.702  
Iteration 13: Log likelihood = -4231.6959  
Iteration 14: Log likelihood = -4231.6958  

Generalized structural equation model                    Number of obs = 1,000
Log likelihood = -4231.6958

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
1.C          |  (base outcome)
-------------+----------------------------------------------------------------
2.C          |
       _cons |   .4287798   .8185974     0.52   0.600    -1.175642    2.033201
-------------+----------------------------------------------------------------
3.C          |
       _cons |  -1.521675   .6923291    -2.20   0.028    -2.878615   -.1647353
------------------------------------------------------------------------------

Class:    1        

Response: item1    
Family:   Bernoulli
Link:     Logit    

Response: item2    
Family:   Bernoulli
Link:     Logit    

Response: item3    
Family:   Bernoulli
Link:     Logit    

Response: item4    
Family:   Bernoulli
Link:     Logit    

Response: item5    
Family:   Bernoulli
Link:     Logit    

Response: item6    
Family:   Bernoulli
Link:     Logit    

Response: item7    
Family:   Bernoulli
Link:     Logit    

Response: item8    
Family:   Bernoulli
Link:     Logit    

Response: item9    
Family:   Bernoulli
Link:     Logit    

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
item1        |
       _cons |  -.7899339   .9859737    -0.80   0.423    -2.722407    1.142539
-------------+----------------------------------------------------------------
item2        |
       _cons |  -1.628514   .3027822    -5.38   0.000    -2.221956   -1.035071
-------------+----------------------------------------------------------------
item3        |
       _cons |  -3.295288   .4608422    -7.15   0.000    -4.198523   -2.392054
-------------+----------------------------------------------------------------
item4        |
       _cons |  -2.823975   .3176524    -8.89   0.000    -3.446562   -2.201388
-------------+----------------------------------------------------------------
item5        |
       _cons |  -3.069835   .8750441    -3.51   0.000     -4.78489    -1.35478
-------------+----------------------------------------------------------------
item6        |
       _cons |  -1.496508   .3030282    -4.94   0.000    -2.090432   -.9025834
-------------+----------------------------------------------------------------
item7        |
       _cons |  -2.223255   .2496319    -8.91   0.000    -2.712524   -1.733985
-------------+----------------------------------------------------------------
item8        |
       _cons |  -2.091981   .2295699    -9.11   0.000    -2.541929   -1.642032
-------------+----------------------------------------------------------------
item9        |
       _cons |  -1.464306   .2636135    -5.55   0.000    -1.980979   -.9476332
------------------------------------------------------------------------------

Class:    2        

Response: item1    
Family:   Bernoulli
Link:     Logit    

Response: item2    
Family:   Bernoulli
Link:     Logit    

Response: item3    
Family:   Bernoulli
Link:     Logit    

Response: item4    
Family:   Bernoulli
Link:     Logit    

Response: item5    
Family:   Bernoulli
Link:     Logit    

Response: item6    
Family:   Bernoulli
Link:     Logit    

Response: item7    
Family:   Bernoulli
Link:     Logit    

Response: item8    
Family:   Bernoulli
Link:     Logit    

Response: item9    
Family:   Bernoulli
Link:     Logit    

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
item1        |
       _cons |   2.292889   1.053674     2.18   0.030     .2277263    4.358053
-------------+----------------------------------------------------------------
item2        |
       _cons |  -.6748749   .2540475    -2.66   0.008    -1.172799    -.176951
-------------+----------------------------------------------------------------
item3        |
       _cons |  -2.637822   .2810791    -9.38   0.000    -3.188727   -2.086917
-------------+----------------------------------------------------------------
item4        |
       _cons |  -2.658283   .3095767    -8.59   0.000    -3.265043   -2.051524
-------------+----------------------------------------------------------------
item5        |
       _cons |  -1.270617   .3102912    -4.09   0.000    -1.878776   -.6624574
-------------+----------------------------------------------------------------
item6        |
       _cons |   -.755464   .1985341    -3.81   0.000    -1.144584   -.3663442
-------------+----------------------------------------------------------------
item7        |
       _cons |  -2.062337   .2129544    -9.68   0.000     -2.47972   -1.644954
-------------+----------------------------------------------------------------
item8        |
       _cons |  -1.816183   .2258445    -8.04   0.000     -2.25883   -1.373536
-------------+----------------------------------------------------------------
item9        |
       _cons |  -.7314625   .1820354    -4.02   0.000    -1.088245   -.3746796
------------------------------------------------------------------------------

Class:    3        

Response: item1    
Family:   Bernoulli
Link:     Logit    

Response: item2    
Family:   Bernoulli
Link:     Logit    

Response: item3    
Family:   Bernoulli
Link:     Logit    

Response: item4    
Family:   Bernoulli
Link:     Logit    

Response: item5    
Family:   Bernoulli
Link:     Logit    

Response: item6    
Family:   Bernoulli
Link:     Logit    

Response: item7    
Family:   Bernoulli
Link:     Logit    

Response: item8    
Family:   Bernoulli
Link:     Logit    

Response: item9    
Family:   Bernoulli
Link:     Logit    

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
item1        |
       _cons |   2.487929   .5795831     4.29   0.000     1.351967    3.623891
-------------+----------------------------------------------------------------
item2        |
       _cons |   .1851186   .3154212     0.59   0.557    -.4330956    .8033328
-------------+----------------------------------------------------------------
item3        |
       _cons |  -.2965715   .3835435    -0.77   0.439    -1.048303      .45516
-------------+----------------------------------------------------------------
item4        |
       _cons |  -.3312536   .3358622    -0.99   0.324    -.9895315    .3270242
-------------+----------------------------------------------------------------
item5        |
       _cons |   1.182954   .5570521     2.12   0.034     .0911515    2.274756
-------------+----------------------------------------------------------------
item6        |
       _cons |  -.1160543   .3060531    -0.38   0.705    -.7159073    .4837988
-------------+----------------------------------------------------------------
item7        |
       _cons |   .0495348   .3718233     0.13   0.894    -.6792255    .7782951
-------------+----------------------------------------------------------------
item8        |
       _cons |   .4862054   .4581794     1.06   0.289    -.4118098    1.384221
-------------+----------------------------------------------------------------
item9        |
       _cons |  -.6241858   .3073837    -2.03   0.042    -1.226647   -.0217248
------------------------------------------------------------------------------

Although there is a lot of output, there is not too much you need to do with it. The model needs to be run so that we can then request the latent class probabilities and the latent class means.

Conditional Probabilities

We use the post-estimation command estat lcprob to get the latent class probabilities.

estat lcprob
Latent class marginal probabilities                      Number of obs = 1,000

--------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.     [95% conf. interval]
-------------+------------------------------------------------
           C |
          1  |    .363144   .1838146      .1072129    .7302799
          2  |   .5575651   .1750544      .2387489    .8350876
          3  |    .079291   .0242463      .0429849    .1417209
--------------------------------------------------------------

We see that the average probability of being in latent class 1 is approximately .36; the average probability of being in latent class 2 is approximately .56; and the average probability of being in latent class 3 is 0.08.

The post-estimation command estat lcmean gives the average proportion of endorsement (meaning selecting 1 rather than 0) for each item in each latent class.

estat lcmean
Latent class marginal means                              Number of obs = 1,000

--------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.     [95% conf. interval]
-------------+------------------------------------------------
1            |
       item1 |   .3121829   .2117129      .0616641    .7581455
       item2 |   .1640341   .0415196      .0977961    .2621021
       item3 |   .0357332   .0158789      .0147956    .0837806
       item4 |   .0560423   .0168043      .0308716     .099626
       item5 |   .0443688   .0371021      .0082858      .20509
       item6 |    .182947   .0452959      .1100302    .2885199
       item7 |   .0976816   .0220025      .0622384    .1500786
       item8 |   .1098787   .0224532      .0729706    .1621888
       item9 |   .1878096   .0402109      .1212145     .279361
-------------+------------------------------------------------
2            |
       item1 |   .9082864   .0877734      .5566868    .9873586
       item2 |   .3374061   .0567957      .2363495    .4558773
       item3 |   .0667436   .0175081      .0395922    .1103749
       item4 |   .0654803   .0189438      .0367901    .1138985
       item5 |   .2191517   .0530983      .1325295    .3401878
       item6 |   .3196319   .0431747      .2414798    .4094247
       item7 |   .1128118   .0213136      .0772922    .1617921
       item8 |   .1398925   .0271742      .0945905    .2020492
       item9 |   .3248739    .039926      .2519488    .4074107
-------------+------------------------------------------------
3            |
       item1 |   .9232913   .0410487       .794451    .9740146
       item2 |   .5461479   .0781836      .3933874    .6906869
       item3 |   .4263958    .093808      .2595511    .6118654
       item4 |   .4179356   .0817037      .2710046    .5810351
       item5 |   .7654785   .1000027      .5227721    .9067646
       item6 |    .471019   .0762562      .3282948    .6186445
       item7 |   .5123812   .0928988      .3364342    .6853126
       item8 |   .6192121   .1080334      .3984782     .799668
       item9 |   .3488301   .0698215      .2267688     .494569
--------------------------------------------------------------

Looking at item1 across the three classes, we see that it was endorsed by a little less than a third of those in latent class 1 and by more than 90% of those in latent classes 2 and 3. Looking at the pattern of responses for each item in each latent class should be helpful in determining the nature of each class. In our example, it seems that those in latent class 1 are those who are “social” drinkers; those in latent class 2 seem to be those who tend to abstain from alcohol, and those in latent class 3 may have a problem with alcohol.

The output above is useful, but it is not in a format that would be easily understood by most audiences. Let’s reformat the output to make it easier to read, as shown below. Each row represents a different item, and the three columns of numbers are the probabilities of answering “yes” to the item given that you belonged to that class. So, if you belong to latent class 1, you have a 90.8% probability of saying “yes, I like to drink”. By contrast, if you belong to latent class 2, you have a 31.2% chance of saying “yes, I like to drink”.

Class 1 Class 2 Class 3 Item Label
ITEM1 0.312 0.908 0.923 I like to drink
ITEM2 0.164 0.337 0.546 I drink hard liquor
ITEM3 0.036 0.067 0.426 I have drank in the morning
ITEM4 0.056 0.065 0.418 I have drank at work
ITEM5 0.044 0.219 0.765 I drink to get drunk
ITEM6 0.183 0.320 0.471 I like the taste of alcohol
ITEM7 0.098 0.113 0.512 I drink help me sleep
ITEM8 0.110 0.140 0.619 Drinking interferes with my relationships
ITEM9 0.188 0.325 0.349 I frequently visit bars

Looking at item1, those in latent class 2 and latent class3 really like to drink (with 90.8% and 92.3% saying yes) while those in latent class1 are not so fond of drinking (they have only a 31.2% probability of saying they like to drink). Jumping to item5, 76.5% of those in latent class 3 say they drink to get drunk, while 21.9% of those in latent class2 agreed to that, and only 4.4% of those in latent class1 say that.

Focusing just on latent class 3 (looking at that column), they really like to drink (92%), they drink hard liquor (54.6%), a pretty large number say they have drank in the morning and at work (42.6% and 41.8%), and well over half say drinking interferes with their relationships (61.9%).

It seems that those in latent class 1 are those who tend to abstain from alcohol; we were expecting to find a latent class like this. Not many of them like to drink (31.2%), few like the taste of alcohol (18.3%), few frequently visit bars (18.8%), and for the rest of the questions they rarely answered “yes”.

This leaves latent class 2; they seem fit the idea of the “social” drinker. They like to drink (90.8%), but they don’t drink hard liquor as often as Class 3 (33.7% versus 54.6%). They rarely drink in the morning or at work (6.7% and 6.5%) and rarely say that drinking interferes with their relationships (14%). They say they frequently visit bars similar to latent class 3 (32.5% versus 34.9%), but that might make sense. Both the social drinkers and those with a problem with alcohol are similar in how much they like to drink and how frequently they go to bars, but differ in key ways such as drinking at work, drinking in the morning, and the impact of drinking on their relationships.

We may also want to know how well this model fits these data, so we can use the post-estimation command estat lcgof.

estat lcgof
----------------------------------------------------------------------------
Fit statistic        |      Value   Description
---------------------+------------------------------------------------------
Likelihood ratio     |
        chi2_ms(482) |    319.955   model vs. saturated
            p > chi2 |      1.000
---------------------+------------------------------------------------------
Information criteria |
                 AIC |   8521.392   Akaike's information criterion
                 BIC |   8663.716   Bayesian information criterion
----------------------------------------------------------------------------

We fail to reject the null hypothesis that the fitted model fits as well as the saturated model.

Graphing the Results

All write-ups of latent class analyses contain tables of results, but graphs are useful as well. In Stata, we use the post-estimation command margins to create a table with particular content, and then use marginsplot to graph the contents of the table. You use must run the margins command before you run the marginsplot command. Let’s start with a basic graph, and then we will modify the graph to make it look nicer.

margins, predict(classpr class(1)) predict(classpr class(2)) predict(classpr class(3))
marginsplot

This graph is OK, but it could be better. We will add some options to the marginsplot command to improve the graph. Among other options, we will use the recast option, which changes the graph from a line graph to a bar graph.

marginsplot, recast(bar) xtitle("Latent Classes") ytitle("Probabilities of Belonging to a Class") xlabel(1 "Class 1" 2 "Class 2" 3 "Class 3") title("Predicted Migraine Latent Class Probabilities with 95% CI")

We can also make graphs showing the predicted probabilities for each of the items in our analysis.

margins, predict(outcome(item1) class(3)) predict(outcome(item2) class(3)) predict(outcome(item3) class(3)) predict(outcome(item4) class(3)) predict(outcome(item5) class(3)) predict(outcome(item6) class(3)) predict(outcome(item7) class(3)) predict(outcome(item8) class(3)) predict(outcome(item9) class(3)) 
marginsplot, recast(bar) xtitle("") ytitle("") xlabel(1 "item1" 2 "item2" 3 "item3" 4 "item4" 5 "item5" 6 "item6" 7 "item7" 8 "item8" 9 "item9", angle(45)) title("Predicted Probability of Behaviors For Class 3 with 95% CI")

Number of Classes

So far we have been assuming that we have chosen the right number of latent classes. Perhaps, however, there are only two types of drinkers, or perhaps there are four or more types of drinkers. So far we have liked the three class model, both based on our theoretical expectations and based on how interpretable our results have been. We can further assess whether we have chosen the right number of classes by running the analysis with different numbers of classes and then comparing the fit of the models. In our example, we may have wanted to compare a one-class, two-class, three-class and four-class model and then compare the results. However, a four-class model will not run with our example data, so we will run a one-class, two-class and three-class model and then compare the results with the estimates stats command. We use the quietly prefix before each gsem command to suppress the output.

quietly gsem (item1-item9 <- ), family(bernoulli) link(logit) lclass(C 1)
estimates store oneclass
quietly gsem (item1-item9 <- ), family(bernoulli) link(logit) lclass(C 2)
estimates store twoclass
quietly gsem (item1-item9 <- ), family(bernoulli) link(logit) lclass(C 3)
estimates store threeclass
* lower is better
estimates stats oneclass twoclass threeclass
Akaike's information criterion and Bayesian information criterion

-----------------------------------------------------------------------------
       Model |          N   ll(null)  ll(model)      df        AIC        BIC
-------------+---------------------------------------------------------------
    oneclass |      1,000          .  -4348.878       9   8715.755   8759.925
    twoclass |      1,000          .  -4251.208      19   8540.416   8633.664
  threeclass |      1,000          .  -4231.696      29   8521.392   8663.716
-----------------------------------------------------------------------------
Note: BIC uses N = number of observations. See [R] IC note.

Both the AIC and BIC are lower for the three-class solution, so we think this is a good model.

Cautions, Flies in the Ointment

We have focused on a very simple example here just to get you started. Here are some problems to watch out for.

Have you specified the right number of latent classes? Perhaps you have specified too many classes (i.e., people largely fall into 2 classes) or you may have specified too few classes (i.e., people really fall into 4 or more classes).

Are some of your measures/indicators lousy? All of our measures were really useful in distinguishing what type of drinker the person was. However, say we had a measure that was “Do you like broccoli?”. This would be a poor indicator, and each type of drinker would probably answer in a similar way, so this question would be a good candidate to discard.

Having developed this model to identify the different types of drinkers, we might be interested in trying to predict why someone is an alcoholic, or why someone is an abstainer. For example, we might be interested in whether parental drinking predicts being an alcoholic. Such analyses are possible but not discussed here.

References

Masyn, Katherine E. Latent Class Analysis and Finite Mixture Modeling. (2013). In The Oxford Handbook of Quantitative Methods. Edited by Todd Little.