Hypothetical Scenarios
Example 1
You are interested in studying drinking behavior among adults. Rather than conceptualizing drinking behavior as a continuous variable, you conceptualize it as forming distinct categories or typologies. For example, you think that people fall into one of three different types: abstainers, social drinkers and those who may have a problem with alcohol. Since you cannot directly measure what category someone falls into, this is a latent variable (a variable that cannot be directly measured). However, you do have a number of indicators that you believe are useful for categorizing people into these different categories. Using these indicators, you would like to:
Create a model that permits you to categorize these people into three different types of drinkers, hopefully fitting your conceptualization that there are abstainers, social drinkers and those who may have a problem with alcohol. Be able to categorize people as to what kind of drinker they are. Determine whether three latent classes is the right number of classes (i.e., are there only two types of drinkers or perhaps are there as many as four types of drinkers).
Example 2
High school students vary in their success in school. This might be indicated by the grades one gets, the number of absences one has, the number of truancies one has, and so forth. A traditional way to conceptualize this might be to view “degree of success in high school” as a latent variable (one that you cannot directly measure) that is normally distributed. However, you might conceptualize some students who are struggling and having trouble as forming a different category, perhaps a group you would call “at risk” (or in older days they would be called “juvenile delinquents”). Using indicators like grades, absences, truancies, tardies, suspensions, etc., you might try to identify latent class memberships based on high school success.
Data Description
Let’s pursue Example 1 from above. We have a hypothetical data file that we created that contains 9 fictional measures of drinking behavior. For each measure, the person would be asked whether the description applies to him/herself (yes or no). The 9 measures are
I like to drink I drink hard liquor I have drank in the morning I have drank at work I drink to get drunk I like the taste of alcohol I drink help me sleep Drinking interferes with my relationships I frequently visit bars
We have made up data for 1000 respondents and stored the data in a file called lca1, which is a Stata data file with the subject ID followed by the responses to the 9 questions, coded 1 for yes and 0 for no. The data for the first 10 observations look like this:
list id item1-item9 in 1/10 +-----------------------------------------------------------------------------+ | id item1 item2 item3 item4 item5 item6 item7 item8 item9 | |-----------------------------------------------------------------------------| 1. | 300 0 0 0 0 0 0 0 0 0 | 2. | 804 0 0 0 0 0 0 0 0 0 | 3. | 949 0 0 0 0 0 0 0 0 0 | 4. | 11 0 0 0 0 0 0 0 0 0 | 5. | 166 0 0 0 0 0 0 0 0 0 | |-----------------------------------------------------------------------------| 6. | 269 0 0 0 0 0 0 0 0 0 | 7. | 437 0 0 0 0 0 0 0 0 0 | 8. | 678 0 0 0 0 0 0 0 0 0 | 9. | 379 0 0 0 0 0 0 0 0 0 | 10. | 525 0 0 0 0 0 0 0 0 0 | +-----------------------------------------------------------------------------+
Some Strategies You Might Try
Before we show how you can analyze this with Latent Class Analysis, let’s consider some other methods that you might use:
Cluster Analysis – You could use cluster analysis for data like these. However, cluster analysis is not based on a statistical model. It can tell you how the cases are clustered into groups, but it does not provide information such as the probability that a given person is an alcoholic or abstainer. Also, cluster analysis would not provide information such as: given that someone said “yes” to drinking at work, what is the probability that they are an alcoholic. Factor Analysis – Because the term “latent variable” is used, you might be tempted to use factor analysis, since that is a technique used with latent variables. However, factor analysis is used for continuous and usually normally distributed latent variables, where this latent variable, e.g., alcoholism, is categorical.
Latent Class Analysis
Stata’s gsem is used to run a latent class analysis. After the command, the categorical predictor variables are listed. Because the variables in this example are numbered consecutively from 1 to 9, we can simply list the first variable name, item1, followed by a dash, and then the last variable name, item9. This is followed by an arrow pointing toward the predictors. The _cons is optional; the analysis will run if it is there are not. After closing the parentheses, a comma is given, indicating that options will follow. We list the family as bernoulli, the link as logit (because our predictors are binary), and then use the lclass option. In the parentheses, we give the name of the class, usually an upper-case C, and the number of classes we want.
gsem (item1-item9 <- _cons), family(bernoulli) link(logit) lclass(C 3) Fitting class model: Iteration 0: (class) log likelihood = -1098.6113 Iteration 1: (class) log likelihood = -1098.6113 Fitting outcome model: Iteration 0: (outcome) log likelihood = -3758.2924 Iteration 1: (outcome) log likelihood = -3646.4855 Iteration 2: (outcome) log likelihood = -3630.2169 Iteration 3: (outcome) log likelihood = -3626.9688 Iteration 4: (outcome) log likelihood = -3626.3104 Iteration 5: (outcome) log likelihood = -3626.1551 Iteration 6: (outcome) log likelihood = -3626.1263 Iteration 7: (outcome) log likelihood = -3626.1234 Iteration 8: (outcome) log likelihood = -3626.1228 Iteration 9: (outcome) log likelihood = -3626.1226 Iteration 10: (outcome) log likelihood = -3626.1226 Refining starting values: Iteration 0: (EM) log likelihood = -4931.4182 Iteration 1: (EM) log likelihood = -4973.2876 Iteration 2: (EM) log likelihood = -4980.0232 Iteration 3: (EM) log likelihood = -4975.9769 Iteration 4: (EM) log likelihood = -4968.4002 Iteration 5: (EM) log likelihood = -4959.8284 Iteration 6: (EM) log likelihood = -4951.1641 Iteration 7: (EM) log likelihood = -4942.6958 Iteration 8: (EM) log likelihood = -4934.4788 Iteration 9: (EM) log likelihood = -4926.4841 Iteration 10: (EM) log likelihood = -4918.6582 Iteration 11: (EM) log likelihood = -4910.9457 Iteration 12: (EM) log likelihood = -4903.2977 Iteration 13: (EM) log likelihood = -4895.6746 Iteration 14: (EM) log likelihood = -4888.0454 Iteration 15: (EM) log likelihood = -4880.3882 Iteration 16: (EM) log likelihood = -4872.689 Iteration 17: (EM) log likelihood = -4864.9418 Iteration 18: (EM) log likelihood = -4857.1472 Iteration 19: (EM) log likelihood = -4849.3128 Iteration 20: (EM) log likelihood = -4841.4511 note: EM algorithm reached maximum iterations. Fitting full model: Iteration 0: Log likelihood = -4243.4772 (not concave) Iteration 1: Log likelihood = -4242.3749 (not concave) Iteration 2: Log likelihood = -4241.0733 (not concave) Iteration 3: Log likelihood = -4240.0798 (not concave) Iteration 4: Log likelihood = -4236.6898 (not concave) Iteration 5: Log likelihood = -4236.656 (not concave) Iteration 6: Log likelihood = -4234.9869 Iteration 7: Log likelihood = -4232.749 (not concave) Iteration 8: Log likelihood = -4232.2285 Iteration 9: Log likelihood = -4231.9255 Iteration 10: Log likelihood = -4231.7821 Iteration 11: Log likelihood = -4231.7411 (not concave) Iteration 12: Log likelihood = -4231.702 Iteration 13: Log likelihood = -4231.6959 Iteration 14: Log likelihood = -4231.6958 Generalized structural equation model Number of obs = 1,000 Log likelihood = -4231.6958 ------------------------------------------------------------------------------ | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- 1.C | (base outcome) -------------+---------------------------------------------------------------- 2.C | _cons | .4287798 .8185974 0.52 0.600 -1.175642 2.033201 -------------+---------------------------------------------------------------- 3.C | _cons | -1.521675 .6923291 -2.20 0.028 -2.878615 -.1647353 ------------------------------------------------------------------------------ Class: 1 Response: item1 Family: Bernoulli Link: Logit Response: item2 Family: Bernoulli Link: Logit Response: item3 Family: Bernoulli Link: Logit Response: item4 Family: Bernoulli Link: Logit Response: item5 Family: Bernoulli Link: Logit Response: item6 Family: Bernoulli Link: Logit Response: item7 Family: Bernoulli Link: Logit Response: item8 Family: Bernoulli Link: Logit Response: item9 Family: Bernoulli Link: Logit ------------------------------------------------------------------------------ | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- item1 | _cons | -.7899339 .9859737 -0.80 0.423 -2.722407 1.142539 -------------+---------------------------------------------------------------- item2 | _cons | -1.628514 .3027822 -5.38 0.000 -2.221956 -1.035071 -------------+---------------------------------------------------------------- item3 | _cons | -3.295288 .4608422 -7.15 0.000 -4.198523 -2.392054 -------------+---------------------------------------------------------------- item4 | _cons | -2.823975 .3176524 -8.89 0.000 -3.446562 -2.201388 -------------+---------------------------------------------------------------- item5 | _cons | -3.069835 .8750441 -3.51 0.000 -4.78489 -1.35478 -------------+---------------------------------------------------------------- item6 | _cons | -1.496508 .3030282 -4.94 0.000 -2.090432 -.9025834 -------------+---------------------------------------------------------------- item7 | _cons | -2.223255 .2496319 -8.91 0.000 -2.712524 -1.733985 -------------+---------------------------------------------------------------- item8 | _cons | -2.091981 .2295699 -9.11 0.000 -2.541929 -1.642032 -------------+---------------------------------------------------------------- item9 | _cons | -1.464306 .2636135 -5.55 0.000 -1.980979 -.9476332 ------------------------------------------------------------------------------ Class: 2 Response: item1 Family: Bernoulli Link: Logit Response: item2 Family: Bernoulli Link: Logit Response: item3 Family: Bernoulli Link: Logit Response: item4 Family: Bernoulli Link: Logit Response: item5 Family: Bernoulli Link: Logit Response: item6 Family: Bernoulli Link: Logit Response: item7 Family: Bernoulli Link: Logit Response: item8 Family: Bernoulli Link: Logit Response: item9 Family: Bernoulli Link: Logit ------------------------------------------------------------------------------ | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- item1 | _cons | 2.292889 1.053674 2.18 0.030 .2277263 4.358053 -------------+---------------------------------------------------------------- item2 | _cons | -.6748749 .2540475 -2.66 0.008 -1.172799 -.176951 -------------+---------------------------------------------------------------- item3 | _cons | -2.637822 .2810791 -9.38 0.000 -3.188727 -2.086917 -------------+---------------------------------------------------------------- item4 | _cons | -2.658283 .3095767 -8.59 0.000 -3.265043 -2.051524 -------------+---------------------------------------------------------------- item5 | _cons | -1.270617 .3102912 -4.09 0.000 -1.878776 -.6624574 -------------+---------------------------------------------------------------- item6 | _cons | -.755464 .1985341 -3.81 0.000 -1.144584 -.3663442 -------------+---------------------------------------------------------------- item7 | _cons | -2.062337 .2129544 -9.68 0.000 -2.47972 -1.644954 -------------+---------------------------------------------------------------- item8 | _cons | -1.816183 .2258445 -8.04 0.000 -2.25883 -1.373536 -------------+---------------------------------------------------------------- item9 | _cons | -.7314625 .1820354 -4.02 0.000 -1.088245 -.3746796 ------------------------------------------------------------------------------ Class: 3 Response: item1 Family: Bernoulli Link: Logit Response: item2 Family: Bernoulli Link: Logit Response: item3 Family: Bernoulli Link: Logit Response: item4 Family: Bernoulli Link: Logit Response: item5 Family: Bernoulli Link: Logit Response: item6 Family: Bernoulli Link: Logit Response: item7 Family: Bernoulli Link: Logit Response: item8 Family: Bernoulli Link: Logit Response: item9 Family: Bernoulli Link: Logit ------------------------------------------------------------------------------ | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- item1 | _cons | 2.487929 .5795831 4.29 0.000 1.351967 3.623891 -------------+---------------------------------------------------------------- item2 | _cons | .1851186 .3154212 0.59 0.557 -.4330956 .8033328 -------------+---------------------------------------------------------------- item3 | _cons | -.2965715 .3835435 -0.77 0.439 -1.048303 .45516 -------------+---------------------------------------------------------------- item4 | _cons | -.3312536 .3358622 -0.99 0.324 -.9895315 .3270242 -------------+---------------------------------------------------------------- item5 | _cons | 1.182954 .5570521 2.12 0.034 .0911515 2.274756 -------------+---------------------------------------------------------------- item6 | _cons | -.1160543 .3060531 -0.38 0.705 -.7159073 .4837988 -------------+---------------------------------------------------------------- item7 | _cons | .0495348 .3718233 0.13 0.894 -.6792255 .7782951 -------------+---------------------------------------------------------------- item8 | _cons | .4862054 .4581794 1.06 0.289 -.4118098 1.384221 -------------+---------------------------------------------------------------- item9 | _cons | -.6241858 .3073837 -2.03 0.042 -1.226647 -.0217248 ------------------------------------------------------------------------------
Although there is a lot of output, there is not too much you need to do with it. The model needs to be run so that we can then request the latent class probabilities and the latent class means.
Conditional Probabilities
We use the post-estimation command estat lcprob to get the latent class probabilities.
estat lcprob Latent class marginal probabilities Number of obs = 1,000 -------------------------------------------------------------- | Delta-method | Margin std. err. [95% conf. interval] -------------+------------------------------------------------ C | 1 | .363144 .1838146 .1072129 .7302799 2 | .5575651 .1750544 .2387489 .8350876 3 | .079291 .0242463 .0429849 .1417209 --------------------------------------------------------------
We see that the average probability of being in latent class 1 is approximately .36; the average probability of being in latent class 2 is approximately .56; and the average probability of being in latent class 3 is 0.08.
The post-estimation command estat lcmean gives the average proportion of endorsement (meaning selecting 1 rather than 0) for each item in each latent class.
estat lcmean Latent class marginal means Number of obs = 1,000 -------------------------------------------------------------- | Delta-method | Margin std. err. [95% conf. interval] -------------+------------------------------------------------ 1 | item1 | .3121829 .2117129 .0616641 .7581455 item2 | .1640341 .0415196 .0977961 .2621021 item3 | .0357332 .0158789 .0147956 .0837806 item4 | .0560423 .0168043 .0308716 .099626 item5 | .0443688 .0371021 .0082858 .20509 item6 | .182947 .0452959 .1100302 .2885199 item7 | .0976816 .0220025 .0622384 .1500786 item8 | .1098787 .0224532 .0729706 .1621888 item9 | .1878096 .0402109 .1212145 .279361 -------------+------------------------------------------------ 2 | item1 | .9082864 .0877734 .5566868 .9873586 item2 | .3374061 .0567957 .2363495 .4558773 item3 | .0667436 .0175081 .0395922 .1103749 item4 | .0654803 .0189438 .0367901 .1138985 item5 | .2191517 .0530983 .1325295 .3401878 item6 | .3196319 .0431747 .2414798 .4094247 item7 | .1128118 .0213136 .0772922 .1617921 item8 | .1398925 .0271742 .0945905 .2020492 item9 | .3248739 .039926 .2519488 .4074107 -------------+------------------------------------------------ 3 | item1 | .9232913 .0410487 .794451 .9740146 item2 | .5461479 .0781836 .3933874 .6906869 item3 | .4263958 .093808 .2595511 .6118654 item4 | .4179356 .0817037 .2710046 .5810351 item5 | .7654785 .1000027 .5227721 .9067646 item6 | .471019 .0762562 .3282948 .6186445 item7 | .5123812 .0928988 .3364342 .6853126 item8 | .6192121 .1080334 .3984782 .799668 item9 | .3488301 .0698215 .2267688 .494569 --------------------------------------------------------------
Looking at item1 across the three classes, we see that it was endorsed by a little less than a third of those in latent class 1 and by more than 90% of those in latent classes 2 and 3. Looking at the pattern of responses for each item in each latent class should be helpful in determining the nature of each class. In our example, it seems that those in latent class 1 are those who are “social” drinkers; those in latent class 2 seem to be those who tend to abstain from alcohol, and those in latent class 3 may have a problem with alcohol.
The output above is useful, but it is not in a format that would be easily understood by most audiences. Let’s reformat the output to make it easier to read, as shown below. Each row represents a different item, and the three columns of numbers are the probabilities of answering “yes” to the item given that you belonged to that class. So, if you belong to latent class 1, you have a 90.8% probability of saying “yes, I like to drink”. By contrast, if you belong to latent class 2, you have a 31.2% chance of saying “yes, I like to drink”.
Class 1 Class 2 Class 3 Item Label ITEM1 0.312 0.908 0.923 I like to drink ITEM2 0.164 0.337 0.546 I drink hard liquor ITEM3 0.036 0.067 0.426 I have drank in the morning ITEM4 0.056 0.065 0.418 I have drank at work ITEM5 0.044 0.219 0.765 I drink to get drunk ITEM6 0.183 0.320 0.471 I like the taste of alcohol ITEM7 0.098 0.113 0.512 I drink help me sleep ITEM8 0.110 0.140 0.619 Drinking interferes with my relationships ITEM9 0.188 0.325 0.349 I frequently visit bars
Looking at item1, those in latent class 2 and latent class3 really like to drink (with 90.8% and 92.3% saying yes) while those in latent class1 are not so fond of drinking (they have only a 31.2% probability of saying they like to drink). Jumping to item5, 76.5% of those in latent class 3 say they drink to get drunk, while 21.9% of those in latent class2 agreed to that, and only 4.4% of those in latent class1 say that.
Focusing just on latent class 3 (looking at that column), they really like to drink (92%), they drink hard liquor (54.6%), a pretty large number say they have drank in the morning and at work (42.6% and 41.8%), and well over half say drinking interferes with their relationships (61.9%).
It seems that those in latent class 1 are those who tend to abstain from alcohol; we were expecting to find a latent class like this. Not many of them like to drink (31.2%), few like the taste of alcohol (18.3%), few frequently visit bars (18.8%), and for the rest of the questions they rarely answered “yes”.
This leaves latent class 2; they seem fit the idea of the “social” drinker. They like to drink (90.8%), but they don’t drink hard liquor as often as Class 3 (33.7% versus 54.6%). They rarely drink in the morning or at work (6.7% and 6.5%) and rarely say that drinking interferes with their relationships (14%). They say they frequently visit bars similar to latent class 3 (32.5% versus 34.9%), but that might make sense. Both the social drinkers and those with a problem with alcohol are similar in how much they like to drink and how frequently they go to bars, but differ in key ways such as drinking at work, drinking in the morning, and the impact of drinking on their relationships.
We may also want to know how well this model fits these data, so we can use the post-estimation command estat lcgof.
estat lcgof ---------------------------------------------------------------------------- Fit statistic | Value Description ---------------------+------------------------------------------------------ Likelihood ratio | chi2_ms(482) | 319.955 model vs. saturated p > chi2 | 1.000 ---------------------+------------------------------------------------------ Information criteria | AIC | 8521.392 Akaike's information criterion BIC | 8663.716 Bayesian information criterion ----------------------------------------------------------------------------
We fail to reject the null hypothesis that the fitted model fits as well as the saturated model.
Graphing the Results
All write-ups of latent class analyses contain tables of results, but graphs are useful as well. In Stata, we use the post-estimation command margins to create a table with particular content, and then use marginsplot to graph the contents of the table. You use must run the margins command before you run the marginsplot command. Let’s start with a basic graph, and then we will modify the graph to make it look nicer.
margins, predict(classpr class(1)) predict(classpr class(2)) predict(classpr class(3)) marginsplot
This graph is OK, but it could be better. We will add some options to the marginsplot command to improve the graph. Among other options, we will use the recast option, which changes the graph from a line graph to a bar graph.
marginsplot, recast(bar) xtitle("Latent Classes") ytitle("Probabilities of Belonging to a Class") xlabel(1 "Class 1" 2 "Class 2" 3 "Class 3") title("Predicted Migraine Latent Class Probabilities with 95% CI")
We can also make graphs showing the predicted probabilities for each of the items in our analysis.
margins, predict(outcome(item1) class(3)) predict(outcome(item2) class(3)) predict(outcome(item3) class(3)) predict(outcome(item4) class(3)) predict(outcome(item5) class(3)) predict(outcome(item6) class(3)) predict(outcome(item7) class(3)) predict(outcome(item8) class(3)) predict(outcome(item9) class(3)) marginsplot, recast(bar) xtitle("") ytitle("") xlabel(1 "item1" 2 "item2" 3 "item3" 4 "item4" 5 "item5" 6 "item6" 7 "item7" 8 "item8" 9 "item9", angle(45)) title("Predicted Probability of Behaviors For Class 3 with 95% CI")
Number of Classes
So far we have been assuming that we have chosen the right number of latent classes. Perhaps, however, there are only two types of drinkers, or perhaps there are four or more types of drinkers. So far we have liked the three class model, both based on our theoretical expectations and based on how interpretable our results have been. We can further assess whether we have chosen the right number of classes by running the analysis with different numbers of classes and then comparing the fit of the models. In our example, we may have wanted to compare a one-class, two-class, three-class and four-class model and then compare the results. However, a four-class model will not run with our example data, so we will run a one-class, two-class and three-class model and then compare the results with the estimates stats command. We use the quietly prefix before each gsem command to suppress the output.
quietly gsem (item1-item9 <- ), family(bernoulli) link(logit) lclass(C 1) estimates store oneclass quietly gsem (item1-item9 <- ), family(bernoulli) link(logit) lclass(C 2) estimates store twoclass quietly gsem (item1-item9 <- ), family(bernoulli) link(logit) lclass(C 3) estimates store threeclass * lower is better estimates stats oneclass twoclass threeclass Akaike's information criterion and Bayesian information criterion ----------------------------------------------------------------------------- Model | N ll(null) ll(model) df AIC BIC -------------+--------------------------------------------------------------- oneclass | 1,000 . -4348.878 9 8715.755 8759.925 twoclass | 1,000 . -4251.208 19 8540.416 8633.664 threeclass | 1,000 . -4231.696 29 8521.392 8663.716 ----------------------------------------------------------------------------- Note: BIC uses N = number of observations. See [R] IC note.
Both the AIC and BIC are lower for the three-class solution, so we think this is a good model.
Cautions, Flies in the Ointment
We have focused on a very simple example here just to get you started. Here are some problems to watch out for.
Have you specified the right number of latent classes? Perhaps you have specified too many classes (i.e., people largely fall into 2 classes) or you may have specified too few classes (i.e., people really fall into 4 or more classes).
Are some of your measures/indicators lousy? All of our measures were really useful in distinguishing what type of drinker the person was. However, say we had a measure that was “Do you like broccoli?”. This would be a poor indicator, and each type of drinker would probably answer in a similar way, so this question would be a good candidate to discard.
Having developed this model to identify the different types of drinkers, we might be interested in trying to predict why someone is an alcoholic, or why someone is an abstainer. For example, we might be interested in whether parental drinking predicts being an alcoholic. Such analyses are possible but not discussed here.
References
Masyn, Katherine E. Latent Class Analysis and Finite Mixture Modeling. (2013). In The Oxford Handbook of Quantitative Methods. Edited by Todd Little.