These code fragments are examples that we are using to try and understand these techniques using Mplus. We ask that you treat them as works in progress that explore these techniques, rather than definitive answers as to how to analyze any particular kind of data.
Consider the file Stata file hsb6 that has 600 observations with information about students like their reading, writing, math and other achievement scores. For the variables locus concept mot read-ss we will make a binary variable called hi___ that is 1 if the score is over the median, and 0 if below the median. This will be useful when we need a binary variable. Download the Mplus-ready dataset here as hsb6.csv.
Example 1
A latent class analysis with 2 classes, and continuous indicators
Here is the input file
Data: File is hsb6.csv; Variable: Names are id gender race ses sch prog locus concept mot career read write math sci ss hilocus hiconcept himot hiread hiwrite himath hisci hiss; Usevariables are read write math sci ss ; classes = c(2); Analysis: Type=mixture; SAVEDATA: file is lca_ex1.txt ; save is cprob; format is free;
Here is the output.
Section 1
------------------------------------------------------------------------------ FINAL CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASSES BASED ON THE ESTIMATED MODEL Latent Classes 1 274.09081 0.45682 2 325.90919 0.54318
- One way to view the third column is the average probability of falling into Class 1 and Class 2. As a result column 2 is the average probability times 600.
- A second way to view the third column is by taking each persons probability of falling into a class, and summing them. If person #6 has a .8 estimated probability of being in Class 1, and .2 of being in Class 2, then that person contributes .8 to Class 1 and .2 to Class 2. This is why these are these are fractional.
- A third way of viewing this is that there is an underlying continuum of the latent variable, and there is a threshold for being categorized as Class 1 or Class 2, and that threshold can be used to compute the probabilities of being in the classes, see Section 5.
Section 2
------------------------------------------------------------------------------ FINAL CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASSES BASED ON THEIR MOST LIKELY LATENT CLASS MEMBERSHIP Class Counts and Proportions Latent Classes 1 272 0.45333 2 328 0.54667
This shows the count of people who fall into each class by taking their probability of membership in each class and assigning them to the class which they have the highest probability of falling into. Note the counts are exact whole numbers.
Section 3
------------------------------------------------------------------------------ Classification Probabilities for the Most Likely Latent Class Membership (Column) by Latent Class (Row) 1 2 1 0.950 0.050 2 0.036 0.964
This is related to the output in #1, but takes the probabilities of class membership and averages them by class, see Stata portion below for more on this.
Section 4
------------------------------------------------------------------------------ MODEL RESULTS Two-Tailed Estimate S.E. Est./S.E. P-Value Latent Class 1 Means READ 43.783 0.642 68.151 0.000 WRITE 45.068 0.730 61.737 0.000 MATH 44.794 0.469 95.540 0.000 SCI 44.446 0.740 60.050 0.000 SS 45.574 0.658 69.237 0.000 Variances READ 46.463 2.785 16.681 0.000 WRITE 49.428 3.011 16.415 0.000 MATH 46.634 3.133 14.884 0.000 SCI 49.022 3.388 14.470 0.000 SS 62.216 4.109 15.141 0.000 Latent Class 2 Means READ 58.730 0.605 96.999 0.000 WRITE 58.538 0.497 117.763 0.000 MATH 57.782 0.687 84.119 0.000 SCI 57.917 0.499 116.079 0.000 SS 57.488 0.589 97.628 0.000 Variances READ 46.463 2.785 16.681 0.000 WRITE 49.428 3.011 16.415 0.000 MATH 46.634 3.133 14.884 0.000 SCI 49.022 3.388 14.470 0.000 SS 62.216 4.109 15.141 0.000
This shows the average on the scores for the two classes. Class 1 is a low performing group, and Class 2 is a high performing group.
Section 5
------------------------------------------------------------------------------ Categorical Latent Variables Two-Tailed Estimate S.E. Est./S.E. P-Value Means C#1 -0.173 0.133 -1.298 0.194
This is the threshold for dividing the two classes. If you are below the threshold, you are Class 1, above it and you are Class 2. We see the threshold is -0.173. Say that we then convert this threshold to a probability like this, letting $t_1$ be Threshold 1
$$P(\mbox{Class}=1) = \dfrac{1}{1 + exp(-t_1)} = \dfrac{1}{ 1 + exp( 0.173)} = .4568$$
(compare to Section 1 above).
$$P(\mbox{Class}=2) = 1 – \dfrac{1}{1 + exp(-t_1)} = 1 – \dfrac{1}{1 + exp( 0.173)} =.54314$$
(compare to Section 1 above).
------------------------------------------------------------------------------
We now read the saved data file into Stata for comparison to the Mplus output.
infile read write math sci ss cprob1 cprob2 class using lca_ex1.txt
Below we show the first observations from the middle of this file. Note that cprob1 is the probability of being in Class 1 and cprob2 is the probability of being in Class 2, and class is the class membership based on the class with the highest probability.
list in 200/210 +-------------------------------------------------------------+ | read write math sci ss cprob1 cprob2 class | |-------------------------------------------------------------| 200. | 46.9 52.1 42.5 47.7 60.5 .944 .056 1 | 201. | 46.9 51.5 57 49.8 40.6 .9 .1 1 | 202. | 46.9 52.8 49.3 53.1 35.6 .983 .017 1 | 203. | 46.9 43.7 41.9 41.7 35.6 1 0 1 | 204. | 46.9 61.9 53 52.6 60.5 .016 .984 2 | |-------------------------------------------------------------| 205. | 46.9 41.1 45.3 47.1 55.6 .998 .002 1 | 206. | 46.9 38.5 47.1 41.7 25.7 1 0 1 | 207. | 46.9 54.1 46.4 49.8 55.6 .827 .173 1 | 208. | 46.9 51.5 48.5 49.8 50.6 .934 .066 1 | 209. | 46.9 41.1 53.6 41.7 55.6 .995 .005 1 | |-------------------------------------------------------------| 210. | 46.9 61.9 46.2 60.7 45.6 .17 .83 2 | +-------------------------------------------------------------+
Note that if we tabulate class we see where the values from Section 2 of the output came from.
tab class class | Freq. Percent Cum. ------------+----------------------------------- 1 | 272 45.33 45.33 2 | 328 54.67 100.00 ------------+----------------------------------- Total | 600 100.00
Note that if we take the average of cprob1 and cprob2, we can relate these values to Column 2 of Section 1 of the output.
summ cprob1 cprob2 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- cprob1 | 600 .4568233 .4664192 0 1 cprob2 | 600 .5431767 .4664192 0 1
If we sum the probabilities, we can relate these to Column 1 of Section 1 of the output.
tabstat cprob1 cprob2, stat(sum) stats | cprob1 cprob2 ---------+-------------------- sum | 274.094 325.906 ------------------------------
If we average the probabilities by class, we can relate these values to Section 3 of the output.
tabstat cprob1 cprob2, by(class) Summary statistics: mean by categories of: class class | cprob1 cprob2 ---------+-------------------- 1 | .9570699 .0429301 2 | .0419848 .9580152 ---------+-------------------- Total | .4568233 .5431767 ------------------------------
Say that we get the mean of the reading, writing, math, science and social science scores and weight them by the probability of being in Class 1 and then again weighting by the probability of being in Class 2. Note the correspondence between these means and the means from Section 4 of the output.
tabstat read write math sci ss [aw=cprob1], stat(mean) stats | read write math sci ss ---------+-------------------------------------------------- mean | 43.78268 45.06829 44.79421 44.44601 45.5743 ------------------------------------------------------------ tabstat read write math sci ss [aw=cprob2], stat(mean) stats | read write math sci ss ---------+-------------------------------------------------- mean | 58.73021 58.53821 57.78224 57.91736 57.48822 ------------------------------------------------------------
Example 2
A latent class analysis with 3 classes, and continuous indicators
Here is the input file
Data: File is hsb6.csv; Variable: Names are id gender race ses sch prog locus concept mot career read write math sci ss hilocus hiconcept himot hiread hiwrite himath hisci hiss; Usevariables are read write math sci ss ; classes = c(3); Analysis: Type=mixture; SAVEDATA: file is lca_ex2.txt ; save is cprob; format is free;
Here is the output
Section 1
------------------------------------------------------------------------------ FINAL CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASSES BASED ON THE ESTIMATED MODEL Latent Classes 1 194.55393 0.32426 2 153.04798 0.25508 3 252.39809 0.42066
- One way to view the second column is the average probability of falling into Class 1 and Class 2. As a result Column 1 is the average probability times 600 (see Stata example below for comparison).
- A second way to view the second column is by taking each persons probability of falling into a class, and summing them. If person #6 has a .8 estimated probability of being in Class 1, and .2 of being in Class 2, then that person contributes .8 to Class 1 and .2 to Class 2. This is why these are these are fractional (see Stata example below for comparison).
- A third way of viewing this is that there is an underlying continuum of the latent variable, and there is a threshold for being categorized as Class 1 or Class 2. If you are below the threshold, you are Class 1, above it and you are Class 2. Below we see the threshold is -0.173. Say that we then convert this threshold to a probability
$$\dfrac{exp( -0.173)}{1 + exp(-0.173)}= .4568$$ (compare to above)
Section 2
------------------------------------------------------------------------------ FINAL CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASSES BASED ON THEIR MOST LIKELY LATENT CLASS MEMBERSHIP Class Counts and Proportions Latent Classes 1 197 0.32833 2 154 0.25667 3 249 0.41500
This shows the count of people who fall into each class by taking their probability of membership in each class and assigning them to the class which they have the highest probability of falling into. Note the counts are exact whole numbers.
Section 3
------------------------------------------------------------------------------ Classification Probabilities for the Most Likely Latent Class Membership (Column) by Latent Class (Row) 1 2 3 1 0.952 0.000 0.048 2 0.000 0.919 0.081 3 0.047 0.053 0.900
This is related to the output in Section 1, but takes the probabilities of class membership and averages them by class, see Stata portion below for more on this.
Section 4
------------------------------------------------------------------------------ MODEL RESULTS Two-Tailed Estimate S.E. Est./S.E. P-Value Latent Class 1 Means READ 41.735 0.477 87.540 0.000 WRITE 42.703 0.962 44.390 0.000 MATH 43.178 0.516 83.648 0.000 SCI 42.160 0.663 63.625 0.000 SS 43.848 0.695 63.097 0.000 Variances READ 32.997 2.820 11.699 0.000 WRITE 42.369 3.775 11.223 0.000 MATH 34.562 2.422 14.269 0.000 SCI 38.395 2.714 14.146 0.000 SS 53.884 3.850 13.996 0.000 Latent Class 2 Means READ 63.644 0.948 67.117 0.000 WRITE 61.193 0.453 135.170 0.000 MATH 62.610 0.865 72.404 0.000 SCI 61.648 0.667 92.451 0.000 SS 61.232 0.758 80.759 0.000 Variances READ 32.997 2.820 11.699 0.000 WRITE 42.369 3.775 11.223 0.000 MATH 34.562 2.422 14.269 0.000 SCI 38.395 2.714 14.146 0.000 SS 53.884 3.850 13.996 0.000 Latent Class 3 Means READ 52.618 0.925 56.867 0.000 WRITE 54.507 0.727 74.938 0.000 MATH 52.008 0.835 62.319 0.000 SCI 53.172 0.835 63.681 0.000 SS 52.794 0.808 65.325 0.000 Variances READ 32.997 2.820 11.699 0.000 WRITE 42.369 3.775 11.223 0.000 MATH 34.562 2.422 14.269 0.000 SCI 38.395 2.714 14.146 0.000 SS 53.884 3.850 13.996 0.000
This shows the average on the scores for the two classes. Class 1 is a low performing group, and Class 2 is a medium performing group, and Class 3 is a high performing group.
Section 5
------------------------------------------------------------------------------ Categorical Latent Variables Two-Tailed Estimate S.E. Est./S.E. P-Value Means C#1 -0.260 0.130 -2.010 0.044 C#2 -0.500 0.181 -2.766 0.006
This is the threshold for dividing the three classes. Note that this is now like a multinomial logistic regression, where the thresholds divide three multinomial categories, with Class 3 being the reference category and C#1 is the threshold for being in Class 1 as compared to Class 3, and C#2 is the threshold for being in Class 2 as compared to Class 3. For the comparison group, Class 3, the probability of being in that class is computed as below, letting $t_1$ be threshold 1 (.24) and $t_2$ be threshold 2 (.5).
$$P(\mbox{Class}=3) = \dfrac{1}{1 + exp(-t_1) + exp(-t_2)} = \dfrac{1}{1 + exp(.26) + exp(.5)} = 0.253$$
For Classes 1 and 2, the formula is a bit different since these are not the comparison class. For Class 1, the formula is
$$P(\mbox{Class}=1) = \dfrac{exp(-t_1)}{1 + exp(-t_1) + exp(-t_2)} = \dfrac{exp(.26)}{1 + exp(.26) + exp(.5)} = 0.329$$
For Class 2, the formula is
$$P(\mbox{Class}=2) =\dfrac{exp(-t_2)}{1 + exp(-t_1) + exp(-t_2)} = \dfrac{exp(.5)}{1 + exp(.26) + exp(.5)} = 0.418$$
------------------------------------------------------------------------------
We now read the saved data file into Stata for comparison to the Mplus output.
infile read write math sci ss cprob1 cprob2 cprob3 class using lca_ex2.txt
Below we show observations from the middle of this file. Note that cprob1 is the probability of being in Class 1 and cprob2 is the probability of being in Class 2, cprob3 is the probability of being in Class 3, and class is the class membership based on the class with the highest probability. Note that we don’t see any folks in Class 3 here, but there are members of Class 3.
list in 200/210 +----------------------------------------------------------------------+ | read write math sci ss cprob1 cprob2 cprob3 class | |----------------------------------------------------------------------| 200. | 46.9 52.1 42.5 47.7 60.5 .133 .867 0 2 | 201. | 46.9 51.5 57 49.8 40.6 .062 .938 0 2 | 202. | 46.9 52.8 49.3 53.1 35.6 .228 .772 0 2 | 203. | 46.9 43.7 41.9 41.7 35.6 .998 .002 0 1 | 204. | 46.9 61.9 53 52.6 60.5 0 .996 .004 2 | |----------------------------------------------------------------------| 205. | 46.9 41.1 45.3 47.1 55.6 .812 .188 0 1 | 206. | 46.9 38.5 47.1 41.7 25.7 1 0 0 1 | 207. | 46.9 54.1 46.4 49.8 55.6 .039 .961 0 2 | 208. | 46.9 51.5 48.5 49.8 50.6 .1 .9 0 2 | 209. | 46.9 41.1 53.6 41.7 55.6 .709 .291 0 1 | |----------------------------------------------------------------------| 210. | 46.9 61.9 46.2 60.7 45.6 .001 .999 0 2 | +----------------------------------------------------------------------+
Note that if we tabulate class we see where the values from Section 2 of the output came from.
tab class class | Freq. Percent Cum. ------------+----------------------------------- 1 | 197 32.83 32.83 2 | 249 41.50 74.33 3 | 154 25.67 100.00 ------------+----------------------------------- Total | 600 100.00
Note that if we take the average of cprob1, cprob2, and cprob3 we can relate these values to column 2 of section #1 of the output.
summ cprob1 cprob2 cprob3 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- cprob1 | 600 .3242633 .440132 0 1 cprob2 | 600 .4206317 .4326395 0 .999 cprob3 | 600 .2550817 .3989861 0 1
If we sum the probabilities, we can relate these to Column 1 of Section 1 of the output.
tabstat cprob1 cprob2 cprob3, stat(sum) stats | cprob1 cprob2 cprob3 ---------+------------------------------ sum | 194.558 252.379 153.049 ----------------------------------------
If we average the probabilities by class, we can relate these values to section #3 of the output.
tabstat cprob1 cprob2 cprob3, by(class) Summary statistics: mean by categories of: class class | cprob1 cprob2 cprob3 ---------+------------------------------ 1 | .9401117 .0598883 0 2 | .0375743 .9123735 .049996 3 | 0 .087013 .912987 ---------+------------------------------ Total | .3242633 .4206317 .2550817 ----------------------------------------
Say that we get the mean of the reading, writing, math, science and social science scores and weight them by the probability of being in Class 1 and then again weighting by the probability of being in Class 2, and likewise for Class 3. Note the correspondence between these means and the means from Section 4 of the output.
tabstat read write math sci ss [aw=cprob1], stat(mean) stats | read write math sci ss ---------+-------------------------------------------------- mean | 41.73485 42.70297 43.17746 42.16013 43.84801 ------------------------------------------------------------ tabstat read write math sci ss [aw=cprob2], stat(mean) stats | read write math sci ss ---------+-------------------------------------------------- mean | 52.61804 54.50678 52.00815 53.17197 52.79395 ------------------------------------------------------------ tabstat read write math sci ss [aw=cprob3], stat(mean) stats | read write math sci ss ---------+-------------------------------------------------- mean | 63.64527 61.19303 62.61002 61.6482 61.2325 ------------------------------------------------------------