Mplus Class Notes Analyzing Data: Latent Class and Other Mixture Models

Mixture models are measurement models that use observed variables as indicators of one or more nominal latent variables (i.e. categorical variables). One way to think about mixture models that one is attempting to identify subsets or "classes" of observations within the observed data. The latent variable (classes) is categorical, but the indicators may be either categorical or continuous. The term latent class analysis is often used to refer to a mixture model in which all of the observed indicator variables are categorical.

Mplus version 5.2 was used for these examples.

1.0 Latent class analysis

The examples on this page use a dataset with information on high school students’ academic histories. In the first example below, a 2 class model is estimated using four dichotomous variables as indicators (category 1 = no, category 2 = yes). The variables are whether the student had taken honors math (hm), honors English (he), or vocational classes (voc); and whether the student reported they were unlikely to go to college (nocol). The expected classes are academically oriented students (i.e. students who took honors classes, did not take vocational classes and reported they were likely to go to college), and students who are less academically oriented. The dataset for this example is https://stats.idre.ucla.edu/wp-content/uploads/2016/02/lca.dat.

The input file for this model is shown below. The usevariables option of the of the variables: command specifies which variables will be used in this analysis (necessary when not all of the variables in the dataset are used). The classes option identifies the name of the latent variable (in this case c), followed by the number of classes to be estimated in parentheses (in this case 2). Note that the class variable(s) can be assigned any valid variable name. The categorical option of the variables: command tells Mplus which variables are categorical. The type option of the analysis: command specifies the type of model to be estimated, in this case a mixture model.

TITLE:	A latent class analysis (LCA)
Data:	
   file is https://stats.idre.ucla.edu/wp-content/uploads/2016/02/lca.dat;
Variable:	
   names are hm he voc nocol ach9-ach12;
   usevariables are hm he voc nocol ;
   classes = c (2);
   categorical = hm he voc nocol ;
Analysis:	
   type = mixture;

The output for this model is shown below.

INPUT READING TERMINATED NORMALLY



A latent class analysis (LCA)

SUMMARY OF ANALYSIS

Number of groups                                                 1
Number of observations                                         500

Number of dependent variables                                    4
Number of independent variables                                  0
Number of continuous latent variables                            0
Number of categorical latent variables                           1

Observed dependent variables

  Binary and ordered categorical (ordinal)
   HM          HE          VOC         NOCOL

Categorical latent variables
   C


Estimator                                                      MLR
Information matrix                                        OBSERVED
Optimization Specifications for the Quasi-Newton Algorithm for
Continuous Outcomes
  Maximum number of iterations                                 100
  Convergence criterion                                  0.100D-05
Optimization Specifications for the EM Algorithm
  Maximum number of iterations                                 500
  Convergence criteria
    Loglikelihood change                                 0.100D-06
    Relative loglikelihood change                        0.100D-06
    Derivative                                           0.100D-05
Optimization Specifications for the M step of the EM Algorithm for
Categorical Latent variables
  Number of M step iterations                                    1
  M step convergence criterion                           0.100D-05
  Basis for M step termination                           ITERATION
Optimization Specifications for the M step of the EM Algorithm for
Censored, Binary or Ordered Categorical (Ordinal), Unordered
Categorical (Nominal) and Count Outcomes
  Number of M step iterations                                    1
  M step convergence criterion                           0.100D-05
  Basis for M step termination                           ITERATION
  Maximum value for logit thresholds                            15
  Minimum value for logit thresholds                           -15
  Minimum expected cell size for chi-square              0.100D-01
Optimization algorithm                                         EMA
Random Starts Specifications
  Number of initial stage random starts                         10
  Number of final stage optimizations                            2
  Number of initial stage iterations                            10
  Initial stage convergence criterion                    0.100D+01
  Random starts scale                                    0.500D+01
  Random seed for generating random starts                       0
Link                                                         LOGIT

Input data file(s)
  https://stats.idre.ucla.edu/wp-content/uploads/2016/02/lca.dat
Input data format  FREE


SUMMARY OF CATEGORICAL DATA PROPORTIONS

    HM
      Category 1    0.678
      Category 2    0.322
    HE
      Category 1    0.686
      Category 2    0.314
    VOC
      Category 1    0.322
      Category 2    0.678
    NOCOL
      Category 1    0.334
      Category 2    0.666


RANDOM STARTS RESULTS RANKED FROM THE BEST TO THE WORST LOGLIKELIHOOD VALUES

Final stage loglikelihood values at local maxima, seeds, and initial stage start numbers:

            -965.244  93468            3
            -965.244  939021           8



THE MODEL ESTIMATION TERMINATED NORMALLY



TESTS OF MODEL FIT

Loglikelihood

          H0 Value                        -965.244
          H0 Scaling Correction Factor       1.013
            for MLR

Information Criteria

          Number of Free Parameters              9
          Akaike (AIC)                    1948.488
          Bayesian (BIC)                  1986.420
          Sample-Size Adjusted BIC        1957.853
            (n* = (n + 2) / 24)

Chi-Square Test of Model Fit for the Binary and Ordered Categorical
(Ordinal) Outcomes

          Pearson Chi-Square

          Value                              6.287
          Degrees of Freedom                     6
          P-Value                           0.3918

          Likelihood Ratio Chi-Square

          Value                              5.605
          Degrees of Freedom                     6
          P-Value                           0.4688



FINAL CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASSES
BASED ON THE ESTIMATED MODEL

    Latent
   Classes

       1        136.38198          0.27276
       2        363.61802          0.72724


FINAL CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASS PATTERNS
BASED ON ESTIMATED POSTERIOR PROBABILITIES

    Latent
   Classes

       1        136.38170          0.27276
       2        363.61830          0.72724


CLASSIFICATION QUALITY

     Entropy                         0.904


CLASSIFICATION OF INDIVIDUALS BASED ON THEIR MOST LIKELY LATENT CLASS MEMBERSHIP

Class Counts and Proportions

    Latent
   Classes

       1              127          0.25400
       2              373          0.74600


Average Latent Class Probabilities for Most Likely Latent Class Membership (Row)
by Latent Class (Column)

           1        2

    1   0.986    0.014
    2   0.030    0.970


MODEL RESULTS

                                                    Two-Tailed
                    Estimate       S.E.  Est./S.E.    P-Value

Latent Class 1

 Thresholds
    HM$1              -2.063      0.373     -5.536      0.000
    HE$1              -1.724      0.300     -5.755      0.000
    VOC$1              2.331      0.389      5.985      0.000
    NOCOL$1            2.078      0.320      6.490      0.000

Latent Class 2

 Thresholds
    HM$1               2.091      0.182     11.502      0.000
    HE$1               2.056      0.180     11.401      0.000
    VOC$1             -2.187      0.203    -10.760      0.000
    NOCOL$1           -1.937      0.183    -10.613      0.000

Categorical Latent Variables

 Means
    C#1               -0.981      0.116     -8.468      0.000


RESULTS IN PROBABILITY SCALE

Latent Class 1

 HM
    Category 1         0.113      0.037      3.025      0.002
    Category 2         0.887      0.037     23.799      0.000
 HE
    Category 1         0.151      0.038      3.934      0.000
    Category 2         0.849      0.038     22.056      0.000
 VOC
    Category 1         0.911      0.031     28.987      0.000
    Category 2         0.089      0.031      2.817      0.005
 NOCOL
    Category 1         0.889      0.032     28.072      0.000
    Category 2         0.111      0.032      3.514      0.000

Latent Class 2

 HM
    Category 1         0.890      0.018     50.016      0.000
    Category 2         0.110      0.018      6.181      0.000
 HE
    Category 1         0.887      0.018     48.873      0.000
    Category 2         0.113      0.018      6.256      0.000
 VOC
    Category 1         0.101      0.018      5.472      0.000
    Category 2         0.899      0.018     48.748      0.000
 NOCOL
    Category 1         0.126      0.020      6.267      0.000
    Category 2         0.874      0.020     43.498      0.000


LATENT CLASS ODDS RATIO RESULTS

Latent Class 1 Compared to Latent Class 2

 HM
    Category > 1      63.670     25.875      2.461      0.014
 HE
    Category > 1      43.795     14.941      2.931      0.003
 VOC
    Category > 1       0.011      0.005      2.351      0.019
 NOCOL
    Category > 1       0.018      0.007      2.768      0.006


QUALITY OF NUMERICAL RESULTS

     Condition Number for the Information Matrix              0.600E-01
       (ratio of smallest to largest eigenvalue)

Towards the top of the output, under FINAL CLASS COUNTS…, Mplus gives the final counts and proportions for the classes in several ways. First it gives the counts (i.e. the number of cases in each class) and proportions based on the estimated model, and on the posterior probabilities. This gives the proportion (and count) of individuals estimated to be in each class in the model. Below that, Mplus gives the classification based on most likely class membership, which is an alternative method of assigning individuals to classes. Based on the estimated model and posterior probabilities we see that about 27% of students belong to class 1, and about 73% belong to class 2. Based on most likely class membership, about 25% of students belong to class 1 and the remaining 75% to class 2. Under MODEL RESULTS the thresholds for the classes are listed. Thresholds are on the logit scale, and hence, can be somewhat difficult to interpret. The same information is given in a more interpretable scale under RESULTS IN PROBABILITY SCALE. Here we see that the probability that an individual in class 1 will be in category 2 of the variable hm is .89. In other words, the estimated probability of a student in class 1 taking honors math is about .89.

2.0 Using both categorical and continuous indicator variables

Above we estimated a specific case of a mixture model, a latent class analysis, in which all of the indicators are categorical, in this example the model contains both categorical and continuous indicators. In addition to the four categorical variables used in the example above, this model includes four continuous variables, the students score on a measure of academic achievement for each of the four years of high school (ach9–ach12). The achievement variables have been centered so that each has a mean of zero. The only difference between the input file for this model and the one for the LCA estimated above is that the usevariables option has been dropped because all variables in the dataset are used in the model. In general, the only difference between the input file for a mixture model with all categorical indicators and the input for a model that includes continuous variables is the type of variables included.

Title: Categorical and continuous indicators
Data: 
   file is https://stats.idre.ucla.edu/wp-content/uploads/2016/02/lca.dat;
Variable:
   names are hm he voc nocol ach9-ach12;
   classes = c (2);
   categorical = hm he voc nocol ;
Analysis: 
   type = mixture;

Below is the output for this model.

*** WARNING in MODEL command
  All variables are uncorrelated with all other variables within class.
  Check that this is what is intended.
   1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS



Categorical and continuous indicators

SUMMARY OF ANALYSIS

Number of groups                                                 1
Number of observations                                         500

Number of dependent variables                                    8
Number of independent variables                                  0
Number of continuous latent variables                            0
Number of categorical latent variables                           1

Observed dependent variables

  Continuous
   ACH9        ACH10       ACH11       ACH12

  Binary and ordered categorical (ordinal)
   HM          HE          VOC         NOCOL

Categorical latent variables
   C


Estimator                                                      MLR
Information matrix                                        OBSERVED
Optimization Specifications for the Quasi-Newton Algorithm for
Continuous Outcomes
  Maximum number of iterations                                 100
  Convergence criterion                                  0.100D-05
Optimization Specifications for the EM Algorithm
  Maximum number of iterations                                 500
  Convergence criteria
    Loglikelihood change                                 0.100D-06
    Relative loglikelihood change                        0.100D-06
    Derivative                                           0.100D-05
Optimization Specifications for the M step of the EM Algorithm for
Categorical Latent variables
  Number of M step iterations                                    1
  M step convergence criterion                           0.100D-05
  Basis for M step termination                           ITERATION
Optimization Specifications for the M step of the EM Algorithm for
Censored, Binary or Ordered Categorical (Ordinal), Unordered
Categorical (Nominal) and Count Outcomes
  Number of M step iterations                                    1
  M step convergence criterion                           0.100D-05
  Basis for M step termination                           ITERATION
  Maximum value for logit thresholds                            15
  Minimum value for logit thresholds                           -15
  Minimum expected cell size for chi-square              0.100D-01
Optimization algorithm                                         EMA
Random Starts Specifications
  Number of initial stage random starts                         10
  Number of final stage optimizations                            2
  Number of initial stage iterations                            10
  Initial stage convergence criterion                    0.100D+01
  Random starts scale                                    0.500D+01
  Random seed for generating random starts                       0
Link                                                         LOGIT

Input data file(s)
  https://stats.idre.ucla.edu/wp-content/uploads/2016/02/lca.dat
Input data format  FREE


SUMMARY OF CATEGORICAL DATA PROPORTIONS

    HM
      Category 1    0.678
      Category 2    0.322
    HE
      Category 1    0.686
      Category 2    0.314
    VOC
      Category 1    0.322
      Category 2    0.678
    NOCOL
      Category 1    0.334
      Category 2    0.666


RANDOM STARTS RESULTS RANKED FROM THE BEST TO THE WORST LOGLIKELIHOOD VALUES

Final stage loglikelihood values at local maxima, seeds, and initial stage start numbers:

           -3842.353  unperturbed      0
           -3842.353  462953           7



THE MODEL ESTIMATION TERMINATED NORMALLY



TESTS OF MODEL FIT

Loglikelihood

          H0 Value                       -3842.353
          H0 Scaling Correction Factor       0.982
            for MLR

Information Criteria

          Number of Free Parameters             21
          Akaike (AIC)                    7726.706
          Bayesian (BIC)                  7815.213
          Sample-Size Adjusted BIC        7748.557
            (n* = (n + 2) / 24)

Chi-Square Test of Model Fit for the Binary and Ordered Categorical
(Ordinal) Outcomes

          Pearson Chi-Square

          Value                              7.628
          Degrees of Freedom                     6
          P-Value                           0.2666

          Likelihood Ratio Chi-Square

          Value                              6.974
          Degrees of Freedom                     6
          P-Value                           0.3233



FINAL CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASSES
BASED ON THE ESTIMATED MODEL

    Latent
   Classes

       1        367.56581          0.73513
       2        132.43419          0.26487


FINAL CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASS PATTERNS
BASED ON ESTIMATED POSTERIOR PROBABILITIES

    Latent
   Classes

       1        367.56581          0.73513
       2        132.43419          0.26487


CLASSIFICATION QUALITY

     Entropy                         0.998


CLASSIFICATION OF INDIVIDUALS BASED ON THEIR MOST LIKELY LATENT CLASS MEMBERSHIP

Class Counts and Proportions

    Latent
   Classes

       1              368          0.73600
       2              132          0.26400


Average Latent Class Probabilities for Most Likely Latent Class Membership (Row)
by Latent Class (Column)

           1        2

    1   0.999    0.001
    2   0.000    1.000


MODEL RESULTS

                                                    Two-Tailed
                    Estimate       S.E.  Est./S.E.    P-Value

Latent Class 1

 Means
    ACH9              -2.058      0.055    -37.121      0.000
    ACH10             -2.061      0.051    -40.656      0.000
    ACH11             -0.987      0.055    -18.070      0.000
    ACH12             -0.990      0.052    -19.023      0.000

 Thresholds
    HM$1               2.021      0.162     12.453      0.000
    HE$1               2.075      0.166     12.521      0.000
    VOC$1             -2.075      0.166    -12.525      0.000
    NOCOL$1           -1.931      0.157    -12.280      0.000

 Variances
    ACH9               1.116      0.073     15.346      0.000
    ACH10              0.956      0.058     16.601      0.000
    ACH11              1.031      0.059     17.382      0.000
    ACH12              0.946      0.060     15.727      0.000

Latent Class 2

 Means
    ACH9               1.988      0.091     21.870      0.000
    ACH10              1.971      0.087     22.653      0.000
    ACH11              0.987      0.081     12.248      0.000
    ACH12              0.829      0.080     10.425      0.000

 Thresholds
    HM$1              -2.101      0.282     -7.440      0.000
    HE$1              -1.954      0.266     -7.354      0.000
    VOC$1              2.267      0.302      7.514      0.000
    NOCOL$1            2.306      0.303      7.617      0.000

 Variances
    ACH9               1.116      0.073     15.346      0.000
    ACH10              0.956      0.058     16.601      0.000
    ACH11              1.031      0.059     17.382      0.000
    ACH12              0.946      0.060     15.727      0.000

Categorical Latent Variables

 Means
      C#1              1.021      0.102     10.055      0.000


RESULTS IN PROBABILITY SCALE

Latent Class 1

 HM
    Category 1         0.883      0.017     52.665      0.000
    Category 2         0.117      0.017      6.977      0.000
 HE
    Category 1         0.888      0.016     54.096      0.000
    Category 2         0.112      0.016      6.792      0.000
 VOC
    Category 1         0.112      0.016      6.794      0.000
    Category 2         0.888      0.016     54.114      0.000
 NOCOL
    Category 1         0.127      0.017      7.283      0.000
    Category 2         0.873      0.017     50.207      0.000

Latent Class 2

 HM
    Category 1         0.109      0.027      3.974      0.000
    Category 2         0.891      0.027     32.487      0.000
 HE
    Category 1         0.124      0.029      4.296      0.000
    Category 2         0.876      0.029     30.323      0.000
 VOC
    Category 1         0.906      0.026     35.304      0.000
    Category 2         0.094      0.026      3.658      0.000
 NOCOL
    Category 1         0.909      0.025     36.451      0.000
    Category 2         0.091      0.025      3.632      0.000


LATENT CLASS ODDS RATIO RESULTS

Latent Class 1 Compared to Latent Class 2

 HM
    Category > 1       0.016      0.005      3.066      0.002
 HE
    Category > 1       0.018      0.006      3.188      0.001
 VOC
    Category > 1      76.870     26.448      2.906      0.004
 NOCOL
    Category > 1      69.181     23.612      2.930      0.003


QUALITY OF NUMERICAL RESULTS

     Condition Number for the Information Matrix              0.275E-01
       (ratio of smallest to largest eigenvalue)

Towards the top of the output is a message warning us that all of the variables are uncorrelated within clusters. This “warning” does not imply a problem with the model, it is merely there to remind the user that the restriction exists, whether this restriction is appropriate must be determined by the user. In addition to the thresholds for the categorical items (which were included in the output for the previous example), the output for this model includes means and variances for the continuous indicators (i.e. ach9–ach12). The means for the academic achievement variables (ach9–ach12) are all lower in the first class than the second class. The first class is also less likely to have taken honors classes (hm and he) and more likely to have taken vocational classes (voc) and to say they don’t intend to go to college (nocol). Although the order of the classes has reversed (i.e. the class we have called "academically oriented students" is class 2 in this model) the results of this model are consistent with the results from the model in the first example. The models in both examples are consistent with hypothesis that there are two types of students, those who are academically oriented, and those who are not. Note that by default, Mplus specifies the model so that it assumes the variances of the continuous class indicators (ach9–ach12) are equal across all classes, this assumption may or may not be appropriate.

3.0 Saving Class Assignments

In addition to the output file produced by Mplus, it is possible to save class membership information for each case in the dataset to a text file. This text file can later be used with Mplus or read into another statistical package. To do this the savedata: command is added to the input file. The file option gives the name of the file in which the class assignments should be saved (i.e. class.txt). Whenever the file option is used, all of the variables used in the analysis are saved in an external file. The save = cprob; option specifies that the class probabilities should be saved, in addition to the variables used in estimation. Additional variables that were not used in the analysis, but which you wish to include in the saved file, for example, an id variable, can be included by adding the auxiliary option (e.g. auxiliary = id;) to the variable: command.

Title:	Saving class probabilities
Data:	
   file is https://stats.idre.ucla.edu/wp-content/uploads/2016/02/lca.dat;
Variable:	
   names are hm he voc nocol ach9-ach12;
   usevariables are hm he voc nocol ;
   classes = c (2);
   categorical = hm he voc nocol ;
Analysis:	
   type = mixture;
Savedata:
  file is class.txt;
  save = cprob;

The output file for this model contains all of the information contained in the output for the model in the first example, plus additional output associated with the savedata: command. This additional output appears towards the end of the output file, and is shown below.

SAVEDATA INFORMATION

  Order and format of variables

    HM             F10.3
    HE             F10.3
    VOC            F10.3
    NOCOL          F10.3
    CPROB1         F10.3
    CPROB2         F10.3
    C              F10.3

  Save file
    class.txt

  Save file format
    7F10.3

  Save file record length    5000

The additional output associated with the savedata: command lists the variables in the order in which they appear in the saved dataset. Note that the 4 observed variables used in estimation are listed first, followed by three variables associated with the latent class assignment. The variables CPROB1 and CPROB2 give the probability that each case is in class 1 or class 2, respectively. The variable C contains the class assignment based on posterior probabilities. Below the list of variables the name of the file, and information on the format of the file are shown.

The file class.txt is a text file that can be read by a large number of programs. The first few lines of this file are shown below. Based on the information in the output file, we know that the first four columns contain each student’s value for the variables hm, hw, voc, and nocol (in that order), the remaining three columns are each student’s predicted probability for each of the two classes, and the final column contains the student’s class membership.

     1.000     1.000     0.000     1.000     0.963     0.037     1.000
     1.000     0.000     0.000     0.000     0.971     0.029     1.000
     0.000     0.000     1.000     1.000     0.000     1.000     2.000
     1.000     1.000     0.000     0.000     0.999     0.001     1.000
     1.000     1.000     0.000     0.000     0.999     0.001     1.000

4.0 Plots

Plots based on the estimated model can also be requested by adding the plot: command to the input file. The type option specifies the type of plots desired, in this case, plot3 requests all plots available for this model. The series option gives the variables to be included in the plots, this can contain either categorical or continuous variables (but not both at the same time). The list of variables in the series option is followed by (*) this uses the defaults for the scaling of the x-axis in the plots. For more information on scaling of the x-axis see the Mplus manual.

Title: Categorical and continuous indicators
Data: 
   file is https://stats.idre.ucla.edu/wp-content/uploads/2016/02/lca.dat;
Variable:
   names are hm he voc nocol ach9-ach12;
   classes = C (2);
   categorical = hm he voc nocol ;
Analysis: 
   type = mixture;
Plot:
  type = plot3;
  series = ach9-ach12(*);

From the Graph menu select View graphs. Because the variables we wish to plot are continuous, we select Estimated means, for categorical variables we would select Estimated probabilities. The options under View graphs are somewhat limited for this model, if you were to specify a model where class membership was predicted by additional variables, then a larger variety of graphs is available.

Image lca_ex2

This graph, sometimes called a profile plot, shows graphically the latent class means given in the MODEL RESULTS section of the output for the second example. By default, the x-axis starts at zero and increases in units of one for each of the observed variables. In our example, this means that the means for the variable ach9 shown at 0, followed by ach10 at 1, etc.

The legend tells us that class 1 is shown in red, and class 2 in green. It also gives the proportion of cases in each class, in this case an estimated 26% of students are in class 1, and 74% are in class 2. This information can be found in the output under the heading "Final Class Counts and Proportions for the latent Classes Based on the Estimated Model". Consistent with the means shown in the output for example 2,the plot shows that students in class 1 have lower average scores on all four of the achievement variables (ach9–ach12) than students in class 2.