How can I estimate a multiple group latent class model (knownclass)?

This page was created using Mplus version 5.2, the output and/or syntax may be different for other versions of Mplus.

Frequently, we wish to compare the structure of measurement models across groups (e.g. men and women). When the latent variable is categorical the model is often referred to as a latent class analysis (LCA), more generally, these models are sometimes referred to as mixture models. Below we show how to estimate an LCA with either continuous or categorical class indicators (it is also possible to estimate a model with both categorical and continuous class indicators). We will start with a latent class model with continuous indicators, because these models have a slightly simpler syntax. In Mplus, the knownclass option is used to estimate a latent class model with multiple groups. This option takes its name from the fact that the grouping variable (e.g. gender) is known (i.e. observed).

In the examples below, group is the known or observed class, while c is the latent variable estimated using the observed items. There are three continuous observed items, named a1, a2, and a3. You can download the example dataset here: https://stats.idre.ucla.edu/wp-content/uploads/2016/02/mult_grp_lca_con.dat .

A single group latent class model

As a starting place, below we show the syntax for a single group latent class model. In this model, the continuous variables a1, a2, and a3, are used to form a latent variable c with two classes. The file option of the data: command gives the name of the file in which the dataset is stored. In the variable: command the names option gives the names of the variables in the dataset. The usevariables option gives the names of the variables used to estimate the model. The classes option defines the names of the categorical latent variable c, followed by the number of classes in parentheses, that is (2) for a two class latent variable. In the analysis: command, the type = mixture command indicates that we wish to estimate a mixture model.

data:
    file = https://stats.idre.ucla.edu/wp-content/uploads/2016/02/mult_grp_lca_con.dat ;
variable:
    names = group a1 a2 a3;
    usevariables = a1 a2 a3;
    classes = c(2);
analysis:
    type = mixture;

Model allowing differences in item means across groups, fixing class probabilities and item variances across groups and classes

In this model, we add the observed grouping variable, group to our model in order to estimate a multiple group mixture model. In this model, the classes option of the variable: command lists two classes (c and g), each with the number of groups listed in parentheses after the class name. The knownclass option specifies that the classes in the variable g are defined by the observed variable group, the observed values of group associated with each class (e.g. group = 0) are listed in parentheses after the class name (i.e. g).

data:
   file = https://stats.idre.ucla.edu/wp-content/uploads/2016/02/mult_grp_lca_con.dat;
variable:
   names = group a1 a2 a3;
   usevariables = a1 a2 a3;
   classes = c(2) g(2);
   knownclass = g (group=0 group=1);
analysis:
   type = mixture;

Model allowing differences in item means and class probabilities across groups, with item variances fixed across groups and classes

In this model we use g (i.e. the grouping variable) to predict the probability of class membership in c, meaning that the probability of being in a given class is allowed to vary by the observed variable group. First, we have changed the classes option so that the known class (i.e. g) is listed first, this is necessary if we want to regress c on g to allow the class probabilities to vary by level of group. We have also added the model: command, in the overall section of the model (under %overall%), we have added c on g which adds a regression in which the known class variable (g) predicts the probability that a given case will be in one of the classes of the latent variable c, this allows the class probabilities for c to vary by g.

data:
   file = https://stats.idre.ucla.edu/wp-content/uploads/2016/02/mult_grp_lca_con.dat ;
variable:
   names = group a1 a2 a3;
   usevariables = a1 a2 a3;
   classes = g(2) c(2);
   knownclass = g (group=0 group=1);
analysis:
   type = mixture;
model:
   %overall%
   c on g;

Another model allowing differences in item means and class probabilities across groups, fixing item variances across groups and classes

The model estimated in this example is identical to the previous model, using different syntax. In the model below we have explicitly listed the item means in the input file, so that we can fix or free individual parameters across groups, which allows us to test for differences between item means in the two groups. We can confirm that the two models are the same by comparing their log likelihoods, if the two are the same, we have indeed run the same model. One thing to note is that when we change to this syntax, the order of the classes may change. This change is substantively and mathematically unimportant, we are, after all, still running the same model, but it can be hard to keep track of which classes correspond across the groups if they are not in the same order. To avoid confusion, we can use the coefficient estimates from the above model as starting values for the current model. We could use all of the item means, but it turns out this is unnecessary, a few starting values is usually sufficient to put the classes in the desired order. Below is the Mplus output from the model immediately above this one (i.e. the model that is identical, but has different syntax). Here we show the item means for all three variables, for g = 0 and c = 1 (Latent Class Pattern 1 1), followed by the item means for g=1 and c=1 (Latent Class Pattern 2 1).

                                                    Two-Tailed
                    Estimate       S.E.  Est./S.E.    P-Value

Latent Class Pattern 1 1

 Means
    A1                 0.247      0.058      4.227      0.000
    A2                 2.142      0.091     23.617      0.000
    A3                -0.960      0.067    -14.223      0.000

<output omitted>

Latent Class Pattern 2 1

 Means
    A1                 0.978      0.064     15.260      0.000
    A2                 0.173      0.042      4.120      0.000
    A3                -1.541      0.065    -23.570      0.000

The item means for class 1 (of the latent variable c) in each group (g) are shown above. In the syntax below, we use these parameter estimates as starting values. Most of the syntax shown below is the same as that from the previous models, the new syntax all appears at the end of the input. Below the overall portion of the model (%overall%) we see specific commands for each combination of g and c, in this case, both g and c have two categories, so there are four sections of the model. The first section is for g=1 (group=0) and c=1 is indicated by %g#1.c#1%, the observed (knownclass) variable must come first, if we used %c#1.g#1% Mplus would issue an error message and the model would not run. Below this designation we see the syntax describing the structure of the model [a1*0.247 a2*2.142 a3*-0.960], this specifically lists each of the item means for the variables used to form the latent variable c (i.e. a1, a2, and a3). In addition to listing the parameters, we assign each parameter a starting value based on the above output, for example, a1*0.247 sets the starting value of the mean of a1 to 0.247 in the class c=1, for the group g=1. For c=2 in g=1 (under the label %g#1.c#2%), we specify the means of the items a1–a3, but do not assign starting values, the set of starting values above is sufficient to set the class ordering for g=1. For c=1 in g=2 (labeled %g#2.c#1%), we again include starting values, so that the classes for the second group (g=2) will be in the desired order.

data:
    file = https://stats.idre.ucla.edu/wp-content/uploads/2016/02/mult_grp_lca_con.dat;
variable:
    names = group a1 a2 a3;
    usevariables = a1 a2 a3;
    classes = g(2) c(2); 
    knownclass = g (group=0 group=1);
analysis:
    type = mixture;
model:
    %overall%
    c on g;

    %g#1.c#1%
    [a1*0.247 a2*2.142 a3*-0.960];

    %g#1.c#2%
    [a1 a2 a3];

    %g#2.c#1%
    [a1*0.978 a2*0.173 a3*-1.541];

    %g#2.c#2%
    [a1 a2 a3];

Testing for differences in parameter estimates

Once we have estimated a model in which the item means are allowed to vary across groups, we may want to test to see whether the differences in item means between the two groups are significant, one method of doing this is to use the model test: command to perform a Wald test. Below we test whether the mean for a1 is different in class=1 across the two groups. To do this we have given each of the parameters in question a name. In Mplus, parameter names must appear at the end of a line and in parentheses, for example [a1*0.247] (p1) used below gives the mean of a1 for class 1 (c=1) in group 1 (g=1) the name p1. We have also given the parameter for the mean of a1 in class 1 (c=1) in group 2 (g=2) a name, p2. Note that these names are arbitrary, except that they must begin with a letter and be enclosed in parentheses. Finally, at the bottom of the input, we use the model test: command, below this is the test we want to perform, specifically, we want to test whether p1 = p2.

data:
    file = https://stats.idre.ucla.edu/wp-content/uploads/2016/02/mult_grp_lca_con.dat;
variable:
    names = group a1 a2 a3;
    usevariables = a1 a2 a3;
    classes = g(2) c(2); 
    knownclass = g (group=0 group=1);
analysis:
    type = mixture;
model:
    %overall%
    c on g;

    %g#1.c#1%
    [a1*0.247] (p1);
    [a2*2.142 a3*-0.960];

    %g#1.c#2%
    [a1 a2 a3];

    %g#2.c#1%
    [a1*0.978] (p2);
    [a2*0.173 a3*-1.541];

    %g#2.c#2%
    [a1 a2 a3];
    
    model test:
    p1 = p2;

The output from this model will have an additional section, shown below. In this case we can reject the null hypothesis that p1 = p2. If we wanted to constrain these two parameters to equality (either because the difference was non-significant, or for other reasons) we could do so by either giving the two parameters the same name, or replacing the parameter name with a number that is the same for all parameters that are to be constrained to equality, for example, we could replace both (p1) and (p2) above with (1).

Wald Test of Parameter Constraints

          Value                             71.900
          Degrees of Freedom                     1
          P-Value                           0.0000

Model allowing differences in item variances across groups, item means and class probabilities fixed across groups

In this model, we have modified the model: command so that the item means and class probabilities are fixed across groups (g), but the item variances are allowed to vary by group. Under model c: we describe the structure of the latent variable c. Under %c#1% the model for class 1 of the latent variable c is defined by the mean of the variables a1, a2, and a3, this is indicated by the name of the variables listed within square brackets (i.e. [a1 a2 a3]). Under %c#2% we describe the structure of c=2 in the same manner as c=1. Under model g: we explicitly show the variance structure of the model. Under %g#1% we see the names of the variables that make up the latent variable c (i.e. a1, a2, and a3), the name of a variable listed without other commands (e.g. on, with, or square brackets) indicates the variance of the variable. So listing the variances of the variables separately by level of g indicates that the variances of a1, a2, and a3 should be allowed to vary by group. This syntax is repeated for the second group (g=2), allowing the variances for this group to differ from those in the first.

data:
    file is https://stats.idre.ucla.edu/wp-content/uploads/2016/02/mult_grp_lca_con.dat ;
variable:
    names are group a1 a2 a3;
    classes = g(2) c(2);
    knownclass = g (group=0 group=1);
analysis:
    type = mixture;
model:
model c:
    %c#1%
    [a1 a2 a3];
    %c#2%
    [a1 a2 a3];
model g:
    %g#1%
    a1 a2 a3;
    %g#2%
    a1 a2 a3;

Working with categorical observed variables

In the examples below, instead of the observed variables being continuous, as above, the observed items are categorical. As above, the variable group is the known or observed class variable. The latent variable c is estimated using the categorical observed items named i1, i2, and i3. The variables i1 and i2 are dichotomous, while the variable i3 is ordinal with three categories. You can download the example dataset here: https://stats.idre.ucla.edu/wp-content/uploads/2016/02/mult_grp_lca_cat.dat.

A single group latent class model

As a starting place, below we show the syntax for a single group latent class model. In this model, the categorical variables i1, i2, and i3, are used to form a latent variable c with two classes. Most of this input file is the same as the single group latent class model with continuous indicators. The file option of the data: command gives the name of the file in which the dataset is stored. In the variable: command the names option gives the names of the variables in the dataset. The usevariables option gives the names of the variables used to estimate the model because not all variables in the dataset are used in the model. The classes option defines the names of the categorical latent variable c, followed by the number of classes in parentheses, that is (2) for a two class latent variable. In the analysis: command, the type = mixture command indicates that we wish to estimate a mixture model. The difference between this model, and a single group model with continuous indicators is that in the variable: command, the categorical option lists the names of the categorical variables in the dataset, in this case (i.e. i1, i2, and i3).

data:
    file = https://stats.idre.ucla.edu/wp-content/uploads/2016/02/mult_grp_lca_cat.dat ;
variable:
    names = group i1 i2 i3;
    usevariables = i1 i2 i3;
    classes = c(2);
    categorical = i1 i2 i3;
analysis:
    type = mixture;

Model allowing differences in item thresholds across groups, fixing class probabilities across groups and classes

data:
    file is https://stats.idre.ucla.edu/wp-content/uploads/2016/02/mult_grp_lca_cat.dat;
variable:
    names = group i1 i2 i3;
    usevariables = i1 i2 i3;
    classes = c(2) g(2);
    knownclass = g (group=0 group=1);
    categorical = i1 i2 i3;
analysis:
    type = mixture;

Model allowing differences in item thresholds and class probabilities across groups

In this model we use g (i.e. the grouping variable) to predict the probability of class membership in c, meaning that the probability of being in a given class is allowed to vary by the observed variable group. First, we have changed the classes option so that the known class (i.e. g) is listed first, this is necessary if we want to regress c on g to allow the class probabilities to vary by level of group. We have also added the model: command, in the overall section of the model (under %overall%), we have included c on g which adds a regression in which the known class variable (g) predicts the probability that a given class will be in one of the classes of the latent variable c, this allows the class probabilities for c to vary by g.

data:
    file is https://stats.idre.ucla.edu/wp-content/uploads/2016/02/mult_grp_lca_cat.dat ;
variable:
    names are group i1 i2 i3;
    usevariables are i1 i2 i3;
    classes = g(2) c(2); 
    knownclass = g (group=0 group=1);
    categorical = i1 i2 i3;
analysis:
    type = mixture;
model:
    %overall%
    c on g;

Another model allowing differences in item thresholds and class probabilities across groups

The model estimated in this example is identical to the previous model, but uses different input. In the input below we have explicitly listed the item thresholds in the input file, so that we can fix or free individual parameters across groups, which allows us to test for differences between item means in the two groups. We can confirm that the two models are the same by comparing their log likelihoods, which will match if we have indeed run the same model. One thing to note is that when we change to this syntax, the order of the classes may change. This change is substantively unimportant, we are, after all, still running the same model, but it can be hard to keep track of which classes correspond across the groups if they are not in the same order. To avoid confusion, we can use the coefficient estimates from the above model as starting values for the current model. We could use all of the item thresholds, but a few starting values is usually sufficient to put the classes in the desired order. Below is the Mplus output from the model immediately above this one (i.e. the model that is identical, but has different syntax). Here we show the item thresholds for all three variables, for g = 0 and c = 1 (Latent Class Pattern 1 1), followed by the item thresholds for g=1 and c=1 (Latent Class Pattern 2 1), note that there are four thresholds, one each for i1 and i2 (which have two categories), and two thresholds for i3, which has three ordinal categories. Each threshold is denoted by the variable name, followed by a dollar sign and a number indicating the order of the threshold, for example, i3$1 is the first threshold for i3, and i3$2 is the second.

                                                    Two-Tailed
                    Estimate       S.E.  Est./S.E.    P-Value

Latent Class Pattern 1 1

 Thresholds
    I1$1              -0.307      0.689     -0.446      0.656
    I2$1               1.976      3.089      0.640      0.522
    I3$1              -1.467      1.150     -1.276      0.202
    I3$2               1.097      0.304      3.604      0.000
    
<output omitted>

Latent Class Pattern 2 1

 Thresholds
    I1$1               1.501      1.453      1.033      0.301
    I2$1              -1.870      3.996     -0.468      0.640
    I3$1              -0.139      0.294     -0.474      0.635
    I3$2               4.358      9.937      0.439      0.661

The item thresholds for class 1 (of the latent variable c) in each group (g) are shown above. In the syntax below, we use these parameter estimates as starting values. Most of the syntax shown below is the same as that from the previous models, the new syntax all appears at the end of the input. Below the overall portion of the model (%overall%) we see the portions of the model: command for each combination of g and c, in this case, both g and c have two categories, so there are four sections of the model. The first section is for g=1 (group=0) and c=1 is indicated by %g#1.c#1%, the observed (knownclass) variable must come first, if we used %c#1.g#1% Mplus would issue an error message and the model would not run. Below this designation we see the syntax describing the structure of the model [i1$1*-0.307 i2$1*1.976 i3$1*-1.467 i3$2*1.097], this specifically lists each of the item means for the variables used to form the latent variable c (i.e. i1, i2, and i3). In addition to listing the parameters, we assign each parameter a starting value based on the above output, for example, i1$1*-0.307 sets the starting value of the threshold of i1 to -0.307 in the class c=1, for the group g=1. For c=2 in g=1 (under the label %g#1.c#2%), we specify the thresholds of the items i1–i3, but do not assign starting values, the set of starting values above is sufficient to set the class ordering for g=1. For c=1 in g=2 (labeled %g#2.c#1%), we again include starting values, so that the classes for the second group (g=2) will be in the desired order.

data:
    file = https://stats.idre.ucla.edu/wp-content/uploads/2016/02/mult_grp_lca_cat.dat ;
variable:
    names = group i1 i2 i3;
    usevariables = i1 i2 i3;
    classes = g(2) c(2);
    knownclass = g (group=0 group=1);
    categorical = i1 i2 i3;
analysis:
    type = mixture;
model:
    %overall%
    c on g;
    
    %g#1.c#1%
    [i1$1*-0.307 i2$1*1.976 i3$1*-1.467 i3$2*1.097];

    %g#1.c#2%
    [i1$1 i2$1 i3$1 i3$2];

    %g#2.c#1%
    [i1$1*1.501 i2$1*-1.870 i3$1*-0.139 i3$2*4.358];

    %g#2.c#2%
    [i1$1 i2$1 i3$1 i3$2];

Testing for differences in parameter estimates

Once we have estimated a model in which the item thresholds are allowed to vary across groups, we may want to test to see whether the differences in item thresholds between the two groups are significant. One method of doing so is to use the model test: command to perform a Wald test. Below we test whether the threshold 1 for i1 is different in class=1 across the two groups. To do this we have given each of the parameters in question a name. In Mplus, parameter names must appear at the end of a line and in parentheses, for example [i1$1*-0.307] (p1) used below gives the threshold for i1 in class 1 (c=1) in group 1 (g=1) the name p1. We have also given the parameter for the threshold of i1 in class 1 (c=1) and group 2 (g=2) a name, p2. Note that these names are arbitrary, except that they must begin with a letter and be enclosed in parentheses. Finally, at the bottom of the input, we see the model test: command, below this is the test we want to perform, specifically, we want to test whether p1 = p2.

data:
    file = https://stats.idre.ucla.edu/wp-content/uploads/2016/02/mult_grp_lca_cat.dat ;
variable:
    names = group i1 i2 i3;
    usevariables = i1 i2 i3;
    classes = g(2) c(2);
    knownclass = g (group=0 group=1);
    categorical = i1 i2 i3;
analysis:
    type = mixture;
model:
    %overall%
    c on g;
    
    %g#1.c#1%
    [i1$1*-0.307] (p1);
    [i2$1*1.976 i3$1*-1.467 i3$2*1.097];

    %g#1.c#2%
    [i1$1 i2$1 i3$1 i3$2];

    %g#2.c#1%
    [i1$1*1.501] (p2);
    [i2$1*-1.870 i3$1*-0.139 i3$2*4.358];

    %g#2.c#2%
    [i1$1 i2$1 i3$1 i3$2];
    
    model test:
    p1 = p2;

The output from this model will have an additional section, shown below. In this case we fail to reject the null hypothesis that p1 = p2. If we wanted to constrain these two parameters to equality (either because the difference was non-significant, or for other reasons) we could do so by either giving the two parameters the same name, or replacing the parameter name with a number that is the same for all parameters that are to be constrained to equality, for example, we could replace both (p1) and (p2) above with (1).

Wald Test of Parameter Constraints

          Value                              1.266
          Degrees of Freedom                     1
          P-Value                           0.2605