Note: This example was done using Mplus version 6.12.
Logistic regression, also called a logit model, is used to model dichotomous outcome variables. In the logit model the log odds of the outcome is modeled as a linear combination of the predictor variables.
Please note: The purpose of this page is to show how to use various data analysis commands. It does not cover all aspects of the research process which researchers are expected to do. In particular, it does not cover data cleaning and checking, verification of assumptions, model diagnostics and potential follow-up analyses.
Examples
Example 1: Suppose that we are interested in the factors
that influence whether a political candidate wins an election. The
outcome (response) variable is binary (0/1); win or lose.
The predictor variables of interest are the amount of money spent on the campaign, the
amount of time spent campaigning negatively, and whether the candidate is an
incumbent.
Example 2: A researcher is interested in how variables, such as GRE (Graduate Record Exam scores),
GPA (grade
point average) and prestige of the undergraduate institution, effect admission into graduate
school. The outcome variable, admit/don’t admit, is binary.
Description of the data
For our data analysis below, we are going to expand on Example 2 about getting into graduate school. We have generated hypothetical data, which can be obtained by clicking on https://stats.idre.ucla.edu/wp-content/uploads/2016/02/binary.dat. You can store this anywhere you like, but our examples will assume it has been stored in c:data. (Note that the names of variables should NOT be included at the top of the data file. Instead, the variables are named as part of the variable command.) You may want to do your descriptive statistics in a general use statistics package, such as SAS, Stata or SPSS, because the options for obtaining descriptive statistics are limited in Mplus. Even if you chose to run descriptive statistics in another package, it is a good idea to run a model with type=basic before you do anything else, just to make sure the dataset is being read correctly.
This dataset has data on 400 cases. There is a binary response (outcome, dependent) variable called admit and there are three predictor variables: gre, gpa, and rank. We will treat the variables gre and gpa as continuous. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest. The dataset also contains four dummy variables, one for each level of rank, named rank1 to rank4, for example, rank1 is equal to 1 when rank=1, and 0 otherwise. Lets start by running a model with type=basic.
Data: File is c:datahttps://stats.idre.ucla.edu/wp-content/uploads/2016/02/binary.dat ; Variable: Names are admit gre gpa rank rank1 rank2 rank3 rank4; Analysis: Type = basic ;
As we mentioned above, you will want to look at this carefully to be sure that the dataset was read into Mplus correctly. You will want to make sure that you have the correct number of observations, and that the variables all have means that are close to those from the descriptive statistics generated in a general purpose statistical package. If there are missing values for some or all of the variables, the descriptive statistics generated by Mplus will not match those from a general purpose statistical package exactly, because by default, Mplus versions 5.0 and later use maximum likelihood based procedures for handling missing values. The main point of running this model is to make sure that the data is being read correct by Mplus, if the number of cases and variables is correct, and the means are reasonable, then it is probably safe to proceed.
<output omitted> SUMMARY OF ANALYSIS Number of groups 1 Number of observations 400 <output omitted> SAMPLE STATISTICS Means ADMIT GRE GPA RANK RANK1 ________ ________ ________ ________ ________ 1 0.318 587.700 3.390 2.485 0.152 Means RANK2 RANK3 RANK4 ________ ________ ________ 1 0.378 0.302 0.168
Analysis methods you might consider
Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable while others have either fallen out of favor or have limitations.
- Logistic regression, the focus of this page.
- Probit regression. Probit analysis will produce results similar
logistic regression. The choice of probit versus logit depends largely on
individual preferences.
- OLS regression. When used with a binary response variable, this model is known
as a linear probability model and can be used as a way to
describe conditional probabilities. However, the errors (i.e., residuals) from the linear probability model violate the homoskedasticity and
normality of errors assumptions of OLS
regression, resulting in invalid standard errors and hypothesis tests. For
a more thorough discussion of these and other problems with the linear
probability model, see Long (1997, p. 38-40).
- Two-group discriminant function analysis. A multivariate method for dichotomous outcome variables.
- Hotelling’s T2. The 0/1 outcome is turned into the
grouping variable, and the former predictors are turned into outcome
variables. This will produce an overall test of significance but will not
give individual coefficients for each variable, and it is unclear the extent
to which each "predictor" is adjusted for the impact of the other
"predictors."
Using the logit model
The Mplus input file for a logistic regression model is shown below. Because the data file contains variables that are not used in the model, the usevariables subcommand is used to list the variables that appear in the model (i.e., admit, gre, gpa, rank1, rank2, and rank3). Note that because Mplus uses the names subcommand to determine the order of variables in the data file, the number and order of variables in the names subcommand should not be changed unless the data file is also changed. The categorical subcommand is used to identify binary and ordinal outcome variables. Only the categorical outcome variable (i.e., admit) is included in the categorical subcommand. Categorical predictor variables should be included as a series of dummy variables (e.g., rank1, rank2, and rank3). Under analysis we have specified estimator=ml, this requests a logit model, rather than the default probit model. Finally, in the model command we specify that the outcome (i.e., admit) should be regressed on the predictor variables (i.e., gre, gpa, rank1, rank2,
and rank3).
Data: File is c:datahttps://stats.idre.ucla.edu/wp-content/uploads/2016/02/binary.dat ; Variable: names = admit gre gpa rank rank1 rank2 rank3 rank4; usevariables = admit gre gpa rank1 rank2 rank3; categorical = admit; Analysis: estimator = ml; Model: admit on gre gpa rank1 rank2 rank3;
SUMMARY OF ANALYSIS Number of groups 1 Number of observations 400 Number of dependent variables 1 Number of independent variables 5 Number of continuous latent variables 0 Observed dependent variables Binary and ordered categorical (ordinal) ADMIT Observed independent variables GRE GPA RANK1 RANK2 RANK3 Estimator ML Information matrix OBSERVED Optimization Specifications for the Quasi-Newton Algorithm for Continuous Outcomes Maximum number of iterations 100 Convergence criterion 0.100D-05 Optimization Specifications for the EM Algorithm Maximum number of iterations 500 Convergence criteria Loglikelihood change 0.100D-02 Relative loglikelihood change 0.100D-05 Derivative 0.100D-02 Optimization Specifications for the M step of the EM Algorithm for Categorical Latent variables Number of M step iterations 1 M step convergence criterion 0.100D-02 Basis for M step termination ITERATION Optimization Specifications for the M step of the EM Algorithm for Censored, Binary or Ordered Categorical (Ordinal), Unordered Categorical (Nominal) and Count Outcomes Number of M step iterations 1 M step convergence criterion 0.100D-02 Basis for M step termination ITERATION Maximum value for logit thresholds 15 Minimum value for logit thresholds -15 Minimum expected cell size for chi-square 0.100D-01 Optimization algorithm EMA Integration Specifications Type STANDARD Number of integration points 15 Dimensions of numerical integration 0 Adaptive quadrature ON Link LOGIT Cholesky OFF
- At the top of the output we see that 400 observations were used.
- From the output we see that the model includes one binary dependent (i.e., outcome) variable and 5 independent (predictor) variables.
- The analysis summary is followed by a block of technical information about the model, we
won’t discuss most of this information, but we will note two things:
- The estimator, given on the first line of the block is listed as ML, which is what we intended.
- The link function, given on the second to the last line of the block, is listed as logit, which is also what we intended.
Input data file(s) C:datahttps://stats.idre.ucla.edu/wp-content/uploads/2016/02/binary.dat Input data format FREE SUMMARY OF CATEGORICAL DATA PROPORTIONS ADMIT Category 1 0.683 Category 2 0.317 THE MODEL ESTIMATION TERMINATED NORMALLY TESTS OF MODEL FIT Loglikelihood H0 Value -229.259 Information Criteria Number of Free Parameters 6 Akaike (AIC) 470.517 Bayesian (BIC) 494.466 Sample-Size Adjusted BIC 475.428 (n* = (n + 2) / 24)
- Several measures of model fit are included in the output. The log likelihood (-229.259) can be used in comparisons of nested models, but we won’t show an example of that here.
- The Akaike information criterion (AIC) and the Bayesian information criterion (BIC, sometimes also called the Schwarz criterion), can also be used to compare models.
MODEL RESULTS Two-Tailed Estimate S.E. Est./S.E. P-Value ADMIT ON GRE 0.002 0.001 2.070 0.038 GPA 0.804 0.332 2.423 0.015 RANK1 1.551 0.418 3.713 0.000 RANK2 0.876 0.367 2.389 0.017 RANK3 0.211 0.393 0.538 0.591 Thresholds ADMIT$1 5.541 1.138 4.869 0.000
- The section titled MODEL RESULTS includes the coefficients (labeled Estimate),
their standard errors, the ratio of each estimate to its standard error (i.e., the z-score,
labeled Est./S.E.), and the associated p-values. Both gre and gpa
are statistically significant, as are the terms for rank=1 and rank=2 (versus
the omitted category rank=4). The logistic regression coefficients give the change in the log odds of the
outcome for a one unit increase in the predictor variable.
- For every one unit change in gre, the log odds of admission (versus non-admission) increases by 0.002.
- For a one unit increase in gpa, the log odds of being admitted to graduate school increases by 0.804.
- The coefficients for the categories of rank have a slightly different interpretation. For example, having attended an undergraduate institution with a rank of 1, versus an institution with a rank of 4 (the omitted category), increases the log odds of admission by 1.551.
- Below the coefficients for each of the predictor variables, under the heading Thresholds, is the threshold for the model (sometimes also called a cutpoint). Mplus reports a threshold in place of the intercept, the two are the same except that they have opposite signs (so the intercept for this model would be -5.541). For more information on the differences between intercepts and thresholds, please see http://www.stata.com/support/faqs/stat/oprobit.html.
LOGISTIC REGRESSION ODDS RATIO RESULTS ADMIT ON GRE 1.002 GPA 2.235 RANK1 4.718 RANK2 2.401 RANK3 1.235
- Mplus also gives the model results as odds ratios. An odds ratio is the exponentiated coefficient, and can be interpreted as the multiplicative change in the odds for a one unit change in the predictor variable. For example, for a one unit increase in gpa, the odds of being admitted to graduate school (versus not being admitted) increase by a factor of 2.24. For more information on interpreting odds ratios see our FAQ page: How do I interpret odds ratios in logistic regression?
We can also test that the coefficients for rank1, rank2, and rank3, are all equal to zero using the model test command. This type of test can also be described as an overall test for the effect of rank. There are multiple ways to test this type of hypothesis, the model test command requests a Wald test. The Mplus input file shown below is similar to the first model, except that the coefficients for rank1, rank2, and rank3 are assigned the names r1, r2, and r3, respectively. In the model test command, these coefficient names (i.e., r1, r2 and r3) are used to test that each of the coefficients is equal to 0.
Data: File is C:datahttps://stats.idre.ucla.edu/wp-content/uploads/2016/02/binary.dat ; Variable: names = admit gre gpa rank rank1 rank2 rank3 rank4; categorical = admit; usevariables = admit gre gpa rank1 rank2 rank3; Analysis: estimator = ML; Model: admit on gre gpa rank1 (r1) rank2 (r2) rank3 (r3); Model test: r1 = 0; r2 = 0; r3 = 0;
The majority of the output from this model is the same as the first model, so we will only show part of the output generated by the model test command.
TESTS OF MODEL FIT Wald Test of Parameter Constraints Value 20.895 Degrees of Freedom 3 P-Value 0.0001 Loglikelihood H0 Value -229.259
The portion of the output associated with the model test command is labeled “Wald Test of Parameter Constraints” and appears under the heading TESTS OF MODEL FIT just before the likelihood for the entire model is printed. The test statistic is 20.895, with three degrees of freedom (one for each of the parameters tested), with an associated p-value of 0.0001. This indicates that the overall effect of rank is statistically significant.
We can also use the model test command to make pairwise comparisons among the terms for rank. The Mplus input below tests the hypothesis that the coefficient for rank2 (i.e., rank=2) is equal to the coefficient for rank3 (i.e., rank=3).
Data: File is C:datahttps://stats.idre.ucla.edu/wp-content/uploads/2016/02/binary.dat ; Variable: names = admit gre gpa rank rank1 rank2 rank3 rank4; categorical = admit; usevariables = admit gre gpa rank1 rank2 rank3; Analysis: estimator = ml; Model: admit on gre gpa rank1 (r1) rank2 (r2) rank3 (r3); Model test: r2 = r3;
Below is the output associated with the model test command (as before, most of the model output is omitted).
MODEL FIT INFORMATION Wald Test of Parameter Constraints Value 5.505 Degrees of Freedom 1 P-Value 0.0190
The test statistic and associated p-value indicate that the coefficient for rank2 (i.e., rank=2) is significantly different from the coefficient for rank3 (rank=3).
Things to consider
- Empty cells or small cells: You should check for empty or small
cells by doing a crosstab between categorical predictors and the outcome variable. If a cell has very few cases (a small cell), the model may become unstable or it might not run at all.
- Separation or quasi-separation (also called perfect prediction), a condition in which the outcome does not vary at some levels of the independent variables. See our page FAQ: What is complete or quasi-complete separation in logistic/probit regression and how do we deal with them? for information on models with perfect prediction.
- Sample size: Both logit and probit models require more cases than OLS regression because they use maximum likelihood estimation techniques. It is sometimes possible to estimate models for binary outcomes in datasets with only a small number of cases using exact logistic regression (using the exlogistic command). For more information see our data analysis example for exact logistic regression. It is also important to keep in mind that when the outcome is rare, even if the overall dataset is large, it can be difficult to estimate a logit model.
- Pseudo-R-squared: Many different measures of pseudo-R-squared exist. They all attempt to provide information similar to that provided by R-squared in OLS regression; however, none of them can be interpreted exactly as R-squared in OLS regression is interpreted. For a discussion of various pseudo-R-squareds see Long and Freese (2006) or our FAQ page What are pseudo R-squareds?
- Diagnostics: The diagnostics for logistic regression are different from those for OLS regression. For a discussion of model diagnostics for logistic regression, see Hosmer and Lemeshow (2000, Chapter 5). Note that diagnostics done for logistic regression are similar to those done for probit regression.
- Clustered data: Sometimes observations are clustered into groups (e.g., people within families, students within classrooms). In such cases, you may want to consider using either a multilevel model or the cluster option of the variable command.
References
Hosmer, D. & Lemeshow, S. (2000). Applied Logistic Regression (Second Edition). New York: John Wiley & Sons, Inc.
Long, J. Scott (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.
See also
- Mplus Annotated Output: Logit Regression
- References
- Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.