Mplus version 8 was used for these examples. All the files for this portion of this seminar can be downloaded here.
Mplus has a rich collection of regression models including ordinary least squares (OLS) regression, probit regression, logistic regression, ordered probit and logit regressions, multinomial probit and logit regressions, poisson regression, negative binomial regression, inflated poisson and negative binomial regressions, censored regression and censored inflated regression.
The keyword for regression models is on, as in response variable regressed on predictor1, predictor2, etc. In context, a regression command looks like this:
response_var on var1 var2;
For most of the examples we will be using the hsbdemo.dat dataset. It contains a nice collection of continuous, binary, ordered, categorical and count variables. You can download the data by clicking here. In this example we will boldface the line that specifies the regression analysis.
Ordinary least squares (OLS) regression
In our first example we will use a standardized test, write, as the response variable and the continuous variables read and math as predictors along with the binary predictor female. We begin by showing the input file which we called hsbreg.inp.
title: OLS regression data: file is hsbdemo.dat; variable: names are id female ses schtyp prog read write math science socst honors awards cid; usevariables are write female read math; model: write on female read math;
Next, we will take a look at the output file, hsbreg.out.
INPUT READING TERMINATED NORMALLY OLS regression SUMMARY OF ANALYSIS Number of groups 1 Number of observations 200 Number of dependent variables 1 Number of independent variables 3 Number of continuous latent variables 0 Observed dependent variables Continuous WRITE Observed independent variables FEMALE READ MATH Estimator ML Information matrix OBSERVED Maximum number of iterations 1000 Convergence criterion 0.500D-04 Maximum number of steepest descent iterations 20 Input data file(s) hsbdemo.dat Input data format FREE THE MODEL ESTIMATION TERMINATED NORMALLY TESTS OF MODEL FIT Chi-Square Test of Model Fit Value 0.000 Degrees of Freedom 0 P-Value 0.0000 Chi-Square Test of Model Fit for the Baseline Model Value 149.335 Degrees of Freedom 3 P-Value 0.0000 CFI/TLI CFI 1.000 TLI 1.000 Loglikelihood H0 Value -2224.303 H1 Value -2224.303 Information Criteria Number of Free Parameters 5 Akaike (AIC) 4458.607 Bayesian (BIC) 4475.098 Sample-Size Adjusted BIC 4459.258 (n* = (n + 2) / 24) RMSEA (Root Mean Square Error Of Approximation) Estimate 0.000 90 Percent C.I. 0.000 0.000 Probability RMSEA <= .05 0.000 SRMR (Standardized Root Mean Square Residual) Value 0.000 MODEL RESULTS Two-Tailed Estimate S.E. Est./S.E. P-Value WRITE ON FEMALE 5.443 0.926 5.881 0.000 READ 0.325 0.060 5.409 0.000 MATH 0.397 0.066 6.047 0.000 Intercepts WRITE 11.896 2.834 4.197 0.000 Residual Variances WRITE 42.367 4.237 10.000 0.000
Regression with missing data
For our next example we will use a dataset, hsbmis2.dat, that has observations with missing data. You can download the dataset by clicking here. Starting with Mplus 5, the default analysis type allows for analysis of missing data by full information maximum likelihood (FIML). The FIML approach uses all of the available information in the data and yields unbiased parameter estimates as long as the missingness is at least missing at random. It is worth noting that this missing data approach is available for all of the different regression models, not just for the OLS regression. By default however, Mplus does not allow for missingness on exogeneous variables (x-variables) in Mplus.
*** WARNING Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 124 1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS
In order to force Mplus to use all observations, we can estimate the mean of the x-variables so that the x-variables becomes an endogenous variable in Mplus and gets treated as an imputable variable. Note that the total number of variables is now back up to 200 instead of 76 (200-124=76) had we not imputed the mean of the x-variables.
title: Multiple regression with missing data data: file is hsbmis2.dat; variable: names are id female race ses hises prog academic read write math science socst hon; usevariables are write read female math; missing are all (-9999); model: write on female read math; [female read math]; INPUT READING TERMINATED NORMALLY multiple regression with missing data SUMMARY OF ANALYSIS Number of groups 1 Number of observations 200 Number of dependent variables 1 Number of independent variables 3 Number of continuous latent variables 0 Observed dependent variables Continuous WRITE Observed independent variables READ FEMALE MATH Estimator ML Information matrix OBSERVED Maximum number of iterations 1000 Convergence criterion 0.500D-04 Maximum number of steepest descent iterations 20 Maximum number of iterations for H1 2000 Convergence criterion for H1 0.100D-03 Input data file(s) hsbmis2.dat Input data format FREE SUMMARY OF DATA Number of missing data patterns 8 COVARIANCE COVERAGE OF DATA Minimum covariance coverage value 0.100 PROPORTION OF DATA PRESENT Covariance Coverage WRITE READ FEMALE MATH ________ ________ ________ ________ WRITE 1.000 READ 0.815 0.815 FEMALE 0.675 0.550 0.675 MATH 0.715 0.575 0.475 0.715 THE MODEL ESTIMATION TERMINATED NORMALLY TESTS OF MODEL FIT Chi-Square Test of Model Fit Value 0.000 Degrees of Freedom 0 P-Value 0.0000 Chi-Square Test of Model Fit for the Baseline Model Value 125.057 Degrees of Freedom 3 P-Value 0.0000 CFI/TLI CFI 1.000 TLI 1.000 Loglikelihood H0 Value -1871.900 H1 Value -1871.900 Information Criteria Number of Free Parameters 5 Akaike (AIC) 3753.800 Bayesian (BIC) 3770.292 Sample-Size Adjusted BIC 3754.451 (n* = (n + 2) / 24) RMSEA (Root Mean Square Error Of Approximation) Estimate 0.000 90 Percent C.I. 0.000 0.000 Probability RMSEA <= .05 0.000 SRMR (Standardized Root Mean Square Residual) Value 0.000 MODEL RESULTS Two-Tailed Estimate S.E. Est./S.E. P-Value WRITE ON FEMALE 5.435 1.121 4.847 0.000 READ 0.298 0.072 4.168 0.000 MATH 0.401 0.077 5.236 0.000 Intercepts WRITE 12.950 2.951 4.388 0.000 Residual Variances WRITE 41.622 4.716 8.825 0.000
Up near the beginning of the output there is a table that shows the proportion of data present for each of the covariates in the model. The model results near the bottom show estimates and standard errors that are close to the first model with complete data. See our Annotated Output: Ordinary Least Squares Regression page for more detailed interpretations of each model parameter.
Probit regression
We will begin with a probit regression model. Mplus treats this as a probit model because we declare that honors is a categorical variable. Mplus recognizes that honors has two levels. Note that Mplus uses a weighted least squares with missing values estimator (as indicated in the output below).
title: Probit regression data: file is hsbdemo.dat; variable: names are id female ses schtyp prog read write math science socst honors awards cid; usevariables are honors female read math; categorical = honors; model: honors on female read math; INPUT READING TERMINATED NORMALLY probit regression SUMMARY OF ANALYSIS Number of groups 1 Number of observations 200 Number of dependent variables 1 Number of independent variables 3 Number of continuous latent variables 0 Observed dependent variables Binary and ordered categorical (ordinal) HONORS Observed independent variables FEMALE READ MATH Estimator WLSMV Maximum number of iterations 1000 Convergence criterion 0.500D-04 Maximum number of steepest descent iterations 20 Parameterization DELTA Input data file(s) hsbdemo.dat Input data format FREE SUMMARY OF CATEGORICAL DATA PROPORTIONS HONORS Category 1 0.735 Category 2 0.265 THE MODEL ESTIMATION TERMINATED NORMALLY TESTS OF MODEL FIT Chi-Square Test of Model Fit Value 0.000* Degrees of Freedom 0** P-Value 0.0000 * The chi-square value for MLM, MLMV, MLR, ULSMV, WLSM and WLSMV cannot be used for chi-square difference tests. MLM, MLR and WLSM chi-square difference testing is described in the Mplus Technical Appendices at www.statmodel.com. See chi-square difference testing in the index of the Mplus User's Guide. ** The degrees of freedom for MLMV, ULSMV and WLSMV are estimated according to a formula given in the Mplus Technical Appendices at www.statmodel.com. See degrees of freedom in the index of the Mplus User's Guide. Chi-Square Test of Model Fit for the Baseline Model Value 35.149 Degrees of Freedom 3 P-Value 0.0000 CFI/TLI CFI 1.000 TLI 1.000 Number of Free Parameters 4 RMSEA (Root Mean Square Error Of Approximation) Estimate 0.000 WRMR (Weighted Root Mean Square Residual) Value 0.000 MODEL RESULTS Two-Tailed Estimate S.E. Est./S.E. P-Value HONORS ON FEMALE 0.682 0.256 2.661 0.008 READ 0.047 0.017 2.745 0.006 MATH 0.074 0.016 4.532 0.000 Thresholds HONORS$1 7.663 1.149 6.671 0.000 R-SQUARE Observed Residual Variable Estimate Variance HONORS 0.553 1.000
For information on interpreting the results of probit models, please visit Annotated Output: Probit Regression.
Logistic regression
Next we have a logistic regression model. The difference between this model and the probit model is that we specify that maximum likelihood is to be used as the estimator. For the rest of this section we will present only the input files for each of the models.
title: Logistic regression data: file is hsbdemo.dat; variable: names are id female ses schtyp prog read write math science socst honors awards cid; usevariables are honors female read math; categorical = honors; analysis: estimator = ML; ! link = logit; model: honors on female read math;
For information on interpreting the results of logistic models, please visit Annotated Output: Logit Regression .
Ordered probit regression
For this next model we use an ordered response variable, ses, which takes on the values 1, 2 and 3. Other then the ordered variable itself the setup is identical to the binary probit model.
title: Ordered probit regression data: file is hsbdemo.dat; variable: names are id female ses schtyp prog read write math science socst honors awards cid; usevariables are ses female read math; categorical = ses; model: ses on female read math;
Ordered logistic regression
For the ordered logit model we again use the maximum likelihood estimator.
title: Ordered logistic regression data: file is hsbdemo.dat; variable: names are id female ses schtyp prog read write math science socst honors awards cid; usevariables are ses female read math; categorical = ses; analysis: estimator = ML; model: ses on female read math;
Multinomial logistic regression
For the multinomial logit model we use the variable prog, which indicates the type of high school program, where 1 is general, 2 is academic and 3 is vocational. We again use the maximum likelihood estimator but declare prog to be a nominal variable.
title: Multinomial logistic regression data: file is hsbdemo.dat; variable: names are id female ses schtyp prog read write math science socst honors awards cid; usevariables are prog female read math; nominal = prog; analysis: estimator = ML; model: prog on female read math;
For information on interpreting the results of multinomial logistic models, please visit Annotated Output: Multinomial Logistic Regression.
Poisson regression
The first model in this section is a poisson regression model using awards as the count response variable. Notice the (p) for poisson on the count statement. Count data often use exposure variables to indicate the number of times the event could have happened. You can incorporate exposure into your model by using the exposure() option.
title: Poisson regression data: file is hsbdemo.dat; variable: names are id female ses schtyp prog read write math science socst honors awards cid; usevariables are awards female read math; count = awards (p); model: awards on female read math;
For information on interpreting the results of poisson models, please visit Annotated Output: Poisson Regression.
Negative binomial regression
The next model in this section is a negative binomial regression model. Negative binomial models are useful when there is overdispersion in the data. Notice the (nb) for negative binomial on the count statement. By default, Mplus uses restricted maximum likelihood (MLR), so robust standard errors would be given in the output. To obtain standard errors calculated using maximum likelihood, include the analysis: estimator = ml; block. Count data often use exposure variables to indicate the number of times the event could have happened. You can incorporate exposure into your model by using the exposure() option.
title: Negative binomial regression data: file is hsbdemo.dat; variable: names are id female ses schtyp prog read write math science socst honors awards cid; usevariables are awards female read math; count = awards (nb); model: awards on female read math;
Zero-inflated poisson regression
The next model is a zero-inflated poisson regression model. Zero-inflated models are useful when there is a second mechanism generating zeros, such that there would be many more zeros than would be expected from the count model alone. The zero-inflated models are examples of multiple equation models. In this case, there is one equation for the count model, awards on female read math, and a second equation for estimating the excess zeros, awards#1 on female read math; this is a logit model. Notice the (pi) for zero-inflated poisson on the count statement.
Although we are using the same predictors in both equations, this is not necessary. You will also note that the output contains a set of parameter estimates for each equation. Thus, the estimate for female of 0.214 is for the count equation, and the estimate -4.029 is for the excess zero equation.
title: Zero-inflated poisson regression data: file is hsbdemo.dat; variable: names are id female ses schtyp prog read write math science socst honors awards cid; usevariables are awards female read math; count = awards (pi); model: awards on female read math; awards#1 on female read math;
For information on interpreting the results of zero-inflated poisson models models, please visit Annotated Output: Zero-inflated Poisson Regression.
Zero-inflated negative binomial regression
The final model in this section is a zero-inflated negative binomial regression model. The setup for this model parallels that of the zero-inflated poisson model above. Notice the (nbi) for zero-inflated negative binomial on the count statement.
title: Zero-inflated negative binomial regression data: file is hsbdemo.dat; variable: names are id female ses schtyp prog read write math science socst honors awards cid; usevariables are awards female read math; count = awards (nbi); model: awards on female read math; awards#1 on female read math;
Mplus can also run zero-truncated negative binomial models and negative binomial hurdle models. Examples of these model is beyond the scope of this seminar.