Linear regression, also called OLS (ordinary least squares) regression, is used to model continuous outcome variables. In the OLS regression model, the outcome is modeled as a linear combination of the predictor variables.
Please note: The purpose of this page is to show how to use various data analysis commands. It does not cover all aspects of the research process which researchers are expected to do. In particular, it does not cover data cleaning and checking, verification of assumptions, model diagnostics and potential follow-up analyses.
Examples of linear regression
Example 1: A researcher is interested in how scores on a math and a science test are associated with scores on a writing test. The outcome variable is the score on the writing test.
Example 2: A research team is interested in motivating people to eat more vegetables by showing subjects videos of simple ways to prepare vegetables for dinner. The outcome variable is the number of ounces of vegetables consumed for dinner for one week.
Example 3: Researchers are interested in the effect of light on sleep quality. They randomly assign subjects to different light conditions and measure sleep quality for one month. The average seep quality score is the outcome variable.
Description of the data
For our data analysis below, we are going to expand on Example 1 about the association between test scores. We have generated hypothetical hsb2, which can be obtained from our website. We can obtain descriptive statistics for each of the variables that we will use in our linear regression model. The Mplus program used to obtain descriptive statistics is shown below.
title: descriptive statistics for outcome and predictor variables
data:
file is "c:\mydata\hsb2.dat";
variable:
names are id female race ses schtyp prog read write math science socst;
usevariables are science math female socst read;
missing are all (-9999); ! this statement is really necessary because there are no missing data
analysis:
type = basic;
Some of the output has been omitted to save space.
INPUT READING TERMINATED NORMALLY
descriptive statistics for outcome and predictor variables
SUMMARY OF ANALYSIS
Number of groups 1
Number of observations 200
Number of dependent variables 5
Number of independent variables 0
Number of continuous latent variables 0
Observed dependent variables
Continuous
SCIENCE MATH FEMALE SOCST READ
Input data format FREE
SUMMARY OF DATA
Number of missing data patterns 1
SUMMARY OF MISSING DATA PATTERNS
MISSING DATA PATTERNS (x = not missing)
1
SCIENCE x
MATH x
FEMALE x
SOCST x
READ x
MISSING DATA PATTERN FREQUENCIES
Pattern Frequency
1 200
PROPORTION OF DATA PRESENT
Covariance Coverage
SCIENCE MATH FEMALE SOCST READ
________ ________ ________ ________ ________
SCIENCE 1.000
MATH 1.000 1.000
FEMALE 1.000 1.000 1.000
SOCST 1.000 1.000 1.000 1.000
READ 1.000 1.000 1.000 1.000 1.000
RESULTS FOR BASIC ANALYSIS
ESTIMATED SAMPLE STATISTICS
Means
SCIENCE MATH FEMALE SOCST READ
________ ________ ________ ________ ________
51.850 52.645 0.545 52.405 52.230
Covariances
SCIENCE MATH FEMALE SOCST READ
________ ________ ________ ________ ________
SCIENCE 97.537
MATH 58.212 87.329
FEMALE -0.628 -0.137 0.248
SOCST 49.191 54.489 0.279 114.681
READ 63.650 63.297 -0.270 68.067 104.597
Correlations
SCIENCE MATH FEMALE SOCST READ
________ ________ ________ ________ ________
SCIENCE 1.000
MATH 0.631 1.000
FEMALE -0.128 -0.029 1.000
SOCST 0.465 0.544 0.052 1.000
READ 0.630 0.662 -0.053 0.621 1.000
MAXIMUM LOG-LIKELIHOOD VALUE FOR THE UNRESTRICTED (H1) MODEL IS -2943.209
UNIVARIATE SAMPLE STATISTICS
UNIVARIATE HIGHER-ORDER MOMENT DESCRIPTIVE STATISTICS
Variable/ Mean/ Skewness/ Minimum/ % with Percentiles
Sample Size Variance Kurtosis Maximum Min/Max 20%/60% 40%/80% Median
SCIENCE 51.850 -0.187 26.000 0.50% 42.000 50.000 53.000
200.000 97.537 -0.572 74.000 0.50% 55.000 61.000
MATH 52.645 0.284 33.000 0.50% 43.000 49.000 52.000
200.000 87.329 -0.663 75.000 1.00% 55.000 61.000
FEMALE 0.545 -0.181 0.000 45.50% 0.000 0.000 1.000
200.000 0.248 -1.967 1.000 54.50% 1.000 1.000
SOCST 52.405 -0.379 26.000 1.50% 41.000 51.000 52.000
200.000 114.681 -0.541 71.000 4.00% 56.000 61.000
READ 52.230 0.195 28.000 0.50% 44.000 47.000 50.000
200.000 104.597 -0.637 76.000 1.00% 55.000 63.000
The means and variances of each variable should be compared to those found with your favorite general-use statistical software program (e.g., SAS, Stata, SPSS or R). This will ensure that the data were read into Mplus correctly.
Analysis methods you might consider
Below is a list of some analysis methods you may have encountered.
-
- Linear regression, the focus of this page.
- ANCOVA: ANCOVA will give the same results as linear regression, except with a different parameterization. Linear regression will use dummy coding for categorical predictors, while ANCOVA will use effect coding.
- Robust regression: Robust regression is a type of linear regression used when the assumption of homogeneity of variance may be violated.
Linear regression
Below we show the Mplus input to estimate a linear regression model.
title: linear regression with a continuous observed
dependent variable with four predictor variables
data:
file is "C:\mydata\hsb2.dat";
variable:
names are id female race ses schtyp prog read write math science socst;
usevariables are science math female socst read;
missing are all (-9999);
model:
science on math female socst read;
INPUT READING TERMINATED NORMALLY
linear regression with a continuous observed
dependent variable with four predictor variables
SUMMARY OF ANALYSIS
Number of groups 1
Number of observations 200
Number of dependent variables 1
Number of independent variables 4
Number of continuous latent variables 0
Observed dependent variables
Continuous
SCIENCE
Observed independent variables
MATH FEMALE SOCST READ
<some output omitted>
THE MODEL ESTIMATION TERMINATED NORMALLY
MODEL FIT INFORMATION
Number of Free Parameters 6
<some output omitted>
MODEL RESULTS
Two-Tailed
Estimate S.E. Est./S.E. P-Value
SCIENCE ON
MATH 0.389 0.073 5.319 0.000
FEMALE -2.010 1.010 -1.990 0.047
SOCST 0.050 0.061 0.811 0.417
READ 0.335 0.072 4.666 0.000
Intercepts
SCIENCE 12.326 3.153 3.909 0.000
Residual Variances
SCIENCE 49.820 4.982 10.000 0.000
QUALITY OF NUMERICAL RESULTS
Condition Number for the Information Matrix 0.286E-04
(ratio of smallest to largest eigenvalue)
- At the top of the output we see that all 200 observations in our data set were used in the analysis. We also see that we have one dependent (AKA outcome) variable and four independent (AKA predictor) variables.
- In the Model Results section we see the coefficients, their standard errors, the z-statistic, associated p-values, and the 95% confidence interval of the coefficients. Both math and read are statistically
significant.
- For every one unit change in math, the expected value of science increases by 0.389.
- For a one unit increase in read, the expected value of science increases by 0.335.
Things to consider
- The outcome variable in a linear regression is assumed to be continuous. It should have a reasonable range of values. There is no assumption that the distribution of the outcome is normal.
- The assumptions of linear regression should be checked.
- Clustered data: Sometimes observations are clustered into groups (e.g., people within families, students within classrooms). In such cases, you may want to see our page on non-independence within clusters.
References
- Regression with Graphics: A Second Course in Statistics by Lawrence C. Hamilton
- Regression Analysis: A Constructive Critique by Richard A. Berk
See also
-
- Mplus Annotated Output: Linear regression
