Linear Regression

Linear regression, also called OLS (ordinary least squares) regression, is used to model continuous outcome variables. In the OLS regression model, the outcome is modeled as a linear combination of the predictor variables.

Please note: The purpose of this page is to show how to use various data analysis commands. It does not cover all aspects of the research process which researchers are expected to do. In particular, it does not cover data cleaning and checking, verification of assumptions, model diagnostics and potential follow-up analyses.

Examples of linear regression

Example 1: A researcher is interested in how scores on a math and a science test are associated with scores on a writing test. The outcome variable is the score on the writing test.

Example 2: A research team is interested in motivating people to eat more vegetables by showing subjects videos of simple ways to prepare vegetables for dinner. The outcome variable is the number of ounces of vegetables consumed for dinner for one week.

Example 3: Researchers are interested in the effect of light on sleep quality. They randomly assign subjects to different light conditions and measure sleep quality for one month. The average seep quality score is the outcome variable.

Description of the data

For our data analysis below, we are going to expand on Example 1 about the association between test scores. We have generated hypothetical hsb2, which can be obtained from our website. We can obtain descriptive statistics for each of the variables that we will use in our linear regression model. The Mplus program used to obtain descriptive statistics is shown below.

title: descriptive statistics for outcome and predictor variables
data:
file is "c:\mydata\hsb2.dat";
variable:
names are id female race ses schtyp prog read write math science socst;
usevariables are science math female socst read;
missing are all (-9999); ! this statement is really necessary because there are no missing data
analysis: 
type = basic;

Some of the output has been omitted to save space.

INPUT READING TERMINATED NORMALLY

descriptive statistics for outcome and predictor variables

SUMMARY OF ANALYSIS

Number of groups                                                 1
Number of observations                                         200

Number of dependent variables                                    5
Number of independent variables                                  0
Number of continuous latent variables                            0

Observed dependent variables

  Continuous
   SCIENCE     MATH        FEMALE      SOCST       READ

Input data format  FREE


SUMMARY OF DATA

     Number of missing data patterns             1


SUMMARY OF MISSING DATA PATTERNS


     MISSING DATA PATTERNS (x = not missing)

           1
 SCIENCE   x
 MATH      x
 FEMALE    x
 SOCST     x
 READ      x


     MISSING DATA PATTERN FREQUENCIES

    Pattern   Frequency
          1         200
     PROPORTION OF DATA PRESENT


           Covariance Coverage
              SCIENCE       MATH          FEMALE        SOCST         READ
              ________      ________      ________      ________      ________
 SCIENCE        1.000
 MATH           1.000         1.000
 FEMALE         1.000         1.000         1.000
 SOCST          1.000         1.000         1.000         1.000
 READ           1.000         1.000         1.000         1.000         1.000


RESULTS FOR BASIC ANALYSIS


     ESTIMATED SAMPLE STATISTICS


           Means
              SCIENCE       MATH          FEMALE        SOCST         READ
              ________      ________      ________      ________      ________
               51.850        52.645         0.545        52.405        52.230


           Covariances
              SCIENCE       MATH          FEMALE        SOCST         READ
              ________      ________      ________      ________      ________
 SCIENCE       97.537
 MATH          58.212        87.329
 FEMALE        -0.628        -0.137         0.248
 SOCST         49.191        54.489         0.279       114.681
 READ          63.650        63.297        -0.270        68.067       104.597


           Correlations
              SCIENCE       MATH          FEMALE        SOCST         READ
              ________      ________      ________      ________      ________
 SCIENCE        1.000
 MATH           0.631         1.000
 FEMALE        -0.128        -0.029         1.000
 SOCST          0.465         0.544         0.052         1.000
 READ           0.630         0.662        -0.053         0.621         1.000


     MAXIMUM LOG-LIKELIHOOD VALUE FOR THE UNRESTRICTED (H1) MODEL IS -2943.209


UNIVARIATE SAMPLE STATISTICS


     UNIVARIATE HIGHER-ORDER MOMENT DESCRIPTIVE STATISTICS

         Variable/         Mean/     Skewness/   Minimum/ % with                Percentiles
        Sample Size      Variance    Kurtosis    Maximum  Min/Max      20%/60%    40%/80%    Median

     SCIENCE              51.850      -0.187      26.000    0.50%      42.000     50.000     53.000
             200.000      97.537      -0.572      74.000    0.50%      55.000     61.000
     MATH                 52.645       0.284      33.000    0.50%      43.000     49.000     52.000
             200.000      87.329      -0.663      75.000    1.00%      55.000     61.000
     FEMALE                0.545      -0.181       0.000   45.50%       0.000      0.000      1.000
             200.000       0.248      -1.967       1.000   54.50%       1.000      1.000
     SOCST                52.405      -0.379      26.000    1.50%      41.000     51.000     52.000
             200.000     114.681      -0.541      71.000    4.00%      56.000     61.000
     READ                 52.230       0.195      28.000    0.50%      44.000     47.000     50.000
             200.000     104.597      -0.637      76.000    1.00%      55.000     63.000

The means and variances of each variable should be compared to those found with your favorite general-use statistical software program (e.g., SAS, Stata, SPSS or R). This will ensure that the data were read into Mplus correctly.

Analysis methods you might consider

Below is a list of some analysis methods you may have encountered.

- Linear regression, the focus of this page.
- ANCOVA: ANCOVA will give the same results as linear regression, except with a different parameterization. Linear regression will use dummy coding for categorical predictors, while ANCOVA will use effect coding.
- Robust regression: Robust regression is a type of linear regression used when the assumption of homogeneity of variance may be violated.

Below we show the Mplus input to estimate a linear regression model.

title: linear regression with a continuous observed
 dependent variable with four predictor variables
data:	
file is "C:\mydata\hsb2.dat";
variable:	
names are id female race ses schtyp prog read write math science socst;
usevariables are science math female socst read;
missing are all (-9999);
model:	
science on math female socst read;

INPUT READING TERMINATED NORMALLY

linear regression with a continuous observed
dependent variable with four predictor variables

SUMMARY OF ANALYSIS

Number of groups                                                 1
Number of observations                                         200

Number of dependent variables                                    1
Number of independent variables                                  4
Number of continuous latent variables                            0

Observed dependent variables

  Continuous
   SCIENCE

Observed independent variables
   MATH        FEMALE      SOCST       READ

<some output omitted>

THE MODEL ESTIMATION TERMINATED NORMALLY


MODEL FIT INFORMATION

Number of Free Parameters                        6

<some output omitted>
      
MODEL RESULTS

                                                    Two-Tailed
                    Estimate       S.E.  Est./S.E.    P-Value

 SCIENCE  ON
    MATH               0.389      0.073      5.319      0.000
    FEMALE            -2.010      1.010     -1.990      0.047
    SOCST              0.050      0.061      0.811      0.417
    READ               0.335      0.072      4.666      0.000

 Intercepts
    SCIENCE           12.326      3.153      3.909      0.000

 Residual Variances
    SCIENCE           49.820      4.982     10.000      0.000


QUALITY OF NUMERICAL RESULTS

     Condition Number for the Information Matrix              0.286E-04
       (ratio of smallest to largest eigenvalue)

At the top of the output we see that all 200 observations in our data set were used in the analysis. We also see that we have one dependent (AKA outcome) variable and four independent (AKA predictor) variables.
In the Model Results section we see the coefficients, their standard errors, the z-statistic, associated p-values, and the 95% confidence interval of the coefficients. Both math and read are statistically significant.
- For every one unit change in math, the expected value of science increases by 0.389.
- For a one unit increase in read, the expected value of science increases by 0.335.

Things to consider

The outcome variable in a linear regression is assumed to be continuous. It should have a reasonable range of values. There is no assumption that the distribution of the outcome is normal.
The assumptions of linear regression should be checked.
Clustered data: Sometimes observations are clustered into groups (e.g., people within families, students within classrooms). In such cases, you may want to see our page on non-independence within clusters.

References

Regression with Graphics: A Second Course in Statistics by Lawrence C. Hamilton
Regression Analysis: A Constructive Critique by Richard A. Berk

Examples of linear regression

Description of the data

Analysis methods you might consider

Linear regression

Things to consider

References

See also