Linear Regression

Linear regression, also called OLS (ordinary least squares) regression, is used to model continuous outcome variables. In the OLS regression model, the outcome is modeled as a linear combination of the predictor variables.

Please note: The purpose of this page is to show how to use various data analysis commands. It does not cover all aspects of the research process which researchers are expected to do. In particular, it does not cover data cleaning and checking, verification of assumptions, model diagnostics and potential follow-up analyses.

Examples of linear regression

Example 1: A researcher is interested in how scores on a math and a science test are associated with scores on a writing test. The outcome variable is the score on the writing test.

Example 2: A research team is interested in motivating people to eat more vegetables by showing subjects videos of simple ways to prepare vegetables for dinner. The outcome variable is the number of ounces of vegetables consumed for dinner for one week.

Example 3: Researchers are interested in the effect of light on sleep quality. They randomly assign subjects to different light conditions and measure sleep quality for one month. The average seep quality score is the outcome variable.

Description of the data

For our data analysis below, we are going to expand on Example 1 about the association between test scores. We have generated a hypothetical dataset called hsb2 .

We can obtain descriptive statistics for each of the variables that we will use in our linear regression model. Although the variable female is binary (coded 0 and 1), we can still use it with proc means.

data hsb2;
set "C:\mydata\hsb2";
run;

proc means data = hsb2;
var science math female socst read;
run;

The MEANS Procedure

Variable      N            Mean         Std Dev         Minimum         Maximum
-------------------------------------------------------------------------------
science     200      51.8500000       9.9008908      26.0000000      74.0000000
math        200      52.6450000       9.3684478      33.0000000      75.0000000
female      200       0.5450000       0.4992205               0       1.0000000
socst       200      52.4050000      10.7357935      26.0000000      71.0000000
read        200      52.2300000      10.2529368      28.0000000      76.0000000
-------------------------------------------------------------------------------

We can use proc freq command to see the number of males and females.

proc freq data = hsb2;
tables female;
run;

The FREQ Procedure

                                   Cumulative    Cumulative
female    Frequency     Percent     Frequency      Percent
-----------------------------------------------------------
     0          91       45.50            91        45.50
     1         109       54.50           200       100.00

Analysis methods you might consider

Below is a list of some analysis methods you may have encountered.

Linear regression, the focus of this page.
ANCOVA: ANCOVA will give the same results as linear regression, except with a different paramaterization. Linear regression will use dummy coding for categorical predictors, while ANCOVA will use effect coding.
Robust regression: Robust regression is a type of linear regression used when the assumption of homogeneity of variance may be violated.

Below we use proc reg to estimate a linear regression model.


proc reg data = hsb2;
model science = math female socst read;
run;

The REG Procedure
Model: MODEL1
Dependent Variable: science

Number of Observations Read         200
Number of Observations Used         200


                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     4     9543.72074     2385.93019      46.69    <.0001 
Error                   195     9963.77926       51.09630 
Corrected Total         199          19508 

Root MSE              7.14817    R-Square    0.4892 
Dependent Mean       51.85000    Adj R-Sq    0.4788 
Coeff Var            13.78624 

                Parameter Estimates 

                      Parameter       Standard 
Variable      DF      Estimate        Error        t Value    Pr > |t|
Intercept     1       12.32529        3.19356       3.86      0.0002
math          1        0.38931        0.07412       5.25      <.0001
female        1       -2.00976        1.02272      -1.97      0.0508
socst         1        0.04984        0.06223       0.80      0.4241
read          1        0.33530        0.07278       4.61      <.0001

At the top of the output we see that all 200 observations in our data set were used in the analysis (fewer observations would have been used if any of our variables had missing values).
In the Analysis of Variance section, we see the test of the overall model with an f-value of 46.96 and a p-value of 0.0001, tells us that our model as a whole fits significantly better than an empty or null model (i.e., a model with no predictors).
We see the R-squared value of 0.4892, which indicates the proportion of variance in the outcome accounted for by the model.
We also see the adjusted R-squared value of 0.4788, which adjusts the R-squared value for the number and quality of the predictors.
In the table of Parameter Estimates, we see the coefficients (parameter estimates), their standard errors, the t-statistic (t value), and the associated p-values (pr > |t|). Both math and read are statistically significant.
- For every one unit change in math, the expected value of science increases by 0.389.
- For a one unit increase in read, the expected value of science increases by 0.335.

We could also have used proc glm. If you have categorical predictor variables, you may want to use proc glm because it has a class statement (proc reg does not). On the class statement, we use the ref option to set the reference group to the first (i.e., lowest numbered) category.

proc glm data = hsb2;
class female ref = first;
model science = math female socst read;
run;
quit;

The GLM Procedure

Dependent Variable: science

                                        Sum of
Source                      DF         Squares     Mean Square    F Value    Pr > F

Model                        4      9543.72074      2385.93019      46.69    <.0001

Error                      195      9963.77926        51.09630

Corrected Total            199     19507.50000


R-Square     Coeff Var      Root MSE    science Mean

0.489233      13.78624      7.148168        51.85000


Source                      DF       Type I SS     Mean Square    F Value    Pr > F

math                         1     7760.557909     7760.557909     151.88    <.0001
female                       1      232.992042      232.992042       4.56    0.0340
socst                        1      465.629457      465.629457       9.11    0.0029
read                         1     1084.541335     1084.541335      21.23    <.0001


Source                      DF     Type III SS     Mean Square    F Value    Pr > F

math                         1     1409.484280     1409.484280      27.58    <.0001
female                       1      197.318846      197.318846       3.86    0.0508
socst                        1       32.778803       32.778803       0.64    0.4241
read                         1     1084.541335     1084.541335      21.23    <.0001


                                  Standard
Parameter         Estimate           Error    t Value    Pr > |t|

Intercept      12.32528915      3.19355732       3.86      0.0002
math            0.38931018      0.07412426       5.25      <.0001
female         -2.00976465      1.02271744      -1.97      0.0508
socst           0.04984429      0.06223197       0.80      0.4241
read            0.33529980      0.07277882       4.61      <.0001

The variable female could be used on a class statement, but it does not have to be. Predictor variables that have more than two levels, or that are not coded 0 and 1, should be put on the class statement.

By default, proc glm gives both the Type I and Type III Sums of Squares. You can just ignore the Type I Sums of Squares.

Things to consider

The outcome variable in a linear regression is assumed to be continuous. It should have a reasonable range of values. There is no assumption that the distribution of the outcome is normal.
The assumptions of linear regression should be checked. Please see SAS Web Book: Linear Regression for information on the assumptions of linear regression and how to assess these assumptions in SAS.
Clustered data: Sometimes observations are clustered into groups (e.g., people within families, students within classrooms). In such situations, a multilevel model may be more appropriate.

References

Regression with Graphics: A Second Course in Statistics by Lawrence C. Hamilton
Regression Analysis: A Constructive Critique by Richard A. Berk
SAS Statistics by Example by Ron Cody

Examples of linear regression

Description of the data

Analysis methods you might consider

Linear regression

Things to consider

References

See also