Linear regression, also called OLS (ordinary least squares) regression, is used to model continuous outcome variables. In the OLS regression model, the outcome is modeled as a linear combination of the predictor variables.
Please note: The purpose of this page is to show how to use various data analysis commands. It does not cover all aspects of the research process which researchers are expected to do. In particular, it does not cover data cleaning and checking, verification of assumptions, model diagnostics and potential follow-up analyses.
Examples of linear regression
Example 1: A researcher is interested in how scores on a math and a science test are associated with scores on a writing test. The outcome variable is the score on the writing test.
Example 2: A research team is interested in motivating people to eat more vegetables by showing subjects videos of simple ways to prepare vegetables for dinner. The outcome variable is the number of ounces of vegetables consumed for dinner for one week.
Example 3: Researchers are interested in the effect of light on sleep quality. They randomly assign subjects to different light conditions and measure sleep quality for one month. The average seep quality score is the outcome variable.
Description of the data
For our data analysis below, we are going to expand on Example 1 about the association between test scores. We have generated a hypothetical dataset called hsb2 .
We can obtain descriptive statistics for each of the variables that we will use in our linear regression model. Although the variable female is binary (coded 0 and 1), we can still use it with proc means.
data hsb2; set "C:\mydata\hsb2"; run; proc means data = hsb2; var science math female socst read; run; The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------------- science 200 51.8500000 9.9008908 26.0000000 74.0000000 math 200 52.6450000 9.3684478 33.0000000 75.0000000 female 200 0.5450000 0.4992205 0 1.0000000 socst 200 52.4050000 10.7357935 26.0000000 71.0000000 read 200 52.2300000 10.2529368 28.0000000 76.0000000 -------------------------------------------------------------------------------
We can use proc freq command to see the number of males and females.
proc freq data = hsb2; tables female; run; The FREQ Procedure Cumulative Cumulative female Frequency Percent Frequency Percent ----------------------------------------------------------- 0 91 45.50 91 45.50 1 109 54.50 200 100.00
Analysis methods you might consider
Below is a list of some analysis methods you may have encountered.
- Linear regression, the focus of this page.
- ANCOVA: ANCOVA will give the same results as linear regression, except with a different paramaterization. Linear regression will use dummy coding for categorical predictors, while ANCOVA will use effect coding.
- Robust regression: Robust regression is a type of linear regression used when the assumption of homogeneity of variance may be violated.
Linear regression
Below we use proc reg to estimate a linear regression model.
proc reg data = hsb2; model science = math female socst read; run; The REG Procedure Model: MODEL1 Dependent Variable: science Number of Observations Read 200 Number of Observations Used 200 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 4 9543.72074 2385.93019 46.69 <.0001 Error 195 9963.77926 51.09630 Corrected Total 199 19508 Root MSE 7.14817 R-Square 0.4892 Dependent Mean 51.85000 Adj R-Sq 0.4788 Coeff Var 13.78624 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 12.32529 3.19356 3.86 0.0002 math 1 0.38931 0.07412 5.25 <.0001 female 1 -2.00976 1.02272 -1.97 0.0508 socst 1 0.04984 0.06223 0.80 0.4241 read 1 0.33530 0.07278 4.61 <.0001
- At the top of the output we see that all 200 observations in our data set were used in the analysis (fewer observations would have been used if any of our variables had missing values).
- In the Analysis of Variance section, we see the test of the overall model with an f-value of 46.96 and a p-value of 0.0001, tells us that our model as a whole fits significantly better than an empty or null model (i.e., a model with no predictors).
- We see the R-squared value of 0.4892, which indicates the proportion of variance in the outcome accounted for by the model.
- We also see the adjusted R-squared value of 0.4788, which adjusts the R-squared value for the number and quality of the predictors.
- In the table of Parameter Estimates, we see the coefficients (parameter estimates), their standard errors, the t-statistic (t value), and the associated p-values (pr > |t|). Both math and read are statistically significant.
- For every one unit change in math, the expected value of science increases by 0.389.
- For a one unit increase in read, the expected value of science increases by 0.335.
We could also have used proc glm. If you have categorical predictor variables, you may want to use proc glm because it has a class statement (proc reg does not). On the class statement, we use the ref option to set the reference group to the first (i.e., lowest numbered) category.
proc glm data = hsb2; class female ref = first; model science = math female socst read; run; quit; The GLM Procedure Dependent Variable: science Sum of Source DF Squares Mean Square F Value Pr > F Model 4 9543.72074 2385.93019 46.69 <.0001 Error 195 9963.77926 51.09630 Corrected Total 199 19507.50000 R-Square Coeff Var Root MSE science Mean 0.489233 13.78624 7.148168 51.85000 Source DF Type I SS Mean Square F Value Pr > F math 1 7760.557909 7760.557909 151.88 <.0001 female 1 232.992042 232.992042 4.56 0.0340 socst 1 465.629457 465.629457 9.11 0.0029 read 1 1084.541335 1084.541335 21.23 <.0001 Source DF Type III SS Mean Square F Value Pr > F math 1 1409.484280 1409.484280 27.58 <.0001 female 1 197.318846 197.318846 3.86 0.0508 socst 1 32.778803 32.778803 0.64 0.4241 read 1 1084.541335 1084.541335 21.23 <.0001 Standard Parameter Estimate Error t Value Pr > |t| Intercept 12.32528915 3.19355732 3.86 0.0002 math 0.38931018 0.07412426 5.25 <.0001 female -2.00976465 1.02271744 -1.97 0.0508 socst 0.04984429 0.06223197 0.80 0.4241 read 0.33529980 0.07277882 4.61 <.0001
The variable female could be used on a class statement, but it does not have to be. Predictor variables that have more than two levels, or that are not coded 0 and 1, should be put on the class statement.
By default, proc glm gives both the Type I and Type III Sums of Squares. You can just ignore the Type I Sums of Squares.
Things to consider
- The outcome variable in a linear regression is assumed to be continuous. It should have a reasonable range of values. There is no assumption that the distribution of the outcome is normal.
- The assumptions of linear regression should be checked. Please see SAS Web Book: Linear Regression for information on the assumptions of linear regression and how to assess these assumptions in SAS.
- Clustered data: Sometimes observations are clustered into groups (e.g., people within families, students within classrooms). In such situations, a multilevel model may be more appropriate.
References
- Regression with Graphics: A Second Course in Statistics by Lawrence C. Hamilton
- Regression Analysis: A Constructive Critique by Richard A. Berk
- SAS Statistics by Example by Ron Cody
See also
- SAS Annotated Output: Linear regression
- SAS Web Book: Linear regression
- Seminar: Regression with SAS