Linear Regression | R Data Analysis Examples

Linear regression, also called OLS (ordinary least squares) regression, is used to model continuous outcome variables. In the OLS regression model, the outcome is modeled as a linear combination of the predictor variables.

Please note: The purpose of this page is to show how to use various data analysis commands. It does not cover all aspects of the research process which researchers are expected to do. In particular, it does not cover data cleaning and checking, verification of assumptions, model diagnostics and potential follow-up analyses.

Examples of linear regression

Example 1: A researcher is interested in how scores on a math and a science test are associated with scores on a writing test. The outcome variable is the score on the writing test.

Example 2: A research team is interested in motivating people to eat more vegetables by showing subjects videos of simple ways to prepare vegetables for dinner. The outcome variable is the number of ounces of vegetables consumed for dinner for one week.

Example 3: Researchers are interested in the effect of light on sleep quality. They randomly assign subjects to different light conditions and measure sleep quality for one month. The average seep quality score is the outcome variable.

Description of the data

For our data analysis below, we are going to expand on Example 1 about the association between test scores. We have generated hypothetical data, hsb2, which can be obtained from our website.

We can obtain descriptive statistics for each of the variables that we will use in our linear regression model. Below we see the minimum, first quartile, median, mean, third quartile and maximum for each variable in the dataset.

hsb2 <- read.csv("c:/mydata/hsb2.dta")

summary(hsb2)
       id             female           read            math          science     
 Min.   :  1.00   Min.   :0.000   Min.   :28.00   Min.   :33.00   Min.   :26.00  
 1st Qu.: 50.75   1st Qu.:0.000   1st Qu.:44.00   1st Qu.:45.00   1st Qu.:44.00  
 Median :100.50   Median :1.000   Median :50.00   Median :52.00   Median :53.00  
 Mean   :100.50   Mean   :0.545   Mean   :52.23   Mean   :52.65   Mean   :51.85  
 3rd Qu.:150.25   3rd Qu.:1.000   3rd Qu.:60.00   3rd Qu.:59.00   3rd Qu.:58.00  
 Max.   :200.00   Max.   :1.000   Max.   :76.00   Max.   :75.00   Max.   :74.00  
     socst      
 Min.   :26.00  
 1st Qu.:46.00  
 Median :52.00  
 Mean   :52.41  
 3rd Qu.:61.00  
 Max.   :71.00

We can use the xtabs functionto see the numbers of males and females.

xtabs(~female, data = hsb2)
female
  0   1 
 91 109

Now let’s run the linear regression. We will use the lm function which is built into R. We will also use the summary function so that we can see the standard errors, t-values and p-values for each coefficient.

summary(lm(science ~ math + female + socst + read, data = hsb2))

Call:
lm(formula = science ~ math + female + socst + read, data = hsb2)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.6706  -4.5764  -0.3237   4.5006  21.8563 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 12.32529    3.19356   3.859 0.000154 ***
math         0.38931    0.07412   5.252 3.92e-07 ***
female      -2.00976    1.02272  -1.965 0.050820 .  
socst        0.04984    0.06223   0.801 0.424139    
read         0.33530    0.07278   4.607 7.36e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.148 on 195 degrees of freedom
Multiple R-squared:  0.4892,	Adjusted R-squared:  0.4788 
F-statistic: 46.69 on 4 and 195 DF,  p-value: < 2.2e-16

In the table we see the coefficients, their standard errors, the t-statistic and associated p-values. Both math and read are statistically significant.

For every one unit change in math, the expected value of science increases by 0.389. For a one unit increase in read, the expected value of science increases by 0.335.

Things to consider

The outcome variable in a linear regression is assumed to be continuous. It should have a reasonable range of values. There is no assumption that the distribution of the outcome is normal.
The assumptions of linear regression should be checked.
Clustered data: Sometimes observations are clustered into groups (e.g., people within families, students within classrooms). In such cases, you may want to see our page on non-independence within clusters.

References

Regression with Graphics: A Second Course in Statistics by Lawrence C. Hamilton
Regression Analysis: A Constructive Critique by Richard A. Berk

Examples of linear regression

Description of the data

Things to consider

References

See also