Introduction to Regression in R

Office of Advanced Research Computing (OARC), Statistical Methods and Data Analytics

1 Simple Regression

1.1 Introduction

Here are some examples of relationships between independent variables and dependent variables:

  1. Employee efficiency may be related to years of training, educational background, and age of the employee.
  1. The amount of time until a headache clears, when taking a pain killer, may be related to the dosage, age, and gender.
  1. The number of votes for a presidential candidate for each city may be related to the candidate’s gender, city average income, and the state.
  1. In a study of schools in California, student academic performance for each school may be related to the poverty level of students in the school, the number of student enrolled in the school, the percentage of teachers with full credentials in the school, or the number of students in the class.

1.2 Linear function

\(X\): The predictor. \(x_1, x_2, \dots, x_n\) are \(n\) observations from \(X\).

\(Y\): The response. \(y_1, y_2, \dots, y_n\) are \(n\) observations from \(Y\).

each index represent one case or unit of observation in the data.

If we plot these points against the X and Y axis, our goal is to fit the best line to the data.

A linear function can be determined by the intercept and the slope of the line.

The figure below shows a plot of linear function

\[y = f(x) = 1 + 0.5 \times x\]

The intercept is equal to 1 and slope of the line is equal to 0.5.

1.3 Linear regression as a statistical model

In real life situations, we usually cannot explain \(Y\) as an exact function of \(X\). There is always some error or noise in the dependent variable that cannot be explained perfectly by the relationship between the independent and dependent variables. We call this the error term in the model. We represent the formula for a simple linear model as:

\[Y = \beta_0 + \beta_1 X + \epsilon\]

Where \(\beta_0\) is the intercept, \(\beta_1\) is the slope of the model, and \(\epsilon\) is the error term.

\[ E(Y |X = x) = \beta_0 +\beta_1x\]

For example, if we want to study the relationship between api00, a school-wide measure of academic performance, and the number of students enrolled in a school, the data might look like this:

index api00 enroll
1 693 247
2 570 463
3 395 546
. . .
. . .
. . .
n 478 520

1.4 Ordinary least squares

1 - Errors have mean equal to zero.

2 - Errors are independent.

3 - Errors have equal variances (Homoscedasticity).

4 - Errors follow normal distribution.

5 - The relationship of dependent variable with the independent variables is linear.

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\]

\[ e_i = y_i - \hat{y_i}\]

\[ \min \sum_{i = 1}^{n} (y_i - \hat{y_i}) ^ 2\]

*Note:

+ Two main issues when the outcome is not continuous is that we may violate homogeneity of variance and also the predicted values might be out of the range of possible values for the outcome (e.g. negative values for a count outcome).

1.5 Decomposing the sum of squares and ANOVA

Regression can be seen as a model to explain variability of the response variable using the predictor variable(s). We will show how we can decompose the total variation in the response into explained and unexplained variation.

Total Variation

If only the intercept is used to model the response variable (slope restricted to be zero), the regression line will be a horizontal line at the mean of the response variable.

In our example this will be the mean of api00 (performance score in 2000), which is the line \(y = \bar{y} = 649.7647\)

From this intercept model, the residuals for each data point \(i\) will be \(y_i - \bar{y}\).

The sum of the these residuals squared, \(\sum_{i = 1}^{n} (y_i - \bar{y})^2\), is the Total Sums of Squares (TSS), representing the total variation in the outcome.


Decomposing into Unexplained and Explained Variation

When we add a predictor to the model, we can decompose the total variation into variation explained by the predictor and unexplained variation.

We can decompose the residual from the intercept model, \(y_i - \bar{y}\), into two parts using the fitted value, \(\hat{y_i}\), from the regression of api00 by predictor, enroll.

This can be written as:

\[ y_i - \bar{y} = (y_i - \hat{y_i}) + (\hat{y_i} - \bar{y}) \]

Where \(\bar{y}\) is the mean of response, \(\hat{y_i}\) is the fitted value from the regression model that includes the predictor variable enroll and \(y_i\) is the observed response.

For example, if school \(i\) has enrollment=\(810\), then we can decompose the deviation of the api score for this school from the mean api score (the residual):

\[ \text{api00}_i - \text{mean(api00)} = (\text{api00}_i - \hat{\text{api00}_i}) + (\hat{\text{api00}_i} -\text{mean(api00)})\]

The graph below shows the decomposition of the residual into two parts.

It can be shown the following equation is true:

\[ \sum_{i = 1}^{n} (y_i - \bar{y})^2 = \sum_{i = 1}^{n}(y_i - \hat{y_i})^2 + \sum_{i = 1}^{n}(\hat{y_i} - \bar{y})^2 \] \[\text{Total variation of the outcome} = \text{unexplained variation} + \text{variation explained by the model} \]

The left hand side of this equation is called Total Sum of Square (TSS). The first term of the right hand side of the above equation is called Residual Sum of Square (RSS) and the second term is called Sum of Square Regression (SSreg). Thus, we have:

\[TSS = RSS + SSreg\]

The ratio of SSreg to SST is called \(R^2\). \(R^2\) is the proportion of the total variation that can be explained by the model.

\[R^2 = \frac{SSreg}{TSS} = 1 - \frac{RSS}{TSS}\]

*In some texts \(RSS\) is also noted as \(SSE\) or sum of squares of errors.

2 Running the regression model in RStudio

2.1 A quick overview of RStudio environment

Before we begin, let’s take a look at the RStudio environment. RStudio is an integrated development environment (IDE) designed to make R easier to use. The RStudio interface has four panes:

RStudio Panes

RStudio Panes

2.2 Loading the data into RStudio

Now let’s read our data set into the R workspace. The data set used in this portion of the seminar is downloadable from the link below:

elemapi2v2

You can download the data and save it into your local computer and load it from there, or you can directly read it from the web URL address. The data is in .csv format, and we use the read.csv function to read and load it in R. The resulting output will be a data.frame object that will appear in the environment tab. By clicking on the data.frame object in the environment pane we can view it as a spreadsheet.

The following code is use to read and load the data into the R workspace first from the URL address and then from the local computer:

#reading the data file in  R
d <- read.csv(
"https://stats.idre.ucla.edu/wp-content/uploads/2019/02/elemapi2v2.csv")
#We can also read the data from the local computer
#we can set the working directory by setwd()
#d <- read.csv("c:/.../elemapi2v2.csv",
#                header = TRUE, sep = ",")

An object’s type can be determined by the function class. To get the names of variables in a data.frame we use names and to get number of rows and columns of data.frame we use dim.

#class of object d, returns data.frame
class(d)  
## [1] "data.frame"
#returns names of variables (columns)
names(d) 
##  [1] "snum"     "dnum"     "api00"    "api99"    "growth"   "meals"   
##  [7] "ell"      "yr_rnd"   "mobility" "acs_k3"   "acs_46"   "not_hsg" 
## [13] "hsg"      "some_col" "col_grad" "grad_sch" "avg_ed"   "full"    
## [19] "emer"     "enroll"   "mealcat"  "collcat"  "abv_hsg"  "lgenroll"
 #returns number of rows and columns of data.frame
dim(d)   
## [1] 400  24

As we mentioned the output of read.csv is a data.frame object. A data.frame is a list of variables with the same number of rows with unique row names. In the data.frame d we have 24 variables and 400 observations.

2.3 Exploring the data

As researchers we need to make sure first that the data we cleaned hold plausible values. Descriptive analysis will help us to understand our data better and investigate if there are any problems in the data. In this workshop we skip this part, but in a real study descriptive analysis is an important part of a good data analysis.

Here we take a look at structure of the data and summary statistics. We only show the first 4 variables.

#returns structure of variables in the data frame d
str(d)      
#Summary statistics of variables in d
summary(d)  
## 'data.frame':    400 obs. of  4 variables:
##  $ snum : int  906 889 887 876 888 4284 4271 2910 2899 2887 ...
##  $ dnum : int  41 41 41 41 41 98 98 108 108 108 ...
##  $ api00: int  693 570 546 571 478 858 918 831 860 737 ...
##  $ api99: int  600 501 472 487 425 844 864 791 838 703 ...
##       snum           dnum           api00           api99      
##  Min.   :  58   Min.   : 41.0   Min.   :369.0   Min.   :333.0  
##  1st Qu.:1720   1st Qu.:395.0   1st Qu.:523.8   1st Qu.:484.8  
##  Median :3008   Median :401.0   Median :643.0   Median :602.0  
##  Mean   :2867   Mean   :457.7   Mean   :647.6   Mean   :610.2  
##  3rd Qu.:4198   3rd Qu.:630.0   3rd Qu.:762.2   3rd Qu.:731.2  
##  Max.   :6072   Max.   :796.0   Max.   :940.0   Max.   :917.0
Variable name Description
snum school number
dnum district number
api00 api 2000, Academic performance index in 2000
api99 api 1999, Academic performance index in 1999
growth growth from 1999 to 2000
meals percentage of students who gets free meals
ell english language learners
yr_rnd year round school
mobility percentage 1st year in school
acs_k3 avg class size k-3
acs_k3 avg class size k-3
acs_46 avg class size 4-6
not_hsg parent not hsg
hsg parent hsg
some_col parent some college
col_grad parent college grad
grad_sch parent grad school
avg_ed avg parent ed
full percentage teachers with full credential
emer percentage of teachers with emergency credential
enroll number of students
mealcat Percentage free meals in 3 categories
. .
. .
. .

2.4 lm function

help(lm)   #shows R Documentation for function lm()

2.5 Our first linear model in R

Now that we have reviewed the linear model, let’s use R and run a simple regression model.

#Simple regression model of DV api00 and IV enroll
m1 <- lm(api00 ~ enroll, data = d)
class(m1)
## [1] "lm"
print(m1)    #Prints coefficients of the model
## 
## Call:
## lm(formula = api00 ~ enroll, data = d)
## 
## Coefficients:
## (Intercept)       enroll  
##    744.2514      -0.1999
+    `api00 ~ enroll` is interpreted as outcome `api00` is modeled by the predictor `enroll`.

\[\hat{\text{api00}} = \text{744.2514 - 0.1999 enroll}\]

2.6 Summary of the lm object

To observe the result of the lm object in more detail, we use the summary() function:

#Prints summary of the model
summary(m1)    
## 
## Call:
## lm(formula = api00 ~ enroll, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -285.50 -112.55   -6.70   95.06  389.15 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 744.25141   15.93308  46.711  < 2e-16 ***
## enroll       -0.19987    0.02985  -6.695 7.34e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 135 on 398 degrees of freedom
## Multiple R-squared:  0.1012, Adjusted R-squared:  0.09898 
## F-statistic: 44.83 on 1 and 398 DF,  p-value: 7.339e-11
#correlation coefficient of api00 and enroll
r <- cor(d$api00, d$enroll)
#this is equal to r-squared in simple regression 
r ^ 2                      
## [1] 0.1012335

2.7 Components of lm-class objects and post-estimation for regression models in R

ls(m1)        #list of components in object class lm
##  [1] "assign"        "call"          "coefficients"  "df.residual"  
##  [5] "effects"       "fitted.values" "model"         "qr"           
##  [9] "rank"          "residuals"     "terms"         "xlevels"
m1$coefficients        #returns coefficients of the model
## (Intercept)      enroll 
## 744.2514142  -0.1998674
m1$fitted.values[1:10] #a vector of fitted values here we show the first 10
##        1        2        3        4        5        6        7        8 
## 694.8842 651.7128 665.3038 660.7068 640.3203 675.6969 683.6916 441.8520 
##        9       10 
## 612.3389 671.8994

We can store the extracted components in a new object.

#a vector of residuals
residuals <- m1$resid  
#r coeff function
coefficients(m1)
## (Intercept)      enroll 
## 744.2514142  -0.1998674

To get the confidence intervals for the coefficients of the model, we use confint()

confint(m1) # returns a matrix of Confidence Interval for coefficients
##                   2.5 %      97.5 %
## (Intercept) 712.9279024 775.5749259
## enroll       -0.2585532  -0.1411817
anova(m1)  #anova table
## Analysis of Variance Table
## 
## Response: api00
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## enroll      1  817326  817326  44.829 7.339e-11 ***
## Residuals 398 7256346   18232                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In ANOVA table we have the sum of squares of the regression model in the row named “enroll” and sum of squares of residuals with their degrees of freedom in the row below. The table also shows the F-statistic that we have seen before in the summary of the model. Mean squares are sum of squares divided by their degrees of freedom.

#first we need to create new data.frame including all predictors of the model
new.data <- data.frame(enroll = c(500, 600, 700))
predict(m1, newdata = new.data)
##        1        2        3 
## 644.3177 624.3309 604.3442
#Optional if someone likes to use nice table for regression results
#install.packages(sjPlot)
library(sjPlot)
tab_model(m1)
  api00
Predictors Estimates CI p
(Intercept) 744.25 712.93 – 775.57 <0.001
enroll -0.20 -0.26 – -0.14 <0.001
Observations 400
R2 / R2 adjusted 0.101 / 0.099

2.8 Plotting a scatter plot and the regression line

In addition to getting the regression table and statistics, it can be useful to see a scatter plot of the predicted and outcome variables with the regression line plotted.

plot(api00 ~ enroll, data = d) #scatter plot of api00 vs. enroll
abline(m1, col = "blue")       #add regression line to the scatter plot

As you see, some of the points appear to be outliers. If you add text with labels = d$snum on the scatter, you can see the school number for each point. This allows us to see, for example, that one of the outliers is school 2910.

#adding labels(school number) to the scatter plot
plot(api00 ~ enroll, data = d) 
text(d$enroll, d$api00+20, labels = d$snum, cex = .7)
abline(m1, col = "blue")

##We can also use identify() to click on the points
# plot(api00 ~ enroll, data = d)
# abline(m1, col = "blue")
# identify(d$enroll,  d$api00,labels = d$snum, col = "red")

#Optional if someone likes to use ggplot  
 library(ggplot2)
 ggplot(d, aes(x = enroll, y = api00)) + 
   geom_point() +
   stat_smooth(method = "lm", col = "red")

2.9 Exercise 1

In the data set elemapi2v2 the variable full is the percentage teachers with full credential for each school. Run the regression model of api00 on full. Use this model to predict the mean of api00 for full equal to \(25\%\), \(50\%\), \(75\%\), and \(90\%\)

Is the predicted value for full \(= 25\%\) valid?

3 Regression with categorical variable as predictor

3.1 Factor and dummy variable

#class of mealcat is integer
class(d$mealcat) 
## [1] "integer"
#we transform variable meancat to a factor variable 
d$mealcat_F <- factor(d$mealcat)
#now the class of mealcat_F is factor
str(d$mealcat_F)
##  Factor w/ 3 levels "1","2","3": 2 3 3 3 3 1 1 1 1 1 ...
#class of yr_rnd is integer
class(d$yr_rnd) 
## [1] "integer"
#we transform variable yr_rnd to a factor variable 
d$yr_rnd_F <- factor(d$yr_rnd)
#levels of factor variable yr_rnd_F
levels(d$yr_rnd_F)
## [1] "0" "1"
#changing the levels to "Yes" and "No"
levels(d$yr_rnd_F) <- c("No", "Yes")

Let’s look at frequency table of variables yr_rnd_F and mealcat_F

table(d$yr_rnd_F)    #frequency table for yr_rnd_F
## 
##  No Yes 
## 308  92
table(d$mealcat_F)   #frequency table for mealcat_F
## 
##   1   2   3 
## 131 132 137

3.2 Regression of api00 on yr_rnd a two level categorical variable

#regression of api00 with yr_rnd_F
m2 <- lm(api00 ~ yr_rnd_F, data = d)
#summary of the model
summary(m2)
## 
## Call:
## lm(formula = api00 ~ yr_rnd_F, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -273.539  -95.662    0.967  103.341  297.967 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   684.54       7.14   95.88   <2e-16 ***
## yr_rnd_FYes  -160.51      14.89  -10.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 125.3 on 398 degrees of freedom
## Multiple R-squared:  0.226,  Adjusted R-squared:  0.2241 
## F-statistic: 116.2 on 1 and 398 DF,  p-value: < 2.2e-16
#the first 6 rows of the model matrix
model.matrix(m2)[1:6,]
##   (Intercept) yr_rnd_FYes
## 1           1           0
## 2           1           0
## 3           1           0
## 4           1           0
## 5           1           0
## 6           1           0

Let’s write out the regression equation that this model implies.

\[\text{api00} = 684.539 + (-160.5064) \times \text{yr_rnd_FYes}\] - As we have discussed before the dummy variable yr_rnd_FYes gets value 1 if yr_rnd_F is “Yes” and 0 if yr_rnd_F is “No”.

If a school is not a year-round school (i.e. yr_rnd_F is 0) the regression equation would simplify to

api00 = 684.539    + (-160.5064) x 0 
api00 = 684.539

This indicates the mean of api00 given the school is not a year-round school is \(684.539\).

If a school is a year-round school, the regression equation would simplify to

api00 = 684.539   +  (-160.5064) x 1 
api00 = 524.0326

This indicates the mean of api00 given the school is not a year-round school is \(524.0326\).

In general when we have a two level categorical variable the intercept is the expected outcome for the reference group and the slope of the other category is the difference of expected outcome of that group with the reference category.

#scatter plot api00 against yr_rnd
plot(api00 ~ yr_rnd, data = d)
abline(m2)  # add regression line to the plot

3.3 Regression of api00 on mealcat, a three level categorical variable

In this example we would like to examine the relationship between the amount of poverty and api scores. We don’t have a measure of poverty, but we can use mealcat as a proxy for a measure of poverty. We have created a factor variable for mealcat named mealcat_F and we can use it to model regression of api00 with mealcat_F.

#regression model of api00 against categorical variable mealcat_F with 3 levels
m3 <- lm(api00 ~ mealcat_F, data = d)
summary(m3)
## 
## Call:
## lm(formula = api00 ~ mealcat_F, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -253.394  -47.883    0.282   52.282  185.620 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  805.718      6.169  130.60   <2e-16 ***
## mealcat_F2  -166.324      8.708  -19.10   <2e-16 ***
## mealcat_F3  -301.338      8.629  -34.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70.61 on 397 degrees of freedom
## Multiple R-squared:  0.7548, Adjusted R-squared:  0.7536 
## F-statistic: 611.1 on 2 and 397 DF,  p-value: < 2.2e-16
#aggregate the mean of api00 for each category in mealcat_F
aggregate(api00 ~ mealcat_F, FUN=mean, data = d)
##   mealcat_F    api00
## 1         1 805.7176
## 2         2 639.3939
## 3         3 504.3796
#relevel factor mealcat_F and make group "3" as the reference group
d$mealcat_F <- relevel(d$mealcat_F, ref = "3")
m4 <- lm(api00 ~ mealcat_F, data = d)
summary(m4)
## 
## Call:
## lm(formula = api00 ~ mealcat_F, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -253.394  -47.883    0.282   52.282  185.620 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  504.380      6.033   83.61   <2e-16 ***
## mealcat_F1   301.338      8.629   34.92   <2e-16 ***
## mealcat_F2   135.014      8.612   15.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70.61 on 397 degrees of freedom
## Multiple R-squared:  0.7548, Adjusted R-squared:  0.7536 
## F-statistic: 611.1 on 2 and 397 DF,  p-value: < 2.2e-16

3.4 regression analysis with a single dummy variable and t-test (Optional)

#scatter plot api00 against yr_rnd_F
plot(api00 ~ yr_rnd_F, data = d)

# mean of api00 when yr_rnd_F is at level "0". school is not year round
mean(d$api00[d$yr_rnd_F == "0"])
## [1] NaN
# mean of api00 when yr_rnd_F is at level "1". school is year round
mean(d$api00[d$yr_rnd_F == "1"])
## [1] NaN
#using aggregate to find the mean for each group of year school round
aggregate(api00 ~ yr_rnd_F, FUN=mean, data = d)
##   yr_rnd_F    api00
## 1       No 684.5390
## 2      Yes 524.0326
#t test for equality of mean of api00 for two group of year round and not year round 
#with equal variance assumption
t.test(api00 ~ yr_rnd_F, data = d, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  api00 by yr_rnd_F
## t = 10.782, df = 398, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##  131.2390 189.7737
## sample estimates:
##  mean in group No mean in group Yes 
##          684.5390          524.0326

Since a t-test is the same as doing an anova, we can get the same results using the anova command as well.

#anova table
anova(m2)
## Analysis of Variance Table
## 
## Response: api00
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## yr_rnd_F    1 1825001 1825001  116.24 < 2.2e-16 ***
## Residuals 398 6248671   15700                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If we square the t-value from the t-test, we get the same value as the F-value from the anova.

#square of t value from the t-test is the same as F value from anova
10.7815 ^ 2
## [1] 116.2407

3.5 Exercise 2

In the data set elemapi2v2 the variable collcat is categorized of the percentage of students with parent attended in some collage for each school. Run the regression model of api00 on collcat as a categorical variable. What is the expected average api00 score for a school with collcat in the 3rd category?

4 Multiple Regression

4.1 Adding more predictors to a simple regression model

Now, let’s look at an example of multiple regression, in which we have one outcome (dependent) variable and multiple predictors.

The percentage of variability of api00 explained by variable enroll was only 10.12%. In order to explain more variation, we can add more predictors. In R we can do this by adding variables with + to the formula of our lm() function. We add meals, the percentage of students who get full meals as an indicator of socioeconomic status, and full, the percentage of teachers with full credentials, to our model.

#multiple regression model of DV api00 and IVs enroll, meals, and full
m5 <- lm(api00 ~  enroll + meals + full, data = d)
summary(m5)   #summary of model m5
## 
## Call:
## lm(formula = api00 ~ enroll + meals + full, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -181.721  -40.802    1.129   39.983  158.774 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 801.82983   26.42660  30.342  < 2e-16 ***
## enroll       -0.05146    0.01384  -3.719 0.000229 ***
## meals        -3.65973    0.10880 -33.639  < 2e-16 ***
## full          1.08109    0.23945   4.515 8.37e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58.73 on 396 degrees of freedom
## Multiple R-squared:  0.8308, Adjusted R-squared:  0.8295 
## F-statistic: 648.2 on 3 and 396 DF,  p-value: < 2.2e-16
anova(m5)     #anova table of model m5 
## Analysis of Variance Table
## 
## Response: api00
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## enroll      1  817326  817326  236.947 < 2.2e-16 ***
## meals       1 5820066 5820066 1687.263 < 2.2e-16 ***
## full        1   70313   70313   20.384 8.369e-06 ***
## Residuals 396 1365967    3449                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For example variable enroll reduces the total error (RSS) by 817326. By adding variable meals we reduce additional 5820066 from the residual sum of squares and by adding variable full we reduce the error by 70313. Finally we have 1365967 left as unexplained error. The total sum of squares (TSS) is the sum of all of the sums of squares added together. To get the total sum of square of variable api00 we can multiply its’ variance by \((n-1)\).

sum(anova(m5)$Sum)          #sum of RSS and SSreg
## [1] 8073672
(400 - 1) * var(d$api00)    #Total sum of squares 
## [1] 8073672

4.2 Standardized regression coefficients

Some researchers are interested in comparing the relative strength of the various predictors within the model. Since each variable is typically measured in different units, we cannot compare coefficients to one another. To address this problem we use standardized regression coefficients, which can be obtain by transforming the outcome and predictor variables all to their standardized scores, also called z-scores, before running the regression.

In R we use function scale() to do this for each variable.

#Standardized regression model
m5.sd <- lm(scale(api00) ~  scale(enroll) + scale(meals) + scale(full), data = d)
summary(m5.sd) #coefficients are standardized
## 
## Call:
## lm(formula = scale(api00) ~ scale(enroll) + scale(meals) + scale(full), 
##     data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27749 -0.28683  0.00793  0.28108  1.11617 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.454e-16  2.064e-02   0.000 1.000000    
## scale(enroll) -8.191e-02  2.203e-02  -3.719 0.000229 ***
## scale(meals)  -8.210e-01  2.441e-02 -33.639  < 2e-16 ***
## scale(full)    1.136e-01  2.517e-02   4.515 8.37e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4129 on 396 degrees of freedom
## Multiple R-squared:  0.8308, Adjusted R-squared:  0.8295 
## F-statistic: 648.2 on 3 and 396 DF,  p-value: < 2.2e-16

In the standardized regression coefficients summary we see that the intercept is zero and all t-statistics (p-values) for the other coefficients are exactly the same as the original model.

Because the coefficients are all in the same standardized units, standard deviations, you can compare these coefficients to assess the relative strength of each of the predictors. In this example, meals has the largest Beta coefficient, -0.821.

Thus, a one standard deviation increase in meals leads to a 0.821 standard deviation decrease in predicted api00, with the other variables held constant.

5 Regression model with two way interaction

5.1 Regression model with interaction between two continuous predictors

#multiple regression model of DV api00 and IVs enroll, meals, and enroll:meals
m6 <- lm(api00 ~  enroll + meals + enroll:meals , data = d)
summary(m6)   #summary of model m6
## 
## Call:
## lm(formula = api00 ~ enroll + meals + enroll:meals, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -211.186  -38.834   -1.137   38.997  163.713 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.833e+02  1.665e+01  53.034   <2e-16 ***
## enroll        8.835e-04  3.362e-02   0.026   0.9790    
## meals        -3.425e+00  2.344e-01 -14.614   <2e-16 ***
## enroll:meals -9.537e-04  4.292e-04  -2.222   0.0269 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59.85 on 396 degrees of freedom
## Multiple R-squared:  0.8243, Adjusted R-squared:  0.823 
## F-statistic: 619.3 on 3 and 396 DF,  p-value: < 2.2e-16

5.2 Regression model with interaction between two categorical predictors

#multiple regression model of DV api00 and IVs enroll, meals, and enroll:meals
m7 <- lm(api00 ~  yr_rnd_F * mealcat_F , data = d)
summary(m7)   #summary of model m7
## 
## Call:
## lm(formula = api00 ~ yr_rnd_F * mealcat_F, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -207.533  -50.764   -1.843   48.874  179.000 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             521.493      8.414  61.978  < 2e-16 ***
## yr_rnd_FYes             -33.493     11.771  -2.845  0.00467 ** 
## mealcat_F1              288.193     10.443  27.597  < 2e-16 ***
## mealcat_F2              123.781     10.552  11.731  < 2e-16 ***
## yr_rnd_FYes:mealcat_F1  -40.764     29.231  -1.395  0.16394    
## yr_rnd_FYes:mealcat_F2  -18.248     22.256  -0.820  0.41278    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.87 on 394 degrees of freedom
## Multiple R-squared:  0.7685, Adjusted R-squared:  0.7656 
## F-statistic: 261.6 on 5 and 394 DF,  p-value: < 2.2e-16

\[521.493 + (-33.493) + 123.781 + (-18.248) = 593.533\]

The expected mean of api00 for a school that is not year round with the mealcat equals to “2” is:

\[521.493 + 123.781 = 645.274\] Thus, the expected difference api00 for year round school with school is not year round if the meals category equals to “2” is:

\[593.533 - 645.274 = -51.741\]

The difference of the expected api00 for year round school with school is not year round when meals category is at level “3 is the coefficient of yr_rnd_FYes \(= -33.493\)

If we calculate the difference between the above differences we get:

\[-51.741 - (-33.493) = -18.248\]

which is the coefficient for yr_rnd_FYes:mealcat_F2.

#running the model without interaction terms
m0 <- lm(api00 ~  yr_rnd_F + mealcat_F , data = d)
#run F-test using anova function
anova(m0, m7)
## Analysis of Variance Table
## 
## Model 1: api00 ~ yr_rnd_F + mealcat_F
## Model 2: api00 ~ yr_rnd_F * mealcat_F
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1    396 1879528                           
## 2    394 1868944  2     10584 1.1156 0.3288

The F test means that we do not have enough evidence to reject the base model m0 and therefore, the effect yr_rnd unlikely change for different levels of mealcat.

5.3 Regression model with interaction between a continuous predictor and a categorical predictor

#multiple regression model of DV api00 and IVs enroll, meals, and enroll:meals
m8 <- lm(api00 ~  yr_rnd_F * enroll , data = d)
summary(m8)   #summary of model m8
## 
## Call:
## lm(formula = api00 ~ yr_rnd_F * enroll, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -274.043  -94.781    0.417   97.666  309.573 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        682.068556  18.797436  36.285   <2e-16 ***
## yr_rnd_FYes        -74.858236  48.224281  -1.552   0.1214    
## enroll               0.006021   0.042396   0.142   0.8871    
## yr_rnd_FYes:enroll  -0.120218   0.072075  -1.668   0.0961 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 125 on 396 degrees of freedom
## Multiple R-squared:  0.2335, Adjusted R-squared:  0.2277 
## F-statistic: 40.21 on 3 and 396 DF,  p-value: < 2.2e-16

5.4 Ploting the interaction between a continuous predictor and a categorical predictor

We use function predict for a new data to plot this interaction.

#First get the range of continuous predictor
range(d$enroll)
## [1]  130 1570
#make a sequence of number of enroll from range of enroll 130-1570 increment by 5
new.enroll <- seq(130, 1570, 5)
#Because we have two levels of yr_rnd_F we repeat enroll two times.
new.yr_rnd <- rep(levels(d$yr_rnd_F), each = length(new.enroll))
#create new data
new.data.plot <- data.frame(enroll = rep(new.enroll, times = 2),  yr_rnd_F = new.yr_rnd)
#predict api00
new.data.plot$predicted_api00 <- predict(m8, newdata = new.data.plot)  
#I use ggplot to plot
library(ggplot2)
ggplot(new.data.plot, aes(x = enroll, y = predicted_api00, colour = yr_rnd_F)) +
  geom_line(lwd=1.2) 

5.5 Summary

6 Regression Diagnostics

6.1 Introduction

In the previous part, we learned how to do ordinary linear regression with R. Without verifying that the data have met the assumptions underlying OLS regression, results of regression analysis may be misleading.

Here will explore how you can use R to check on how well your data meet the assumptions of OLS regression. In particular, we will consider the following assumptions.

Additionally, there are issues that can arise during the analysis that, while strictly speaking are not assumptions of regression, are none the less, of great concern to data analysts.

Many graphical methods and numerical tests have been developed over the years for regression diagnostics.

R has many of these methods in stats package which is already installed and loaded in R. There are some other tools in different packages that we can use by installing and loading those packages in our R environment.

install.packages("car")

# install packages for Regression Diagnostics

#install.packages("car") 
#install.packages("alr4")
#install.packages("faraway")

#loading packages into working environment
library(car)
library(alr4)
library(faraway)

6.2 Unusual and Influential data

A single observation that is substantially different from all other observations can make a large difference in the results of your regression analysis. If a single observation (or small group of observations) substantially changes your results, you would want to know about this and investigate further. There are three ways that an observation can be unusual.

We use the model m2 in our last multiple regression model. In that model we predicted api00 by independent variables enroll, meals, and full.

#multiple regression model of DV api00 and DVs enroll, meals, and full
m2 <- lm(api00 ~  enroll + meals + full, data = d)

Let’s make individual graphs of api00 with enroll and meals and full so we can get a better view of these scatter plots.

scatterplotMatrix(~ api00 + enroll + meals + full, data = d)

The graphs of api00 with other variables show some potential problems. In every plot, we see one or more data point that is far away from the rest of the data points.

We will go over some graphical methods to identify influential points and potential outliers.

We are not going to discuss and justify underlying methods and formulas here.

Studentized residuals can be used to identify outliers. In R we use rstandard() function to compute Studentized residuals.

res.std <- rstandard(m2) #studentized residuals stored in vector res.std 
#plot Standardized residual in y axis. X axis will be the index or row names
plot(res.std, ylab="Standardized Residual", ylim=c(-3.5,3.5))
#add horizontal lines 3 and -3 to identify extreme values
abline(h =c(-3,0,3), lty = 2)
#find out which data point is outside of 3 standard deviation cut-off
#index is row numbers of those point
index <- which(res.std > 3 | res.std < -3)
#add School name next to points that have extreme value
text(index-20, res.std[index] , labels = d$snum[index])

#print row number of values that are out of bounds
print(index)
## 226 
## 226
#print school names of points that are out of bounds
print(d$snum[index])
## [1] 211

We should pay attention to studentized residuals that exceed +2 or -2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3. These results show that school number 211 is the most worrisome observation.

Note: To get a t-statistics we use the externally Studentized residuals by excluding each point from the data to calculate standard deviation of residuals. In package "car" we can use function rstudent(model) to get the externally Studentized residuals.

We can run a test statistic to test if we have outlier using function outlierTest(m2) in the package "car".

#Bonferroni p-values for testing outliner
outlierTest(m2)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##      rstudent unadjusted p-value Bonferroni p
## 226 -3.151186            0.00175      0.70001

This is a test of hypothesis that we do not have outlier vs. at least we have one outlier.

Now let’s look at the leverage’s to identify observations that will have potential great influence on regression coefficient estimates.

To find points with high leverage is to use half normal plot in the package "faraway". First we need to compute diagonal of hat matrix as a measure for leverage

#a vector containing the diagonal of the 'hat' matrix
h <- influence(m2)$hat
#half normal plot of leverage from package faraway
halfnorm(influence(m2)$hat, ylab = "leverage")

Cook’s distance is a measure for influence points. A point with high level of cook’s distance is considers as a point with high influence point.

A cut of value for cook’s distance can be calculated as

\[ \frac{4}{(n - p - 1)}\]

Where n is sample size and p is number of predictors.

We can plot Cook’s distance using the following code:

#the cut of value for cook's distance
cutoff <- 4/((nrow(d)-length(m2$coefficients)-2))
plot(m2, which = 4, cook.levels = cutoff)

We can use influencePlot() function in package "car" to identify influence point. It plots Studentized residuals against leverage with cook’s distance.

#cook's distance, studentized residuals, and leverage in the same plot
influencePlot(m2, main="Influence Plot", 
              sub="Circle size is proportional to Cook's Distance" )

##         StudRes        Hat        CookD
## 8    0.18718812 0.08016299 7.652779e-04
## 93   2.76307269 0.02940688 5.687488e-02
## 210  0.03127861 0.06083329 1.588292e-05
## 226 -3.15118603 0.01417076 3.489753e-02
## 346 -2.83932062 0.00412967 8.211170e-03

And finally infIndexPlot() function will gives use a series of plots that we need for to investigate influence points.

#4 diagnostic plots to identify influential points
infIndexPlot(m2)

6.3 Checking Homoscedasticity

One of the main assumptions for the ordinary least squares regression is the homogeneity of variance of the residuals. If the model is well-fitted, there should be no pattern to the residuals plotted against the fitted values. If the variance of the residuals is non-constant then the residual variance is said to be “heteroscedastic.” There are graphical and non-graphical methods for detecting heteroscedasticity. A commonly used graphical method is to plot the residuals versus fitted (predicted) values.

#residual vs. fitted value plot for Homoscedasticity
plot(m2$resid ~ m2$fitted.values)
#add horizental line from 0
abline(h = 0, lty = 2)

6.4 Checking Linearity

To check linearity residuals should be plotted against the fit as well as other predictors. If any of these plots show systematic shapes, then the linear model is not appropriate and some nonlinear terms may need to be added. In package car, function residualPlots() produces those plots. Also give use a test of linear model vs. adding quadratic term for each variable (test for curvature).

#residual vs. fitted value and all predictors plus test for curvature
residualPlots(m2)

##            Test stat Pr(>|Test stat|)
## enroll        0.0111           0.9911
## meals        -0.6238           0.5331
## full          1.1565           0.2482
## Tukey test   -0.8411           0.4003

6.5 Issues of Independence

A simple visual check would be to plot the residuals versus the time variable. Here the data is not as time series therefore we use school number as an order variable for the data.

#plotting the trend of residuals and school number
plot(m2$resid ~ d$snum)   #residual plot vs. school id 

6.6 Checking Normality of Residuals

Many researchers believe that multiple regression requires normality. This is not the case. Normality of residuals is only required for valid hypothesis testing, that is, the normality assumption assures that the p-values for the t-tests and F-test will be valid. Normality is not required in order to obtain unbiased estimates of the regression coefficients. OLS regression merely requires that the residuals (errors) be identically and independently distributed. Furthermore, there is no assumption or requirement that the predictor variables be normally distributed. If this were the case than we would not be able to use dummy coded variables in our models.

Furthermore, because of large sample theory if we have large enough sample size we do not even need the residuals be normally distributed.

However for small sample sizes the normality assumption is required.

To test normality we use qq-normal plot of residuals.

qqnorm(m2$resid)  #Normal Quantile to Quantile plot
qqline(m2$resid)  

6.7 Checking for Multicollinearity

When there is a perfect linear relationship among the predictors, the estimates for a regression model cannot be uniquely computed. The term collinearity implies that two variables are near perfect linear combinations of one another. When more than two variables are involved it is often called multicollinearity, although the two terms are often used interchangeably.

The primary concern is that as the degree of multicollinearity increases, the regression model estimates of the coefficients become unstable and the standard errors for the coefficients can get wildly inflated.

VIF, variance inflation factor, is used to measure the degree of multicollinearity. As a rule of thumb, a variable whose VIF values are greater than 10 may merit further investigation. Tolerance, defined as 1/VIF, is used by many researchers to check on the degree of collinearity. A tolerance value lower than 0.1 is comparable to a VIF of 10. It means that the variable could be considered as a linear combination of other independent variables.

In R we can use function vif from package car or faraway. Since there are two packages in our environment that include this function we use ‘car::vif()’ to tell R that we want to run this function from package car

car::vif(m2)  #variance inflation factor
##   enroll    meals     full 
## 1.135733 1.394279 1.482305

6.8 Model Specification

A model specification error can occur when one or more relevant variables are omitted from the model or one or more irrelevant variables are included in the model. If relevant variables are omitted from the model, the common variance they share with included variables may be wrongly attributed to those variables, and the error term is inflated. On the other hand, if irrelevant variables are included in the model, the common variance they share with included variables may be wrongly attributed to them. Model specification errors can substantially affect the estimate of regression coefficients.

There are many methods and statistical tests to check whether or not a variable is relevant in the model. To visualize the effect of each variable in the model we can use added variable plot also called a partial-regression plot. The added variable plot is scatter plot of residuals of a model by excluding one variable from the full model against residuals of a model that uses the excluded variable as dependent variable predicted by other variables. The slope of the simple regression between those residuals will be the same as coefficient of the excluded variable. If the slope is not close to zero we conclude that the variable is relevant to the model.

Added variable plot is also useful for detecting influential points.

avPlots(m2) #added variable plots from package "car"