Introduction to Linear Regression in R

Statistical Methods and Data Analytics

Why do we care about linear regression?

Linear regression allows us to:

understand and quantify the relationships between an outcome and one or more predictors
estimate the unique effect of a predictor while accounting for the influence of other variables
predict outcomes for future observations
understand other regression models, which build on the linear regression model

The typical goal is to estimate population parameters describing these relationships, as well as their uncertainties, from a sample.

Simple linear regression model

In a simple linear regression model, we quantify the relationship between one outcome and one predictor.

##            x           y
## 1  0.9819694  0.49317417
## 2  0.4687150  0.70972185
## 3 -0.1079713 -0.83172816
## 4 -0.2128782  0.09893859
## 5  1.1580985  1.60159397
## 6  1.2923548 -0.36389939

x: predictor, independent variable (IV), regressor

y: outcome, dependent variable (DV), response

Linear regression assumes linear relationship between x and y.

The simple linear regression model:

\[y = \beta_0 + \beta_1 x + \epsilon\]

\(\beta_0\) is the intercept , the predicted \(y\) value when \(x=0\)
\(\beta_1\) is the slope , the predicted change in \(y\) for a one-unit increase in \(x\)
\(\epsilon\) is the error term , the deviation of the predicted \(y\) value from the observed \(y\)

Ordinary Least Squares (OLS)

The Ordinary Least Squares (OLS) estimator (formula or method) finds the best-fitting line relating \(x\) to \(y\).

The best-fit line minimizes the sum of the squared errors, or differences between observed and predicted values of \(y\).

The OLS estimator has several desirable statistical properties:

Consistency: As sample size increases, estimates converge to true population value
Unbiasedness: If we estimate a parameter over many samples, the mean of those estimates will equal the true population value
Efficiency: Of all unbiased estimators, OLS has the smallest variance (standard errors)

Linear regression in R

Dataset: Capuchin monkey weight

A simulated dataset of 100 observations of capuchin monkey weights

# load dataset
d <- read.csv("https://stats.oarc.ucla.edu/wp-content/uploads/2024/01/capuchins100.csv")

Variables

weight, in kg
sex, male or female
social_rank, low, medium or high
kin, count of close kin within group
group_size, number of monkeys in group
forest_type, dry or rain

Weight is the outcome, and we expect each of the variables to influence weight.

NOTE: These data are simulated and do not represent real monkeys

The data

Explore the model variables before running a regression.

skim() from the skimr package produces quick summaries of variables in a data frame.

library(skimr)
skim(d)

Data summary
Name	d
Number of rows	100
Number of columns	7
_______________________
Column type frequency:
character	3
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
sex	1	4	6	2
social_rank	1	3	6	3
forest_type	1	3	4	2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ID	1	475215.21	317074.62	1024.00	184206.00	514714.50	784331.00	987254.0	▇▅▅▅▇
group_size	1	9.77	3.27	5.00	7.00	9.50	13.00	15.0	▇▆▅▅▅
kin	1	1.73	1.47	0.00	0.00	2.00	3.00	4.0	▇▅▅▅▅
weight	1	2.94	0.59	1.58	2.52	3.01	3.34	4.3	▂▇▇▇▂

Histograms and frequency plots give quick summaries of variable distributions and may reveal extreme or erroneous values.

Linear regression with one continuous predictor: Kin

Use lm() to run a linear regression in R, specifying the model with formula outcome ~ predictor.

Below we model how weight is predicted by kin, and we expect that increasing number of close kin will predict increasing weight.

# Simple linear regression where weight is modeled as a function of kin
mK <- lm(weight ~ kin, data=d)

# object mK class
class(mK)

## [1] "lm"

mK

## 
## Call:
## lm(formula = weight ~ kin, data = d)
## 
## Coefficients:
## (Intercept)          kin  
##      2.7220       0.1232

# names() function lists the elements that are stored inside the object mK
names(mK)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Objects returned by lm() are complex, so we use other functions on the object to extract model estimates and produce model predictions.

coef(): model coefficients
summary(): coefficients, standard errors, p-values, model F-test, etc.
predict(): model predictions
residuals(): various kinds of residuals
confint(): confidence intervals for coefficients
plot(): diagnostic plots

# some useful extractor functions
mK$coefficients # extracts model coefficients
coef(mK) # extracts model coefficients

# extractor function summary() 
summary(mK)

## 
## Call:
## lm(formula = weight ~ kin, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.26348 -0.36073 -0.01462  0.39490  1.37212 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.72196    0.08736  31.160  < 2e-16 ***
## kin          0.12316    0.03857   3.193  0.00189 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5638 on 98 degrees of freedom
## Multiple R-squared:  0.09424,    Adjusted R-squared:  0.085 
## F-statistic:  10.2 on 1 and 98 DF,  p-value: 0.001892

`summary()` Output explained

## [...]
## Call:
## lm(formula = weight ~ kin, data = d)
## [...]

Model formula.

## [...]
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.26348 -0.36073 -0.01462  0.39490  1.37212 
## [...]

Summary of the distribution of the residuals.

## [...]
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.72196    0.08736  31.160  < 2e-16 ***
## kin          0.12316    0.03857   3.193  0.00189 ** 
## ---
## [...]

Estimated coefficients, standard errors, t-values, and p-values for the hypothesis test that a coefficient equals to zero.

## [...]
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [...]

Asterisks provide a quick visual shortcut to p-values.

## [...]
## Residual standard error: 0.5638 on 98 degrees of freedom
## [...]

Residual standard error measures the average amount that observed responses deviate from the fitted regression line.

Degrees of freedom (\(df\)) represent the number of independent observations (\(N\)) in the data that are available to estimate a parameter.

In the case of linear regression:

\(p =\) the number of predictors \(+ 1\)

\(df = N - p\)

`summary()` Output explained

Multiple \(R^2\)

## [...]
## Multiple R-squared:  0.09424,    Adjusted R-squared:  0.085 
## [...]

The proportion of variance in the dependent variable explained by the independent variables.

\[R^2 = 1 - \frac{SSR}{SST}\]

SSR - the sum of squared residuals \(\sum_{i=1}^n(y_i - \hat{y_i})^2\).
SST - the total sum of squares \(\sum_{i=1}^n(y_i - \bar{y_i})\).
Ranges from 0 to 1
Increases as more variables are included in the model

Adjusted \(R^2\)

Adjusted \(R^2\) penalizes adding complexity (predictors) to the model to prevent overfitting.

\[1 – \left( (1 – R^2) \cdot \frac{N – 1}{ N – k – 1} \right)\]

\(N\) - number of observations
\(k\) - number of predictors

`summary()` Output explained

\(F\)-statistic

## [...]
## F-statistic:  10.2 on 1 and 98 DF,  p-value: 0.001892
## [...]

Indicates whether the overall model is statistically significant
Provides evidence that at least one independent variable is related to the dependent variable, but not which one

Interpreting model coefficients

## 
## Call:
## lm(formula = weight ~ kin, data = d)
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.72196    0.08736  31.160  < 2e-16 ***
## kin          0.12316    0.03857   3.193  0.00189 ** 
## ---
## [...]

Given our model and our data, the best-fitting regression equation is:

\[ \hat{weight} = 2.72196 + 0.12316 * kin\]

(Intercept) 2.72196 is the predicted mean weight for a monkey with no close kin members in their group (i.e., kin = 0).

The coefficient 0.12316 for kin means that for one unit increase in kin, we can expect a 0.12316 unit increase in weight.

Standard errors and confidence intervals

Standard errors and confidence intervals quantify the uncertainty in the estimates.

# Use extractor function confint() to compute confidence intervals
conf_intervals <- confint(mK)
conf_intervals

##                  2.5 %   97.5 %
## (Intercept) 2.54860910 2.895316
## kin         0.04661907 0.199694

# Combine summary() output with confint() output
sumCI <- summary(mK)
sumCI$coefficients <- cbind(sumCI$coefficients[,c(1:2)], conf_intervals)
sumCI

## 
## Call:
## lm(formula = weight ~ kin, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.26348 -0.36073 -0.01462  0.39490  1.37212 
## 
## Coefficients:
##             Estimate Std. Error   2.5 % 97.5 %
## (Intercept)  2.72196    0.08736 2.54861  2.895
## kin          0.12316    0.03857 0.04662  0.200
## 
## Residual standard error: 0.5638 on 98 degrees of freedom
## Multiple R-squared:  0.09424,    Adjusted R-squared:  0.085 
## F-statistic:  10.2 on 1 and 98 DF,  p-value: 0.001892

Std. Error is a measure of variability of an estimate over repeated samples (i.e. the standard deviation of the sampling distribution).
2.5% and 97.5% provide the lower and upper end, respectively, of the range of values of a parameter that are reasonably compatible with the data.

A 95% confidence interval implies that if we were to repeat the experiment or data collection process many times, each time constructing a 95% confidence interval, we would expect the true parameter to fall within the interval about 95% of the time.

The 95% confidence interval for the kin parameter estimate indicates that the data is consistent with a coefficient range from 0.04662 to 0.2.

t statistic and p-values

## [...]
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.72196    0.08736  31.160  < 2e-16 ***
## kin          0.12316    0.03857   3.193  0.00189 ** 
## ---
## [...]

t value and Pr(>|t|) are used in testing the null hypothesis that a parameter is 0 in the population
t value is the test statistic, calculated by dividing Estimate by its Std. Error, and is assumed to have a \(t\)-distribution under the null hypothesis with \(df\) degrees of freedom
Pr(>|t|) is the p-value, the probability of observing a t value equal or greater than the observed t value if the null hypothesis were true
The p-value 0.00189 suggests that if the null hypothesis were true, we would expect to see t values for the kin parameter as or more extreme than 3.193 in about 2 out of 1000 samples, an unlikely result that suggests that the null hypothesis should be rejected

Assessing the fit of the model

One way to assess the fit of the model is to plot model predictions to ensure they look reasonable against the observed data.

First, calculate the predicted outcome across a range of predictor values observed in the data.

The predict() function takes a model object and produces:

model predictions for the observed data by default
model predictions for specific values of the predictors if a new data frame of predictor values is specified

# Create a dataframe with predictor values
new_data <- data.frame(kin=0:4)

# Calculate predictions of the outcome (weight) for each value of the predictor (kin)
predictions <- predict(mK, newdata=new_data, interval="confidence")

# Prepare the data frame for plotting
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("kin","weight","lwr","upr")
new_data

##   kin   weight      lwr      upr
## 1   0 2.721963 2.548609 2.895316
## 2   1 2.845119 2.720056 2.970182
## 3   2 2.968276 2.854495 3.082057
## 4   3 3.091432 2.943218 3.239646
## 5   4 3.214589 3.007938 3.421240

Then make a plot of your data and model predictions.

# A simple plot for model mK
ggplot(d, aes(kin, weight)) +
  geom_point() + # plots the observed data
  geom_line(data = new_data, aes(kin, weight)) + # plots the regression line
  geom_ribbon(data = new_data, aes(x = kin, ymin = lwr, ymax = upr), alpha=0.2) # plots the CI

# A plot for model mK with some custom adjustments
ggplot(d, aes(kin, weight)) +
  geom_point() +
  geom_ribbon(data = new_data, aes(x = kin, ymin = lwr, ymax = upr), fill = "orange2", alpha = 0.3) +
  geom_line(data = new_data, aes(kin, weight), color = "orange4", linewidth = 1) +
  labs(title = "Linear regression model: weight ~ kin" ,x = "Number of close kin", y = "Weight")

Exercise 1

Capuchin monkeys live in social groups that can range from just a few individuals up to 40. As group size increases, competition for food is expected to rise. Therefore, capuchins living in larger groups should have a lower average weight compared to those in smaller groups.

Run a linear regression modeling capuchin monkey weight as a function of their group_size.
Interpret the output of summary() run on the object returned by lm()
Calculate the weight predictions for capuchin monkeys living in groups ranging from 5 to 15 individuals (i.e., 5,6,…,14,15).

Exercise 1 Solution: Weight and Group size

# fit the model
mG <- lm( weight ~ group_size, data=d)

# Extract model summary
sum_mG <- summary(mG)
# Confidence intervals
ci_mG <- confint(mG)
# Add confidence intervals to model summary
sum_mG <- cbind(sum_mG$coefficients,ci_mG)
round(sum_mG,4) #rounding to fit the screen

##             Estimate Std. Error t value Pr(>|t|)   2.5 % 97.5 %
## (Intercept)   3.1261     0.1861 16.7939   0.0000  2.7567 3.4955
## group_size   -0.0196     0.0181 -1.0821   0.2819 -0.0554 0.0163

As group size increases by one individual, on average, the weight is expected to decrease by 0.0196 kg (95% CI = [-0.0554, 0.0163])

# Calculate predicted weight
new_data <- data.frame(group_size=5:15)
predictions <- predict(mG, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("group_size","weight","lwr","upr")
new_data

##    group_size   weight      lwr      upr
## 1           5 3.028313 2.821120 3.235507
## 2           6 3.008756 2.830030 3.187481
## 3           7 2.989198 2.835805 3.142591
## 4           8 2.969640 2.836641 3.102640
## 5           9 2.950083 2.829994 3.070171
## 6          10 2.930525 2.813365 3.047685
## 7          11 2.910968 2.786049 3.035887
## 8          12 2.891410 2.749791 3.033029
## 9          13 2.871852 2.707292 3.036412
## 10         14 2.852295 2.660783 3.043807
## 11         15 2.832737 2.611724 3.053751

# Plot: weight ~ group_size
pmG <- ggplot(d, aes(group_size, weight)) +
    geom_point() +
    geom_ribbon(data=new_data, aes(x=group_size, ymin=lwr, ymax=upr), fill="skyblue3", alpha=0.25) +
    geom_line(data=new_data, aes(group_size, weight), color="deepskyblue4", linewidth=1) +
    labs(x = "Group size", y = "Weight") 
pmG

Regression diagnostics

Linear regression assumptions

Inferences made from a regression model depend on several assumptions.

L: Linear relationship between the mean response (Y) and the explanatory variable (X)
I: The errors are independent
N: The responses are normally distributed at each level of X
E: The variance of the responses is equal for all levels of X

* From “Beyond Multiple Linear Regression” by Paul Roback and Julie Legler

Violation of an assumption threatens the validity of the estimates and inferences made from them, so it’s important to assess the plausibility of these assumptions with regression diagnostics.

Linearity assumption

Violation of linearity results in erroneous inference regarding the form of the relationship between the predictor and outcome as well as bad predictions.

Diagnostic plot: residuals (y-axis) vs fitted outcome values (x-axis).

No systematic trend among the residuals suggests linearity is met. A smoothed curve can help to identify a trend.

Using plot() on the lm object will produce diagnostic plots, the first of which is the residuals vs fitted values plot.

# which = 1 plots residuals vs.fitted values plot 
plot(mG, which=1)

A trend in the residuals suggests a nonlinear relationship between variables a predictor and the outcome.

If linearity assumption might be violated, consider:

Adding nonlinear terms to model (polynomial, splines), see R package “splines”
Applying non-linear transformations to either predictor or outcome (e.g. logarithm)

Equal variance assumption or Homoscedasticity

Violation of of homoscedasticity of the errors results in inaccurate standard errors, leading to inaccurate test statistics, confidence intervals, and p-values.

Diagnostic plot: residuals (y-axis) vs fitted outcome values (x-axis) again.

Random scatter about the zero line with equal spread across the range of fitted values suggests homoscedasticity.

# Residual vs. fitted value plot for Homoscedasticity
plot(mG, which=1)

One commonly observed form of heteroscedasticity is increasing variance of the residuals with increasing fitted values:

If homoscedasticity might be violated, consider:

Variable transformation (e.g., log transformation of the outcome)
Using robust (heteroscedastic-consistent or “sandwich”) standard errors, see R package “sandwich”
Modeling heteroscedasticity in a mixed model, see R package “nlme”

Independence assumption

Violation of independence threatens the accuracy of confidence intervals and p-values.

Two common design features that result in dependencies in the data:

sampling in clusters
repeated measurements of the same subjects

Diagnostic plot: residuals (y-axis) against a clustering index (e.g., schoolID) or order variable (e.g., time).

If we had sampled several monkeys from the same group, we might see systematic differences in the residuals across groups, suggesting non-independence.

# Plot residuals by group
boxplot(m$residuals ~ group, data = dat, xlab = "Group", ylab = "Residuals")

If independence might be violated, consider:

Model dependency in the data (e.g. mixed/multilevel models, fixed effect models), see R package “lme4”
cluster robust standard errors

Normality assumption

Violation of normality of errors threatens the accuracy of confidence intervals and p-values, but becomes less impactful with large sample sizes.

Diagnostic plots: histogram and q-q plot of residuals.

# A simple histogram of mG residuals
hist(mG$residuals, breaks=20)

# Normal Quantile to Quantile plot
qqnorm(mG$residuals)
qqline(mG$residuals)

Deviation from the diagonal line in the q-q plot suggests that errors are not normally distributed.

If normality might be violated, consider:

Transforming outcome (e.g. logarithm)
Generalized linear models, see R funtion glm()

Influential observations

Influential observations change model estimates substantially when excluded from the data during estimation.

Ideally, inferences are robust to inclusion or exclusion of influential observations.
Influential observations are typically outliers in two senses
- extreme values on predictors, known as high leverage
- large residuals, far from the fitted regression line
Influence diagnostic measures often quantify how much model coefficients change when a particular observation is excluded
- dfbeta measures how a specific coefficient changes when deleting an observation
- Cook’s distance measures how much all coefficients change when deleting an observation

influence.measures() can be run on an lm object, and returns several influence measures including dfbetas for each coefficient and cook’s distance.

If any of the influence measures are unusually large, they are flagged with an * in the infl column.

# influence measures for each observation
influence <- influence.measures(mG)
influence

## Influence measures of
##   lm(formula = weight ~ group_size, data = d) :
## 
##        dfb.1_  dfb.grp_     dffit cov.r   cook.d    hat inf
## 1    0.064279 -9.12e-02 -0.115102 1.039 6.66e-03 0.0269    
## 2   -0.042063  5.60e-02  0.065945 1.056 2.19e-03 0.0358    
## 3   -0.050624  3.31e-02 -0.069269 1.026 2.41e-03 0.0130    
## 4   -0.045000  6.38e-02  0.080580 1.044 3.27e-03 0.0269    
## 5   -0.000932 -2.64e-04 -0.003745 1.031 7.08e-06 0.0100    
## 6    0.002997  8.48e-04  0.012048 1.031 7.33e-05 0.0100    
## 7    0.146781 -2.08e-01 -0.262837 0.997 3.40e-02 0.0269    
## 8    0.004419 -5.88e-03 -0.006928 1.059 2.42e-05 0.0358    
## 9    0.018768  5.31e-03  0.075436 1.019 2.86e-03 0.0100    
## 10  -0.128697  1.05e-01 -0.139193 1.028 9.71e-03 0.0234    
## 11  -0.014705  2.34e-02  0.033182 1.040 5.56e-04 0.0198    
## 12  -0.138223  9.03e-02 -0.189134 0.978 1.76e-02 0.0130    
## 13   0.092738 -7.59e-02  0.100301 1.036 5.06e-03 0.0234    
## 14  -0.188054  1.23e-01 -0.257319 0.934 3.18e-02 0.0130   *
## 15   0.022861 -3.24e-02 -0.040937 1.048 8.46e-04 0.0269    
## 16   0.322569 -2.77e-01  0.335389 0.982 5.49e-02 0.0314    
## 17  -0.073702  9.81e-02  0.115545 1.051 6.72e-03 0.0358    
## 18   0.007616 -6.89e-02 -0.195137 0.966 1.86e-02 0.0114    
## 19  -0.024989  3.97e-02  0.056389 1.038 1.60e-03 0.0198    
## 20   0.181734 -1.19e-01  0.248670 0.941 2.98e-02 0.0130    
## 21  -0.137603  1.95e-01  0.246402 1.003 3.00e-02 0.0269    
## 22  -0.044047  1.93e-02 -0.083730 1.018 3.52e-03 0.0106    
## 23  -0.142638  1.22e-01 -0.148307 1.039 1.10e-02 0.0314    
## 24   0.147436 -1.21e-01  0.159461 1.023 1.27e-02 0.0234    
## 25   0.040424 -1.77e-02  0.076843 1.020 2.97e-03 0.0106    
## 26  -0.028385  2.44e-02 -0.029513 1.053 4.40e-04 0.0314    
## 27  -0.021543  9.42e-03 -0.040952 1.028 8.46e-04 0.0106    
## 28   0.005072 -1.04e-02 -0.018466 1.035 1.72e-04 0.0147    
## 29   0.059900  1.70e-02  0.240757 0.920 2.77e-02 0.0100   *
## 30  -0.081929  5.35e-02 -0.112105 1.014 6.29e-03 0.0130    
## 31  -0.325557  2.80e-01 -0.338496 0.981 5.58e-02 0.0314    
## 32   0.002045 -1.85e-02 -0.052401 1.027 1.38e-03 0.0114    
## 33  -0.096149  1.28e-01  0.150737 1.045 1.14e-02 0.0358    
## 34   0.043374 -6.89e-02 -0.097875 1.031 4.82e-03 0.0198    
## 35  -0.082761  6.27e-02 -0.096743 1.027 4.70e-03 0.0172    
## 36   0.004527 -4.10e-02 -0.115994 1.008 6.72e-03 0.0114    
## 37   0.007169  2.03e-03  0.028813 1.029 4.19e-04 0.0100    
## 38  -0.013310  2.74e-02  0.048454 1.033 1.18e-03 0.0147    
## 39   0.142358 -2.26e-01 -0.321238 0.940 4.95e-02 0.0198    
## 40   0.043315 -5.76e-02 -0.067907 1.056 2.33e-03 0.0358    
## 41   0.067758 -5.82e-02  0.070451 1.051 2.50e-03 0.0314    
## 42   0.101768 -7.71e-02  0.118961 1.022 7.09e-03 0.0172    
## 43   0.156962 -2.09e-01 -0.246076 1.024 3.01e-02 0.0358    
## 44  -0.127022  1.04e-01 -0.137381 1.028 9.46e-03 0.0234    
## 45   0.056355 -3.68e-02  0.077112 1.025 2.99e-03 0.0130    
## 46   0.020019 -8.75e-03  0.038053 1.029 7.30e-04 0.0106    
## 47  -0.079020  5.98e-02 -0.092370 1.028 4.29e-03 0.0172    
## 48  -0.025194  5.18e-02  0.091720 1.024 4.23e-03 0.0147    
## 49   0.015033 -6.57e-03  0.028577 1.030 4.12e-04 0.0106    
## 50  -0.095191  6.22e-02 -0.130252 1.007 8.46e-03 0.0130    
## 51   0.155556 -1.27e-01  0.168242 1.020 1.41e-02 0.0234    
## 52   0.229604 -1.88e-01  0.248329 0.992 3.03e-02 0.0234    
## 53  -0.008436  1.73e-02  0.030712 1.035 4.76e-04 0.0147    
## 54   0.248756 -3.31e-01 -0.389987 0.974 7.37e-02 0.0358    
## 55   0.087150 -1.16e-01 -0.136630 1.048 9.38e-03 0.0358    
## 56  -0.013531 -3.83e-03 -0.054385 1.025 1.49e-03 0.0100    
## 57  -0.124298  1.02e-01 -0.134435 1.029 9.06e-03 0.0234    
## 58   0.067430 -5.79e-02  0.070110 1.051 2.48e-03 0.0314    
## 59   0.088296 -7.58e-02  0.091805 1.048 4.25e-03 0.0314    
## 60  -0.062258  9.89e-02  0.140489 1.021 9.87e-03 0.0198    
## 61  -0.099371  1.41e-01  0.177941 1.025 1.58e-02 0.0269    
## 62  -0.000386  3.49e-03  0.009893 1.032 4.94e-05 0.0114    
## 63   0.045589 -2.98e-02  0.062381 1.028 1.96e-03 0.0130    
## 64  -0.036066  5.73e-02  0.081385 1.034 3.33e-03 0.0198    
## 65   0.007180 -4.69e-03  0.009825 1.034 4.88e-05 0.0130    
## 66   0.000106  3.01e-05  0.000428 1.031 9.25e-08 0.0100    
## 67   0.130755 -9.90e-02  0.152845 1.011 1.16e-02 0.0172    
## 68  -0.043225  8.89e-02  0.157364 1.001 1.23e-02 0.0147    
## 69   0.217732 -1.78e-01  0.235489 0.997 2.74e-02 0.0234    
## 70   0.051350 -4.20e-02  0.055537 1.042 1.56e-03 0.0234    
## 71  -0.048752  2.13e-02 -0.092673 1.015 4.30e-03 0.0106    
## 72   0.036982 -5.88e-02 -0.083452 1.034 3.51e-03 0.0198    
## 73  -0.043075  2.81e-02 -0.058940 1.028 1.75e-03 0.0130    
## 74  -0.041674  1.82e-02 -0.079219 1.019 3.15e-03 0.0106    
## 75   0.054770 -8.70e-02 -0.123591 1.025 7.66e-03 0.0198    
## 76  -0.042930  8.83e-02  0.156290 1.002 1.21e-02 0.0147    
## 77  -0.050080  6.66e-02  0.078512 1.055 3.11e-03 0.0358    
## 78   0.202715 -2.88e-01 -0.362994 0.953 6.34e-02 0.0269    
## 79  -0.130979  1.74e-01  0.205342 1.034 2.11e-02 0.0358    
## 80  -0.183652  1.50e-01 -0.198630 1.011 1.96e-02 0.0234    
## 81  -0.066635  2.91e-02 -0.126666 1.000 7.98e-03 0.0106    
## 82  -0.107103  9.20e-02 -0.111360 1.046 6.24e-03 0.0314    
## 83   0.039181 -1.71e-02  0.074479 1.021 2.79e-03 0.0106    
## 84  -0.008382  1.72e-02  0.030515 1.035 4.70e-04 0.0147    
## 85  -0.007141  1.01e-02  0.012787 1.049 8.26e-05 0.0269    
## 86   0.022704  6.43e-03  0.091256 1.014 4.17e-03 0.0100    
## 87   0.012744  3.61e-03  0.051223 1.026 1.32e-03 0.0100    
## 88   0.009216 -6.02e-03  0.012611 1.034 8.03e-05 0.0130    
## 89   0.194797 -1.59e-01  0.210683 1.006 2.20e-02 0.0234    
## 90   0.028126  7.96e-03  0.113049 1.005 6.37e-03 0.0100    
## 91  -0.071197  5.83e-02 -0.077004 1.040 2.99e-03 0.0234    
## 92  -0.219223  2.92e-01  0.343687 0.992 5.78e-02 0.0358    
## 93  -0.020134  1.65e-02 -0.021776 1.045 2.39e-04 0.0234    
## 94  -0.046526  3.52e-02 -0.054386 1.035 1.49e-03 0.0172    
## 95   0.063330 -5.44e-02  0.065847 1.051 2.19e-03 0.0314    
## 96  -0.086577  6.55e-02 -0.101203 1.026 5.14e-03 0.0172    
## 97   0.042475 -2.77e-02  0.058120 1.029 1.70e-03 0.0130    
## 98   0.004160  1.18e-03  0.016719 1.031 1.41e-04 0.0100    
## 99  -0.264185  2.16e-01 -0.285730 0.975 3.98e-02 0.0234    
## 100  0.025815  7.31e-03  0.103760 1.009 5.38e-03 0.0100

Running summary() on the object returned by influence.measures() will produce the table of diagnostics only for those observations flagged as influential (with an *).

# observations 14 and 29 are influential
summary(influence.measures(mG))

## Potentially influential observations of
##   lm(formula = weight ~ group_size, data = d) :
## 
##    dfb.1_ dfb.grp_ dffit cov.r   cook.d hat  
## 14 -0.19   0.12    -0.26  0.93_*  0.03   0.01
## 29  0.06   0.02     0.24  0.92_*  0.03   0.01

Rather than rushing to delete outliers, consider re-running the model without these observations to assess the robustness of the results to inclusion of these influential points (i.e. a sensitivity analysis).

# rerun modeling excluding influential observations
mG_no_infl <- lm( weight ~ group_size, data=d[-c(14,29),])


# compare coefficients
cbind(mG=coef(mG), mG_no_infl=coef(mG_no_infl))

##                      mG  mG_no_infl
## (Intercept)  3.12610110  3.14926223
## group_size  -0.01955759 -0.02200743

We can see that the estimates change slightly between the two models, but not enough to change our overall inferences (standard errors and p-values are also similar).

R packages for more diagnostics:

car
performance

Categorical predictors

Categorical predictors: factor and dummy variables

Categorical predictors

typically have a fixed number of possible values
values denote membership to different classifications
for a predictor with \(k\) categories, \(k-1\) coefficients will be estimated

Dummy variables

The simplest categorical variable is a dummy variable, where 0 and 1 denote not belonging and belonging to a category, respectively.

table(d$sex)

## 
## female   male 
##     52     48

# Create dummy variable for male
d$male <- if_else(d$sex == "female", 0, 1)

table(d$male)

## 
##  0  1 
## 52 48

In regression models, categorical variables with \(k\) categories will be represented by \(k - 1\) dummy variables.

The omitted category becomes the reference group.

For example, the social_rank variable has 3 levels: low, medium, and high.

table(d$social_rank)

## 
##   high    low medium 
##     24     29     47

We can create dummy variables for medium and high, which will make low the reference.

social_rank	medium	high
low	0	0
medium	1	0
high	0	1

d$rank_medium <- if_else(d$social_rank=="medium", 1, 0)
d$rank_high <- if_else(d$social_rank=="high", 1, 0)

# check dummy codes
# rank_medium = 0 & rank_high = 0 implies monkey has "low" social rank
table(d$social_rank, d$rank_medium)

##         
##           0  1
##   high   24  0
##   low    29  0
##   medium  0 47

table(d$social_rank, d$rank_high)

##         
##           0  1
##   high    0 24
##   low    29  0
##   medium 47  0

Factor variables

Categorical variables in R can be represented using factors.

Factors are integer variables with labels for each value.

Function factor() converts both character and integer variables into factors

# Create a factor variable for social_rank variable
d$social_rank_f <- factor(d$social_rank)
levels(d$social_rank_f)

## [1] "high"   "low"    "medium"

By default, categories will be ordered alphabetically, but we can specify a particular order with levels=.

# Create a factor variable with custom level order
d$social_rank_f <- factor(d$social_rank, levels=c("low","medium","high"))

levels(d$social_rank_f)

## [1] "low"    "medium" "high"

To convert a variable with integer values to a factor, use labels= argument

d$male <- factor(d$male, labels=c("female","male"))

d$male

##   [1] male   female male   female male   female male   male   female female
##  [11] male   female female female female male   female female female female
##  [21] female female female male   female female female female male   male  
##  [31] female female male   female female male   female female female male  
##  [41] female male   male   female female female male   male   male   female
##  [51] female male   female female male   male   female female male   male  
##  [61] female male   male   male   female female male   female male   male  
##  [71] male   male   male   female male   male   male   male   female female
##  [81] female female male   male   male   male   male   male   male   male  
##  [91] female female male   female female female male   male   female male  
## Levels: female male

Conveniently, when entering a factor into lm() as a predictor, R will automatically create dummy variables for each level of factor except for the first, which will thus be the reference group.

Linear regression with one categorical predictor: Sex

Below we model weight predicted by sex, with the expectation that male monkeys are heavier than female monkeys.

# Simple linear regression where weight is modeled as a function of individual's sex
# male 
mS <- lm( weight ~ male, data=d)

# Extract model summary
sum_mS <- summary(mS)
sum_mS

## 
## Call:
## lm(formula = weight ~ male, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.44358 -0.36511  0.00375  0.41178  1.38766 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.82878    0.08068  35.061   <2e-16 ***
## malemale     0.22134    0.11645   1.901   0.0603 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5818 on 98 degrees of freedom
## Multiple R-squared:  0.03555,    Adjusted R-squared:  0.02571 
## F-statistic: 3.613 on 1 and 98 DF,  p-value: 0.06028

# Add confidence intervals
ci_mS <- confint(mS)
sum_mS <- cbind(sum_mS$coefficients, ci_mS)
round(sum_mS, 4) # rounding

##             Estimate Std. Error t value Pr(>|t|)   2.5 % 97.5 %
## (Intercept)   2.8288     0.0807 35.0611   0.0000  2.6687 2.9889
## malemale      0.2213     0.1165  1.9007   0.0603 -0.0098 0.4524

A female capuchin (male=0) is expected to weigh 2.8288 kg.

Males are expected to weigh 0.2213 kg more than females on average. However, the 95% confidence interval, [-0.0098, 0.4524], suggests uncertainty in this estimate, with values near zero or even slightly negative values also consistent with the data.

# Calculate predicted weight for males and females
new_data <- data.frame(male=factor(c("female","male"), levels=c("female","male")))
predictions <- predict(mS, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("male","weight","lwr","upr")
new_data

##     male   weight      lwr      upr
## 1 female 2.828778 2.668669 2.988888
## 2   male 3.050122 2.883475 3.216770

# Plot: weight ~ sex
ggplot(d, aes(male, weight)) +
  geom_point(color="grey32",size=0.75) +
  geom_errorbar(data = new_data, aes(x=male, ymin=lwr, ymax=upr, color=male),
                width=0.15,linewidth=1.5) +
  geom_point(data = new_data, aes(male, weight, color=male), size = 3) +
  scale_fill_manual(values = c("male"="lightcyan2","female"="orange2" ), name="Sex") +
  scale_color_manual(values = c("male"="deepskyblue3","female"="orange2" ), name="Sex") +  
  labs(x = "Sex", y = "Weight") +
  theme_bw()

Linear regression with one categorical predictor: Social rank

Higher ranking individuals should weigh more because rank translates into better access to resources.

# Simple linear regression where weight is modeled as a function of individual's social rank
mR <- lm( weight ~ social_rank_f, data=d)

# Extract model summary
sum_mR <- summary(mR)
sum_mR

## 
## Call:
## lm(formula = weight ~ social_rank_f, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.56992 -0.33489  0.02641  0.40106  1.04668 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.7261     0.1044  26.105  < 2e-16 ***
## social_rank_fmedium   0.1763     0.1328   1.327  0.18752    
## social_rank_fhigh     0.5254     0.1552   3.386  0.00103 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5624 on 97 degrees of freedom
## Multiple R-squared:  0.1082, Adjusted R-squared:  0.08977 
## F-statistic: 5.882 on 2 and 97 DF,  p-value: 0.003881

Above, we see that R entered dummy variables representing medium and high in the regression model, and the coefficients represent comparisons to the reference group, low.

# Add confidence intervals
ci_mR <- confint(mR)
sum_mR <- cbind(sum_mR$coefficients, ci_mR)
round(sum_mR, 4) # rounding

##                     Estimate Std. Error t value Pr(>|t|)   2.5 % 97.5 %
## (Intercept)           2.7261     0.1044 26.1054   0.0000  2.5188 2.9333
## social_rank_fmedium   0.1763     0.1328  1.3273   0.1875 -0.0873 0.4398
## social_rank_fhigh     0.5254     0.1552  3.3859   0.0010  0.2174 0.8334

A low ranking capuchin (reference group) is predicted to weigh 2.7261 kg.

Medium ranking capuchin are expected to weigh 0.1763 kg more than low-ranking capuchin on average. However, the uncertainty surrounding the estimate (95% CI = [-0.0873, 0.4398] ) makes clear inference difficult.

High-ranking capuchin are expected to weigh 0.5254 kg more than low-ranking capuchin, although considerably smaller and larger effects are also consistent with the data (95% CI = [0.2174, 0.8334] )

# Calculate predictions of weight for low, medium and high ranking capuchins
new_data <- data.frame(social_rank_f=factor(c("low","medium","high"), levels=c("low","medium","high")))
predictions <- predict(mR, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("social_rank_f","weight","lwr","upr")
new_data

##   social_rank_f   weight      lwr      upr
## 1           low 2.726080 2.518824 2.933337
## 2        medium 2.902337 2.739535 3.065138
## 3          high 3.251508 3.023683 3.479333

# Plot: weight ~ social_rank
ggplot(d, aes(social_rank_f, weight)) +
  geom_point(color="grey32", size=0.75) +
  geom_errorbar(data = new_data, aes(x=social_rank_f, ymin=lwr, ymax=upr), 
                width=0.2, color="orange2",linewidth=1.5) +
  geom_point(data = new_data, aes(social_rank_f, weight), color = "orange2", size = 3) +
  labs(x = "Social Rank", y = "Weight") +
  theme_bw()

Exercise 2

Some capuchin monkeys live in a rainforest, while others live in a dry tropical forest. These two habitats differ the amount of rain: the dry tropical forest experiences drought about half of the year, while the rainforest does not. As a result, food is more abundant in the rainforest, and therefore, we expect that monkeys who live in the rainforest will generally be heavier.

Create a forest_type_f factor variable with two levels: “dry” and “rain”.
Run a linear regression modeling capuchin monkey weight as a function of their forest_type_f.
Calculate weight predictions for an average capuchin residing in a rain forest and in a dry tropical forest.
Assess the assumptions of homoscedasticity and normality of the residuals using plots.

Exercise 2: Weight and Forest type

d$forest_type_f <- factor(d$forest_type, levels=c("dry","rain"))
# fit the model
mF <- lm( weight ~ forest_type_f, data=d)

# Extract model summary
sum_mF <- summary(mF)
# Confidence intervals
ci_mF <- confint(mF)
# Add confidence intervals to model summary
sum_mF <- cbind(sum_mF$coefficients,ci_mF)
round(sum_mF,4) #rounding to fit the screen

##                   Estimate Std. Error t value Pr(>|t|)  2.5 % 97.5 %
## (Intercept)         2.8334     0.0701 40.3988   0.0000 2.6942 2.9726
## forest_type_frain   0.3080     0.1221  2.5226   0.0133 0.0657 0.5503

An average capuchin living in a dry forest is predicted to weigh 2.8334 kg.

Capuchins living in a rainforest are expected to weigh 0.308 kg more than those living in a dry forest, although very small differences are also consistent with the data (95% CI = [0.0657, 0.5503] ).

new_data <- data.frame(forest_type_f=factor(c("dry","rain"), levels=c("dry","rain")))
predictions <- predict(mF, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("forest_type_f","weight","lwr","upr")
new_data

##   forest_type_f   weight      lwr      upr
## 1           dry 2.833389 2.694207 2.972570
## 2          rain 3.141373 2.943055 3.339691

ggplot(d, aes(forest_type_f, weight)) +
  geom_point(color="grey32", size=0.75) +
  geom_errorbar(data = new_data, aes(x=forest_type_f, ymin=lwr, ymax=upr, color=forest_type_f), 
                width=0.2, linewidth=1.5) +
  geom_point(data = new_data, aes(forest_type_f, weight, color=forest_type_f), size = 3) +
  scale_fill_manual(values = c("dry"="orange3","rain"="skyblue3" ), name="Forest type") +
  scale_color_manual(values = c("dry"="orange3","rain"="skyblue3" ), name="Forest type") +  
  labs(x = "Forest type", y = "Weight") +
  theme_bw()

# residual vs. fitted plot for homoscedasticity
plot(mF, which = 1)

# qqplot for normality residuals
qqnorm(residuals(mF))
qqline(residuals(mF))

Multiple regression

Adding another predictor to a simple linear regression model

Multiple regression allows us to estimate the unique effects of many predictors on the outcome simultaneously.

# Multiple linear regression where weight is modeled as a function of individual's sex and kin
mSK <- lm( weight ~ male + kin, data=d)

# Extract model summary
sum_mSK <- summary(mSK)
sum_mSK

## 
## Call:
## lm(formula = weight ~ male + kin, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.31410 -0.25358  0.02095  0.24475  1.26425 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.05494    0.13132  15.648  < 2e-16 ***
## malemale     0.77489    0.12569   6.165 1.62e-08 ***
## kin          0.29372    0.04296   6.838 7.19e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4804 on 97 degrees of freedom
## Multiple R-squared:  0.3492, Adjusted R-squared:  0.3358 
## F-statistic: 26.03 on 2 and 97 DF,  p-value: 8.941e-10

# Add confidence intervals
ci_mSK <- confint(mSK)
sum_mSK <- cbind(sum_mSK$coefficients, ci_mSK)
round(sum_mSK, 4) # rounding

##             Estimate Std. Error t value Pr(>|t|)  2.5 % 97.5 %
## (Intercept)   2.0549     0.1313 15.6480        0 1.7943 2.3156
## malemale      0.7749     0.1257  6.1649        0 0.5254 1.0244
## kin           0.2937     0.0430  6.8377        0 0.2085 0.3790

If two predictors are correlated (and are also both correlated with the outcome), their coefficients will change from models where they are entered alone.

In model mS, we found that males were expected to weigh 0.2213 kg more than females on average (95% CI = [-0.0098, 0.4524]) when sex was modeled alone.

Now, we see that, after controlling for the effect of kin, males are expected to weigh 0.7749 kg more than females (95% CI = [0.5254, 1.0244]).

Males generally have fewer kin within group than females (i.e. being male and number of kin are correlated), and having fewer close kin will generally lead to lower weights. So, when modeled alone, the effect of being male is smaller because it includes the effect of having fewer kin as well.

Omitted variable bias occurs when we do not include important predictors in regression models that are correlated with other predictors, resulting in coefficients that represent the combined effects of several predictors. Collecting data on all important explanatory predictors is crucial, therefore.

# Calculate predicted weight for males and females 
new_data <- expand.grid(male=factor(c("female", "male"), levels=c("female","male")),
                       kin = 0:4)
predictions <- predict(mSK, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("male","kin","weight","lwr","upr")
new_data

##      male kin   weight      lwr      upr
## 1  female   0 2.054942 1.794303 2.315581
## 2    male   0 2.829833 2.678091 2.981576
## 3  female   1 2.348661 2.156562 2.540759
## 4    male   1 3.123552 2.984299 3.262805
## 5  female   2 2.642380 2.499524 2.785236
## 6    male   2 3.417271 3.243219 3.591324
## 7  female   3 2.936099 2.800265 3.071933
## 8    male   3 3.710990 3.474909 3.947071
## 9  female   4 3.229818 3.053661 3.405974
## 10   male   4 4.004709 3.695337 4.314081

pSK <- ggplot(d, aes(kin, weight)) +
  geom_point() +
  geom_ribbon(data = new_data, aes(x = kin, ymin = lwr, ymax = upr, fill=male), alpha = 0.5) +  
  geom_line(data = new_data, aes(kin, weight, color=male), linewidth = 1) +
  scale_fill_manual(values = c("male"="lightcyan2","female"="orange2" ), name="Sex") +
  scale_color_manual(values = c("male"="deepskyblue4","female"="orange4" ), name="Sex") +
  labs(x = "Number of close kin", y = "Weight") +
  theme_bw()
pSK

Standardizing and centering predictors

For some questions, it may be useful to put some or all of the predictors on a standardized scale:

to understand the magnitude of effect of a predictor whose scaling is not commonly known
to compare the magnitude of effects of predictors with different natural scalings

Predictors can be standardized by dividing values by their estimated standard deviation. The resulting standardized predictor is measured in standard deviation units.

The scale() function standardizes and centers variables (subtracts the mean so that the mean is now zero).

# Multiple linear regression with standardized predictors
mGK <- lm( weight ~ scale(group_size) + scale(kin), data=d)

# Extract model summary
sum_mGK <- summary(mGK)
sum_mGK

## 
## Call:
## lm(formula = weight ~ scale(group_size) + scale(kin), data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20813 -0.39458 -0.00674  0.39713  1.33433 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.93502    0.05638  52.061  < 2e-16 ***
## scale(group_size) -0.05721    0.05670  -1.009  0.31548    
## scale(kin)         0.17876    0.05670   3.153  0.00215 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5638 on 97 degrees of freedom
## Multiple R-squared:  0.1036, Adjusted R-squared:  0.08517 
## F-statistic: 5.608 on 2 and 97 DF,  p-value: 0.004957

# Add confidence intervals
ci_mGK <- confint(mGK)
sum_mGK <- cbind(sum_mGK$coefficients, ci_mGK)
round(sum_mGK, 4) # rounding

##                   Estimate Std. Error t value Pr(>|t|)   2.5 % 97.5 %
## (Intercept)         2.9350     0.0564 52.0606   0.0000  2.8231 3.0469
## scale(group_size)  -0.0572     0.0567 -1.0090   0.3155 -0.1698 0.0553
## scale(kin)          0.1788     0.0567  3.1526   0.0022  0.0662 0.2913

Coefficients estimated for standardized predictors are interpreted as the expected change in the outcome per one-standard-deviation increase in the predictor.

For a one-standard-deviation increase in kin, we expect weight to increase by 0.1788 kg.

Centering predictors changes the interpretation of the intercept to be the expected outcome when the centered predictors are at their mean (which is now zero).

The expected weight for a capuchin coming from a group of average size and with an average number of kin to be 2.935 kg.

Interactions

Interaction between two continuous predictors: Kin and Group size

Interaction terms in regression model allow the effect of one predictor to vary with levels of another predictor.

Without an interaction, the effect of one predictor is modeled to be the same across all levels of the other predictor (i.e. a main effect)

A variable representing the interaction of two variables is formed by taking the product of those two variables.

Generally, the two component variables and the interaction variable are all entered into the regression together.

In R, we use the * operator between predictors to request that both the interaction and the two component variables be added to lm() model

Below, we model the interaction of number of kin and group size. Kin number should have greater effect in big groups, because competition increases with group size and coalitionary support from kin matters more.

# Multiple linear regression with interaction between two continuous variables
mKG <- lm( weight ~ kin + group_size + kin*group_size, data=d)

# Extract model summary
sum_mKG <- summary(mKG)
sum_mKG

## 
## Call:
## lm(formula = weight ~ kin + group_size + kin * group_size, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.17226 -0.37270 -0.00907  0.39110  1.33113 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.35099    0.28257  11.859   <2e-16 ***
## kin            -0.12715    0.12073  -1.053   0.2949    
## group_size     -0.06214    0.02669  -2.328   0.0220 *  
## kin:group_size  0.02459    0.01133   2.170   0.0324 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5533 on 96 degrees of freedom
## Multiple R-squared:  0.1456, Adjusted R-squared:  0.1189 
## F-statistic: 5.452 on 3 and 96 DF,  p-value: 0.001668

# Add confidence intervals
ci_mKG <- confint(mKG)
sum_mKG <- cbind(sum_mKG$coefficients, ci_mKG)
round(sum_mKG, 4) # rounding

##                Estimate Std. Error t value Pr(>|t|)   2.5 %  97.5 %
## (Intercept)      3.3510     0.2826 11.8591   0.0000  2.7901  3.9119
## kin             -0.1272     0.1207 -1.0531   0.2949 -0.3668  0.1125
## group_size      -0.0621     0.0267 -2.3283   0.0220 -0.1151 -0.0092
## kin:group_size   0.0246     0.0113  2.1704   0.0324  0.0021  0.0471

The interpretation of the intercept is the same as before: the expected weight for a capuchin with no close kin and from a group size of zero is 3.351 kg.

The interpretation of coefficients for variables involved in an interaction is different from a main effects model. Namely, these are the simple or conditional effects when the other variable is zero.

For each additional kin, we expect weight to decrease by 0.1272 kg when group size is zero, although the sign of this effect is not clear from the data (95% CI = [-0.3668, 0.1125]).
For each additional group member, we expect weight to decrease by 0.0621 kg when number of close kin is zero, (95% CI = [-0.1151, -0.0092]).

The interaction coefficient expresses the change in a variable’s effect when the interacting variable is increased by one-unit:

For each additional kin, the effect of group size is expected to increase by 0.0246 kg/monkey, (95% CI = [0.002, 0.047]).
For each additional group member, the effect of kin is expected to increase by 0.0246 kg/kin.

The effect of kin at any particular group size can be calculated by adding the coefficient for kin, 0.1272, to the interaction coefficient, 0.0246, multiplied by group size.

\[\begin{aligned} kin.effect_{(group\_size=g)} &= - 0.1272 + 0.0246*g \\ kin.effect_{(group\_size=2)} &= - 0.1272 + 0.0246*10 \\ &= 0.1188 \end{aligned}\]

Interactions can be difficult to understand from just examining the coefficients, so graphs of model predictions can greatly help.

predict the outcome across a range of values of predictors involved in the interaction
one predictor should be graphed along the x-axis
the other predictor should be used to group lines (using different colors or line patterns) or to create separate panels

# Calculate predictions
new_data <- expand.grid(group_size=c(5,10,15), kin = c(0,2,4))
predictions <- predict(mKG, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("group_size","kin","weight","lwr","upr")
new_data

##   group_size kin   weight      lwr      upr
## 1          5   0 3.040281 2.721467 3.359095
## 2         10   0 2.729567 2.559185 2.899950
## 3         15   0 2.418854 2.107790 2.729918
## 4          5   2 3.031873 2.835853 3.227893
## 5         10   2 2.967053 2.855005 3.079101
## 6         15   2 2.902233 2.691468 3.112999
## 7          5   4 3.023465 2.667883 3.379046
## 8         10   4 3.204539 3.001180 3.407898
## 9         15   4 3.385612 3.019975 3.751250

# Subset dataset
dsub <- d[d$group_size%in%c(5,10,15) & d$kin%in%c(0,2,4),]
# Plot
pKxG1 <- ggplot(dsub, aes(kin, weight)) +
  geom_point() +
  geom_ribbon(data=new_data, aes(x=kin, ymin=lwr, ymax=upr, fill=factor(group_size)), alpha=0.25) +
  geom_line(data=new_data, aes(kin, weight, color=factor(group_size)), linewidth = 1) +
  facet_grid(. ~ group_size, scales="free") +
  scale_fill_manual(values=c("5"="skyblue","10"="deepskyblue3" ,"15"="deepskyblue4"), 
                    name="Group size") +
  scale_color_manual(values = c("5"="skyblue","10"="deepskyblue3","15"="deepskyblue4") ,
                    name="Group size") +
  labs(x = "Number of Kin", y = "Weight") +
  theme_bw()

pKxG1

The impact of the number of kin depends on group size. In small groups (n=5), there is essentially no effect of kinship on an individual’s weight. However, in large groups (n=15), individuals are expected to weigh more with each additional kin member.

# Subset dataset
dsub <- d[d$group_size%in%c(5,10,15) & d$kin%in%c(0,2,4),]
pKG2 <- ggplot(dsub, aes(group_size, weight)) +
  geom_point() +
  geom_ribbon(data = new_data, aes(x = group_size, ymin = lwr, ymax = upr, fill=factor(kin)), alpha = 0.25) +
  geom_line(data = new_data, aes(group_size, weight, color=factor(kin)), linewidth = 1) +
  scale_fill_manual(values = c("0"="orange2","2"="salmon1" ,"4"="orange4"),name="Number of Kin" ) +
  scale_color_manual(values = c("0"="orange2","2"="salmon1","4"="orange4" ), name="Number of Kin") +
  labs(x = "Group size", y = "Weight") +
  theme_bw()

pKG2

The effect of group size on weight depends on the number of kin that an individual has. Individuals with no kin members in their group are expected to weigh less as the group size increases. Individuals who have four kin members in the group are expected to weigh more as the group size increases.

Interaction between two categorical predictors: Sex and Social rank

Higher ranking individuals should weigh more than lower ranking individuals. The effect should be stronger for males in comparison to females.

# Multiple linear regression with interaction between two continuous variables
mSR <- lm( weight ~ male*social_rank_f, data=d)

# Extract model summary
sum_mSR <- summary(mSR)
sum_mSR

## 
## Call:
## lm(formula = weight ~ male * social_rank_f, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3207 -0.3713 -0.0326  0.3559  1.7507 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    2.7724     0.1356  20.453   <2e-16 ***
## malemale                      -0.1034     0.2025  -0.511   0.6107    
## social_rank_fmedium            0.1299     0.1660   0.782   0.4359    
## social_rank_fhigh             -0.3067     0.3031  -1.012   0.3142    
## malemale:social_rank_fmedium   0.1034     0.2641   0.392   0.6963    
## malemale:social_rank_fhigh     1.0463     0.3594   2.911   0.0045 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5422 on 94 degrees of freedom
## Multiple R-squared:  0.1966, Adjusted R-squared:  0.1538 
## F-statistic: 4.599 on 5 and 94 DF,  p-value: 0.0008492

# Add confidence intervals
ci_mSR <- confint(mSR)
sum_mSR <- cbind(sum_mSR$coefficients, ci_mSR)
round(sum_mSR, 3) # rounding

##                              Estimate Std. Error t value Pr(>|t|)  2.5 % 97.5 %
## (Intercept)                     2.772      0.136  20.453    0.000  2.503  3.042
## malemale                       -0.103      0.202  -0.511    0.611 -0.505  0.299
## social_rank_fmedium             0.130      0.166   0.782    0.436 -0.200  0.460
## social_rank_fhigh              -0.307      0.303  -1.012    0.314 -0.908  0.295
## malemale:social_rank_fmedium    0.103      0.264   0.392    0.696 -0.421  0.628
## malemale:social_rank_fhigh      1.046      0.359   2.911    0.004  0.333  1.760

The expected weight for a low ranking capuchin female is 2.772 kg.
Low ranking males are expected to weigh 0.103 kg less than low ranking females, (95% CI = [-0.505, 0.299]).
For females, we expect medium ranking monkeys to weigh 0.13 kg more than low ranking monkeys, (95% CI = [-0.2, 0.46]).
For females, we expect high ranking monkeys to weigh 0.307 kg less than low ranking monkeys, (95% CI = [-0.908, 0.295]).

The interaction coefficients:

Compared to females, the difference between medium-ranking and low-ranking capuchins is increased by 0.103 kg for males, (95% CI = [-0.421, 0.628])
Compared to females, the difference between high-ranking and low-ranking capuchins is increased by 1.046 kg for males, (95% CI = [0.333, 1.76]).
- OR,
Compared to low-ranking monkeys, the difference between males and females is increased by 0.103 kg for medium-ranking monkeys, (95% CI = [-0.421, 0.628]).
Compared to low-ranking monkeys, the difference between males and females is increased by 1.046 kg for high-ranking monkeys, (95% CI = [0.333, 1.76]).

# Calculate predictions
new_data <- expand.grid(male=factor(c("female","male"), level=c("female","male")),
            social_rank_f = factor(c("low","medium","high"), level=c("low","medium","high")))
predictions <- predict(mSR, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("male","social_rank_f","weight","lwr","upr")
new_data

##     male social_rank_f   weight      lwr      upr
## 1 female           low 2.772431 2.503290 3.041572
## 2   male           low 2.669033 2.370449 2.967618
## 3 female        medium 2.902331 2.712019 3.092642
## 4   male        medium 2.902350 2.624383 3.180317
## 5 female          high 2.465748 1.927467 3.004029
## 6   male          high 3.408660 3.167933 3.649386

pSR <- ggplot(d, aes(social_rank_f, weight)) +
  geom_point(color="grey32",size=0.75) +
  geom_errorbar(data = new_data, aes(x = social_rank_f, ymin = lwr, ymax = upr, color=male),  
                width=0.2, linewidth=1) +  
  geom_point(data = new_data, aes(social_rank_f, weight, color=male), size = 2) +
  scale_fill_manual(values = c("male"="skyblue3","female"="sandybrown" ), name="Sex") +
  scale_color_manual(values = c("male"="skyblue3","female"="sandybrown" ), name="Sex") +
  labs(x = "Social rank", y = "Weight") +
  theme_bw()

pSR

There seems to be no difference in expected weights for low ranking and medium ranking males and females. High ranking males are expected to weigh more than high ranking females.

Interaction between a continuous predictor and a categorical predictor: Sex and Kin

The presence of kin should have greater effect on female weight compared to male weight.

# Multiple linear regression with interaction between a continuous var and a categorical var
mSxK <- lm( weight ~ male*kin, data=d)

# Extract model summary
sum_mSxK <- summary(mSxK)
sum_mSxK

## 
## Call:
## lm(formula = weight ~ male * kin, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.36506 -0.26226  0.03128  0.27863  1.12248 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.81203    0.14638  12.379  < 2e-16 ***
## malemale      1.15958    0.16941   6.845 7.20e-10 ***
## kin           0.38592    0.05004   7.713 1.15e-11 ***
## malemale:kin -0.28123    0.08739  -3.218  0.00176 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4588 on 96 degrees of freedom
## Multiple R-squared:  0.4126, Adjusted R-squared:  0.3942 
## F-statistic: 22.48 on 3 and 96 DF,  p-value: 4.163e-11

# Add confidence intervals
ci_mSxK <- confint(mSxK)
sum_mSxK <- cbind(summary(mSxK)$coefficients, ci_mSxK)
round(sum_mSxK, 3) # rounding

##              Estimate Std. Error t value Pr(>|t|)  2.5 % 97.5 %
## (Intercept)     1.812      0.146  12.379    0.000  1.521  2.103
## malemale        1.160      0.169   6.845    0.000  0.823  1.496
## kin             0.386      0.050   7.713    0.000  0.287  0.485
## malemale:kin   -0.281      0.087  -3.218    0.002 -0.455 -0.108

The expected weight for a a female capuchin with no kin is 1.812 kg.
For capuchins with no kin, we expect males to weigh 1.16 kg more than females, (95% CI = [0.823, 1.496]).
For females, we expect each additional kin to increase a monkey’s weight by 0.386 kg, (95% CI = [0.287, 0.485]).

The interaction coefficient:

For each additional kin, we expect males to weigh 0.281 kg less in comparison to females, (95% CI = [-0.455, -0.108]).
- OR,
Compared to females, we expect each additional kin to decrease a monkey’s weight by 0.281 kg for males, (95% CI = [-0.455, -0.108]).

# Calculate predictions
new_data <- expand.grid(male=factor(c("female","male"), level=c("female","male")),kin = 0:4)
predictions <- predict(mSxK, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("male","kin","weight","lwr","upr")
new_data

##      male kin   weight      lwr      upr
## 1  female   0 1.812028 1.521472 2.102584
## 2    male   0 2.971605 2.802334 3.140876
## 3  female   1 2.197948 1.992263 2.403632
## 4    male   1 3.076295 2.940134 3.212456
## 5  female   2 2.583868 2.442730 2.725006
## 6    male   2 3.180985 2.959901 3.402068
## 7  female   3 2.969787 2.838395 3.101180
## 8    male   3 3.285675 2.939746 3.631603
## 9  female   4 3.355707 3.170402 3.541013
## 10   male   4 3.390364 2.909838 3.870891

# Plot: weight ~ sex*kin
pSxK <- ggplot(d, aes(kin, weight, color=male)) +
  geom_point() +
  geom_ribbon(data=new_data, aes(x=kin, ymin=lwr, ymax=upr, fill=male), alpha = 0.5, color=NA ) +  
  geom_line(data=new_data, aes(kin, weight, color=male), linewidth = 1) +
  scale_fill_manual(values = c("male"="lightcyan2","female"="orange2" ), name="Sex") +
  scale_color_manual(values = c("male"="deepskyblue4","female"="orange4" ), name="Sex") +
  geom_hline(yintercept=mean(d$weight[d$male=="female"]), linetype="dashed", color = "orange2") +
  geom_hline(yintercept=mean(d$weight[d$male=="male"]), linetype="dashed", color="deepskyblue3") +
  labs(x = "Number of Kin", y = "Weight") +
  theme_bw()

pSxK

The presence of kin increases weight, with stronger effect expected in females compared to males.

True model

The data is simulated.

b0 <- 2.5 # INTERCEPT
bS <- 0.8 # SEX
bG <- -0.1 # GROUP SIZE
bK <- 0.1 # KIN
bRm <- 0.1 # SOCIAL RANK MEDIUM
bRh <- 0.3 # SOCIAL RANK HIGH
bF <- 0.5 # FORETS TYPE
bKG <- 0.03 # KIN x GROUP SIZE
bSRm <- 0.3 # SEX x RANK MEDIUM
bSRh <- 0.5 # SEX x RANK HIGH
bSK <- -0.2 # SEX x KIN

# Multiple linear regression with interaction between two continuous variables
mTRUE <- lm( weight ~ male + group_size + kin + social_rank_f + forest_type_f  +
              kin * group_size +
              male * social_rank_f +
              male * kin, data=d )

All the models we fit were underspecified:

A TRUE model might still not fully recover the estimates

Samples are random draws, leading to variable estimates and varying levels of certainty.

Why do we care about linear regression?

Simple linear regression model

Ordinary Least Squares (OLS)

Linear regression in R

Dataset: Capuchin monkey weight

The data

Linear regression with one continuous predictor: Kin

summary() Output explained

summary() Output explained

Multiple \(R^2\)

Adjusted \(R^2\)

summary() Output explained

\(F\)-statistic

Interpreting model coefficients

Standard errors and confidence intervals

t statistic and p-values

Assessing the fit of the model

Exercise 1

Exercise 1 Solution: Weight and Group size

Regression diagnostics

Linear regression assumptions

Linearity assumption

Equal variance assumption or Homoscedasticity

Independence assumption

Normality assumption

Influential observations

Categorical predictors

Categorical predictors: factor and dummy variables

Categorical predictors

Dummy variables

Factor variables

Linear regression with one categorical predictor: Sex

Linear regression with one categorical predictor: Social rank

Exercise 2

Exercise 2: Weight and Forest type

Multiple regression

Adding another predictor to a simple linear regression model

Standardizing and centering predictors

Interactions

Interaction between two continuous predictors: Kin and Group size

Interaction between two categorical predictors: Sex and Social rank

Interaction between a continuous predictor and a categorical predictor: Sex and Kin

True model

`summary()` Output explained

`summary()` Output explained

`summary()` Output explained