.

Why do we care about linear regression?

Linear regression allows us to:

understand and quantify the relationships between an outcome and one or more predictors
estimate the unique effect of a predictor while accounting for the influence of other variables
predict outcomes for future observations
understand other regression models, which build on the linear regression model

The typical goal is to estimate population parameters describing these relationships, as well as their uncertainties, from a sample.

Simple linear regression model

A simple linear regression measures the relationship between one outcome and one predictor.

           x           y
1  0.9819694  0.49317417
2  0.4687150  0.70972185
3 -0.1079713 -0.83172816
4 -0.2128782  0.09893859
5  1.1580985  1.60159397
6  1.2923548 -0.36389939

x: predictor, independent variable (IV), regressor

y: outcome, dependent variable (DV), response

Linear regression assumes linear relationship between x and y.

The simple linear regression model:

\[y = \beta_0 + \beta_1 x + \epsilon\]

\(\beta_0\) is the intercept , the predicted \(y\) value when \(x=0\)
\(\beta_1\) is the slope , the predicted change in \(y\) for a one-unit increase in \(x\)
\(\epsilon\) is the error term , the deviation of the predicted \(y\) value from the observed \(y\)

Ordinary Least Squares (OLS)

The Ordinary Least Squares (OLS) estimator (formula or method) finds the best-fitting line relating \(x\) to \(y\).

The best-fit line minimizes the sum of the squared errors, or differences between observed and predicted values of \(y\).

The OLS estimator has several desirable statistical properties:

Consistency: As sample size increases, estimates converge to true population value
Unbiasedness: If we estimate a parameter over many samples, the mean of those estimates will equal the true population value
Efficiency: Of all unbiased estimators, OLS has the smallest variance (standard errors)

Linear regression in `R`

Dataset: Capuchin monkey weight

Alt text

A simulated dataset of 100 observations of capuchin monkey weights

# load dataset
d <- read.csv("https://stats.oarc.ucla.edu/wp-content/uploads/2024/01/capuchins100.csv")

Variables

weight, in kg
sex, male or female
social_rank, low, medium or high
kin, count of close kin within group
group_size, number of monkeys in group
forest_type, dry or rain

Weight is the outcome, and we expect each of the variables to influence weight.

NOTE: These data are simulated and do not represent real monkeys

The data

Explore the model variables before running a regression.

skim() from the skimr package produces quick summaries of variables in a data frame.

library(skimr)
skim(d)

Data summary
Name	d
Number of rows	100
Number of columns	7
_______________________
Column type frequency:
character	3
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
sex	1	4	6	2
social_rank	1	3	6	3
forest_type	1	3	4	2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ID	1	475215.21	317074.62	1024.00	184206.00	514714.50	784331.00	987254.0	▇▅▅▅▇
group_size	1	9.77	3.27	5.00	7.00	9.50	13.00	15.0	▇▆▅▅▅
kin	1	1.73	1.47	0.00	0.00	2.00	3.00	4.0	▇▅▅▅▅
weight	1	2.94	0.59	1.58	2.52	3.01	3.34	4.3	▂▇▇▇▂

Alt text

Histograms and frequency plots give quick summaries of variable distributions and may reveal extreme or erroneous values.

Linear regression with one continuous predictor: Kin

Use lm() to run a linear regression in R, specifying the model with formula outcome ~ predictor.

Below we model how weight is predicted by kin, and we expect that increasing number of close kin will predict increasing weight.

# Simple linear regression where weight is modeled as a function of kin
mK <- lm(weight ~ kin, data=d)

# object mK class
class(mK)

[1] "lm"

mK


Call:
lm(formula = weight ~ kin, data = d)

Coefficients:
(Intercept)          kin  
     2.7220       0.1232

# names() function lists the elements that are stored inside the object mK
names(mK)

 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"

Objects returned by lm() are complex, so we use other functions on the object to extract model estimates and produce model predictions.

coef(): model coefficients
summary(): coefficients, standard errors, p-values, model F-test, etc.
predict(): model predictions
residuals(): various kinds of residuals
confint(): confidence intervals for coefficients
plot(): diagnostic plots

# some useful extractor functions
mK$coefficients # extracts model coefficients
coef(mK) # extracts model coefficients

# extractor function summary() 
summary(mK)


Call:
lm(formula = weight ~ kin, data = d)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.26348 -0.36073 -0.01462  0.39490  1.37212 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.72196    0.08736  31.160  < 2e-16 ***
kin          0.12316    0.03857   3.193  0.00189 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5638 on 98 degrees of freedom
Multiple R-squared:  0.09424,   Adjusted R-squared:  0.085 
F-statistic:  10.2 on 1 and 98 DF,  p-value: 0.001892

`summary()` Output explained

## [...]
## Call:
## lm(formula = weight ~ kin, data = d)
## [...]

Model formula.

## [...]
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.26348 -0.36073 -0.01462  0.39490  1.37212 
## [...]

Summary of the distribution of the residuals.

## [...]
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.72196    0.08736  31.160  < 2e-16 ***
## kin          0.12316    0.03857   3.193  0.00189 ** 
## ---
## [...]

Estimated coefficients, standard errors, t-values, and p-values for the hypothesis test that a coefficient equals to zero.

## [...]
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [...]

Asterisks provide a quick visual shortcut to p-values.

## [...]
## Residual standard error: 0.5638 on 98 degrees of freedom
## [...]

Residual standard error measures the average amount that observed responses deviate from the fitted regression line.

Degrees of freedom (\(df\)) represent the number of independent observations (\(N\)) in the data that are available to estimate a parameter.

In the case of linear regression:

\(p =\) the number of predictors \(+ 1\)

\(df = N - p\)

`summary()` Output explained

\(F\)-statistic

## [...]
## F-statistic:  10.2 on 1 and 98 DF,  p-value: 0.001892
## [...]

Indicates whether the overall model is statistically significant
Provides evidence that at least one independent variable is related to the dependent variable, but not which one

`summary()` Output explained

Multiple \(R^2\)

## [...]
## Multiple R-squared:  0.09424,    Adjusted R-squared:  0.085 
## [...]

The proportion of variance in the dependent variable explained by the independent variables.

\[R^2 = 1 - \frac{SSR}{SST}\]

SSR - the sum of squared residuals \(\sum_{i=1}^n(y_i - \hat{y_i})^2\).
SST - the total sum of squares \(\sum_{i=1}^n(y_i - \bar{y_i})\).
Ranges from 0 to 1
Increases as more variables are included in the model

Adjusted \(R^2\)

Adjusted \(R^2\) penalizes adding complexity (predictors) to the model to prevent overfitting.

\[1 – \left( (1 – R^2) \cdot \frac{N – 1}{ N – k – 1} \right)\]

\(N\) - number of observations
\(k\) - number of predictors

Interpreting model coefficients

## 
## Call:
## lm(formula = weight ~ kin, data = d)
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.72196    0.08736  31.160  < 2e-16 ***
## kin          0.12316    0.03857   3.193  0.00189 ** 
## ---
## [...]

Given our model and our data, the best-fitting regression equation is:

\[ \hat{weight} = 2.72196 + 0.12316 * kin\]

(Intercept) 2.72196 is the predicted mean weight for a monkey with no close kin members in their group (i.e., kin = 0).

The coefficient 0.12316 for kin means that for one unit increase in kin, we can expect a 0.12316 unit increase in weight.

Standard errors and confidence intervals

Standard errors and confidence intervals quantify the uncertainty in the estimates.

# Use extractor function confint() to compute confidence intervals
conf_intervals <- confint(mK)
conf_intervals

                 2.5 %   97.5 %
(Intercept) 2.54860910 2.895316
kin         0.04661907 0.199694

# Combine summary() output with confint() output
sumCI <- summary(mK)
sumCI$coefficients <- cbind(sumCI$coefficients[,c(1:2)], conf_intervals)
sumCI


Call:
lm(formula = weight ~ kin, data = d)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.26348 -0.36073 -0.01462  0.39490  1.37212 

Coefficients:
            Estimate Std. Error   2.5 % 97.5 %
(Intercept)  2.72196    0.08736 2.54861  2.895
kin          0.12316    0.03857 0.04662  0.200

Residual standard error: 0.5638 on 98 degrees of freedom
Multiple R-squared:  0.09424,   Adjusted R-squared:  0.085 
F-statistic:  10.2 on 1 and 98 DF,  p-value: 0.001892

Std. Error is a measure of variability of an estimate over repeated samples (i.e. the standard deviation of the sampling distribution).
2.5% and 97.5% provide the lower and upper end, respectively, of the range of values of a parameter that are reasonably compatible with the data.

A 95% confidence interval implies that if we were to repeat the experiment or data collection process many times, each time constructing a 95% confidence interval, we would expect the true parameter to fall within the interval about 95% of the time.

The 95% confidence interval for the kin parameter estimate indicates that the data is consistent with a coefficient range from 0.04662 to 0.2.

t statistic and p-values

NA [...]
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.72196    0.08736  31.160  < 2e-16 ***
kin          0.12316    0.03857   3.193  0.00189 ** 
---
NA [...]

t value and Pr(>|t|) are used in testing the null hypothesis that a parameter is 0 in the population
t value is the test statistic, calculated by dividing Estimate by its Std. Error, and is assumed to have a \(t\)-distribution under the null hypothesis with \(df\) degrees of freedom
Pr(>|t|) is the p-value, the probability of observing a t value equal or greater than the observed t value if the null hypothesis were true
The p-value 0.00189 suggests that if the null hypothesis were true, we would expect to see t values for the kin parameter as or more extreme than 3.193 in about 2 out of 1000 samples, an unlikely result that suggests that the null hypothesis should be rejected

Assessing the fit of the model

One way to assess the fit of the model is to plot model predictions to ensure they look reasonable against the observed data.

First, calculate the predicted outcome across a range of predictor values observed in the data.

The predict() function takes a model object and produces:

model predictions for the observed data by default
model predictions for specific values of the predictors if a new data frame of predictor values is specified

# Create a dataframe with predictor values
new_data <- data.frame(kin=0:4)

# Calculate predictions of the outcome (weight) for each value of the predictor (kin)
predictions <- predict(mK, newdata=new_data, interval="confidence")

# Prepare the data frame for plotting
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("kin","weight","lwr","upr")
new_data

  kin   weight      lwr      upr
1   0 2.721963 2.548609 2.895316
2   1 2.845119 2.720056 2.970182
3   2 2.968276 2.854495 3.082057
4   3 3.091432 2.943218 3.239646
5   4 3.214589 3.007938 3.421240

Then make a plot of your data and model predictions.

# A simple plot for model mK
ggplot(d, aes(kin, weight)) +
  geom_point() + # plots the observed data
  geom_line(data = new_data, aes(kin, weight)) + # plots the regression line
  geom_ribbon(data = new_data, aes(x = kin, ymin = lwr, ymax = upr), alpha=0.2) # plots the CI

# A plot for model mK with some custom adjustments
ggplot(d, aes(kin, weight)) +
  geom_point() +
  geom_ribbon(data = new_data, aes(x = kin, ymin = lwr, ymax = upr), fill = "orange2", alpha = 0.3) +
  geom_line(data = new_data, aes(kin, weight), color = "orange4", linewidth = 1) +
  labs(title = "Linear regression model: weight ~ kin" ,x = "Number of close kin", y = "Weight")

Exercise 1

Capuchin monkeys live in social groups that can range from just a few individuals up to 40. As group size increases, competition for food is expected to rise. Therefore, capuchins living in larger groups should have a lower average weight compared to those in smaller groups.

Run a linear regression modeling capuchin monkey weight as a function of their group_size.
Interpret the output of summary() run on the object returned by lm()
Calculate the weight predictions for capuchin monkeys living in groups ranging from 5 to 15 individuals (i.e., 5,6,…,14,15).

Exercise 1 Solution: Weight and Group size

# fit the model
mG <- lm( weight ~ group_size, data=d)

# Extract model summary
sum_mG <- summary(mG)
# Confidence intervals
ci_mG <- confint(mG)
# Add confidence intervals to model summary
sum_mG <- cbind(sum_mG$coefficients,ci_mG)
round(sum_mG,4) #rounding to fit the screen

            Estimate Std. Error t value Pr(>|t|)   2.5 % 97.5 %
(Intercept)   3.1261     0.1861 16.7939   0.0000  2.7567 3.4955
group_size   -0.0196     0.0181 -1.0821   0.2819 -0.0554 0.0163

As group size increases by one individual, on average, the weight is expected to decrease by 0.0196 kg (95% CI = [-0.0554, 0.0163])

# Calculate predicted weight
new_data <- data.frame(group_size=5:15)
predictions <- predict(mG, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("group_size","weight","lwr","upr")
new_data

   group_size   weight      lwr      upr
1           5 3.028313 2.821120 3.235507
2           6 3.008756 2.830030 3.187481
3           7 2.989198 2.835805 3.142591
4           8 2.969640 2.836641 3.102640
5           9 2.950083 2.829994 3.070171
6          10 2.930525 2.813365 3.047685
7          11 2.910968 2.786049 3.035887
8          12 2.891410 2.749791 3.033029
9          13 2.871852 2.707292 3.036412
10         14 2.852295 2.660783 3.043807
11         15 2.832737 2.611724 3.053751

# Plot: weight ~ group_size
pmG <- ggplot(d, aes(group_size, weight)) +
    geom_point() +
    geom_ribbon(data=new_data, aes(x=group_size, ymin=lwr, ymax=upr), fill="skyblue3", alpha=0.25) +
    geom_line(data=new_data, aes(group_size, weight), color="deepskyblue4", linewidth=1) +
    labs(x = "Group size", y = "Weight") 
pmG

Regression diagnostics

Linear regression assumptions

Inferences made from a regression model depend on several assumptions.

L: Linear relationship between the mean response (Y) and the explanatory variable (X)
I: The errors are independent
N: The responses are normally distributed at each level of X
E: The variance of the responses is equal for all levels of X

* From “Beyond Multiple Linear Regression” by Paul Roback and Julie Legler

Violation of an assumption threatens the validity of the estimates and inferences made from them, so it’s important to assess the plausibility of these assumptions with regression diagnostics.

Linearity assumption

Violation of linearity results in erroneous inference regarding the form of the relationship between the predictor and outcome as well as bad predictions.

Diagnostic plot: residuals (y-axis) vs fitted outcome values (x-axis).

No systematic trend among the residuals suggests linearity is met. A smoothed curve can help to identify a trend.

Using plot() on the lm object will produce diagnostic plots, the first of which is the residuals vs fitted values plot.

# which = 1 plots residuals vs.fitted values plot 
plot(mG, which=1)

A trend in the residuals suggests a nonlinear relationship between variables a predictor and the outcome.

If linearity assumption might be violated, consider:

Adding nonlinear terms to model (polynomial, splines), see R package “splines”
Applying non-linear transformations to either predictor or outcome (e.g. logarithm)

Equal variance assumption or Homoscedasticity

Violation of of homoscedasticity of the errors results in inaccurate standard errors, leading to inaccurate test statistics, confidence intervals, and p-values.

Diagnostic plot: residuals (y-axis) vs fitted outcome values (x-axis) again.

Random scatter about the zero line with equal spread across the range of fitted values suggests homoscedasticity.

# Residual vs. fitted value plot for Homoscedasticity
plot(mG, which=1)

One commonly observed form of heteroscedasticity is increasing variance of the residuals with increasing fitted values:

If homoscedasticity might be violated, consider:

Variable transformation (e.g., log transformation of the outcome)
Using robust (heteroscedastic-consistent or “sandwich”) standard errors, see R package “sandwich”
Modeling heteroscedasticity in a mixed model, see R package “nlme”

Independence assumption

Violation of independence threatens the accuracy of confidence intervals and p-values.

Two common design features that result in dependencies in the data:

sampling in clusters
repeated measurements of the same subjects

Diagnostic plot: residuals (y-axis) against a clustering index (e.g., schoolID) or order variable (e.g., time).

If we had sampled several monkeys from the same group, we might see systematic differences in the residuals across groups, suggesting non-independence.

# Plot residuals by group
boxplot(m$residuals ~ group, data = dat, xlab = "Group", ylab = "Residuals")

If independence might be violated, consider:

Model dependency in the data (e.g. mixed/multilevel models, fixed effect models), see R package “lme4”
cluster robust standard errors

Normality assumption

Violation of normality of errors threatens the accuracy of confidence intervals and p-values, but becomes less impactful with large sample sizes.

Diagnostic plots: histogram and q-q plot of residuals.

# A simple histogram of mG residuals
hist(mG$residuals, breaks=20)

# Normal Quantile to Quantile plot
qqnorm(mG$residuals)
qqline(mG$residuals)

Deviation from the diagonal line in the q-q plot suggests that errors are not normally distributed.

If normality might be violated, consider:

Transforming outcome (e.g. logarithm)
Generalized linear models, see R funtion glm()

Influential observations

Influential observations change model estimates substantially when excluded from the data during estimation.

Ideally, inferences are robust to inclusion or exclusion of any particular observation.
Influential observations are typically outliers in two senses
- extreme values on predictors, known as high leverage
- large residuals, far from the fitted regression line
Influence diagnostic measures often quantify how much model coefficients change when a particular observation is excluded
- dfbeta measures how a specific coefficient changes when deleting an observation
- Cook’s distance measures how much all coefficients change when deleting an observation

influence.measures() can be run on an lm object, and returns several influence measures including dfbetas for each coefficient and cook’s distance.

If any of the influence measures are unusually large, they are flagged with an * in the infl column.

# influence measures for each observation
influence <- influence.measures(mG)
influence

Influence measures of
     lm(formula = weight ~ group_size, data = d) :

       dfb.1_  dfb.grp_     dffit cov.r   cook.d    hat inf
1    0.064279 -9.12e-02 -0.115102 1.039 6.66e-03 0.0269    
2   -0.042063  5.60e-02  0.065945 1.056 2.19e-03 0.0358    
3   -0.050624  3.31e-02 -0.069269 1.026 2.41e-03 0.0130    
4   -0.045000  6.38e-02  0.080580 1.044 3.27e-03 0.0269    
5   -0.000932 -2.64e-04 -0.003745 1.031 7.08e-06 0.0100    
6    0.002997  8.48e-04  0.012048 1.031 7.33e-05 0.0100    
7    0.146781 -2.08e-01 -0.262837 0.997 3.40e-02 0.0269    
8    0.004419 -5.88e-03 -0.006928 1.059 2.42e-05 0.0358    
9    0.018768  5.31e-03  0.075436 1.019 2.86e-03 0.0100    
10  -0.128697  1.05e-01 -0.139193 1.028 9.71e-03 0.0234    
11  -0.014705  2.34e-02  0.033182 1.040 5.56e-04 0.0198    
12  -0.138223  9.03e-02 -0.189134 0.978 1.76e-02 0.0130    
13   0.092738 -7.59e-02  0.100301 1.036 5.06e-03 0.0234    
14  -0.188054  1.23e-01 -0.257319 0.934 3.18e-02 0.0130   *
15   0.022861 -3.24e-02 -0.040937 1.048 8.46e-04 0.0269    
16   0.322569 -2.77e-01  0.335389 0.982 5.49e-02 0.0314    
17  -0.073702  9.81e-02  0.115545 1.051 6.72e-03 0.0358    
18   0.007616 -6.89e-02 -0.195137 0.966 1.86e-02 0.0114    
19  -0.024989  3.97e-02  0.056389 1.038 1.60e-03 0.0198    
20   0.181734 -1.19e-01  0.248670 0.941 2.98e-02 0.0130    
21  -0.137603  1.95e-01  0.246402 1.003 3.00e-02 0.0269    
22  -0.044047  1.93e-02 -0.083730 1.018 3.52e-03 0.0106    
23  -0.142638  1.22e-01 -0.148307 1.039 1.10e-02 0.0314    
24   0.147436 -1.21e-01  0.159461 1.023 1.27e-02 0.0234    
25   0.040424 -1.77e-02  0.076843 1.020 2.97e-03 0.0106    
26  -0.028385  2.44e-02 -0.029513 1.053 4.40e-04 0.0314    
27  -0.021543  9.42e-03 -0.040952 1.028 8.46e-04 0.0106    
28   0.005072 -1.04e-02 -0.018466 1.035 1.72e-04 0.0147    
29   0.059900  1.70e-02  0.240757 0.920 2.77e-02 0.0100   *
30  -0.081929  5.35e-02 -0.112105 1.014 6.29e-03 0.0130    
31  -0.325557  2.80e-01 -0.338496 0.981 5.58e-02 0.0314    
32   0.002045 -1.85e-02 -0.052401 1.027 1.38e-03 0.0114    
33  -0.096149  1.28e-01  0.150737 1.045 1.14e-02 0.0358    
34   0.043374 -6.89e-02 -0.097875 1.031 4.82e-03 0.0198    
35  -0.082761  6.27e-02 -0.096743 1.027 4.70e-03 0.0172    
36   0.004527 -4.10e-02 -0.115994 1.008 6.72e-03 0.0114    
37   0.007169  2.03e-03  0.028813 1.029 4.19e-04 0.0100    
38  -0.013310  2.74e-02  0.048454 1.033 1.18e-03 0.0147    
39   0.142358 -2.26e-01 -0.321238 0.940 4.95e-02 0.0198    
40   0.043315 -5.76e-02 -0.067907 1.056 2.33e-03 0.0358    
41   0.067758 -5.82e-02  0.070451 1.051 2.50e-03 0.0314    
42   0.101768 -7.71e-02  0.118961 1.022 7.09e-03 0.0172    
43   0.156962 -2.09e-01 -0.246076 1.024 3.01e-02 0.0358    
44  -0.127022  1.04e-01 -0.137381 1.028 9.46e-03 0.0234    
45   0.056355 -3.68e-02  0.077112 1.025 2.99e-03 0.0130    
46   0.020019 -8.75e-03  0.038053 1.029 7.30e-04 0.0106    
47  -0.079020  5.98e-02 -0.092370 1.028 4.29e-03 0.0172    
48  -0.025194  5.18e-02  0.091720 1.024 4.23e-03 0.0147    
49   0.015033 -6.57e-03  0.028577 1.030 4.12e-04 0.0106    
50  -0.095191  6.22e-02 -0.130252 1.007 8.46e-03 0.0130    
51   0.155556 -1.27e-01  0.168242 1.020 1.41e-02 0.0234    
52   0.229604 -1.88e-01  0.248329 0.992 3.03e-02 0.0234    
53  -0.008436  1.73e-02  0.030712 1.035 4.76e-04 0.0147    
54   0.248756 -3.31e-01 -0.389987 0.974 7.37e-02 0.0358    
55   0.087150 -1.16e-01 -0.136630 1.048 9.38e-03 0.0358    
56  -0.013531 -3.83e-03 -0.054385 1.025 1.49e-03 0.0100    
57  -0.124298  1.02e-01 -0.134435 1.029 9.06e-03 0.0234    
58   0.067430 -5.79e-02  0.070110 1.051 2.48e-03 0.0314    
59   0.088296 -7.58e-02  0.091805 1.048 4.25e-03 0.0314    
60  -0.062258  9.89e-02  0.140489 1.021 9.87e-03 0.0198    
61  -0.099371  1.41e-01  0.177941 1.025 1.58e-02 0.0269    
62  -0.000386  3.49e-03  0.009893 1.032 4.94e-05 0.0114    
63   0.045589 -2.98e-02  0.062381 1.028 1.96e-03 0.0130    
64  -0.036066  5.73e-02  0.081385 1.034 3.33e-03 0.0198    
65   0.007180 -4.69e-03  0.009825 1.034 4.88e-05 0.0130    
66   0.000106  3.01e-05  0.000428 1.031 9.25e-08 0.0100    
67   0.130755 -9.90e-02  0.152845 1.011 1.16e-02 0.0172    
68  -0.043225  8.89e-02  0.157364 1.001 1.23e-02 0.0147    
69   0.217732 -1.78e-01  0.235489 0.997 2.74e-02 0.0234    
70   0.051350 -4.20e-02  0.055537 1.042 1.56e-03 0.0234    
71  -0.048752  2.13e-02 -0.092673 1.015 4.30e-03 0.0106    
72   0.036982 -5.88e-02 -0.083452 1.034 3.51e-03 0.0198    
73  -0.043075  2.81e-02 -0.058940 1.028 1.75e-03 0.0130    
74  -0.041674  1.82e-02 -0.079219 1.019 3.15e-03 0.0106    
75   0.054770 -8.70e-02 -0.123591 1.025 7.66e-03 0.0198    
76  -0.042930  8.83e-02  0.156290 1.002 1.21e-02 0.0147    
77  -0.050080  6.66e-02  0.078512 1.055 3.11e-03 0.0358    
78   0.202715 -2.88e-01 -0.362994 0.953 6.34e-02 0.0269    
79  -0.130979  1.74e-01  0.205342 1.034 2.11e-02 0.0358    
80  -0.183652  1.50e-01 -0.198630 1.011 1.96e-02 0.0234    
81  -0.066635  2.91e-02 -0.126666 1.000 7.98e-03 0.0106    
82  -0.107103  9.20e-02 -0.111360 1.046 6.24e-03 0.0314    
83   0.039181 -1.71e-02  0.074479 1.021 2.79e-03 0.0106    
84  -0.008382  1.72e-02  0.030515 1.035 4.70e-04 0.0147    
85  -0.007141  1.01e-02  0.012787 1.049 8.26e-05 0.0269    
86   0.022704  6.43e-03  0.091256 1.014 4.17e-03 0.0100    
87   0.012744  3.61e-03  0.051223 1.026 1.32e-03 0.0100    
88   0.009216 -6.02e-03  0.012611 1.034 8.03e-05 0.0130    
89   0.194797 -1.59e-01  0.210683 1.006 2.20e-02 0.0234    
90   0.028126  7.96e-03  0.113049 1.005 6.37e-03 0.0100    
91  -0.071197  5.83e-02 -0.077004 1.040 2.99e-03 0.0234    
92  -0.219223  2.92e-01  0.343687 0.992 5.78e-02 0.0358    
93  -0.020134  1.65e-02 -0.021776 1.045 2.39e-04 0.0234    
94  -0.046526  3.52e-02 -0.054386 1.035 1.49e-03 0.0172    
95   0.063330 -5.44e-02  0.065847 1.051 2.19e-03 0.0314    
96  -0.086577  6.55e-02 -0.101203 1.026 5.14e-03 0.0172    
97   0.042475 -2.77e-02  0.058120 1.029 1.70e-03 0.0130    
98   0.004160  1.18e-03  0.016719 1.031 1.41e-04 0.0100    
99  -0.264185  2.16e-01 -0.285730 0.975 3.98e-02 0.0234    
100  0.025815  7.31e-03  0.103760 1.009 5.38e-03 0.0100

Running summary() on the object returned by influence.measures() will produce the table of diagnostics only for those observations flagged as influential (with an *).

# observations 14 and 29 are influential
summary(influence.measures(mG))

Potentially influential observations of
     lm(formula = weight ~ group_size, data = d) :

   dfb.1_ dfb.grp_ dffit cov.r   cook.d hat  
14 -0.19   0.12    -0.26  0.93_*  0.03   0.01
29  0.06   0.02     0.24  0.92_*  0.03   0.01

Rather than rushing to delete outliers, consider re-running the model without these observations to assess the robustness of the results to inclusion of these influential points (i.e. a sensitivity analysis).

# rerun modeling excluding influential observations
mG_no_infl <- lm( weight ~ group_size, data=d[-c(14,29),])


# compare coefficients
cbind(mG=coef(mG), mG_no_infl=coef(mG_no_infl))

                     mG  mG_no_infl
(Intercept)  3.12610110  3.14926223
group_size  -0.01955759 -0.02200743

We can see that the estimates change slightly between the two models, but not enough to change our overall inferences (standard errors and p-values are also similar).

R packages for more diagnostics:

car
performance

Categorical predictors

Categorical predictors: factor and dummy variables

Categorical predictors

typically have a fixed number of possible values
values denote membership to different classifications
for a predictor with \(k\) categories, \(k-1\) coefficients will be estimated

Dummy variables

The simplest categorical variable is a dummy variable, where 0 and 1 denote not belonging and belonging to a category, respectively.

table(d$sex)


female   male 
    52     48

# Create dummy variable for male
d$male <- if_else(d$sex == "female", 0, 1)

table(d$male)


 0  1 
52 48

In regression models, categorical variables with \(k\) categories will be represented by \(k - 1\) dummy variables.

The omitted category becomes the reference group.

For example, the social_rank variable has 3 levels: low, medium, and high.

table(d$social_rank)


  high    low medium 
    24     29     47

We can create dummy variables for medium and high, which will make low the reference.

social_rank	medium	high
low	0	0
medium	1	0
high	0	1

d$rank_medium <- if_else(d$social_rank=="medium", 1, 0)
d$rank_high <- if_else(d$social_rank=="high", 1, 0)

# check dummy codes
# rank_medium = 0 & rank_high = 0 implies monkey has "low" social rank
table(d$social_rank, d$rank_medium)

        
          0  1
  high   24  0
  low    29  0
  medium  0 47

table(d$social_rank, d$rank_high)

        
          0  1
  high    0 24
  low    29  0
  medium 47  0

Factor variables

Categorical variables in R can be represented using factors.

Factors are integer variables with labels for each value.

Function factor() converts both character and integer variables into factors

# Create a factor variable for social_rank variable
d$social_rank_f <- factor(d$social_rank)
levels(d$social_rank_f)

[1] "high"   "low"    "medium"

By default, categories will be ordered alphabetically, but we can specify a particular order with levels=.

# Create a factor variable with custom level order
d$social_rank_f <- factor(d$social_rank, levels=c("low","medium","high"))

levels(d$social_rank_f)

[1] "low"    "medium" "high"

To convert a variable with integer values to a factor, use labels= argument

d$male <- factor(d$male, labels=c("female","male"))

d$male

  [1] male   female male   female male   female male   male   female female
 [11] male   female female female female male   female female female female
 [21] female female female male   female female female female male   male  
 [31] female female male   female female male   female female female male  
 [41] female male   male   female female female male   male   male   female
 [51] female male   female female male   male   female female male   male  
 [61] female male   male   male   female female male   female male   male  
 [71] male   male   male   female male   male   male   male   female female
 [81] female female male   male   male   male   male   male   male   male  
 [91] female female male   female female female male   male   female male  
Levels: female male

Conveniently, when entering a factor into lm() as a predictor, R will automatically create dummy variables for each level of factor except for the first, which will thus be the reference group.

Linear regression with one categorical predictor: Sex

Below we model weight predicted by sex, with the expectation that male monkeys are heavier than female monkeys.

# Simple linear regression where weight is modeled as a function of individual's sex
# male 
mS <- lm( weight ~ male, data=d)

# Extract model summary
sum_mS <- summary(mS)
sum_mS


Call:
lm(formula = weight ~ male, data = d)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.44358 -0.36511  0.00375  0.41178  1.38766 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.82878    0.08068  35.061   <2e-16 ***
malemale     0.22134    0.11645   1.901   0.0603 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5818 on 98 degrees of freedom
Multiple R-squared:  0.03555,   Adjusted R-squared:  0.02571 
F-statistic: 3.613 on 1 and 98 DF,  p-value: 0.06028

# Add confidence intervals
ci_mS <- confint(mS)
sum_mS <- cbind(sum_mS$coefficients, ci_mS)
round(sum_mS, 4) # rounding

            Estimate Std. Error t value Pr(>|t|)   2.5 % 97.5 %
(Intercept)   2.8288     0.0807 35.0611   0.0000  2.6687 2.9889
malemale      0.2213     0.1165  1.9007   0.0603 -0.0098 0.4524

A female capuchin (male=0) is expected to weigh 2.8288 kg.

Males are expected to weigh 0.2213 kg more than females on average. However, the 95% confidence interval, [-0.0098, 0.4524], suggests uncertainty in this estimate, with values near zero or even slightly negative values also consistent with the data.

# Calculate predicted weight for males and females
new_data <- data.frame(male=factor(c("female","male"), levels=c("female","male")))
predictions <- predict(mS, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("male","weight","lwr","upr")
new_data

    male   weight      lwr      upr
1 female 2.828778 2.668669 2.988888
2   male 3.050122 2.883475 3.216770

# Plot: weight ~ sex
ggplot(d, aes(male, weight)) +
  geom_point(color="grey32",size=0.75) +
  geom_errorbar(data = new_data, aes(x=male, ymin=lwr, ymax=upr, color=male),
                width=0.15,linewidth=1.5) +
  geom_point(data = new_data, aes(male, weight, color=male), size = 3) +
  scale_fill_manual(values = c("male"="lightcyan2","female"="orange2" ), name="Sex") +
  scale_color_manual(values = c("male"="deepskyblue3","female"="orange2" ), name="Sex") +  
  labs(x = "Sex", y = "Weight") +
  theme_bw()

Linear regression with one categorical predictor: Social rank

Higher ranking individuals should weigh more because rank translates into better access to resources.

# Simple linear regression where weight is modeled as a function of individual's social rank
mR <- lm( weight ~ social_rank_f, data=d)

# Extract model summary
sum_mR <- summary(mR)
sum_mR


Call:
lm(formula = weight ~ social_rank_f, data = d)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.56992 -0.33489  0.02641  0.40106  1.04668 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)           2.7261     0.1044  26.105  < 2e-16 ***
social_rank_fmedium   0.1763     0.1328   1.327  0.18752    
social_rank_fhigh     0.5254     0.1552   3.386  0.00103 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5624 on 97 degrees of freedom
Multiple R-squared:  0.1082,    Adjusted R-squared:  0.08977 
F-statistic: 5.882 on 2 and 97 DF,  p-value: 0.003881

Above, we see that R entered dummy variables representing medium and high in the regression model, and the coefficients represent comparisons to the reference group, low.

# Add confidence intervals
ci_mR <- confint(mR)
sum_mR <- cbind(sum_mR$coefficients, ci_mR)
round(sum_mR, 4) # rounding

                    Estimate Std. Error t value Pr(>|t|)   2.5 % 97.5 %
(Intercept)           2.7261     0.1044 26.1054   0.0000  2.5188 2.9333
social_rank_fmedium   0.1763     0.1328  1.3273   0.1875 -0.0873 0.4398
social_rank_fhigh     0.5254     0.1552  3.3859   0.0010  0.2174 0.8334

A low ranking capuchin (reference group) is predicted to weigh 2.7261 kg.

Medium ranking capuchin are expected to weigh 0.1763 kg more than low-ranking capuchin on average. However, the uncertainty surrounding the estimate (95% CI = [-0.0873, 0.4398] ) makes clear inference difficult.

High-ranking capuchin are expected to weigh 0.5254 kg more than low-ranking capuchin, although considerably smaller and larger effects are also consistent with the data (95% CI = [0.2174, 0.8334] )

# Calculate predictions of weight for low, medium and high ranking capuchins
new_data <- data.frame(social_rank_f=factor(c("low","medium","high"), levels=c("low","medium","high")))
predictions <- predict(mR, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("social_rank_f","weight","lwr","upr")
new_data

  social_rank_f   weight      lwr      upr
1           low 2.726080 2.518824 2.933337
2        medium 2.902337 2.739535 3.065138
3          high 3.251508 3.023683 3.479333

# Plot: weight ~ social_rank
ggplot(d, aes(social_rank_f, weight)) +
  geom_point(color="grey32", size=0.75) +
  geom_errorbar(data = new_data, aes(x=social_rank_f, ymin=lwr, ymax=upr), 
                width=0.2, color="orange2",linewidth=1.5) +
  geom_point(data = new_data, aes(social_rank_f, weight), color = "orange2", size = 3) +
  labs(x = "Social Rank", y = "Weight") +
  theme_bw()

Exercise 2

Some capuchin monkeys live in a rainforest, while others live in a dry tropical forest. These two habitats differ the amount of rain: the dry tropical forest experiences drought about half of the year, while the rainforest does not. As a result, food is more abundant in the rainforest, and therefore, we expect that monkeys who live in the rainforest will generally be heavier.

Create a forest_type_f factor variable with two levels: “dry” and “rain”.
Run a linear regression modeling capuchin monkey weight as a function of their forest_type_f.
Calculate weight predictions for an average capuchin residing in a rain forest and in a dry tropical forest.
Assess the assumptions of homoscedasticity and normality of the residuals using plots.

Exercise 2: Weight and Forest type

d$forest_type_f <- factor(d$forest_type, levels=c("dry","rain"))
# fit the model
mF <- lm( weight ~ forest_type_f, data=d)

# Extract model summary
sum_mF <- summary(mF)
# Confidence intervals
ci_mF <- confint(mF)
# Add confidence intervals to model summary
sum_mF <- cbind(sum_mF$coefficients,ci_mF)
round(sum_mF,4) #rounding to fit the screen

                  Estimate Std. Error t value Pr(>|t|)  2.5 % 97.5 %
(Intercept)         2.8334     0.0701 40.3988   0.0000 2.6942 2.9726
forest_type_frain   0.3080     0.1221  2.5226   0.0133 0.0657 0.5503

An average capuchin living in a dry forest is predicted to weigh 2.8334 kg.

Capuchins living in a rainforest are expected to weigh 0.308 kg more than those living in a dry forest, although very small differences are also consistent with the data (95% CI = [0.0657, 0.5503] ).

new_data <- data.frame(forest_type_f=factor(c("dry","rain"), levels=c("dry","rain")))
predictions <- predict(mF, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("forest_type_f","weight","lwr","upr")
new_data

  forest_type_f   weight      lwr      upr
1           dry 2.833389 2.694207 2.972570
2          rain 3.141373 2.943055 3.339691

ggplot(d, aes(forest_type_f, weight)) +
  geom_point(color="grey32", size=0.75) +
  geom_errorbar(data = new_data, aes(x=forest_type_f, ymin=lwr, ymax=upr, color=forest_type_f), 
                width=0.2, linewidth=1.5) +
  geom_point(data = new_data, aes(forest_type_f, weight, color=forest_type_f), size = 3) +
  scale_fill_manual(values = c("dry"="orange3","rain"="skyblue3" ), name="Forest type") +
  scale_color_manual(values = c("dry"="orange3","rain"="skyblue3" ), name="Forest type") +  
  labs(x = "Forest type", y = "Weight") +
  theme_bw()

# residual vs. fitted plot for homoscedasticity
plot(mF, which = 1)

# qqplot for normality residuals
qqnorm(residuals(mF))
qqline(residuals(mF))

Multiple regression

Adding another predictor to a simple linear regression model

Multiple regression allows us to estimate the unique effects of many predictors on the outcome simultaneously.

# Multiple linear regression where weight is modeled as a function of individual's sex and kin
mSK <- lm( weight ~ male + kin, data=d)

# Extract model summary
sum_mSK <- summary(mSK)
sum_mSK


Call:
lm(formula = weight ~ male + kin, data = d)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.31410 -0.25358  0.02095  0.24475  1.26425 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.05494    0.13132  15.648  < 2e-16 ***
malemale     0.77489    0.12569   6.165 1.62e-08 ***
kin          0.29372    0.04296   6.838 7.19e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4804 on 97 degrees of freedom
Multiple R-squared:  0.3492,    Adjusted R-squared:  0.3358 
F-statistic: 26.03 on 2 and 97 DF,  p-value: 8.941e-10

# Add confidence intervals
ci_mSK <- confint(mSK)
sum_mSK <- cbind(sum_mSK$coefficients, ci_mSK)
round(sum_mSK, 4) # rounding

            Estimate Std. Error t value Pr(>|t|)  2.5 % 97.5 %
(Intercept)   2.0549     0.1313 15.6480        0 1.7943 2.3156
malemale      0.7749     0.1257  6.1649        0 0.5254 1.0244
kin           0.2937     0.0430  6.8377        0 0.2085 0.3790

If two predictors are correlated (and are also both correlated with the outcome), their coefficients will change from models where they are entered alone.

In model mS, we found that males were expected to weigh 0.2213 kg more than females on average (95% CI = [-0.0098, 0.4524]) when sex was modeled alone.

Now, we see that, after controlling for the effect of kin, males are expected to weigh 0.7749 kg more than females (95% CI = [0.5254, 1.0244]).

Males generally have fewer kin within group than females (i.e. being male and number of kin are correlated), and having fewer close kin will generally lead to lower weights. So, when modeled alone, the effect of being male is smaller because it includes the effect of having fewer kin as well.

Omitted variable bias occurs when we do not include important predictors in regression models that are correlated with other predictors, resulting in coefficients that represent the combined effects of several predictors. Collecting data on all important explanatory predictors is crucial, therefore.

# Calculate predicted weight for males and females 
### <b>
new_data <- expand.grid(male=factor(c("female", "male"), levels=c("female","male")),
                       kin = 0:4)
### </b>
predictions <- predict(mSK, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("male","kin","weight","lwr","upr")
new_data

     male kin   weight      lwr      upr
1  female   0 2.054942 1.794303 2.315581
2    male   0 2.829833 2.678091 2.981576
3  female   1 2.348661 2.156562 2.540759
4    male   1 3.123552 2.984299 3.262805
5  female   2 2.642380 2.499524 2.785236
6    male   2 3.417271 3.243219 3.591324
7  female   3 2.936099 2.800265 3.071933
8    male   3 3.710990 3.474909 3.947071
9  female   4 3.229818 3.053661 3.405974
10   male   4 4.004709 3.695337 4.314081

pSK <- ggplot(d, aes(kin, weight)) +
  geom_point() +
  geom_ribbon(data = new_data, aes(x = kin, ymin = lwr, ymax = upr, fill=male), alpha = 0.5) +  
  geom_line(data = new_data, aes(kin, weight, color=male), linewidth = 1) +
  scale_fill_manual(values = c("male"="lightcyan2","female"="orange2" ), name="Sex") +
  scale_color_manual(values = c("male"="deepskyblue4","female"="orange4" ), name="Sex") +
  labs(x = "Number of close kin", y = "Weight") +
  theme_bw()
pSK

Standardizing and centering predictors

For some questions, it may be useful to put some or all of the predictors on a standardized scale:

to understand the magnitude of effect of a predictor whose scaling is not commonly known
to compare the magnitude of effects of predictors with different natural scalings

Predictors can be standardized by dividing values by their estimated standard deviation. The resulting standardized predictor is measured in standard deviation units.

The scale() function standardizes and centers variables (subtracts the mean so that the mean is now zero).

# Multiple linear regression with standardized predictors
mGK <- lm( weight ~ scale(group_size) + scale(kin), data=d)

# Extract model summary
sum_mGK <- summary(mGK)
sum_mGK


Call:
lm(formula = weight ~ scale(group_size) + scale(kin), data = d)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.20813 -0.39458 -0.00674  0.39713  1.33433 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        2.93502    0.05638  52.061  < 2e-16 ***
scale(group_size) -0.05721    0.05670  -1.009  0.31548    
scale(kin)         0.17876    0.05670   3.153  0.00215 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5638 on 97 degrees of freedom
Multiple R-squared:  0.1036,    Adjusted R-squared:  0.08517 
F-statistic: 5.608 on 2 and 97 DF,  p-value: 0.004957

# Add confidence intervals
ci_mGK <- confint(mGK)
sum_mGK <- cbind(sum_mGK$coefficients, ci_mGK)
round(sum_mGK, 4) # rounding

                  Estimate Std. Error t value Pr(>|t|)   2.5 % 97.5 %
(Intercept)         2.9350     0.0564 52.0606   0.0000  2.8231 3.0469
scale(group_size)  -0.0572     0.0567 -1.0090   0.3155 -0.1698 0.0553
scale(kin)          0.1788     0.0567  3.1526   0.0022  0.0662 0.2913

Coefficients estimated for standardized predictors are interpreted as the expected change in the outcome per one-standard-deviation increase in the predictor.

For a one-standard-deviation increase in kin, we expect weight to increase by 0.1788 kg.

Centering predictors changes the interpretation of the intercept to be the expected outcome when the centered predictors are at their mean (which is now zero).

The expected weight for a capuchin coming from a group of average size and with an average number of kin to be 2.935 kg.

Interactions

Interaction between two continuous predictors: Kin and Group size

Interaction terms in regression model allow the effect of one predictor to vary with levels of another predictor.

Without an interaction, the effect of one predictor is modeled to be the same across all levels of the other predictor (i.e. a main effect)

A variable representing the interaction of two variables is formed by taking the product of those two variables.

Generally, the two component variables and the interaction variable are all entered into the regression together.

In R, we use the * operator between predictors to request that both the interaction and the two component variables be added to lm() model

Below, we model the interaction of number of kin and group size. Kin number should have greater effect in big groups, because competition increases with group size and coalitionary support from kin matters more.

# Multiple linear regression with interaction between two continuous variables
mKG <- lm( weight ~ kin + group_size + kin*group_size, data=d)

# Extract model summary
sum_mKG <- summary(mKG)
sum_mKG


Call:
lm(formula = weight ~ kin + group_size + kin * group_size, data = d)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.17226 -0.37270 -0.00907  0.39110  1.33113 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3.35099    0.28257  11.859   <2e-16 ***
kin            -0.12715    0.12073  -1.053   0.2949    
group_size     -0.06214    0.02669  -2.328   0.0220 *  
kin:group_size  0.02459    0.01133   2.170   0.0324 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5533 on 96 degrees of freedom
Multiple R-squared:  0.1456,    Adjusted R-squared:  0.1189 
F-statistic: 5.452 on 3 and 96 DF,  p-value: 0.001668

# Add confidence intervals
ci_mKG <- confint(mKG)
sum_mKG <- cbind(sum_mKG$coefficients, ci_mKG)
round(sum_mKG, 4) # rounding

               Estimate Std. Error t value Pr(>|t|)   2.5 %  97.5 %
(Intercept)      3.3510     0.2826 11.8591   0.0000  2.7901  3.9119
kin             -0.1272     0.1207 -1.0531   0.2949 -0.3668  0.1125
group_size      -0.0621     0.0267 -2.3283   0.0220 -0.1151 -0.0092
kin:group_size   0.0246     0.0113  2.1704   0.0324  0.0021  0.0471

The interpretation of the intercept is the same as before: the expected weight for a capuchin with no close kin and from a group size of zero is 3.351 kg.

The interpretation of coefficients for variables involved in an interaction is different from a main effects model. Namely, these are the simple or conditional effects when the other variable is zero.

For each additional kin, we expect weight to decrease by 0.1272 kg when group size is zero, although the sign of this effect is not clear from the data (95% CI = [-0.3668, 0.1125]).
For each additional group member, we expect weight to decrease by 0.0621 kg when number of close kin is zero, (95% CI = [-0.1151, -0.0092]).

The interaction coefficient expresses the change in a variable’s effect when the interacting variable is increased by one-unit:

For each additional kin, the effect of group size is expected to increase by 0.0246 kg/monkey, (95% CI = [0.002, 0.047]).
For each additional group member, the effect of kin is expected to increase by 0.0246 kg/kin.

The effect of kin at any particular group size can be calculated by adding the coefficient for kin, 0.1272, to the interaction coefficient, 0.0246, multiplied by group size.

\[\begin{aligned} kin.effect_{(group\_size=g)} &= - 0.1272 + 0.0246*g \\ kin.effect_{(group\_size=2)} &= - 0.1272 + 0.0246*10 \\ &= 0.1188 \end{aligned}\]

Interactions can be difficult to understand from just examining the coefficients, so graphs of model predictions can greatly help.

predict the outcome across a range of values of predictors involved in the interaction
one predictor should be graphed along the x-axis
the other predictor should be used to group lines (using different colors or line patterns) or to create separate panels

# Calculate predictions
new_data <- expand.grid(group_size=c(5,10,15), kin = c(0,2,4))
predictions <- predict(mKG, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("group_size","kin","weight","lwr","upr")
new_data

  group_size kin   weight      lwr      upr
1          5   0 3.040281 2.721467 3.359095
2         10   0 2.729567 2.559185 2.899950
3         15   0 2.418854 2.107790 2.729918
4          5   2 3.031873 2.835853 3.227893
5         10   2 2.967053 2.855005 3.079101
6         15   2 2.902233 2.691468 3.112999
7          5   4 3.023465 2.667883 3.379046
8         10   4 3.204539 3.001180 3.407898
9         15   4 3.385612 3.019975 3.751250

### <b>
# Subset dataset
dsub <- d[d$group_size%in%c(5,10,15) & d$kin%in%c(0,2,4),]
### </b>
# Plot
pKxG1 <- ggplot(dsub, aes(kin, weight)) +
  geom_point() +
  geom_ribbon(data=new_data, aes(x=kin, ymin=lwr, ymax=upr, fill=factor(group_size)), alpha=0.25) +
  geom_line(data=new_data, aes(kin, weight, color=factor(group_size)), linewidth = 1) +
  facet_grid(. ~ group_size, scales="free") +
  scale_fill_manual(values=c("5"="skyblue","10"="deepskyblue3" ,"15"="deepskyblue4"), 
                    name="Group size") +
  scale_color_manual(values = c("5"="skyblue","10"="deepskyblue3","15"="deepskyblue4") ,
                    name="Group size") +
  labs(x = "Number of Kin", y = "Weight") +
  theme_bw()

pKxG1

The impact of the number of kin depends on group size. In small groups (n=5), there is essentially no effect of kinship on an individual’s weight. However, in large groups (n=15), individuals are expected to weigh more with each additional kin member.

### <b>
# Subset dataset
dsub <- d[d$group_size%in%c(5,10,15) & d$kin%in%c(0,2,4),]
### </b>
pKG2 <- ggplot(dsub, aes(group_size, weight)) +
  geom_point() +
  geom_ribbon(data = new_data, aes(x = group_size, ymin = lwr, ymax = upr, fill=factor(kin)), alpha = 0.25) +
  geom_line(data = new_data, aes(group_size, weight, color=factor(kin)), linewidth = 1) +
  scale_fill_manual(values = c("0"="orange2","2"="salmon1" ,"4"="orange4"),name="Number of Kin" ) +
  scale_color_manual(values = c("0"="orange2","2"="salmon1","4"="orange4" ), name="Number of Kin") +
  labs(x = "Group size", y = "Weight") +
  theme_bw()

pKG2

The effect of group size on weight depends on the number of kin that an individual has. Individuals with no kin members in their group are expected to weigh less as the group size increases. Individuals who have four kin members in the group are expected to weigh more as the group size increases.

Interaction between two categorical predictors: Sex and Social rank

Higher ranking individuals should weigh more than lower ranking individuals. The effect should be stronger for males in comparison to females.

# Multiple linear regression with interaction between two continuous variables
mSR <- lm( weight ~ male*social_rank_f, data=d)

# Extract model summary
sum_mSR <- summary(mSR)
sum_mSR


Call:
lm(formula = weight ~ male * social_rank_f, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3207 -0.3713 -0.0326  0.3559  1.7507 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    2.7724     0.1356  20.453   <2e-16 ***
malemale                      -0.1034     0.2025  -0.511   0.6107    
social_rank_fmedium            0.1299     0.1660   0.782   0.4359    
social_rank_fhigh             -0.3067     0.3031  -1.012   0.3142    
malemale:social_rank_fmedium   0.1034     0.2641   0.392   0.6963    
malemale:social_rank_fhigh     1.0463     0.3594   2.911   0.0045 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5422 on 94 degrees of freedom
Multiple R-squared:  0.1966,    Adjusted R-squared:  0.1538 
F-statistic: 4.599 on 5 and 94 DF,  p-value: 0.0008492

# Add confidence intervals
ci_mSR <- confint(mSR)
sum_mSR <- cbind(sum_mSR$coefficients, ci_mSR)
round(sum_mSR, 3) # rounding

                             Estimate Std. Error t value Pr(>|t|)  2.5 % 97.5 %
(Intercept)                     2.772      0.136  20.453    0.000  2.503  3.042
malemale                       -0.103      0.202  -0.511    0.611 -0.505  0.299
social_rank_fmedium             0.130      0.166   0.782    0.436 -0.200  0.460
social_rank_fhigh              -0.307      0.303  -1.012    0.314 -0.908  0.295
malemale:social_rank_fmedium    0.103      0.264   0.392    0.696 -0.421  0.628
malemale:social_rank_fhigh      1.046      0.359   2.911    0.004  0.333  1.760

The expected weight for a low ranking capuchin female is 2.772 kg.
Low ranking males are expected to weigh 0.103 kg less than low ranking females, (95% CI = [-0.505, 0.299]).
For females, we expect medium ranking monkeys to weigh 0.13 kg more than low ranking monkeys, (95% CI = [-0.2, 0.46]).
For females, we expect high ranking monkeys to weigh 0.307 kg less than low ranking monkeys, (95% CI = [-0.908, 0.295]).

The interaction coefficients:

Compared to females, the difference between medium-ranking and low-ranking capuchins is increased by 0.103 kg for males, (95% CI = [-0.421, 0.628])
Compared to females, the difference between high-ranking and low-ranking capuchins is increased by 1.046 kg for males, (95% CI = [0.333, 1.76]).
- OR,
Compared to low-ranking monkeys, the difference between males and females is increased by 0.103 kg for medium-ranking monkeys, (95% CI = [-0.421, 0.628]).
Compared to low-ranking monkeys, the difference between males and females is increased by 1.046 kg for high-ranking monkeys, (95% CI = [0.333, 1.76]).

# Calculate predictions
new_data <- expand.grid(male=factor(c("female","male"), level=c("female","male")),
            social_rank_f = factor(c("low","medium","high"), level=c("low","medium","high")))
predictions <- predict(mSR, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("male","social_rank_f","weight","lwr","upr")
new_data

    male social_rank_f   weight      lwr      upr
1 female           low 2.772431 2.503290 3.041572
2   male           low 2.669033 2.370449 2.967618
3 female        medium 2.902331 2.712019 3.092642
4   male        medium 2.902350 2.624383 3.180317
5 female          high 2.465748 1.927467 3.004029
6   male          high 3.408660 3.167933 3.649386

pSR <- ggplot(d, aes(social_rank_f, weight)) +
  geom_point(color="grey32",size=0.75) +
  geom_errorbar(data = new_data, aes(x = social_rank_f, ymin = lwr, ymax = upr, color=male),  
                width=0.2, linewidth=1) +  
  geom_point(data = new_data, aes(social_rank_f, weight, color=male), size = 2) +
  scale_fill_manual(values = c("male"="skyblue3","female"="sandybrown" ), name="Sex") +
  scale_color_manual(values = c("male"="skyblue3","female"="sandybrown" ), name="Sex") +
  labs(x = "Social rank", y = "Weight") +
  theme_bw()

pSR

There seems to be no difference in expected weights for low ranking and medium ranking males and females. High ranking males are expected to weigh more than high ranking females.

Interaction between a continuous predictor and a categorical predictor: Sex and Kin

The presence of kin should have greater effect on female weight compared to male weight.

# Multiple linear regression with interaction between a continuous var and a categorical var
mSxK <- lm( weight ~ male*kin, data=d)

# Extract model summary
sum_mSxK <- summary(mSxK)
sum_mSxK


Call:
lm(formula = weight ~ male * kin, data = d)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.36506 -0.26226  0.03128  0.27863  1.12248 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.81203    0.14638  12.379  < 2e-16 ***
malemale      1.15958    0.16941   6.845 7.20e-10 ***
kin           0.38592    0.05004   7.713 1.15e-11 ***
malemale:kin -0.28123    0.08739  -3.218  0.00176 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4588 on 96 degrees of freedom
Multiple R-squared:  0.4126,    Adjusted R-squared:  0.3942 
F-statistic: 22.48 on 3 and 96 DF,  p-value: 4.163e-11

# Add confidence intervals
ci_mSxK <- confint(mSxK)
sum_mSxK <- cbind(summary(mSxK)$coefficients, ci_mSxK)
round(sum_mSxK, 3) # rounding

             Estimate Std. Error t value Pr(>|t|)  2.5 % 97.5 %
(Intercept)     1.812      0.146  12.379    0.000  1.521  2.103
malemale        1.160      0.169   6.845    0.000  0.823  1.496
kin             0.386      0.050   7.713    0.000  0.287  0.485
malemale:kin   -0.281      0.087  -3.218    0.002 -0.455 -0.108

The expected weight for a a female capuchin with no kin is 1.812 kg.
For capuchins with no kin, we expect males to weigh 1.16 kg more than females, (95% CI = [0.823, 1.496]).
For females, we expect each additional kin to increase a monkey’s weight by 0.386 kg, (95% CI = [0.287, 0.485]).

The interaction coefficient:

For each additional kin, we expect males to weigh 0.281 kg less in comparison to females, (95% CI = [-0.455, -0.108]).
- OR,
Compared to females, we expect each additional kin to decrease a monkey’s weight by 0.281 kg for males, (95% CI = [-0.455, -0.108]).

# Calculate predictions
new_data <- expand.grid(male=factor(c("female","male"), level=c("female","male")),kin = 0:4)
predictions <- predict(mSxK, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("male","kin","weight","lwr","upr")
new_data

     male kin   weight      lwr      upr
1  female   0 1.812028 1.521472 2.102584
2    male   0 2.971605 2.802334 3.140876
3  female   1 2.197948 1.992263 2.403632
4    male   1 3.076295 2.940134 3.212456
5  female   2 2.583868 2.442730 2.725006
6    male   2 3.180985 2.959901 3.402068
7  female   3 2.969787 2.838395 3.101180
8    male   3 3.285675 2.939746 3.631603
9  female   4 3.355707 3.170402 3.541013
10   male   4 3.390364 2.909838 3.870891

# Plot: weight ~ sex*kin
  ### <b>
pSxK <- ggplot(d, aes(kin, weight, color=male)) +
  geom_point() +
  ### </b>
  geom_ribbon(data=new_data, aes(x=kin, ymin=lwr, ymax=upr, fill=male), alpha = 0.5, color=NA ) +  
  geom_line(data=new_data, aes(kin, weight, color=male), linewidth = 1) +
  scale_fill_manual(values = c("male"="lightcyan2","female"="orange2" ), name="Sex") +
  scale_color_manual(values = c("male"="deepskyblue4","female"="orange4" ), name="Sex") +
  ### <b>
  geom_hline(yintercept=mean(d$weight[d$male=="female"]), linetype="dashed", color = "orange2") +
  geom_hline(yintercept=mean(d$weight[d$male=="male"]), linetype="dashed", color="deepskyblue3") +
  ### </b>
  labs(x = "Number of Kin", y = "Weight") +
  theme_bw()

pSxK

The presence of kin increases weight, with stronger effect expected in females compared to males.

True model

The data is simulated.

Generate predictor variables (sex, rank, kin, group size, forest type)
Specify relationships among predictors
Assign coefficient values (effect sizes + interactions)
Simulate normally distributed random error
Compute outcome (weight) as linear combination of predictors + interactions + error

\[ \begin{aligned} \text{weight}_i &= \beta_0 + \beta_S \,\text{male}_i + \beta_{Rm}\,\text{social_rank_medium}_i + \beta_{Rh}\,\text{social_rank_high}_i \\ &\quad + \beta_G\,\text{group_size}_i + \beta_K\,\text{kin}_i + \beta_F\,\text{rain_forest}_i \\ &\quad + \beta_{KG}\,(\text{kin}_i \times \text{group_size}_i) \\ &\quad + \beta_{SRm}\,(\text{male}_i \times \text{social_rank_medium}_i) \\ &\quad + \beta_{SRh}\,(\text{male}_i \times \text{social_rank_high}_i) \\ &\quad + \beta_{SK}\,(\text{male}_i \times \text{kin}_i) \\ &\quad + \epsilon_i \end{aligned} \]

b0 <- 2.5 # INTERCEPT
bS <- 0.8 # SEX
bG <- -0.1 # GROUP SIZE
bK <- 0.1 # KIN
bRm <- 0.1 # SOCIAL RANK MEDIUM
bRh <- 0.3 # SOCIAL RANK HIGH
bF <- 0.5 # FORETS TYPE
bKG <- 0.03 # KIN x GROUP SIZE
bSRm <- 0.3 # SEX x RANK MEDIUM
bSRh <- 0.5 # SEX x RANK HIGH
bSK <- -0.2 # SEX x KIN

# Multiple linear regression with interaction between two continuous variables
mTRUE <- lm( weight ~ male + group_size + kin + social_rank_f + forest_type_f  +
              kin * group_size +
              male * social_rank_f +
              male * kin, data=d )

All the models we fit were underspecified:

A TRUE model might still not fully recover the estimates

Samples are random draws, leading to variable estimates and varying levels of certainty.

.

Why do we care about linear regression?

Simple linear regression model

Ordinary Least Squares (OLS)

Linear regression in R

Dataset: Capuchin monkey weight

The data

Linear regression with one continuous predictor: Kin

summary() Output explained

summary() Output explained

\(F\)-statistic

summary() Output explained

Multiple \(R^2\)

Adjusted \(R^2\)

Interpreting model coefficients

Standard errors and confidence intervals

t statistic and p-values

Assessing the fit of the model

Exercise 1

Exercise 1 Solution: Weight and Group size

Regression diagnostics

Linear regression assumptions

Linearity assumption

Equal variance assumption or Homoscedasticity

Independence assumption

Normality assumption

Influential observations

Categorical predictors

Categorical predictors: factor and dummy variables

Categorical predictors

Dummy variables

Factor variables

Linear regression with one categorical predictor: Sex

Linear regression with one categorical predictor: Social rank

Exercise 2

Exercise 2: Weight and Forest type

Multiple regression

Adding another predictor to a simple linear regression model

Standardizing and centering predictors

Interactions

Interaction between two continuous predictors: Kin and Group size

Interaction between two categorical predictors: Sex and Social rank

Interaction between a continuous predictor and a categorical predictor: Sex and Kin

True model

Linear regression in `R`

`summary()` Output explained

`summary()` Output explained

`summary()` Output explained