x y
1 0.9819694 0.49317417
2 0.4687150 0.70972185
3 -0.1079713 -0.83172816
4 -0.2128782 0.09893859
5 1.1580985 1.60159397
6 1.2923548 -0.36389939
Linear regression allows us to:
The typical goal is to estimate population parameters describing these relationships, as well as their uncertainties, from a sample.
A simple linear regression measures the relationship between one outcome and one predictor.
x y
1 0.9819694 0.49317417
2 0.4687150 0.70972185
3 -0.1079713 -0.83172816
4 -0.2128782 0.09893859
5 1.1580985 1.60159397
6 1.2923548 -0.36389939
x: predictor, independent variable (IV), regressor
y: outcome, dependent variable (DV), response
Linear regression assumes linear relationship between x and y.
The simple linear regression model:
\[y = \beta_0 + \beta_1 x + \epsilon\]
The Ordinary Least Squares (OLS) estimator (formula or method) finds the best-fitting line relating \(x\) to \(y\).
The best-fit line minimizes the sum of the squared errors, or differences between observed and predicted values of \(y\).
The OLS estimator has several desirable statistical properties:
RA simulated dataset of 100 observations of capuchin monkey weights
Variables
Weight is the outcome, and we expect each of the variables to influence weight.
NOTE: These data are simulated and do not represent real monkeys
Explore the model variables before running a regression.
skim() from the skimr package produces quick summaries of variables in a data frame.
| Name | d |
| Number of rows | 100 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| sex | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| social_rank | 0 | 1 | 3 | 6 | 0 | 3 | 0 |
| forest_type | 0 | 1 | 3 | 4 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1 | 475215.21 | 317074.62 | 1024.00 | 184206.00 | 514714.50 | 784331.00 | 987254.0 | ▇▅▅▅▇ |
| group_size | 0 | 1 | 9.77 | 3.27 | 5.00 | 7.00 | 9.50 | 13.00 | 15.0 | ▇▆▅▅▅ |
| kin | 0 | 1 | 1.73 | 1.47 | 0.00 | 0.00 | 2.00 | 3.00 | 4.0 | ▇▅▅▅▅ |
| weight | 0 | 1 | 2.94 | 0.59 | 1.58 | 2.52 | 3.01 | 3.34 | 4.3 | ▂▇▇▇▂ |
Histograms and frequency plots give quick summaries of variable distributions and may reveal extreme or erroneous values.
Use lm() to run a linear regression in R, specifying the model with formula outcome ~ predictor.
Below we model how weight is predicted by kin, and we expect that increasing number of close kin will predict increasing weight.
[1] "lm"
Call:
lm(formula = weight ~ kin, data = d)
Coefficients:
(Intercept) kin
2.7220 0.1232
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
Objects returned by lm() are complex, so we use other functions on the object to extract model estimates and produce model predictions.
coef(): model coefficientssummary(): coefficients, standard errors, p-values, model F-test, etc.predict(): model predictionsresiduals(): various kinds of residualsconfint(): confidence intervals for coefficientsplot(): diagnostic plots
Call:
lm(formula = weight ~ kin, data = d)
Residuals:
Min 1Q Median 3Q Max
-1.26348 -0.36073 -0.01462 0.39490 1.37212
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.72196 0.08736 31.160 < 2e-16 ***
kin 0.12316 0.03857 3.193 0.00189 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5638 on 98 degrees of freedom
Multiple R-squared: 0.09424, Adjusted R-squared: 0.085
F-statistic: 10.2 on 1 and 98 DF, p-value: 0.001892
summary() Output explained## [...]
## Call:
## lm(formula = weight ~ kin, data = d)
## [...]
Model formula.
## [...]
## Residuals:
## Min 1Q Median 3Q Max
## -1.26348 -0.36073 -0.01462 0.39490 1.37212
## [...]
Summary of the distribution of the residuals.
## [...]
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.72196 0.08736 31.160 < 2e-16 ***
## kin 0.12316 0.03857 3.193 0.00189 **
## ---
## [...]
Estimated coefficients, standard errors, t-values, and p-values for the hypothesis test that a coefficient equals to zero.
## [...]
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [...]
Asterisks provide a quick visual shortcut to p-values.
## [...]
## Residual standard error: 0.5638 on 98 degrees of freedom
## [...]
Residual standard error measures the average amount that observed responses deviate from the fitted regression line.
Degrees of freedom (\(df\)) represent the number of independent observations (\(N\)) in the data that are available to estimate a parameter.
In the case of linear regression:
\(p =\) the number of predictors \(+ 1\)
\(df = N - p\)
summary() Output explained## [...]
## F-statistic: 10.2 on 1 and 98 DF, p-value: 0.001892
## [...]
Indicates whether the overall model is statistically significant
Provides evidence that at least one independent variable is related to the dependent variable, but not which one
summary() Output explained## [...]
## Multiple R-squared: 0.09424, Adjusted R-squared: 0.085
## [...]
The proportion of variance in the dependent variable explained by the independent variables.
\[R^2 = 1 - \frac{SSR}{SST}\]
SSR - the sum of squared residuals \(\sum_{i=1}^n(y_i - \hat{y_i})^2\).
SST - the total sum of squares \(\sum_{i=1}^n(y_i - \bar{y_i})\).
Ranges from 0 to 1
Increases as more variables are included in the model
Adjusted \(R^2\) penalizes adding complexity (predictors) to the model to prevent overfitting.
\[1 – \left( (1 – R^2) \cdot \frac{N – 1}{ N – k – 1} \right)\]
##
## Call:
## lm(formula = weight ~ kin, data = d)
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.72196 0.08736 31.160 < 2e-16 ***
## kin 0.12316 0.03857 3.193 0.00189 **
## ---
## [...]
Given our model and our data, the best-fitting regression equation is:
\[ \hat{weight} = 2.72196 + 0.12316 * kin\]
(Intercept) 2.72196 is the predicted mean weight for a monkey with no close kin members in their group (i.e., kin = 0).
The coefficient 0.12316 for kin means that for one unit increase in kin, we can expect a 0.12316 unit increase in weight.
Standard errors and confidence intervals quantify the uncertainty in the estimates.
# Use extractor function confint() to compute confidence intervals
conf_intervals <- confint(mK)
conf_intervals 2.5 % 97.5 %
(Intercept) 2.54860910 2.895316
kin 0.04661907 0.199694
# Combine summary() output with confint() output
sumCI <- summary(mK)
sumCI$coefficients <- cbind(sumCI$coefficients[,c(1:2)], conf_intervals)
sumCI
Call:
lm(formula = weight ~ kin, data = d)
Residuals:
Min 1Q Median 3Q Max
-1.26348 -0.36073 -0.01462 0.39490 1.37212
Coefficients:
Estimate Std. Error 2.5 % 97.5 %
(Intercept) 2.72196 0.08736 2.54861 2.895
kin 0.12316 0.03857 0.04662 0.200
Residual standard error: 0.5638 on 98 degrees of freedom
Multiple R-squared: 0.09424, Adjusted R-squared: 0.085
F-statistic: 10.2 on 1 and 98 DF, p-value: 0.001892
Std. Error is a measure of variability of an estimate over repeated samples (i.e. the standard deviation of the sampling distribution).
2.5% and 97.5% provide the lower and upper end, respectively, of the range of values of a parameter that are reasonably compatible with the data.
A 95% confidence interval implies that if we were to repeat the experiment or data collection process many times, each time constructing a 95% confidence interval, we would expect the true parameter to fall within the interval about 95% of the time.
The 95% confidence interval for the kin parameter estimate indicates that the data is consistent with a coefficient range from 0.04662 to 0.2.
NA [...]
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.72196 0.08736 31.160 < 2e-16 ***
kin 0.12316 0.03857 3.193 0.00189 **
---
NA [...]
t value and Pr(>|t|) are used in testing the null hypothesis that a parameter is 0 in the population
t value is the test statistic, calculated by dividing Estimate by its Std. Error, and is assumed to have a \(t\)-distribution under the null hypothesis with \(df\) degrees of freedom
Pr(>|t|) is the p-value, the probability of observing a t value equal or greater than the observed t value if the null hypothesis were true
The p-value 0.00189 suggests that if the null hypothesis were true, we would expect to see t values for the kin parameter as or more extreme than 3.193 in about 2 out of 1000 samples, an unlikely result that suggests that the null hypothesis should be rejected
One way to assess the fit of the model is to plot model predictions to ensure they look reasonable against the observed data.
First, calculate the predicted outcome across a range of predictor values observed in the data.
The predict() function takes a model object and produces:
# Create a dataframe with predictor values
new_data <- data.frame(kin=0:4)
# Calculate predictions of the outcome (weight) for each value of the predictor (kin)
predictions <- predict(mK, newdata=new_data, interval="confidence")
# Prepare the data frame for plotting
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("kin","weight","lwr","upr")
new_data kin weight lwr upr
1 0 2.721963 2.548609 2.895316
2 1 2.845119 2.720056 2.970182
3 2 2.968276 2.854495 3.082057
4 3 3.091432 2.943218 3.239646
5 4 3.214589 3.007938 3.421240
Then make a plot of your data and model predictions.
# A simple plot for model mK
ggplot(d, aes(kin, weight)) +
geom_point() + # plots the observed data
geom_line(data = new_data, aes(kin, weight)) + # plots the regression line
geom_ribbon(data = new_data, aes(x = kin, ymin = lwr, ymax = upr), alpha=0.2) # plots the CI# A plot for model mK with some custom adjustments
ggplot(d, aes(kin, weight)) +
geom_point() +
geom_ribbon(data = new_data, aes(x = kin, ymin = lwr, ymax = upr), fill = "orange2", alpha = 0.3) +
geom_line(data = new_data, aes(kin, weight), color = "orange4", linewidth = 1) +
labs(title = "Linear regression model: weight ~ kin" ,x = "Number of close kin", y = "Weight") Capuchin monkeys live in social groups that can range from just a few individuals up to 40. As group size increases, competition for food is expected to rise. Therefore, capuchins living in larger groups should have a lower average weight compared to those in smaller groups.
weight as a function of their group_size.summary() run on the object returned by lm()# fit the model
mG <- lm( weight ~ group_size, data=d)
# Extract model summary
sum_mG <- summary(mG)
# Confidence intervals
ci_mG <- confint(mG)
# Add confidence intervals to model summary
sum_mG <- cbind(sum_mG$coefficients,ci_mG)
round(sum_mG,4) #rounding to fit the screen Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
(Intercept) 3.1261 0.1861 16.7939 0.0000 2.7567 3.4955
group_size -0.0196 0.0181 -1.0821 0.2819 -0.0554 0.0163
As group size increases by one individual, on average, the weight is expected to decrease by 0.0196 kg (95% CI = [-0.0554, 0.0163])
# Calculate predicted weight
new_data <- data.frame(group_size=5:15)
predictions <- predict(mG, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("group_size","weight","lwr","upr")
new_data group_size weight lwr upr
1 5 3.028313 2.821120 3.235507
2 6 3.008756 2.830030 3.187481
3 7 2.989198 2.835805 3.142591
4 8 2.969640 2.836641 3.102640
5 9 2.950083 2.829994 3.070171
6 10 2.930525 2.813365 3.047685
7 11 2.910968 2.786049 3.035887
8 12 2.891410 2.749791 3.033029
9 13 2.871852 2.707292 3.036412
10 14 2.852295 2.660783 3.043807
11 15 2.832737 2.611724 3.053751
# Plot: weight ~ group_size
pmG <- ggplot(d, aes(group_size, weight)) +
geom_point() +
geom_ribbon(data=new_data, aes(x=group_size, ymin=lwr, ymax=upr), fill="skyblue3", alpha=0.25) +
geom_line(data=new_data, aes(group_size, weight), color="deepskyblue4", linewidth=1) +
labs(x = "Group size", y = "Weight")
pmGInferences made from a regression model depend on several assumptions.
* From “Beyond Multiple Linear Regression” by Paul Roback and Julie Legler
Violation of an assumption threatens the validity of the estimates and inferences made from them, so it’s important to assess the plausibility of these assumptions with regression diagnostics.
Violation of linearity results in erroneous inference regarding the form of the relationship between the predictor and outcome as well as bad predictions.
Diagnostic plot: residuals (y-axis) vs fitted outcome values (x-axis).
No systematic trend among the residuals suggests linearity is met. A smoothed curve can help to identify a trend.
Using plot() on the lm object will produce diagnostic plots, the first of which is the residuals vs fitted values plot.
A trend in the residuals suggests a nonlinear relationship between variables a predictor and the outcome.
If linearity assumption might be violated, consider:
Violation of of homoscedasticity of the errors results in inaccurate standard errors, leading to inaccurate test statistics, confidence intervals, and p-values.
Diagnostic plot: residuals (y-axis) vs fitted outcome values (x-axis) again.
Random scatter about the zero line with equal spread across the range of fitted values suggests homoscedasticity.
One commonly observed form of heteroscedasticity is increasing variance of the residuals with increasing fitted values:
If homoscedasticity might be violated, consider:
Violation of independence threatens the accuracy of confidence intervals and p-values.
Two common design features that result in dependencies in the data:
Diagnostic plot: residuals (y-axis) against a clustering index (e.g., schoolID) or order variable (e.g., time).
If we had sampled several monkeys from the same group, we might see systematic differences in the residuals across groups, suggesting non-independence.
# Plot residuals by group
boxplot(m$residuals ~ group, data = dat, xlab = "Group", ylab = "Residuals")If independence might be violated, consider:
Violation of normality of errors threatens the accuracy of confidence intervals and p-values, but becomes less impactful with large sample sizes.
Diagnostic plots: histogram and q-q plot of residuals.
Deviation from the diagonal line in the q-q plot suggests that errors are not normally distributed.
If normality might be violated, consider:
glm()Influential observations change model estimates substantially when excluded from the data during estimation.
influence.measures() can be run on an lm object, and returns several influence measures including dfbetas for each coefficient and cook’s distance.
If any of the influence measures are unusually large, they are flagged with an * in the infl column.
Influence measures of
lm(formula = weight ~ group_size, data = d) :
dfb.1_ dfb.grp_ dffit cov.r cook.d hat inf
1 0.064279 -9.12e-02 -0.115102 1.039 6.66e-03 0.0269
2 -0.042063 5.60e-02 0.065945 1.056 2.19e-03 0.0358
3 -0.050624 3.31e-02 -0.069269 1.026 2.41e-03 0.0130
4 -0.045000 6.38e-02 0.080580 1.044 3.27e-03 0.0269
5 -0.000932 -2.64e-04 -0.003745 1.031 7.08e-06 0.0100
6 0.002997 8.48e-04 0.012048 1.031 7.33e-05 0.0100
7 0.146781 -2.08e-01 -0.262837 0.997 3.40e-02 0.0269
8 0.004419 -5.88e-03 -0.006928 1.059 2.42e-05 0.0358
9 0.018768 5.31e-03 0.075436 1.019 2.86e-03 0.0100
10 -0.128697 1.05e-01 -0.139193 1.028 9.71e-03 0.0234
11 -0.014705 2.34e-02 0.033182 1.040 5.56e-04 0.0198
12 -0.138223 9.03e-02 -0.189134 0.978 1.76e-02 0.0130
13 0.092738 -7.59e-02 0.100301 1.036 5.06e-03 0.0234
14 -0.188054 1.23e-01 -0.257319 0.934 3.18e-02 0.0130 *
15 0.022861 -3.24e-02 -0.040937 1.048 8.46e-04 0.0269
16 0.322569 -2.77e-01 0.335389 0.982 5.49e-02 0.0314
17 -0.073702 9.81e-02 0.115545 1.051 6.72e-03 0.0358
18 0.007616 -6.89e-02 -0.195137 0.966 1.86e-02 0.0114
19 -0.024989 3.97e-02 0.056389 1.038 1.60e-03 0.0198
20 0.181734 -1.19e-01 0.248670 0.941 2.98e-02 0.0130
21 -0.137603 1.95e-01 0.246402 1.003 3.00e-02 0.0269
22 -0.044047 1.93e-02 -0.083730 1.018 3.52e-03 0.0106
23 -0.142638 1.22e-01 -0.148307 1.039 1.10e-02 0.0314
24 0.147436 -1.21e-01 0.159461 1.023 1.27e-02 0.0234
25 0.040424 -1.77e-02 0.076843 1.020 2.97e-03 0.0106
26 -0.028385 2.44e-02 -0.029513 1.053 4.40e-04 0.0314
27 -0.021543 9.42e-03 -0.040952 1.028 8.46e-04 0.0106
28 0.005072 -1.04e-02 -0.018466 1.035 1.72e-04 0.0147
29 0.059900 1.70e-02 0.240757 0.920 2.77e-02 0.0100 *
30 -0.081929 5.35e-02 -0.112105 1.014 6.29e-03 0.0130
31 -0.325557 2.80e-01 -0.338496 0.981 5.58e-02 0.0314
32 0.002045 -1.85e-02 -0.052401 1.027 1.38e-03 0.0114
33 -0.096149 1.28e-01 0.150737 1.045 1.14e-02 0.0358
34 0.043374 -6.89e-02 -0.097875 1.031 4.82e-03 0.0198
35 -0.082761 6.27e-02 -0.096743 1.027 4.70e-03 0.0172
36 0.004527 -4.10e-02 -0.115994 1.008 6.72e-03 0.0114
37 0.007169 2.03e-03 0.028813 1.029 4.19e-04 0.0100
38 -0.013310 2.74e-02 0.048454 1.033 1.18e-03 0.0147
39 0.142358 -2.26e-01 -0.321238 0.940 4.95e-02 0.0198
40 0.043315 -5.76e-02 -0.067907 1.056 2.33e-03 0.0358
41 0.067758 -5.82e-02 0.070451 1.051 2.50e-03 0.0314
42 0.101768 -7.71e-02 0.118961 1.022 7.09e-03 0.0172
43 0.156962 -2.09e-01 -0.246076 1.024 3.01e-02 0.0358
44 -0.127022 1.04e-01 -0.137381 1.028 9.46e-03 0.0234
45 0.056355 -3.68e-02 0.077112 1.025 2.99e-03 0.0130
46 0.020019 -8.75e-03 0.038053 1.029 7.30e-04 0.0106
47 -0.079020 5.98e-02 -0.092370 1.028 4.29e-03 0.0172
48 -0.025194 5.18e-02 0.091720 1.024 4.23e-03 0.0147
49 0.015033 -6.57e-03 0.028577 1.030 4.12e-04 0.0106
50 -0.095191 6.22e-02 -0.130252 1.007 8.46e-03 0.0130
51 0.155556 -1.27e-01 0.168242 1.020 1.41e-02 0.0234
52 0.229604 -1.88e-01 0.248329 0.992 3.03e-02 0.0234
53 -0.008436 1.73e-02 0.030712 1.035 4.76e-04 0.0147
54 0.248756 -3.31e-01 -0.389987 0.974 7.37e-02 0.0358
55 0.087150 -1.16e-01 -0.136630 1.048 9.38e-03 0.0358
56 -0.013531 -3.83e-03 -0.054385 1.025 1.49e-03 0.0100
57 -0.124298 1.02e-01 -0.134435 1.029 9.06e-03 0.0234
58 0.067430 -5.79e-02 0.070110 1.051 2.48e-03 0.0314
59 0.088296 -7.58e-02 0.091805 1.048 4.25e-03 0.0314
60 -0.062258 9.89e-02 0.140489 1.021 9.87e-03 0.0198
61 -0.099371 1.41e-01 0.177941 1.025 1.58e-02 0.0269
62 -0.000386 3.49e-03 0.009893 1.032 4.94e-05 0.0114
63 0.045589 -2.98e-02 0.062381 1.028 1.96e-03 0.0130
64 -0.036066 5.73e-02 0.081385 1.034 3.33e-03 0.0198
65 0.007180 -4.69e-03 0.009825 1.034 4.88e-05 0.0130
66 0.000106 3.01e-05 0.000428 1.031 9.25e-08 0.0100
67 0.130755 -9.90e-02 0.152845 1.011 1.16e-02 0.0172
68 -0.043225 8.89e-02 0.157364 1.001 1.23e-02 0.0147
69 0.217732 -1.78e-01 0.235489 0.997 2.74e-02 0.0234
70 0.051350 -4.20e-02 0.055537 1.042 1.56e-03 0.0234
71 -0.048752 2.13e-02 -0.092673 1.015 4.30e-03 0.0106
72 0.036982 -5.88e-02 -0.083452 1.034 3.51e-03 0.0198
73 -0.043075 2.81e-02 -0.058940 1.028 1.75e-03 0.0130
74 -0.041674 1.82e-02 -0.079219 1.019 3.15e-03 0.0106
75 0.054770 -8.70e-02 -0.123591 1.025 7.66e-03 0.0198
76 -0.042930 8.83e-02 0.156290 1.002 1.21e-02 0.0147
77 -0.050080 6.66e-02 0.078512 1.055 3.11e-03 0.0358
78 0.202715 -2.88e-01 -0.362994 0.953 6.34e-02 0.0269
79 -0.130979 1.74e-01 0.205342 1.034 2.11e-02 0.0358
80 -0.183652 1.50e-01 -0.198630 1.011 1.96e-02 0.0234
81 -0.066635 2.91e-02 -0.126666 1.000 7.98e-03 0.0106
82 -0.107103 9.20e-02 -0.111360 1.046 6.24e-03 0.0314
83 0.039181 -1.71e-02 0.074479 1.021 2.79e-03 0.0106
84 -0.008382 1.72e-02 0.030515 1.035 4.70e-04 0.0147
85 -0.007141 1.01e-02 0.012787 1.049 8.26e-05 0.0269
86 0.022704 6.43e-03 0.091256 1.014 4.17e-03 0.0100
87 0.012744 3.61e-03 0.051223 1.026 1.32e-03 0.0100
88 0.009216 -6.02e-03 0.012611 1.034 8.03e-05 0.0130
89 0.194797 -1.59e-01 0.210683 1.006 2.20e-02 0.0234
90 0.028126 7.96e-03 0.113049 1.005 6.37e-03 0.0100
91 -0.071197 5.83e-02 -0.077004 1.040 2.99e-03 0.0234
92 -0.219223 2.92e-01 0.343687 0.992 5.78e-02 0.0358
93 -0.020134 1.65e-02 -0.021776 1.045 2.39e-04 0.0234
94 -0.046526 3.52e-02 -0.054386 1.035 1.49e-03 0.0172
95 0.063330 -5.44e-02 0.065847 1.051 2.19e-03 0.0314
96 -0.086577 6.55e-02 -0.101203 1.026 5.14e-03 0.0172
97 0.042475 -2.77e-02 0.058120 1.029 1.70e-03 0.0130
98 0.004160 1.18e-03 0.016719 1.031 1.41e-04 0.0100
99 -0.264185 2.16e-01 -0.285730 0.975 3.98e-02 0.0234
100 0.025815 7.31e-03 0.103760 1.009 5.38e-03 0.0100
Running summary() on the object returned by influence.measures() will produce the table of diagnostics only for those observations flagged as influential (with an *).
Potentially influential observations of
lm(formula = weight ~ group_size, data = d) :
dfb.1_ dfb.grp_ dffit cov.r cook.d hat
14 -0.19 0.12 -0.26 0.93_* 0.03 0.01
29 0.06 0.02 0.24 0.92_* 0.03 0.01
Rather than rushing to delete outliers, consider re-running the model without these observations to assess the robustness of the results to inclusion of these influential points (i.e. a sensitivity analysis).
# rerun modeling excluding influential observations
mG_no_infl <- lm( weight ~ group_size, data=d[-c(14,29),])
# compare coefficients
cbind(mG=coef(mG), mG_no_infl=coef(mG_no_infl)) mG mG_no_infl
(Intercept) 3.12610110 3.14926223
group_size -0.01955759 -0.02200743
We can see that the estimates change slightly between the two models, but not enough to change our overall inferences (standard errors and p-values are also similar).
R packages for more diagnostics:
The simplest categorical variable is a dummy variable, where 0 and 1 denote not belonging and belonging to a category, respectively.
female male
52 48
0 1
52 48
In regression models, categorical variables with \(k\) categories will be represented by \(k - 1\) dummy variables.
The omitted category becomes the reference group.
For example, the social_rank variable has 3 levels: low, medium, and high.
We can create dummy variables for medium and high, which will make low the reference.
| social_rank | medium | high |
|---|---|---|
| low | 0 | 0 |
| medium | 1 | 0 |
| high | 0 | 1 |
d$rank_medium <- if_else(d$social_rank=="medium", 1, 0)
d$rank_high <- if_else(d$social_rank=="high", 1, 0)
# check dummy codes
# rank_medium = 0 & rank_high = 0 implies monkey has "low" social rank
table(d$social_rank, d$rank_medium)
0 1
high 24 0
low 29 0
medium 0 47
0 1
high 0 24
low 29 0
medium 47 0
Categorical variables in R can be represented using factors.
Factors are integer variables with labels for each value.
Function factor() converts both character and integer variables into factors
# Create a factor variable for social_rank variable
d$social_rank_f <- factor(d$social_rank)
levels(d$social_rank_f)[1] "high" "low" "medium"
By default, categories will be ordered alphabetically, but we can specify a particular order with levels=.
# Create a factor variable with custom level order
d$social_rank_f <- factor(d$social_rank, levels=c("low","medium","high"))
levels(d$social_rank_f)[1] "low" "medium" "high"
To convert a variable with integer values to a factor, use labels= argument
[1] male female male female male female male male female female
[11] male female female female female male female female female female
[21] female female female male female female female female male male
[31] female female male female female male female female female male
[41] female male male female female female male male male female
[51] female male female female male male female female male male
[61] female male male male female female male female male male
[71] male male male female male male male male female female
[81] female female male male male male male male male male
[91] female female male female female female male male female male
Levels: female male
Conveniently, when entering a factor into lm() as a predictor, R will automatically create dummy variables for each level of factor except for the first, which will thus be the reference group.
Below we model weight predicted by sex, with the expectation that male monkeys are heavier than female monkeys.
# Simple linear regression where weight is modeled as a function of individual's sex
# male
mS <- lm( weight ~ male, data=d)
# Extract model summary
sum_mS <- summary(mS)
sum_mS
Call:
lm(formula = weight ~ male, data = d)
Residuals:
Min 1Q Median 3Q Max
-1.44358 -0.36511 0.00375 0.41178 1.38766
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.82878 0.08068 35.061 <2e-16 ***
malemale 0.22134 0.11645 1.901 0.0603 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5818 on 98 degrees of freedom
Multiple R-squared: 0.03555, Adjusted R-squared: 0.02571
F-statistic: 3.613 on 1 and 98 DF, p-value: 0.06028
# Add confidence intervals
ci_mS <- confint(mS)
sum_mS <- cbind(sum_mS$coefficients, ci_mS)
round(sum_mS, 4) # rounding Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
(Intercept) 2.8288 0.0807 35.0611 0.0000 2.6687 2.9889
malemale 0.2213 0.1165 1.9007 0.0603 -0.0098 0.4524
A female capuchin (male=0) is expected to weigh 2.8288 kg.
Males are expected to weigh 0.2213 kg more than females on average. However, the 95% confidence interval, [-0.0098, 0.4524], suggests uncertainty in this estimate, with values near zero or even slightly negative values also consistent with the data.
# Calculate predicted weight for males and females
new_data <- data.frame(male=factor(c("female","male"), levels=c("female","male")))
predictions <- predict(mS, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("male","weight","lwr","upr")
new_data male weight lwr upr
1 female 2.828778 2.668669 2.988888
2 male 3.050122 2.883475 3.216770
# Plot: weight ~ sex
ggplot(d, aes(male, weight)) +
geom_point(color="grey32",size=0.75) +
geom_errorbar(data = new_data, aes(x=male, ymin=lwr, ymax=upr, color=male),
width=0.15,linewidth=1.5) +
geom_point(data = new_data, aes(male, weight, color=male), size = 3) +
scale_fill_manual(values = c("male"="lightcyan2","female"="orange2" ), name="Sex") +
scale_color_manual(values = c("male"="deepskyblue3","female"="orange2" ), name="Sex") +
labs(x = "Sex", y = "Weight") +
theme_bw()Higher ranking individuals should weigh more because rank translates into better access to resources.
# Simple linear regression where weight is modeled as a function of individual's social rank
mR <- lm( weight ~ social_rank_f, data=d)
# Extract model summary
sum_mR <- summary(mR)
sum_mR
Call:
lm(formula = weight ~ social_rank_f, data = d)
Residuals:
Min 1Q Median 3Q Max
-1.56992 -0.33489 0.02641 0.40106 1.04668
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.7261 0.1044 26.105 < 2e-16 ***
social_rank_fmedium 0.1763 0.1328 1.327 0.18752
social_rank_fhigh 0.5254 0.1552 3.386 0.00103 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5624 on 97 degrees of freedom
Multiple R-squared: 0.1082, Adjusted R-squared: 0.08977
F-statistic: 5.882 on 2 and 97 DF, p-value: 0.003881
Above, we see that R entered dummy variables representing medium and high in the regression model, and the coefficients represent comparisons to the reference group, low.
# Add confidence intervals
ci_mR <- confint(mR)
sum_mR <- cbind(sum_mR$coefficients, ci_mR)
round(sum_mR, 4) # rounding Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
(Intercept) 2.7261 0.1044 26.1054 0.0000 2.5188 2.9333
social_rank_fmedium 0.1763 0.1328 1.3273 0.1875 -0.0873 0.4398
social_rank_fhigh 0.5254 0.1552 3.3859 0.0010 0.2174 0.8334
A low ranking capuchin (reference group) is predicted to weigh 2.7261 kg.
Medium ranking capuchin are expected to weigh 0.1763 kg more than low-ranking capuchin on average. However, the uncertainty surrounding the estimate (95% CI = [-0.0873, 0.4398] ) makes clear inference difficult.
High-ranking capuchin are expected to weigh 0.5254 kg more than low-ranking capuchin, although considerably smaller and larger effects are also consistent with the data (95% CI = [0.2174, 0.8334] )
# Calculate predictions of weight for low, medium and high ranking capuchins
new_data <- data.frame(social_rank_f=factor(c("low","medium","high"), levels=c("low","medium","high")))
predictions <- predict(mR, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("social_rank_f","weight","lwr","upr")
new_data social_rank_f weight lwr upr
1 low 2.726080 2.518824 2.933337
2 medium 2.902337 2.739535 3.065138
3 high 3.251508 3.023683 3.479333
# Plot: weight ~ social_rank
ggplot(d, aes(social_rank_f, weight)) +
geom_point(color="grey32", size=0.75) +
geom_errorbar(data = new_data, aes(x=social_rank_f, ymin=lwr, ymax=upr),
width=0.2, color="orange2",linewidth=1.5) +
geom_point(data = new_data, aes(social_rank_f, weight), color = "orange2", size = 3) +
labs(x = "Social Rank", y = "Weight") +
theme_bw()Some capuchin monkeys live in a rainforest, while others live in a dry tropical forest. These two habitats differ the amount of rain: the dry tropical forest experiences drought about half of the year, while the rainforest does not. As a result, food is more abundant in the rainforest, and therefore, we expect that monkeys who live in the rainforest will generally be heavier.
forest_type_f factor variable with two levels: “dry” and “rain”.weight as a function of their forest_type_f.d$forest_type_f <- factor(d$forest_type, levels=c("dry","rain"))
# fit the model
mF <- lm( weight ~ forest_type_f, data=d)
# Extract model summary
sum_mF <- summary(mF)
# Confidence intervals
ci_mF <- confint(mF)
# Add confidence intervals to model summary
sum_mF <- cbind(sum_mF$coefficients,ci_mF)
round(sum_mF,4) #rounding to fit the screen Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
(Intercept) 2.8334 0.0701 40.3988 0.0000 2.6942 2.9726
forest_type_frain 0.3080 0.1221 2.5226 0.0133 0.0657 0.5503
An average capuchin living in a dry forest is predicted to weigh 2.8334 kg.
Capuchins living in a rainforest are expected to weigh 0.308 kg more than those living in a dry forest, although very small differences are also consistent with the data (95% CI = [0.0657, 0.5503] ).
new_data <- data.frame(forest_type_f=factor(c("dry","rain"), levels=c("dry","rain")))
predictions <- predict(mF, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("forest_type_f","weight","lwr","upr")
new_data forest_type_f weight lwr upr
1 dry 2.833389 2.694207 2.972570
2 rain 3.141373 2.943055 3.339691
ggplot(d, aes(forest_type_f, weight)) +
geom_point(color="grey32", size=0.75) +
geom_errorbar(data = new_data, aes(x=forest_type_f, ymin=lwr, ymax=upr, color=forest_type_f),
width=0.2, linewidth=1.5) +
geom_point(data = new_data, aes(forest_type_f, weight, color=forest_type_f), size = 3) +
scale_fill_manual(values = c("dry"="orange3","rain"="skyblue3" ), name="Forest type") +
scale_color_manual(values = c("dry"="orange3","rain"="skyblue3" ), name="Forest type") +
labs(x = "Forest type", y = "Weight") +
theme_bw()Multiple regression allows us to estimate the unique effects of many predictors on the outcome simultaneously.
# Multiple linear regression where weight is modeled as a function of individual's sex and kin
mSK <- lm( weight ~ male + kin, data=d)
# Extract model summary
sum_mSK <- summary(mSK)
sum_mSK
Call:
lm(formula = weight ~ male + kin, data = d)
Residuals:
Min 1Q Median 3Q Max
-1.31410 -0.25358 0.02095 0.24475 1.26425
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.05494 0.13132 15.648 < 2e-16 ***
malemale 0.77489 0.12569 6.165 1.62e-08 ***
kin 0.29372 0.04296 6.838 7.19e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4804 on 97 degrees of freedom
Multiple R-squared: 0.3492, Adjusted R-squared: 0.3358
F-statistic: 26.03 on 2 and 97 DF, p-value: 8.941e-10
# Add confidence intervals
ci_mSK <- confint(mSK)
sum_mSK <- cbind(sum_mSK$coefficients, ci_mSK)
round(sum_mSK, 4) # rounding Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
(Intercept) 2.0549 0.1313 15.6480 0 1.7943 2.3156
malemale 0.7749 0.1257 6.1649 0 0.5254 1.0244
kin 0.2937 0.0430 6.8377 0 0.2085 0.3790
If two predictors are correlated (and are also both correlated with the outcome), their coefficients will change from models where they are entered alone.
In model mS, we found that males were expected to weigh 0.2213 kg more than females on average (95% CI = [-0.0098, 0.4524]) when sex was modeled alone.
Now, we see that, after controlling for the effect of kin, males are expected to weigh 0.7749 kg more than females (95% CI = [0.5254, 1.0244]).
Males generally have fewer kin within group than females (i.e. being male and number of kin are correlated), and having fewer close kin will generally lead to lower weights. So, when modeled alone, the effect of being male is smaller because it includes the effect of having fewer kin as well.
Omitted variable bias occurs when we do not include important predictors in regression models that are correlated with other predictors, resulting in coefficients that represent the combined effects of several predictors. Collecting data on all important explanatory predictors is crucial, therefore.
# Calculate predicted weight for males and females
### <b>
new_data <- expand.grid(male=factor(c("female", "male"), levels=c("female","male")),
kin = 0:4)
### </b>
predictions <- predict(mSK, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("male","kin","weight","lwr","upr")
new_data male kin weight lwr upr
1 female 0 2.054942 1.794303 2.315581
2 male 0 2.829833 2.678091 2.981576
3 female 1 2.348661 2.156562 2.540759
4 male 1 3.123552 2.984299 3.262805
5 female 2 2.642380 2.499524 2.785236
6 male 2 3.417271 3.243219 3.591324
7 female 3 2.936099 2.800265 3.071933
8 male 3 3.710990 3.474909 3.947071
9 female 4 3.229818 3.053661 3.405974
10 male 4 4.004709 3.695337 4.314081
pSK <- ggplot(d, aes(kin, weight)) +
geom_point() +
geom_ribbon(data = new_data, aes(x = kin, ymin = lwr, ymax = upr, fill=male), alpha = 0.5) +
geom_line(data = new_data, aes(kin, weight, color=male), linewidth = 1) +
scale_fill_manual(values = c("male"="lightcyan2","female"="orange2" ), name="Sex") +
scale_color_manual(values = c("male"="deepskyblue4","female"="orange4" ), name="Sex") +
labs(x = "Number of close kin", y = "Weight") +
theme_bw()
pSKFor some questions, it may be useful to put some or all of the predictors on a standardized scale:
Predictors can be standardized by dividing values by their estimated standard deviation. The resulting standardized predictor is measured in standard deviation units.
The scale() function standardizes and centers variables (subtracts the mean so that the mean is now zero).
# Multiple linear regression with standardized predictors
mGK <- lm( weight ~ scale(group_size) + scale(kin), data=d)
# Extract model summary
sum_mGK <- summary(mGK)
sum_mGK
Call:
lm(formula = weight ~ scale(group_size) + scale(kin), data = d)
Residuals:
Min 1Q Median 3Q Max
-1.20813 -0.39458 -0.00674 0.39713 1.33433
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.93502 0.05638 52.061 < 2e-16 ***
scale(group_size) -0.05721 0.05670 -1.009 0.31548
scale(kin) 0.17876 0.05670 3.153 0.00215 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5638 on 97 degrees of freedom
Multiple R-squared: 0.1036, Adjusted R-squared: 0.08517
F-statistic: 5.608 on 2 and 97 DF, p-value: 0.004957
# Add confidence intervals
ci_mGK <- confint(mGK)
sum_mGK <- cbind(sum_mGK$coefficients, ci_mGK)
round(sum_mGK, 4) # rounding Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
(Intercept) 2.9350 0.0564 52.0606 0.0000 2.8231 3.0469
scale(group_size) -0.0572 0.0567 -1.0090 0.3155 -0.1698 0.0553
scale(kin) 0.1788 0.0567 3.1526 0.0022 0.0662 0.2913
Coefficients estimated for standardized predictors are interpreted as the expected change in the outcome per one-standard-deviation increase in the predictor.
Centering predictors changes the interpretation of the intercept to be the expected outcome when the centered predictors are at their mean (which is now zero).
Interaction terms in regression model allow the effect of one predictor to vary with levels of another predictor.
A variable representing the interaction of two variables is formed by taking the product of those two variables.
Generally, the two component variables and the interaction variable are all entered into the regression together.
* operator between predictors to request that both the interaction and the two component variables be added to lm() modelBelow, we model the interaction of number of kin and group size. Kin number should have greater effect in big groups, because competition increases with group size and coalitionary support from kin matters more.
# Multiple linear regression with interaction between two continuous variables
mKG <- lm( weight ~ kin + group_size + kin*group_size, data=d)
# Extract model summary
sum_mKG <- summary(mKG)
sum_mKG
Call:
lm(formula = weight ~ kin + group_size + kin * group_size, data = d)
Residuals:
Min 1Q Median 3Q Max
-1.17226 -0.37270 -0.00907 0.39110 1.33113
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.35099 0.28257 11.859 <2e-16 ***
kin -0.12715 0.12073 -1.053 0.2949
group_size -0.06214 0.02669 -2.328 0.0220 *
kin:group_size 0.02459 0.01133 2.170 0.0324 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5533 on 96 degrees of freedom
Multiple R-squared: 0.1456, Adjusted R-squared: 0.1189
F-statistic: 5.452 on 3 and 96 DF, p-value: 0.001668
# Add confidence intervals
ci_mKG <- confint(mKG)
sum_mKG <- cbind(sum_mKG$coefficients, ci_mKG)
round(sum_mKG, 4) # rounding Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
(Intercept) 3.3510 0.2826 11.8591 0.0000 2.7901 3.9119
kin -0.1272 0.1207 -1.0531 0.2949 -0.3668 0.1125
group_size -0.0621 0.0267 -2.3283 0.0220 -0.1151 -0.0092
kin:group_size 0.0246 0.0113 2.1704 0.0324 0.0021 0.0471
The interpretation of the intercept is the same as before: the expected weight for a capuchin with no close kin and from a group size of zero is 3.351 kg.
The interpretation of coefficients for variables involved in an interaction is different from a main effects model. Namely, these are the simple or conditional effects when the other variable is zero.
The interaction coefficient expresses the change in a variable’s effect when the interacting variable is increased by one-unit:
The effect of kin at any particular group size can be calculated by adding the coefficient for kin, 0.1272, to the interaction coefficient, 0.0246, multiplied by group size.
\[\begin{aligned} kin.effect_{(group\_size=g)} &= - 0.1272 + 0.0246*g \\ kin.effect_{(group\_size=2)} &= - 0.1272 + 0.0246*10 \\ &= 0.1188 \end{aligned}\]
Interactions can be difficult to understand from just examining the coefficients, so graphs of model predictions can greatly help.
# Calculate predictions
new_data <- expand.grid(group_size=c(5,10,15), kin = c(0,2,4))
predictions <- predict(mKG, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("group_size","kin","weight","lwr","upr")
new_data group_size kin weight lwr upr
1 5 0 3.040281 2.721467 3.359095
2 10 0 2.729567 2.559185 2.899950
3 15 0 2.418854 2.107790 2.729918
4 5 2 3.031873 2.835853 3.227893
5 10 2 2.967053 2.855005 3.079101
6 15 2 2.902233 2.691468 3.112999
7 5 4 3.023465 2.667883 3.379046
8 10 4 3.204539 3.001180 3.407898
9 15 4 3.385612 3.019975 3.751250
### <b>
# Subset dataset
dsub <- d[d$group_size%in%c(5,10,15) & d$kin%in%c(0,2,4),]
### </b>
# Plot
pKxG1 <- ggplot(dsub, aes(kin, weight)) +
geom_point() +
geom_ribbon(data=new_data, aes(x=kin, ymin=lwr, ymax=upr, fill=factor(group_size)), alpha=0.25) +
geom_line(data=new_data, aes(kin, weight, color=factor(group_size)), linewidth = 1) +
facet_grid(. ~ group_size, scales="free") +
scale_fill_manual(values=c("5"="skyblue","10"="deepskyblue3" ,"15"="deepskyblue4"),
name="Group size") +
scale_color_manual(values = c("5"="skyblue","10"="deepskyblue3","15"="deepskyblue4") ,
name="Group size") +
labs(x = "Number of Kin", y = "Weight") +
theme_bw()
pKxG1The impact of the number of kin depends on group size. In small groups (n=5), there is essentially no effect of kinship on an individual’s weight. However, in large groups (n=15), individuals are expected to weigh more with each additional kin member.
### <b>
# Subset dataset
dsub <- d[d$group_size%in%c(5,10,15) & d$kin%in%c(0,2,4),]
### </b>
pKG2 <- ggplot(dsub, aes(group_size, weight)) +
geom_point() +
geom_ribbon(data = new_data, aes(x = group_size, ymin = lwr, ymax = upr, fill=factor(kin)), alpha = 0.25) +
geom_line(data = new_data, aes(group_size, weight, color=factor(kin)), linewidth = 1) +
scale_fill_manual(values = c("0"="orange2","2"="salmon1" ,"4"="orange4"),name="Number of Kin" ) +
scale_color_manual(values = c("0"="orange2","2"="salmon1","4"="orange4" ), name="Number of Kin") +
labs(x = "Group size", y = "Weight") +
theme_bw()
pKG2The effect of group size on weight depends on the number of kin that an individual has. Individuals with no kin members in their group are expected to weigh less as the group size increases. Individuals who have four kin members in the group are expected to weigh more as the group size increases.
Higher ranking individuals should weigh more than lower ranking individuals. The effect should be stronger for males in comparison to females.
# Multiple linear regression with interaction between two continuous variables
mSR <- lm( weight ~ male*social_rank_f, data=d)
# Extract model summary
sum_mSR <- summary(mSR)
sum_mSR
Call:
lm(formula = weight ~ male * social_rank_f, data = d)
Residuals:
Min 1Q Median 3Q Max
-1.3207 -0.3713 -0.0326 0.3559 1.7507
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.7724 0.1356 20.453 <2e-16 ***
malemale -0.1034 0.2025 -0.511 0.6107
social_rank_fmedium 0.1299 0.1660 0.782 0.4359
social_rank_fhigh -0.3067 0.3031 -1.012 0.3142
malemale:social_rank_fmedium 0.1034 0.2641 0.392 0.6963
malemale:social_rank_fhigh 1.0463 0.3594 2.911 0.0045 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5422 on 94 degrees of freedom
Multiple R-squared: 0.1966, Adjusted R-squared: 0.1538
F-statistic: 4.599 on 5 and 94 DF, p-value: 0.0008492
# Add confidence intervals
ci_mSR <- confint(mSR)
sum_mSR <- cbind(sum_mSR$coefficients, ci_mSR)
round(sum_mSR, 3) # rounding Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
(Intercept) 2.772 0.136 20.453 0.000 2.503 3.042
malemale -0.103 0.202 -0.511 0.611 -0.505 0.299
social_rank_fmedium 0.130 0.166 0.782 0.436 -0.200 0.460
social_rank_fhigh -0.307 0.303 -1.012 0.314 -0.908 0.295
malemale:social_rank_fmedium 0.103 0.264 0.392 0.696 -0.421 0.628
malemale:social_rank_fhigh 1.046 0.359 2.911 0.004 0.333 1.760
The expected weight for a low ranking capuchin female is 2.772 kg.
Low ranking males are expected to weigh 0.103 kg less than low ranking females, (95% CI = [-0.505, 0.299]).
For females, we expect medium ranking monkeys to weigh 0.13 kg more than low ranking monkeys, (95% CI = [-0.2, 0.46]).
For females, we expect high ranking monkeys to weigh 0.307 kg less than low ranking monkeys, (95% CI = [-0.908, 0.295]).
The interaction coefficients:
Compared to females, the difference between medium-ranking and low-ranking capuchins is increased by 0.103 kg for males, (95% CI = [-0.421, 0.628])
Compared to females, the difference between high-ranking and low-ranking capuchins is increased by 1.046 kg for males, (95% CI = [0.333, 1.76]).
Compared to low-ranking monkeys, the difference between males and females is increased by 0.103 kg for medium-ranking monkeys, (95% CI = [-0.421, 0.628]).
Compared to low-ranking monkeys, the difference between males and females is increased by 1.046 kg for high-ranking monkeys, (95% CI = [0.333, 1.76]).
# Calculate predictions
new_data <- expand.grid(male=factor(c("female","male"), level=c("female","male")),
social_rank_f = factor(c("low","medium","high"), level=c("low","medium","high")))
predictions <- predict(mSR, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("male","social_rank_f","weight","lwr","upr")
new_data male social_rank_f weight lwr upr
1 female low 2.772431 2.503290 3.041572
2 male low 2.669033 2.370449 2.967618
3 female medium 2.902331 2.712019 3.092642
4 male medium 2.902350 2.624383 3.180317
5 female high 2.465748 1.927467 3.004029
6 male high 3.408660 3.167933 3.649386
pSR <- ggplot(d, aes(social_rank_f, weight)) +
geom_point(color="grey32",size=0.75) +
geom_errorbar(data = new_data, aes(x = social_rank_f, ymin = lwr, ymax = upr, color=male),
width=0.2, linewidth=1) +
geom_point(data = new_data, aes(social_rank_f, weight, color=male), size = 2) +
scale_fill_manual(values = c("male"="skyblue3","female"="sandybrown" ), name="Sex") +
scale_color_manual(values = c("male"="skyblue3","female"="sandybrown" ), name="Sex") +
labs(x = "Social rank", y = "Weight") +
theme_bw()
pSRThere seems to be no difference in expected weights for low ranking and medium ranking males and females. High ranking males are expected to weigh more than high ranking females.
The presence of kin should have greater effect on female weight compared to male weight.
# Multiple linear regression with interaction between a continuous var and a categorical var
mSxK <- lm( weight ~ male*kin, data=d)
# Extract model summary
sum_mSxK <- summary(mSxK)
sum_mSxK
Call:
lm(formula = weight ~ male * kin, data = d)
Residuals:
Min 1Q Median 3Q Max
-1.36506 -0.26226 0.03128 0.27863 1.12248
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.81203 0.14638 12.379 < 2e-16 ***
malemale 1.15958 0.16941 6.845 7.20e-10 ***
kin 0.38592 0.05004 7.713 1.15e-11 ***
malemale:kin -0.28123 0.08739 -3.218 0.00176 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4588 on 96 degrees of freedom
Multiple R-squared: 0.4126, Adjusted R-squared: 0.3942
F-statistic: 22.48 on 3 and 96 DF, p-value: 4.163e-11
# Add confidence intervals
ci_mSxK <- confint(mSxK)
sum_mSxK <- cbind(summary(mSxK)$coefficients, ci_mSxK)
round(sum_mSxK, 3) # rounding Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
(Intercept) 1.812 0.146 12.379 0.000 1.521 2.103
malemale 1.160 0.169 6.845 0.000 0.823 1.496
kin 0.386 0.050 7.713 0.000 0.287 0.485
malemale:kin -0.281 0.087 -3.218 0.002 -0.455 -0.108
The expected weight for a a female capuchin with no kin is 1.812 kg.
For capuchins with no kin, we expect males to weigh 1.16 kg more than females, (95% CI = [0.823, 1.496]).
For females, we expect each additional kin to increase a monkey’s weight by 0.386 kg, (95% CI = [0.287, 0.485]).
The interaction coefficient:
For each additional kin, we expect males to weigh 0.281 kg less in comparison to females, (95% CI = [-0.455, -0.108]).
Compared to females, we expect each additional kin to decrease a monkey’s weight by 0.281 kg for males, (95% CI = [-0.455, -0.108]).
# Calculate predictions
new_data <- expand.grid(male=factor(c("female","male"), level=c("female","male")),kin = 0:4)
predictions <- predict(mSxK, newdata=new_data, interval="confidence")
new_data <- cbind(new_data, as.data.frame(predictions))
names(new_data) <- c("male","kin","weight","lwr","upr")
new_data male kin weight lwr upr
1 female 0 1.812028 1.521472 2.102584
2 male 0 2.971605 2.802334 3.140876
3 female 1 2.197948 1.992263 2.403632
4 male 1 3.076295 2.940134 3.212456
5 female 2 2.583868 2.442730 2.725006
6 male 2 3.180985 2.959901 3.402068
7 female 3 2.969787 2.838395 3.101180
8 male 3 3.285675 2.939746 3.631603
9 female 4 3.355707 3.170402 3.541013
10 male 4 3.390364 2.909838 3.870891
# Plot: weight ~ sex*kin
### <b>
pSxK <- ggplot(d, aes(kin, weight, color=male)) +
geom_point() +
### </b>
geom_ribbon(data=new_data, aes(x=kin, ymin=lwr, ymax=upr, fill=male), alpha = 0.5, color=NA ) +
geom_line(data=new_data, aes(kin, weight, color=male), linewidth = 1) +
scale_fill_manual(values = c("male"="lightcyan2","female"="orange2" ), name="Sex") +
scale_color_manual(values = c("male"="deepskyblue4","female"="orange4" ), name="Sex") +
### <b>
geom_hline(yintercept=mean(d$weight[d$male=="female"]), linetype="dashed", color = "orange2") +
geom_hline(yintercept=mean(d$weight[d$male=="male"]), linetype="dashed", color="deepskyblue3") +
### </b>
labs(x = "Number of Kin", y = "Weight") +
theme_bw()
pSxKThe presence of kin increases weight, with stronger effect expected in females compared to males.
The data is simulated.
\[ \begin{aligned} \text{weight}_i &= \beta_0 + \beta_S \,\text{male}_i + \beta_{Rm}\,\text{social_rank_medium}_i + \beta_{Rh}\,\text{social_rank_high}_i \\ &\quad + \beta_G\,\text{group_size}_i + \beta_K\,\text{kin}_i + \beta_F\,\text{rain_forest}_i \\ &\quad + \beta_{KG}\,(\text{kin}_i \times \text{group_size}_i) \\ &\quad + \beta_{SRm}\,(\text{male}_i \times \text{social_rank_medium}_i) \\ &\quad + \beta_{SRh}\,(\text{male}_i \times \text{social_rank_high}_i) \\ &\quad + \beta_{SK}\,(\text{male}_i \times \text{kin}_i) \\ &\quad + \epsilon_i \end{aligned} \]