This workshop will teach you how to analyze and visualize interactions in regression models in R both using the emmeans package and with base R coding.
Topics discussed in the workshop:
Review of linear regression
interpreting coefficients
dummy variables for categorical predictors
main effects models
Introduction to the emmeans package
estimated marginal means
estimating effects
visualizing effects
Models with interactions
why we model interactions
product terms
general interpretation of coefficients in models with interactions
Analyzing, intepreting, and visualizing two-way interactions with emmeans
interactions of two continuous variables
interactions of two categorical variables
interactions of a categorical variable and a continuous variable
interactions in generalized linear models (logistic regression)
Do-it-yourself (DIY) analysis and visualizations of interactions
Estimating simple effects through re-centering and changing reference groups
Predictions
Graphing predictions to visualize effects
Workshop packages
This workshop requires the emmeans and ggplot2 packages. This workshop is interactive, so if you wish to participate, please load both packages into R now
library(emmeans)library(ggplot2)
Workshop data set
The workshop data set contains data from an experiment of mice being fed 3 different diets.
Variables
weight: weight in grams, the dependent variable
age: age in weeks, from 2 to 16 weeks in increments of 2
diet: diet type, one of high-“fat”, high-“starch”, or “control”
sex: male or female
litter: number of pups in litter, from 2 to 10
hyper: 0=no hypertension, 1=hypertension
Note: the data are artificial. Do not make any inferences regarding mouse weights from these data.
# read in datamouse <-read.csv("https://stats.oarc.ucla.edu/wp-content/uploads/2023/03/mouse.csv")# display first few rowshead(mouse)
age sex diet litter weight hyper
1 2 male control 2 14.76288 0
2 2 female fat 4 24.18650 0
3 2 male starch 6 21.14797 0
4 2 female control 8 21.60721 0
5 2 male fat 10 20.22701 0
6 2 female starch 2 16.17364 0
Review of regression
Regression with a continuous predictor
First, we review the interpretation of the coefficients of a regression model with a single continuous predictor.
In general, a regression coefficient for a predictor will be interpreted as the expected change in the dependent variable for a one-unit change in the predictor.
Below we model weight predicted by age:
\[{weight}_i = b_0 + b_{age}age_i + \epsilon_i\]
We use the lm() function to run a linear regression model.
# simple linear regressionm_age <-lm(weight ~ age, data=mouse)# regression tablsummary(m_age)
Call:
lm(formula = weight ~ age, data = mouse)
Residuals:
Min 1Q Median 3Q Max
-21.7664 -3.4502 -0.0978 3.3996 17.5530
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.86225 0.34484 57.60 <2e-16 ***
age 0.53043 0.03414 15.54 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.42 on 1198 degrees of freedom
Multiple R-squared: 0.1677, Adjusted R-squared: 0.167
F-statistic: 241.3 on 1 and 1198 DF, p-value: < 2.2e-16
Interpretation of coefficients:
\(\hat{b}_0=19.86\), the intercept is the expected outcome when all of the predictors equal zero. Thus, this is the expected weight for a mouse at week 0.
\(\hat{b}_{age}=0.53\), the slope of age. Mice are expected to gain .53 grams of weight per week.
Review of dummy variables
In regression models, categorical variables are typically represented by one or more dummy or indicator variables:
0 means not belonging to the category
1 means belonging to the category
A variable with \(k\) categories can be represented by \(k-1\) dummy variables.
For example, we can represent sex (2 categories) with a single dummy variable that equals 0 for males and equals 1 for females.
sex
female
male
0
female
1
female
1
male
0
For a 3-level categorical variable, we will need two dummies:
diet
fat
starch
fat
1
0
starch
0
1
starch
0
1
control
0
0
fat
1
0
We do not need a third dummy variable for control, because being fat=0 and starch=0 implies membership to the control group.
The omitted category is often referred to as the reference group in regression models.
Factors in R
Factors can be used to represent categorical variables in R.
Essentially, factors are integer variables with labels for each integer value.
We use the factor() function to convert a character variable (or integer variable) to a factor.
To see the distinct categories of the factor in the order of their integer representation, use levels() on the factor variable.
# character vectorincome <-c("low", "mid", "high", "high", "low", "mid")# convert it to factorf_income <-factor(income)# distinct categories, in order (first value is 1, second values is 2, ...)levels(f_income)
[1] "high" "low" "mid"
The factor() function will use alphabetical ordering to order the categories by default.
If we want to specify the order, we use the levels= argument in factor():
# specifying the order of levelsf_income <-factor(income, levels=c("low", "mid", "high"))levels(f_income)
[1] "low" "mid" "high"
Although we will rarely need to use them, we can see the underlying integer values by converting the factor to a numeric vector:
# underlying integer values for factoras.numeric(f_income)
[1] 1 2 3 3 1 2
If we are starting with an integer variable that we wish to convert to a factor, we use labels= in factor() to apply the text labels which will then become the levels themselves:
# 0=low, 1=mid, 2=highnum_income <-c(0, 1, 2, 2, 0, 1)f_num_income <-factor(num_income, labels=c("low", "mid", "high"))# labels become the levelslevels(f_num_income)
[1] "low" "mid" "high"
# the variable now displays the labels/levelsf_num_income
[1] low mid high high low mid
Levels: low mid high
Important: In regression models, R will automatically create dummy variables for each level of factor except for the first, which will thus be the reference group.
Let’s make sex and diet into factors.
# convert mouse vars to factorsmouse$sex <-factor(mouse$sex, levels=c("male", "female")) # male is referencemouse$diet <-factor(mouse$diet, levels=c("control", "fat", "starch")) # control is referencestr(mouse)
Categorical variables with \(k\) categories are represented in regression models by \(k-1\) dummy variables. We cannot add all \(k\) dummy variables to the model due to perfect collinearity.
The dummy variable omitted from the model represents the reference group, and the coefficients will represent differences from this reference group.
Let’s interpret the coefficients for a model where diet is predicting weight:
# main effects modelm_me <-lm(weight ~ sex + diet + age, data=mouse)summary(m_me)
Call:
lm(formula = weight ~ sex + diet + age, data = mouse)
Residuals:
Min 1Q Median 3Q Max
-20.082 -3.242 -0.042 3.274 15.236
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.17791 0.40983 44.354 < 2e-16 ***
sexfemale -1.60246 0.29242 -5.480 5.18e-08 ***
dietfat 4.00950 0.35814 11.195 < 2e-16 ***
dietstarch 3.44720 0.35814 9.625 < 2e-16 ***
age 0.53043 0.03191 16.625 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.065 on 1195 degrees of freedom
Multiple R-squared: 0.275, Adjusted R-squared: 0.2726
F-statistic: 113.3 on 4 and 1195 DF, p-value: < 2.2e-16
Interpretation of regression coefficients:
\(\hat{b}_0=18.18\), the predicted weight for a male mouse on the control diet at age=0.
\(\hat{b}_{fem}=-1.60\), females are expected to weigh 1.6 grams less than males, holding diet and age constant
\(\hat{b}_{fat}=4.00\), mice on the high-fat diet are expected to weigh 4 grams more than mice on the control diet, holding sex and age constant
\(\hat{b}_{sta}=3.45\), mice on the high-starch diet are expected to weigh 3.45 grams more than mice on the control diet, holding sex and age constant
\(\hat{b}_{age}=0.53\), mice are expected to gain 0.53 grams per week, holding sex and diet constant
Exercise 1
Load the auction data set from our website:
# exercise data setauction <-read.csv("https://stats.oarc.ucla.edu/wp-content/uploads/2023/03/auction.csv")
These data describe how much participants were willing to bid on items in a blind auction (all bidders make a single bid simultaneously), depending on characteristics of the item and how long the auction would last:
bid: the amount of the bid in dollars
hours: the number of hours before the auction ends
quality: the quality of the item, as judged by the participant based on pictures and a description of the item; ranges from 0 (lowest) to 5 (highest)
rarity: how hard to find the item is, one of “common”, “rare”, “unique”
budget: the participants budget, categorized as “small” or “large”
Currently, rarity and budget are character variables. Convert them both to factors, with the ordering of the levels for rarity and budget as (common, rare, unique) and (small, large), respectively.
Run a main effects linear regression modeling bid (the outcome) predicted by hours and rarity.
Interpret the coefficients
The emmeans package
Overview of the emmeans package
The emmeans package can:
compute estimated marginal means (EMMs), which are predicted outcomes at specific values of predictors, possibly averaged over values of other predictors
compute contrasts between EMMs, which can be useful to estimate effects among levels of a categorical variable (that may be involved in an interaction)
or calculate slopes (linear trends) for continuous variables
generate graphs of model estimates and predictions, including interactions
The procedure for estimating effects (and contrasts) will differ depending on whether the variable whose effects we seek is categorical or continuous.
For effects of a categorical variable:
use emmeans() to estimate EMMs across values of predictors (possibly involved in an interaction) and store these in an object of class emmGrid
use contrast() on the emmGrid object (created by emmeans()) to estimate effects and differences between effects
use emmip() on the emmGrid object to graph the interaction
For effects (slopes) of a continuous variable we skip the emmeans() step and instead:
use emtrends() on the object created by the regression function (e.g. lm object) to estimate slopes of the continuous variable
use emmip() on the regression object to graph the interaction
We can add sex to specs= to get EMMs at each crossing of diet and sex (and at the mean age):
# EMMs for each crossing of diet and sexemm_me2 <-emmeans(m_me, specs=c("diet", "sex"))emm_me2
diet sex emmean SE df lower.CL upper.CL
control male 23.0 0.292 1195 22.4 23.5
fat male 27.0 0.292 1195 26.4 27.5
starch male 26.4 0.292 1195 25.8 27.0
control female 21.3 0.292 1195 20.8 21.9
fat female 25.4 0.292 1195 24.8 25.9
starch female 24.8 0.292 1195 24.2 25.4
Confidence level used: 0.95
Once we have created an emmeans object (class=emmGrid), we next specify it in contrast() to estimate differences between the EMMs or in emmip() to graph the interaction.
A quick note about averaging over categorical covariates
Variables not specified in specs= are averaged over. For categorical variables, the default is to give equal weight to each category. This typically makes sense for experimental data where there is counterbalancing of categorical covariates.
In observational data, we instead often want the EMMs to reflect the proportional representation of the categories in the population. For example, races are not equally represented in the US population, and if we want our EMMs to reflect the proportional representation of race in the US, we can specify weights=proportional in emmeans:
# when averaging over sex for these EMMs, use the frequencies of sex # in the data to weightemm_me <-emmeans(m_me, specs=c("diet"), weights="proportional") # not run, because all factors are balanced in the data
Note: The method of averaging over covariates will typically only affect the estimation of EMMs themselves and should not affect estimation of simple effects and interaction contrasts (because those covariate effects are differenced out).
Pairwise comparisons among factor levels with emmeans::contrast()
Although we get estimates of differences between the high fat diet and control as well as high starch diet and control, we do not get a direct estimate of the difference between high fat and high starch.
We can easily obtain all pairwise comparisons among levels of factor by running contrast() on the emmGrid object.
Important arguments for contrast():
the first argument to contrast() is the emmGrid object produced by emmeans()
method="pairwise" requests contrasts among all levels of a variable with each other
the default is method="eff", which requests contrasts of all levels with the average over all levels
simple="diet" requests contrasts among all levels of diet
# effects of dietcontrast(emm_me, method="pairwise", simple="diet")
contrast estimate SE df t.ratio p.value
control - fat -4.010 0.358 1195 -11.195 <0.0001
control - starch -3.447 0.358 1195 -9.625 <0.0001
fat - starch 0.562 0.358 1195 1.570 0.2591
Results are averaged over the levels of: sex
P value adjustment: tukey method for comparing a family of 3 estimates
Multiple comparison adjustments for emmeans
In the output of contrast() estimating the effects of diet, a note at the bottom of the output indicates that p-values have been adjusted using Tukey’s method.
contrast estimate SE df t.ratio p.value
control - fat -4.010 0.358 1195 -11.195 <0.0001
control - starch -3.447 0.358 1195 -9.625 <0.0001
fat - starch 0.562 0.358 1195 1.570 0.2591
Results are averaged over the levels of: sex
P value adjustment: tukey method for comparing a family of 3 estimates
The emmeans package will by default apply adjustments that vary by the method= used.
An adjust= option can be added to many functions from the emmeans package to specify which method of adjustment to use.
For example, we can request no p-value adjustment with adjust="none"
# no p-value adjustmentcontrast(emm_me, method="pairwise", simple="diet", adjust="none")
contrast estimate SE df t.ratio p.value
control - fat -4.010 0.358 1195 -11.195 <0.0001
control - starch -3.447 0.358 1195 -9.625 <0.0001
fat - starch 0.562 0.358 1195 1.570 0.1167
Results are averaged over the levels of: sex
See ?summary.emmGrid for a list of the many adjustments that are possible.
A discussion of multiple hypothesis adjustments is beyond the scope of this workshop, and henceforth we will not re-specify the adjustment.
Estimating slopes with emmeans::emtrends()
To estimate slopes of continuous variables, we skip the emmeans() step and use emtrends(), specifying:
the first argument must be a fitted model object, not an emmGrid object
specs=, a name of a variable (or vector of names of variables) across whose levels the slopes will be estimated
e.g. if a factor variable is specified, slopes will be calculated at each level of the factor
var=, the variable whose slopes we want to estimate
Below we request estimates of the slope of age for each diet group:
# slopes of age across dietsemtrends(m_me, specs="diet", var="age")
diet age.trend SE df lower.CL upper.CL
control 0.53 0.0319 1195 0.468 0.593
fat 0.53 0.0319 1195 0.468 0.593
starch 0.53 0.0319 1195 0.468 0.593
Results are averaged over the levels of: sex
Confidence level used: 0.95
The slope estimates are the same because this is a main effects model, where the effects of variables are constrained to be the same across levels of other variables.
Graphing model predictions with emmeans::emmip()
The emmip() function is designed to create interaction plots, but can be used to plot predictions from models without interactions.
Graphs of model predictions across values of the predictors are a common way to visualize predictor effects.
Important arguments for emmip() (see ?emmip for many more):
the first argument is
the emmGrid object if we wish to visualize the effects of a categorical variable
the regression model object if we wish to visualize the effects of a continuous variable
formula=, the second argument is a formula of the form break.variable ~ x.variable, resulting in x.variable being mapped to the x-axis, and separate lines/colors being produced across levels of break.variable.
Note: if an object created by emmeans() is used as the first argument (i.e. an emmGrid object), then only those variables specified in specs= in emmeans() can be specified in this formula
CIs=, requests confidence intervals and is FALSE by default.
xlab=, ylab=, tlab=, labels for the x-axis, y-axis, and moderator variable
Generally, the variable whose effect we wish to visualize should be mapped to the x-axis.
For example, let’s visualize the effects of diet for each sex (we will need to use the emm_me2 in which both sex and diet were specified in specs=):
# graphs of diet effect for each sex# diet is on the x-axis# different lines/colors by sex# 95% confidence intervalsemmip(emm_me2, sex ~ diet, CIs=TRUE)
The graph above is actually a plot of the EMMs estimated for each combination of diet and sex in the object emm_me2, and it allows us to visualize the differences between the expected weights for each diet and sex.
Graphing slopes with emmeans::emmip()
We can also visualize slopes with emmip(), using the following procedure:
use emmip() directly by specifying the regression model object as the first argument, not an emmGrid object
i.e., skip the emmeans() step
specify the formula with the break variable on the left side of ~ and the slope variable on the right side
use at= to specify the range of values to plot for the slope variable inside of list()
Here we visualize the slopes of age for each diet group:
# model object is first argument to visualize slopes# age on x-axis# separate lines by diet# at= specifies we want estimates at ages 2, 9, and 16# 95% CIsemmip(m_me, diet ~ age, at=list(age=c(2,9,16)), CIs=TRUE)
For more confidence interval estimates, specify more values inside of the range of the slope variable, such as each integer between 2 and 16 for age…
# now we estimate at each integer between 2 and 16 for ageemmip(m_me, diet ~ age, at=list(age=2:16), CIs=TRUE)
… or each number between 2 and 16 in increments of 0.2:
# now we estimate at each integer between 2 and 16 for age# we also change the label for the y-axisemmip(m_me, diet ~ age, at=list(age=seq(2,16,0.2)), CIs=TRUE,ylab="predicted weight (g)")
Customizing emmip() plots with ggplot2
Conveniently, we can use ggplot2 functions and syntax to customize the plot:
# using ggplot2 code to modify emmip graphemmip(m_me, diet ~ age, at=list(age=seq(2,16,0.2)), CIs=TRUE) +theme_classic() +scale_color_manual(values=c("orange", "brown", "purple")) +labs(x="diet", y="predicted weight (g)") # can also use emmip xlab=, ylab=
If you’d like to build the plot from scratch using the emmeans estimates, you can save the data used to build the emmip() plot by specifying plotit=FALSE and saving the result to an object:
# saving emmip plot data to data setplotdata <-emmip(m_me, diet ~ age, at=list(age=seq(2,16,0.2)),CIs=TRUE,plotit=FALSE)head(plotdata)
diet age yvar SE df LCL UCL tvar xvar
control 2.0 18.4 0.338 1195 17.8 19.1 control 2.0
fat 2.0 22.4 0.338 1195 21.8 23.1 fat 2.0
starch 2.0 21.9 0.338 1195 21.2 22.5 starch 2.0
control 2.2 18.5 0.333 1195 17.9 19.2 control 2.2
fat 2.2 22.6 0.333 1195 21.9 23.2 fat 2.2
starch 2.2 22.0 0.333 1195 21.3 22.6 starch 2.2
Results are averaged over the levels of: sex
Confidence level used: 0.95
Then we can use ggplot2 functions to plot the data ourselves. Here we change the CI style from geom_pointrange to geom_ribbon:
# plotting with CI bandsggplot(plotdata, aes(x=age, y=yvar, color=diet, group=diet)) +geom_line() +geom_ribbon(aes(ymin=LCL, ymax=UCL, fill=diet), alpha=.5)
Exercise 2
Using the main effects model of bid regressed on hours and rarity (from the auction data) in Exercise 1:
Estimate the slopes of hours for each rarity.
Plot the slopes of hours for each rarity (from 2 to 9 hours).
Interactions and moderation models
Why model interactions?
An interaction allows the effect of a variable to depend on the value of another variable.
Main effects models assume the effect of a variable to be the same across values of another variable, whereas interactions relax this assumption.
If we enter the interaction of variable \(X\) and variable \(Z\) into a regression model, we can answer the following questions:
What are the effects of \(X\) at different values of \(Z\)?
The effects of \(X\) at specific levels of \(Z\) are often called simple effects or conditional effects
How different are the effects of \(X\) at different values of \(Z\)?
The same questions with the roles of \(X\) and \(Z\) reversed can also be answered.
The dependency of a variable’s effect on another variable is often called moderation or effect modification.
Product terms
An interaction between variables \(X\) and \(Z\) can be modeled with a product term (variable), \(XZ=X \times Z\).
Although we can create these variables ourselves and add them to the regression model, R provides a convenient syntax for interactions in regression models that does not require the product term to be in the data set.
We don’t usually need to create these variables ourselves because R provides a convenient syntax for regression model formulas to model interactions using only the component variables:
X:Z enters the product term \(XZ\) into the model
X*Z enters both of the lower order terms, \(X\) and \(Z\), as well as the product term \(XZ\) into the model
If either X or Z (or both) are factors, R will multiple each dummy variable representing that variable by the other variable.
The following two model specifications are equivalent.
# equivalent interaction notation in Rm_int1 <-lm(weight ~ diet + sex + diet:sex, data=mouse)m_int2 <-lm(weight ~ diet*sex, data=mouse)
Generally, we should include the lower order terms \(X\) and \(Z\) in a model with their interaction \(XZ\), because omitting them constrains certain simple effects to zero, which can be result in a poorly fitting model.
General interpretation of coefficients in interaction models
For a model that includes \(X\), \(Z\), and their interaction \(XZ\) predicting an outcome \(Y\):
\[Y_i = b_0 + b_XX_i + b_ZZ_i + b_{X \times Z}XZ_i + \epsilon_i\] the general interpretation of the coefficients will be:
\(b_0\), the intercept is again the expected outcome \(Y_i\) when all predictors equal zero, or \(X=0\), \(Z=0\), and \(XZ=0\)
\(b_X\), the simple effect of \(X\) when \(Z=0\)
commonly misinterpreted as the main effect of \(X\)
\(b_Z\), the simple effect of \(Z\) when \(X=0\)
\(b_{X \times Z}\), the change in the effect of \(X\) per unit-increase in \(Z\) (and vice-versa)
interactions allow the effect of \(X\) to vary with \(Z\), and the effect of \(Z\) to vary with \(X\)
The coefficient for the interaction is a difference between two simple effects. For example, \(b_{X \times Z}\) equals the difference in the simple effect of \(X\) when \(Z=1\) and the simple effect of \(X\) when \(Z=0\).
In short:
coefficients for lower order terms are interpreted as simple effects when the moderator is zero
interaction coefficients are interpreted as differences between simple effects
interaction coefficients will have 2 interpretations, reflecting that each variable in the interaction moderates the effect of the other
Interaction of two continuous variables
Regression model with an interaction of two continuous variables
Let’s interpret the coefficients of a regression model that includes:
We include sex in the model to discuss how to handle other covariates when visualizing an interaction.
# interaction of age and litterm_agelitter <-lm(weight ~ age*litter + sex, data=mouse)summary(m_agelitter)
Call:
lm(formula = weight ~ age * litter + sex, data = mouse)
Residuals:
Min 1Q Median 3Q Max
-22.5676 -3.5492 -0.0922 3.3112 16.0442
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.51989 0.81014 25.329 < 2e-16 ***
age 0.67363 0.07876 8.553 < 2e-16 ***
litter 0.02393 0.11991 0.200 0.8418
sexfemale -1.60246 0.30778 -5.207 2.26e-07 ***
age:litter -0.02387 0.01187 -2.010 0.0446 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.331 on 1195 degrees of freedom
Multiple R-squared: 0.1969, Adjusted R-squared: 0.1942
F-statistic: 73.24 on 4 and 1195 DF, p-value: < 2.2e-16
Interpretation:
\(\hat{b}_0=20.52\): a male mouse at 0 weeks from a litter of 0 pups is expected to weight 20.52 grams
\(\hat{b}_{age}=0.67\): a mouse from a litter of 0 pups is expected to gain 0.67 grams per week, holding sex constant
\(\hat{b}_{lit}=0.024\): at 0 weeks, mice are expected to weight 0.024 grams more per additional pup in the litter, holding sex constant
\(\hat{b}_{a \times l}=-0.024\): for each additional pup in the litter, the expected weekly increase in weight (the age effect) decreases by 0.024, holding sex constant
or, for each additional week, the expected change in weight for each additional pup (the litter effect) decreases by 0.024, holding sex constant
As we can see above,
the lower order terms are interpreted as simple effects of a variable when the interacting variable=0, (e.g. \(\hat{b}_{age}\) is the simple effect (slope) of the age when litter=0)
interaction terms are interpreted as a difference between two simple effects, (\(\hat{b}_{a \times l}\) is the difference between the simple effects of age for two litter values that differ by 1 (e.g. litter=1 vs litter=0))
With these coefficients, we can estimate the simple slope (effect) of age at different litter sizes (litter is the moderator here).
Simple effects not directly estimated by the regression will be calculated by adding the lower order term to the interaction term multiplied by the value of the moderator.
\[age.effect_{litter} = \hat{b}_{age} + \hat{b}_{a \times l}litter\] For example, the \(age\) effect when \(litter=0\) is:
At 8 weeks, for each additional pup in the birth litter, a mouse is expected to weigh 0.17 grams less.
Using emmeans::emtrends() to estimate simple slopes
Because we wish to estimate the simple effects (slopes) of a continuous variable, we will use emtrends().
Below, we will estimate the simple slopes of age at different values of litter. For this problem, litter is the moderator. When the moderator is continuous, we will have to choose values at which we wish to estimate simple effects/slopes of the other variable. Common choices are:
substantively meaningful values, such as ages 18 and 21 for US citizens, or clinical cutoffs (e.g. for blood pressure and hypertension)
minimum and maximum observed values
mean, mean+sd, mean-sd
Let’s estimate the effects of age at litter sizes 2, 6, and 10 (min, mean, and max).
Here, we will use the following arguments:
the first argument is the regression model object
specs=, the name of the moderator, "litter"
var=, the name of the variable whose simple effects we wish to estimate, "age"
at=, a list() containing the values of the moderator (litter) at which to estimate the simple effects
# simples slopes of age at litter=2,6,10age_by_litter <-emtrends(m_agelitter, # modelspecs="litter", # moderatorvar="age", # simple effects we wantat=list(litter=c(2, 6, 10))) # moderator valuesage_by_litter
litter age.trend SE df lower.CL upper.CL
2 0.626 0.0582 1195 0.512 0.740
6 0.530 0.0336 1195 0.465 0.596
10 0.435 0.0582 1195 0.321 0.549
Results are averaged over the levels of: sex
Confidence level used: 0.95
We can see the decreasing slope of age as litter increases.
If you need p-values for the age slopes, put the result of emtrends() inside of test():
# p-values for age slopestest(age_by_litter)
litter age.trend SE df t.ratio p.value
2 0.626 0.0582 1195 10.761 <0.0001
6 0.530 0.0336 1195 15.795 <0.0001
10 0.435 0.0582 1195 7.478 <0.0001
Results are averaged over the levels of: sex
Visualizing the interaction of two continuous variables with emmeans::emmip()
Now we will graph the slopes of age at specific values of litter using emmip():
to graph slopes, we use the regression model object as the first argument
litter is the break variable, and age is the x-variable, so the formula is litter ~ age
we specify values for both litter and age inside of a list() as an argument for at=
# graph of age slopes by litter valuesemmip(m_agelitter, litter ~ age, at=list(age=2:16, litter=c(2,6,10)),CIs=TRUE)
We see that mice from larger litters grow slower over time.
Exercise 3
Using the auction dataset:
Run a linear regression modeling bid predicted by hours, quality, and their interaction.
Estimate the simple slopes of hours at each quality from 0 to 5.
Graph the simple slopes of hours at each quality from 0 to 5.
Interaction of two categorical variables
Modeling the interaction of two categorical variables
Now we will model the interaction of 2 categorical variables in a model that includes:
diet
sex
interaction of diet and sex
age
The diet variable is represented by \(2\) dummy variables, while the sex variable is represented by \(1\) dummy variable, so the interaction will be represented by \(2 \times 1 = 2\) variables.
# interaction of diet and sexm_dietsex <-lm(weight ~ diet*sex + age, data=mouse)summary(m_dietsex)
Call:
lm(formula = weight ~ diet * sex + age, data = mouse)
Residuals:
Min 1Q Median 3Q Max
-19.8664 -3.2698 -0.0475 3.2184 15.4413
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.962272 0.458595 39.168 < 2e-16 ***
dietfat 4.011671 0.505993 7.928 5.07e-15 ***
dietstarch 4.091955 0.505993 8.087 1.49e-15 ***
sexfemale -1.171179 0.505993 -2.315 0.0208 *
age 0.530425 0.031875 16.641 < 2e-16 ***
dietfat:sexfemale -0.004334 0.715583 -0.006 0.9952
dietstarch:sexfemale -1.289503 0.715583 -1.802 0.0718 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.06 on 1193 degrees of freedom
Multiple R-squared: 0.2777, Adjusted R-squared: 0.274
F-statistic: 76.43 on 6 and 1193 DF, p-value: < 2.2e-16
Interpretation:
\(\hat{b}_0 = 17.96\): the intercept is the expected weight when all predictors equal 0, so a male on the control diet at week 0 is expected to weight 17.96 grams, \(\widehat{weight}_{con,mal,age=0}=17.96\)
\(\hat{b}_{fat}=4.01\): males on the high fat diet are expected to weigh 4.01 grams more than males on the control diet (i.e. the effect of the high fat diet when sex=0), holding age constant, \(\widehat{weight}_{fat,mal,age} - \widehat{weight}_{con,mal,age} = 4.01\)
\(\hat{b}_{starch}=4.09\): males on the high starch diet are expected to weigh 4.09 grams more than males on the control diet, holding age constant, \(\widehat{weight}_{sta,mal,age} - \widehat{weight}_{con,mal,age} = 4.09\)
\(\hat{b}_{fem}=-1.17\): females on the control diet are expected to weigh 1.17 grams less than males on the control diet (i.e. the sex effect when fat=0 and starch=0), holding age constant, \(\widehat{weight}_{con,fem,age} - \widehat{weight}_{con,mal,age} = -1.17\)
\(\hat{b}_{age}=0.53\): mice are expected to gain 0.53 grams per week, holding diet and sex constant, \(\widehat{weight}_{diet,sex,age} - \widehat{weight}_{diet,sex,age-1} = 0.53\)
\(\hat{b}_{f \times f}=-0.004\): the difference between high fat females and control females is .004 grams less than the difference between high fat males and control males (diet fat effect), holding age constant, \((\widehat{weight}_{fat,fem,age} - \widehat{weight}_{con,fem,age}) - (\widehat{weight}_{fat,mal,age} - \widehat{weight}_{con,mal,age})=-0.004\)
or, the difference between high fat females and high fat males is .004 grams less than the difference between control females and control males (sex effect), holding age constant, \((\widehat{weight}_{fat,fem,age} - \widehat{weight}_{fat,mal,age}) - (\widehat{weight}_{con,fem,age} - \widehat{weight}_{con,mal,age})=-0.004\)
\(\hat{b}_{s \times f}=-1.29\): the difference between high starch females and control females is 1.29 grams less than the difference between high starch males and control males (diet starch effect), holding age constant
or, the difference between high starch females and high starch males is 1.29 grams less than the difference between control females and control males (sex effect), holding age constant
Questions not directly answered by the coefficients:
how different are males on the high fat and high starch diets?
what are the simple effects of diet for females?
is (high fat - high starch) difference different between males and females?
These questions can be answered by functions in the emmeans pacakge (or by re-parameterizing the model).
Simple effects analysis with emmeans::contrast()
Two of the regression coefficients from the model with diet and sex interacted could be interpreted as the simple effects of diet for males.
We can estimate all of these simple effects easily with contrast() from emmeans.
The first step is creating an emmGrid object from the emmeans() function, and we should specify both variables involved in the interaction in specs=.
# EMMs for diet*sex interactionemm_dietsex <-emmeans(m_dietsex, specs=c("diet", "sex"))emm_dietsex
diet sex emmean SE df lower.CL upper.CL
control male 22.7 0.358 1193 22.0 23.4
fat male 26.7 0.358 1193 26.0 27.4
starch male 26.8 0.358 1193 26.1 27.5
control female 21.6 0.358 1193 20.9 22.3
fat female 25.6 0.358 1193 24.9 26.3
starch female 24.4 0.358 1193 23.7 25.1
Confidence level used: 0.95
Then we specify the emmGrid object as the first argument to contrast, as well as:
method="pairwise" for all pairwise comparisons of diet
simple="diet" requests the simple effects of diet (for each sex)
# simple effects of diet by sexcontrast(emm_dietsex, method="pairwise", simple="diet")
sex = male:
contrast estimate SE df t.ratio p.value
control - fat -4.0117 0.506 1193 -7.928 <0.0001
control - starch -4.0920 0.506 1193 -8.087 <0.0001
fat - starch -0.0803 0.506 1193 -0.159 0.9862
sex = female:
contrast estimate SE df t.ratio p.value
control - fat -4.0073 0.506 1193 -7.920 <0.0001
control - starch -2.8025 0.506 1193 -5.539 <0.0001
fat - starch 1.2049 0.506 1193 2.381 0.0458
P value adjustment: tukey method for comparing a family of 3 estimates
Above we see the estimated simple effects of diet for each sex.
Although the control - fat contrast looks similar between the sexes, the other two contrasts do not:
the interaction coefficient \(\hat{b}_{f \times f}=-0.004\) is the difference between the two control - fat contrasts (\(-4.0117 - -4.0073 = -0.004\))
the interaction coefficient \(\hat{b}_{s \times f}=-1.29\) is the difference between the two control - strach contrasts
the difference between the two fat - starch contrasts is not directly estimated in the model
For confidence intervals, we can put the contrast() inside of confint():
# 95% CI for simple effects of diet by sexconfint(contrast(emm_dietsex, method="pairwise", simple="diet"))
sex = male:
contrast estimate SE df lower.CL upper.CL
control - fat -4.0117 0.506 1193 -5.1991 -2.82
control - starch -4.0920 0.506 1193 -5.2793 -2.90
fat - starch -0.0803 0.506 1193 -1.2677 1.11
sex = female:
contrast estimate SE df lower.CL upper.CL
control - fat -4.0073 0.506 1193 -5.1947 -2.82
control - starch -2.8025 0.506 1193 -3.9898 -1.62
fat - starch 1.2049 0.506 1193 0.0175 2.39
Confidence level used: 0.95
Conf-level adjustment: tukey method for comparing a family of 3 estimates
If we are instead interested in simple sex effects for each diet, we simply specify simple="sex" instead:
# simple effects of sex by dietcontrast(emm_dietsex, method="pairwise", simple="sex")
diet = control:
contrast estimate SE df t.ratio p.value
male - female 1.17 0.506 1193 2.315 0.0208
diet = fat:
contrast estimate SE df t.ratio p.value
male - female 1.18 0.506 1193 2.323 0.0203
diet = starch:
contrast estimate SE df t.ratio p.value
male - female 2.46 0.506 1193 4.863 <0.0001
Interaction contrasts using contrast()
One of the interaction contrasts (difference between simple effects) is not directly estimated in the regression model: the difference between (fat - starch) for males and (fat - starch) for females (or, equivalently, the difference between (male - female) for the fat diet vs (male - female) for the starch diet.)
We can easily get all interaction contrasts with contrast(), by specifying interaction=TRUE instead of simple= (but keeping method="pairwise"):
diet_pairwise sex_pairwise estimate SE df t.ratio p.value
control - fat male - female -0.00433 0.716 1193 -0.006 0.9952
control - starch male - female -1.28950 0.716 1193 -1.802 0.0718
fat - starch male - female -1.28517 0.716 1193 -1.796 0.0728
The final contrast with estimate \(-1.28517\) is the one interaction not directly estimated by the regression model we ran.
Visualizing interactions of two categorical variables with emmeans::emmip()
To visualize the simple effects of diet across levels of sex (the moderator here), we specify in emmip():
the emmGrid object as the first argument
sex is the break variable (moderators) and diet is the x.variable, so the formula is sex ~ diet
# diet mapped to x-axis, separate lines/colors by sex, with 95% CIsemmip(emm_dietsex, sex ~ diet, CI=TRUE)
The lack of parallelism between the lines suggests an interaction. Specifically, it seems the starch diet affects weight (vs control or vs fat) differently between the sexes.
If we instead wish to visualize the simple effects of sex across diet, we reverse the formula:
# sex mapped to x-axis, separate lines/colors by diet, with 95% CIsemmip(emm_dietsex, diet ~ sex, CI=TRUE)
Exercise 4
Using the auction data set
Run a linear regression modeling bid predicted by rarity, budget, and their interaction.
Estimate the simple effects of rarity across budgets.
Graph the simple effects of rarity across budgets.
Interaction of a categorical and continuous variable
Regression model with an interaction of a categorical and continuous variable
We will discuss how to interpret and visualize the interaction of a categorical and continuous variable by modeling weight regressed on:
# interaction of diet and agem_dietage <-lm(weight ~ diet*age + litter, data=mouse)summary(m_dietage)
Call:
lm(formula = weight ~ diet * age + litter, data = mouse)
Residuals:
Min 1Q Median 3Q Max
-18.5675 -3.3093 -0.0853 3.1501 15.1545
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.66186 0.63530 32.523 < 2e-16 ***
dietfat 0.23384 0.78533 0.298 0.765939
dietstarch 0.80315 0.78533 1.023 0.306664
age 0.29266 0.05498 5.323 1.22e-07 ***
litter -0.19088 0.05143 -3.711 0.000216 ***
dietfat:age 0.41952 0.07776 5.395 8.25e-08 ***
dietstarch:age 0.29378 0.07776 3.778 0.000166 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.039 on 1193 degrees of freedom
Multiple R-squared: 0.2835, Adjusted R-squared: 0.2799
F-statistic: 78.68 on 6 and 1193 DF, p-value: < 2.2e-16
Interpretation of coefficients:
\(\hat{b}_0=20.66\): mice on the control diet at 0 weeks from litters with 0 pups are expected to weigh 20.66 grams
\(\hat{b}_{fat}=0.23\): high-fat mice are expected to weight .23 grams more than control mice at 0 weeks (age=0), holding litter constant
\(\hat{b}_{starch}=0.80\): high-starch mice are expected to weight .80 grams more than control mice at 0 weeks, holding litter constant
\(\hat{b}_{age}=0.29\): mice on the control diet (the reference group, when fat=0 and starch=0) are expected to gain 0.29 grams per week (i.e. the simple slope of age), holding litter constant
\(\hat{b}_{lit}=-0.19\): mice are expected to weigh .19 grams less per additional pup in the litter, holding diet and age constant
\(\hat{b}_{f \times a}=0.42\): for high-fat mice, the amount of weight gain per week (age effect) is .42 grams/week more than for control mice
or, the difference between high-fat and control mice (diet fat effect) increases by .42 grams/week
\(\hat{b}_{s \times a}=0.29\): for high-starch mice, the amount of weight gain per week (age effect) is .29 g/wk more than for control mice
or, the difference between high-starch and control mice (diet strach effect) increases by .29 grams/week
Some quantities not directly estimated by this regression model:
age slope for high-fat and high-starch mice
difference between those 2 age slopes
Simple slopes analysis using emmeans::emtrends()
To estimate simple slopes, we again skip the step of estimating EMMs with emmeans() and use emtrends() directly, specifying:
the first argument is the regression model object
specs=, the moderator(s)
var=, the variable whose simple slopes we want to estimate
# simple slopes of age by dietemtrends(m_dietage, specs=c("diet"), var="age")
Then, to estimate the simple effects of diet, we use contrast() just as before, with method="pairwise" and simple="diet":
# simple effects of diet by agecontrast(emm_dietage, method="pairwise", simple="diet")
age = 2:
contrast estimate SE df t.ratio p.value
control - fat -1.073 0.651 1193 -1.649 0.2256
control - starch -1.391 0.651 1193 -2.138 0.0828
fat - starch -0.318 0.651 1193 -0.489 0.8768
age = 9:
contrast estimate SE df t.ratio p.value
control - fat -4.010 0.356 1193 -11.252 <0.0001
control - starch -3.447 0.356 1193 -9.674 <0.0001
fat - starch 0.562 0.356 1193 1.578 0.2556
age = 16:
contrast estimate SE df t.ratio p.value
control - fat -6.946 0.651 1193 -10.677 <0.0001
control - starch -5.504 0.651 1193 -8.460 <0.0001
fat - starch 1.442 0.651 1193 2.217 0.0687
P value adjustment: tukey method for comparing a family of 3 estimates
Visualizing a categorical by continuous interaction using emmeans:emmip()
To graph the simple slopes of age by diet, we again skip the emmeans() step and proceed immediately to using emmip().
the first argument is the lm model object
for the formula, we want age on the x-axis and separate lines/colors by diet, so we specify diet ~ age
we must specify a list of ages to use with the at= argument, so we first specify the integers between 2 and 16
# slopes of age from 2 to 16 weeks for each dietemmip(m_dietage, diet ~ age, at=list(age=2:16), CIs=TRUE)
If instead we wish to plot simple effects of diet at specific ages we can use either the emmGrid object we created before (which has the ages specified)…
# diet effects at ages 2 6 and 9 (specified in emmeans)emmip(emm_dietage, age ~ diet)
…or use the lm model object and specify the ages as above.
# not run, same as aboveemmip(m_dietage, age ~ diet, at=list(age=c(2,9,16)))
Exercise 5
Using the auction dataset
Run a linear regression modeling bid predicted by hours, rarity, and their interaction.
Analyze the interaction both through estimation and graphing.
Interactions in generalized linear models
Considerations for interpreting interactions in generalized linear models
The methods we have used thus far will continue to work for generalized linear models (GLMs), such as logistic and Poisson regression models.
GLMs use a link function to transform the mean parameter of the assumed distribution of the outcome with limited range (e.g between 0 and 1) typically to a variable that can take on any value, to allow modeling linear relationships with the predictors.
in logistic regression, the logit link function transforms probabilities (range \([0,1]\)) to log-odds (range \((-\infty, \infty)\))
in Poisson regression, the log link function transforms mean counts (or rates) (range \([0,\infty]\)) to log-counts (range \((-\infty, \infty)\))
The predictors will have a linear relationship with the transformed mean (e.g. the log-odds), but a non-linear (i.e. curved) relationship with the untransformed mean.
For example, here we show in a logistic regression that age has a linear relationship with the log-odds of hypertension.
# logistic regressionm_logistic <-glm(hyper ~ age, data=mouse, family="binomial")# log-odds of hypertension vs ageemmip(m_logistic, ~ age, at=list(age=2:16))
…but a curved relationship with the probability (mean) of hypertension:
# probability of hypertension vs ageemmip(m_logistic, ~ age, at=list(age=2:16), type="response")
If we extend the range of age to beyond the observed data, we see the typical S-shape relationship between continuous predictors and probability of the outcome in logistic regression:
# probability of hypertension vs extended ages; notice the S-shapeemmip(m_logistic, ~ age, at=list(age=0:50), type="response")
The advantage of plotting the predictor against the transformed mean is that the non-parallelism associated with an interaction will be more obvious. On the other hand, most people will have a harder time interpreting on the transformed scale (i.e. how to interpret changes in log-odds).
Additionally, effect estimates from GLMs are often expressed on two scales. In logistic regression:
raw coefficients are intepreted as differences in log-odds of the outcome
exponentiated coefficients are interpreted as odds ratios (multiplicative effects)
Thus, we can express simple effects in multiple ways.
We will use a logistic regression model to discuss analysis and visualization of interactions in GLMs.
Logistic regression model
Here we model the probability of hypertension (0/1) in a logistic regression that includes as predictors:
# logistic regression with interaction of diet and agem_hyper <-glm(hyper ~ diet*age + litter, data=mouse, family="binomial")summary(m_hyper)
Call:
glm(formula = hyper ~ diet * age + litter, family = "binomial",
data = mouse)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.86236 0.64515 -4.437 9.13e-06 ***
dietfat -0.42293 0.74247 -0.570 0.56893
dietstarch -0.53860 0.79490 -0.678 0.49804
age -0.03290 0.06442 -0.511 0.60954
litter -0.05673 0.03607 -1.573 0.11578
dietfat:age 0.23874 0.07312 3.265 0.00109 **
dietstarch:age 0.17518 0.07697 2.276 0.02284 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 788.95 on 1199 degrees of freedom
Residual deviance: 676.02 on 1193 degrees of freedom
AIC: 690.02
Number of Fisher Scoring iterations: 6
Interpretation:
\(\hat{b}_0=-2.86\), the expected log-odds of hypertension for control mouse at week 0 from a litter of 0 pups is \(-2.86\)
\(\hat{b}_{fat}=-0.42\), at week 0, high-fat mice are expected to have a decrease of 0.42 log-odds of hypertension compared to control mice, holding litter constant
\(\hat{b}_{sta}=-0.54\), at week 0, high-starch mice are expected to have a decrease of -0.54 log-odds of hypertension compared to control mice, holding litter constant
\(\hat{b}_{age}=-0.033\), control mice are expected have a decrease of -0.033 log-odds of hypertension per week, holding litter constant
\(\hat{b}_{lit}=-0.057\), for each additional pup in the litter, the log-odds of hypertension are expected to decrease by -0.057, holding diet and age constant
\(\hat{b}_{f \times a}=0.24\), the change in log-odds per week (age effect) is 0.24 more for high-fat mice than control mice
or, the difference in log-odds between high-fat and control mice (diet fat effect) increases by 0.24 per week
\(\hat{b}_{s \times a}=0.18\), the change in log-odds per week (age effect) is 0.24 more for high-starch mice than control mice
or, the difference in log-odds between high-starch and control mice (diet fat effect) increases by 0.18 per week
Odds Ratios
Coefficients in logistic regression are often exponentiated so that they be interpreted as odds ratios, except:
the exponentiated intercept, which is interpreted as the expected odds when all predictors are zero
exponentiated interaction coefficients, which are interpreted as ratios of odds ratios (see below)
# exponentiating coefficients for odds ratiosexp(coef(m_hyper))
\(exp(\hat{b}_0)=0.057\), the expected odds of hypertension for control mouse at week 0 from a litter of 0 pups is 0.057
\(exp(\hat{b}_{fat})=0.66\), at week 0, high-fat mice are expected to have a 34% decrease in the odds of hypertension compared to control mice, holding litter constant
\(exp(\hat{b}_{sta})=0.58\), at week 0, high-starch mice are expected to have a 42% decrease in the odds of hypertension compared to control mice, holding litter constant
\(exp(\hat{b}_{age})=0.97\), control mice are expected have a 3% decrease in the odds of hypertension per week, holding litter constant
\(exp(\hat{b}_{lit})=0.94\), for each additional pup in the litter, the odds of hypertension are expected to decrease 6%, holding diet and age constant
\(exp(\hat{b}_{f \times a})=1.27\), the odds ratio for age (age effect) is 27% larger for high-fat mice than control mice
or, the odds ratio comparing high-fat and control mice (diet fat effect) increases by 27% per week
\(exp(\hat{b}_{s \times a})=1.19\), the odds ratio for age (age effect) is 19% larger for high-starch mice than control mice
or, the odds ratio comparing high-starch and control mice (diet starch effect) increases by 19% per week
Exponentiated logistic regression interaction coefficients have a challenging interpretation – they are ratios of odds ratios. For example, if we compare the odds ratio representing the age effect (i.e. how much the odds of hypertension change when age increases by 1 week) between high fat and control mice, that ratio is \(exp(\hat{b}_{f \times a})=1.27\). In other words, the increasing risk of hypertension with age is 27% larger for high fat mice than control mice.
Estimating logistic regression simple effects with contrast()
The procedure to estimate simple effects across levels of a moderator are the same as before.
For simple effects of a categorical variable, we begin by estimating EMMs with emmeans(). Here we estimate the simple effects of diet at ages 2, 9, and 16.
By default, these will be expressed in log-odds (logits).
# EMMs for diet*age expressed as log-oddsemm_hyper_logodds <-emmeans(m_hyper, specs=c("diet", "age"),at=list(age=c(2,9,16)))emm_hyper_logodds
diet age emmean SE df asymp.LCL asymp.UCL
control 2 -3.269 0.504 Inf -4.256 -2.2811
fat 2 -3.214 0.355 Inf -3.909 -2.5185
starch 2 -3.457 0.428 Inf -4.297 -2.6170
control 9 -3.499 0.297 Inf -4.080 -2.9174
fat 9 -1.773 0.162 Inf -2.090 -1.4563
starch 9 -2.461 0.200 Inf -2.854 -2.0679
control 16 -3.729 0.573 Inf -4.853 -2.6052
fat 16 -0.332 0.209 Inf -0.742 0.0774
starch 16 -1.465 0.266 Inf -1.986 -0.9438
Results are given on the logit (not the response) scale.
Confidence level used: 0.95
If we instead want the EMMs expressed as probabilities, we can add type="response":
# EMMs for diet*age expressed as probabilitiesemm_hyper_prob <-emmeans(m_hyper, specs=c("diet", "age"),at=list(age=c(2,9,16)),type="response")emm_hyper_prob
Then, to estimate the simple effects we use contrast(). With the emmGrid object where EMMs are expressed in log-odds, the simple effects will be interpreted as differences in log-odds.
# using EMMs expressed as log-odds, effects will be differences in log-oddscontrast(emm_hyper_logodds, method="pairwise", simple="diet")
age = 2:
contrast estimate SE df z.ratio p.value
control - fat -0.0546 0.616 Inf -0.089 0.9957
control - starch 0.1882 0.661 Inf 0.285 0.9563
fat - starch 0.2428 0.556 Inf 0.437 0.9002
age = 9:
contrast estimate SE df z.ratio p.value
control - fat -1.7258 0.337 Inf -5.116 <0.0001
control - starch -1.0380 0.357 Inf -2.904 0.0103
fat - starch 0.6877 0.257 Inf 2.677 0.0203
age = 16:
contrast estimate SE df z.ratio p.value
control - fat -3.3970 0.610 Inf -5.566 <0.0001
control - starch -2.2643 0.632 Inf -3.584 0.0010
fat - starch 1.1326 0.338 Inf 3.351 0.0023
Results are given on the log odds ratio (not the response) scale.
P value adjustment: tukey method for comparing a family of 3 estimates
If we instead use contrast() on the emmGrid object where the EMMs are expressed as probabilities, the simple effects will be expressed as odds ratios:
# using EMMs expressed as probabilities, effects will be odds ratioscontrast(emm_hyper_prob, method="pairwise", simple="diet")
age = 2:
contrast odds.ratio SE df null z.ratio p.value
control / fat 0.9469 0.5830 Inf 1 -0.089 0.9957
control / starch 1.2071 0.7980 Inf 1 0.285 0.9563
fat / starch 1.2748 0.7090 Inf 1 0.437 0.9002
age = 9:
contrast odds.ratio SE df null z.ratio p.value
control / fat 0.1780 0.0601 Inf 1 -5.116 <0.0001
control / starch 0.3541 0.1270 Inf 1 -2.904 0.0103
fat / starch 1.9892 0.5110 Inf 1 2.677 0.0203
age = 16:
contrast odds.ratio SE df null z.ratio p.value
control / fat 0.0335 0.0204 Inf 1 -5.566 <0.0001
control / starch 0.1039 0.0656 Inf 1 -3.584 0.0010
fat / starch 3.1039 1.0500 Inf 1 3.351 0.0023
P value adjustment: tukey method for comparing a family of 3 estimates
Tests are performed on the log odds ratio scale
Estimating logistic regression simple slopes with emtrends()
To estimate the simple slopes of age across diets, we use emtrends(), which by default will be expressed as changes in log-odds per week:
# slopes of age by diet in log-oddsemtrends(m_hyper, specs="diet", var="age")
No option in emtrends() will transform these simple slopes to odds ratios, but we can do it manually:
# extract the table of estimatesemt <-summary(emtrends(m_hyper, specs="diet", var="age"))# exponentiate the estimate and the confidence limitsemt$OR <-exp(emt$age.trend)emt$OR_LCL <-exp(emt$asymp.LCL)emt$OR_UCL <-exp(emt$asymp.UCL)# display the results, removing the unexponentiated columns and SEemt
Odds ratios are commonly displayed in plots that resemble Forest plots (often used for meta-analyses), where the odds ratio estimate is displayed with its confidence interval.
Using plot() on a contrast() object will produce a Forest-style plot of the simple effects.
Below we use the emmGrid object where the contrasts are expressed as odds ratios to plot odds ratios.
# Forest-style plot of odds ratiosplot(contrast(emm_hyper_prob, method="pairwise", simple="diet"))
A reference line at x=1 is often added to signify “no effect”:
# Forest-style plot of odds ratiosplot(contrast(emm_hyper_prob, method="pairwise", simple="diet")) +geom_vline(xintercept=1, linetype=2) # dashed vertical line at x=1
Interaction plots on the log-odds scale
Plotting the interaction on the log-odds scale will make the interaction more apparent (assuming it’s large enough), but log-odds units are hard to interpret.
Here we use emmip() to plot simple effects of diet at weeks 2, 9, and 16. We use the emmGrid object where EMMs are estimated in log odds.
# using EMMs expressed as log odds# diet on x-axis, separate lines/colors by age# predicted log-odds of hypertensionemmip(emm_hyper_logodds, age ~ diet, CIs=TRUE)
We can see that at age=2, the risk of hypertension appears quite similar across diets, but they diverge more as age increases.
We can also plot the simple slopes of age by diet using emmip():
# age on x-axis, separate lines by diet# predicted log-odds of hypertensionemmip(m_hyper, diet ~ age, at=list(age=2:16), CIs=TRUE)
We see the increasing risk of hypertension with age on the two experimental diets.
Interaction plots on the probability scale
Plotting the interaction on the probability scale will generally make the interaction less apparent because it is difficult to judge the parallelism of curves, but probability units are much more natural to interpret for most researchers.
To change the outcome to probabilities, we simply add type="response" to emmip() specifications we used before:
# diet on x-axis, separate lines/colors by age# predicted probabilities of hypertension emmip(emm_hyper_logodds, age ~ diet, CIs=TRUE, type="response")
# age on x-axis, separate lines by diet# predicted probabilities of hypertensionemmip(m_hyper, diet ~ age, at=list(age=2:16), CIs=TRUE, type="response")
With predicted probability curves like above, it can be difficult to judge whether the curves look different because the underlying age effect is different or simply because of consistent diet effects across ages.
Do it yourself (DIY) analysis and visualization of interactions
Overview of DIY analysis of interactions
This section looks at methods for analyzing interactions with base R coding and visualizing interactions with the ggplot2 package.
One advantage of learning to analyze interactions without emmeans is that these methods will work for regression models and packages not supported by emmeans.
On the other hand, doing it yourself requires that you have a solid grasp of how to interpret regression coefficients.
DIY Estimating simple effects and slopes in R
As we have discussed throughout this workshop, the coefficient for the lower order term of \(X\) when it is interacted with \(Z\) represents effect of \(X\) when \(Z=0\). We change the interpretation of this coefficient by changing the meaning of \(Z=0\):
for a continuous variable \(Z\), we will recenter it to change the meaning of 0
for a categorical variable \(Z\), we will change the omitted, reference group
Re-centering a continuous moderator
Imagine we have a variable \(Z^* = Z - 5\). If we model:
\(b_X\) is interpreted as the effect of \(X\) when \(Z^*=0\). However, when \(Z^*=0\), \(Z=5\), so we can also interpret \(b_X\) as the effect of \(X\) when \(Z=5\).
Thus, we can estimate a coefficient that represents the effect of \(X\) at any value of \(Z\) by re-centering \(Z\).
Let’s revisit our model with the interaction of diet and age:
# diet*age model againm_dietage <-lm(weight ~ diet*age + litter, data=mouse)summary(m_dietage)
Call:
lm(formula = weight ~ diet * age + litter, data = mouse)
Residuals:
Min 1Q Median 3Q Max
-18.5675 -3.3093 -0.0853 3.1501 15.1545
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.66186 0.63530 32.523 < 2e-16 ***
dietfat 0.23384 0.78533 0.298 0.765939
dietstarch 0.80315 0.78533 1.023 0.306664
age 0.29266 0.05498 5.323 1.22e-07 ***
litter -0.19088 0.05143 -3.711 0.000216 ***
dietfat:age 0.41952 0.07776 5.395 8.25e-08 ***
dietstarch:age 0.29378 0.07776 3.778 0.000166 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.039 on 1193 degrees of freedom
Multiple R-squared: 0.2835, Adjusted R-squared: 0.2799
F-statistic: 78.68 on 6 and 1193 DF, p-value: < 2.2e-16
The coefficients for diet above, \(b_{fat}\) and \(b_{sta}\), are interpreted as the differences from the control diet when age=0.
If we instead want to estimate diet effects at other ages, we can re-center age, such as at ages 9 and 16 weeks.
Although we could add various re-centered versions a variable to a data set and enter them into regression models to estimate these simple effects…
Those lower order coefficients for diet, \(\hat{b}_{fat}=4.01\) and \(\hat{b}_{sta}=3.45\) are now interpreted as differences from the control group when age = 9.
They are the same as the diet simple effects at age=9 we estimated with contrast() from the emmeans package:
# compare to emmeans::contrast()contrast(emm_dietage, method="pairwise", simple="diet")
age = 2:
contrast estimate SE df t.ratio p.value
control - fat -1.073 0.651 1193 -1.649 0.2256
control - starch -1.391 0.651 1193 -2.138 0.0828
fat - starch -0.318 0.651 1193 -0.489 0.8768
age = 9:
contrast estimate SE df t.ratio p.value
control - fat -4.010 0.356 1193 -11.252 <0.0001
control - starch -3.447 0.356 1193 -9.674 <0.0001
fat - starch 0.562 0.356 1193 1.578 0.2556
age = 16:
contrast estimate SE df t.ratio p.value
control - fat -6.946 0.651 1193 -10.677 <0.0001
control - starch -5.504 0.651 1193 -8.460 <0.0001
fat - starch 1.442 0.651 1193 2.217 0.0687
P value adjustment: tukey method for comparing a family of 3 estimates
Let’s compare the coefficients from the model with uncentered age to the model with age centered at 9:
uncentered age
age centered at 9
term
estimate
std.error
p.value
term
estimate
std.error
p.value
(Intercept)
20.662
0.635
0.000
(Intercept)
23.296
0.398
0
dietfat
0.234
0.785
0.766
dietfat
4.010
0.356
0
dietstarch
0.803
0.785
0.307
dietstarch
3.447
0.356
0
age
0.293
0.055
0.000
I(age - 9)
0.293
0.055
0
litter
-0.191
0.051
0.000
litter
-0.191
0.051
0
dietfat:age
0.420
0.078
0.000
dietfat:I(age - 9)
0.420
0.078
0
dietstarch:age
0.294
0.078
0.000
dietstarch:I(age - 9)
0.294
0.078
0
Above we can see that only the intercept, \(\hat{b}_0\), and the two diet simple effects, \(\hat{b}_{fat}\) and \(\hat{b}_{sta}\) have changed after re-centering.
Importantly, replacing a variable with a re-centered version of it will not change the overall fit of the model:
# log-likelihoods of 2 modelsc(uncentered=logLik(m_dietage), centered.at.9=logLik(m_dietage9))
uncentered centered.at.9
-3639.956 -3639.956
Mini-exercise: Recreate the estimates of the simple effects of diet at age=2 and age=16 shown in the output of contrast() above.
Re-centering a categorical moderator
Let’s revisit our model with the interaction of diet and uncentered age one more time:
# diet*age model one more timem_dietage <-lm(weight ~ diet*age + litter, data=mouse)summary(m_dietage)
Call:
lm(formula = weight ~ diet * age + litter, data = mouse)
Residuals:
Min 1Q Median 3Q Max
-18.5675 -3.3093 -0.0853 3.1501 15.1545
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.66186 0.63530 32.523 < 2e-16 ***
dietfat 0.23384 0.78533 0.298 0.765939
dietstarch 0.80315 0.78533 1.023 0.306664
age 0.29266 0.05498 5.323 1.22e-07 ***
litter -0.19088 0.05143 -3.711 0.000216 ***
dietfat:age 0.41952 0.07776 5.395 8.25e-08 ***
dietstarch:age 0.29378 0.07776 3.778 0.000166 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.039 on 1193 degrees of freedom
Multiple R-squared: 0.2835, Adjusted R-squared: 0.2799
F-statistic: 78.68 on 6 and 1193 DF, p-value: < 2.2e-16
The lower order coefficient for age, \(\hat{b}_{age}=0.29\), is interpreted as the slope of age for the control group (i.e. fat=0 and starch=0).
We might be interested in the slope of age for the high-fat and high-starch groups. We can get those directly estimated by the model if we change the reference groups for diet.
Although we can create a new variable with a changed reference group using factor(), we can instead change reference groups of an existing factor variable temporarily for a regression model using relevel(), which has a ref= argument that specifies the reference group.
Here we change the reference group of diet to the high-fat group:
# changing reference group of diet from control to fatm_dietfatage <-lm(weight ~relevel(diet, ref="fat")*age + litter, data=mouse)# age slope is now for high-fat groupsummary(m_dietfatage)
Call:
lm(formula = weight ~ relevel(diet, ref = "fat") * age + litter,
data = mouse)
Residuals:
Min 1Q Median 3Q Max
-18.5675 -3.3093 -0.0853 3.1501 15.1545
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.89570 0.63530 32.891 < 2e-16 ***
relevel(diet, ref = "fat")control -0.23384 0.78533 -0.298 0.765939
relevel(diet, ref = "fat")starch 0.56931 0.78533 0.725 0.468640
age 0.71218 0.05498 12.952 < 2e-16 ***
litter -0.19088 0.05143 -3.711 0.000216 ***
relevel(diet, ref = "fat")control:age -0.41952 0.07776 -5.395 8.25e-08 ***
relevel(diet, ref = "fat")starch:age -0.12573 0.07776 -1.617 0.106150
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.039 on 1193 degrees of freedom
Multiple R-squared: 0.2835, Adjusted R-squared: 0.2799
F-statistic: 78.68 on 6 and 1193 DF, p-value: < 2.2e-16
Several estimates have changed here, because all coefficients that involve diet are now interpreted as differences from the fat group (rather than differences from the control group as in our original model).
The \(\hat{b}_{age}=0.71\) coefficient is now interpreted as the slope of age for the high-fat group.
The final coefficient, \(\hat{b}_{s \times a}=-0.126\) is interpreted as the difference in age slopes between the high-starch mice and high-fat mice.
This estimate matches the slope of age for the high-fat group estimated by the emtrends() function:
# compare to emmeans::emtrendsemtrends(m_dietage, specs="diet", var="age")
Mini-exercise: Estimate the simple slope of age for the high-starch group by changing the reference group of diet.
DIY Prediction after regression
As we’ve seen, graphs of predictors vs predicted values are a common way to visualize regression effects.
The predict() function can be used on regression model objects to produce predictions for the current data or for new data.
Typically, to visualize an effect using model predictions, we will create a new data set of predictor values where:
the values of the predictors whose effects we wish to visualize are varied
for the variable’s whose effects we wish to visualize, specify the range of values to appear on the x-axis
for the moderator, specify values at which to evaluate and graph the simple effects
the values of predictors whose effects we do not wish to visualize are fixed at constant values (e.g means or reference groups)
The function expand.grid() creates data frames where the values of the variables are fully crossed. Below, we create a dataset that crosses each of the 3 diet groups, with two ages, 2 and 16, and the mean of litter (litter=6), which should result in \(3 \times 2 \times 1 = 6\) rows of data.
# expand.grid() crosses levels of variablessmalldata <-expand.grid(diet=levels(mouse$diet), age=c(2, 16), litter=mean(mouse$litter))smalldata
diet age litter
1 control 2 6
2 fat 2 6
3 starch 2 6
4 control 16 6
5 fat 16 6
6 starch 16 6
Confidence intervals in graphs will generally be more accurate if we estimate more predictions, so we create a data set that includes all observed ages from 2 to 16 weeks:
# data we will use to plot predictions with more agespreddata <-expand.grid(diet=levels(mouse$diet), age=2:16, litter=mean(mouse$litter)) head(preddata, n=10)
diet age litter
1 control 2 6
2 fat 2 6
3 starch 2 6
4 control 3 6
5 fat 3 6
6 starch 3 6
7 control 4 6
8 fat 4 6
9 starch 4 6
10 control 5 6
Now, we can use predict() on the model object and add the predictions to the plotdata data set for plotting. Adding interval="confidence" requests confidence intervals as well.
# just predictions for each new observation in plotdatafit <-predict(m_dietage, newdata=preddata)head(fit)
As we see above, with interval="confidence", predict() produces 3 new variables, fit, lwr, and upr, which are the predicted (mean) value, lower limit of the 95% CI for the predicted mean, and upper limit of the 95% CI, respectively.
We can use cbind() to add all 3 columns to plotdata.
# adding predicted values and CIs to data setplotdata <-cbind(preddata, predict(m_dietage, newdata=preddata, interval="confidence"))head(plotdata)
DIY Graphing of predictions to visualize regression effects
We use the ggplot2 package to visualize our predictions.
Quick explanation of ggplot() syntax:
Use the ggplot() function to specify
the data set used for graphing
which variables are mapped to aesthetics (visual aspects of the graph), inside of aes()
Add geom functions to produce shapes on the graphs
e.g. geom_line() for line graphs, geom_point() for scatterplots
geom functions inherit aesthetic mappings from the ggplot() function
additional aesthetics specific to the geom can be specified in aes() inside the geom function
Below, we produce a graph of the plotdata data set with:
age mapped to the x-axis
fit, the predicted value mapped to the y-axis
diet mapped to the color of the lines produced by geom_line()
# ggplot age slopes by dietggplot(plotdata, aes(x=age, y=fit, color=diet)) +geom_line()
We can add the 95% confidence intervals to our plot by mapping the lwr and upr columns in plotdata to ymin= and ymax= in geom_errorbar().
# ggplot age slopes by diet add error barsggplot(plotdata, aes(x=age, y=fit, color=diet)) +geom_line() +geom_errorbar(aes(ymin=lwr, ymax=upr), width=.2) # used width to narrow the error bars
Or in geom_pointrange(), which the emmeans package uses for confidence intervals…
# ggplot age slopes by diet with pointrange CIsggplot(plotdata, aes(x=age, y=fit, color=diet)) +geom_line() +geom_pointrange(aes(ymin=lwr, ymax=upr))
…or using geom_ribbon() to produce confidence bands.
fill= colors the insides of the band by a variable (color= controls the border of the band).
alpha= makes the bands transparent. Use a value between 0 and 1 (0 is completely transparent).
# ggplot age slopes by diet with CI bandsggplot(plotdata, aes(x=age, y=fit, color=diet)) +geom_line() +geom_ribbon(aes(ymin=lwr, ymax=upr, fill=diet), alpha=.5)
To graph the simple effects of diet at different ages, we need to subset plotdata to only those ages at which we wish to plot the diet effects:
# subset plot data to only ages 2, 9, 16plotdata.age.2.9.16<-subset(plotdata, plotdata$age==2| plotdata$age==9| plotdata$age==16) # only ages 2 and 16# ggplot diet effects across agesggplot(plotdata.age.2.9.16, aes(x=diet, y=fit, color=age)) +geom_point() +geom_errorbar(aes(ymin=lwr, ymax=upr), width=.2)
If we want distinctive colors for each age, put age inside of factor() in the ggplot() specification:
# change color variable (age) to factor for distinct colorsggplot(plotdata.age.2.9.16, aes(x=diet, y=fit, color=factor(age))) +geom_point() +geom_errorbar(aes(ymin=lwr, ymax=upr), width=.2)
Extending to three-way interactions
The methods we have discussed so far extend in a straightforward manner to interactions of 3 variables.
When using emmeans to analyze the interaction, generally we will need to specify one more variable in specs= than for an interaction of two variables.
Here we model a 3-way interaction of diet, sex, and age. Models with 3-way interactions can be difficult to interpret because of the number of coefficients estimated:
We can estimate the simple effects of diet for each sex and at ages 2, 9, and 16 using the same method as before:
# EMMs for diet by sex at ages=2,9,16emm_3 <-emmeans(m_3, specs=c("diet", "sex", "age"),at=list(age=c(2,9,16)))# simple effects of diet for each sex and ages=2,9,16contrast(emm_3, method="pairwise", simple="diet")
sex = male, age = 2:
contrast estimate SE df t.ratio p.value
control - fat -0.7200 0.907 1188 -0.794 0.7069
control - starch -0.6084 0.907 1188 -0.671 0.7806
fat - starch 0.1116 0.907 1188 0.123 0.9917
sex = female, age = 2:
contrast estimate SE df t.ratio p.value
control - fat -1.4257 0.907 1188 -1.572 0.2583
control - starch -2.1730 0.907 1188 -2.396 0.0441
fat - starch -0.7473 0.907 1188 -0.824 0.6883
sex = male, age = 9:
contrast estimate SE df t.ratio p.value
control - fat -4.0117 0.497 1188 -8.074 <0.0001
control - starch -4.0920 0.497 1188 -8.236 <0.0001
fat - starch -0.0803 0.497 1188 -0.162 0.9857
sex = female, age = 9:
contrast estimate SE df t.ratio p.value
control - fat -4.0073 0.497 1188 -8.066 <0.0001
control - starch -2.8025 0.497 1188 -5.641 <0.0001
fat - starch 1.2049 0.497 1188 2.425 0.0409
sex = male, age = 16:
contrast estimate SE df t.ratio p.value
control - fat -7.3033 0.907 1188 -8.051 <0.0001
control - starch -7.5755 0.907 1188 -8.351 <0.0001
fat - starch -0.2722 0.907 1188 -0.300 0.9516
sex = female, age = 16:
contrast estimate SE df t.ratio p.value
control - fat -6.5890 0.907 1188 -7.264 <0.0001
control - starch -3.4319 0.907 1188 -3.783 0.0005
fat - starch 3.1571 0.907 1188 3.480 0.0015
P value adjustment: tukey method for comparing a family of 3 estimates
To graph a three-way interaction with emmip(), we use a formula involving three variables: break.variable ~ x.variable | panel.variable, such that separate graph panels are created for each level of panel.variable:
# diet on x-axis, separate lines/colors by sex, separate panels by ageemmip(emm_3, sex ~ diet | age, CIs=TRUE)
As we see above, a three-way interaction implies that the interaction of two variables varies with a third variable. Here, we see that the interaction of sex and diet appears different across the 3 ages.
We can also estimate simple slopes of age in the same way as before with emtrends():
# simple slopes of age by diet and sexemtrends(m_3, specs=c("diet", "sex"), var="age")
diet sex age.trend SE df lower.CL upper.CL
control male 0.308 0.0767 1188 0.157 0.458
fat male 0.778 0.0767 1188 0.627 0.928
starch male 0.805 0.0767 1188 0.655 0.956
control female 0.278 0.0767 1188 0.127 0.428
fat female 0.646 0.0767 1188 0.496 0.797
starch female 0.368 0.0767 1188 0.217 0.518
Confidence level used: 0.95
And graph the slopes with separate panels by the third variable (here sex):
# age on x-axis, separate lines/colors by diet, panel by sexemmip(m_3, diet ~ age | sex, at=list(age=2:16), CIs=TRUE)
Mini-exercise: try estimating and graphing the diet effects using DIY methods.