Analysis and visualization of interactions in R

OARC Statistical Methods and Data Analytics

Workshop outline

This workshop will teach you how to analyze and visualize interactions in regression models in R both using the emmeans package and with base R coding.

Topics discussed in the workshop:

Workshop packages

This workshop requires the emmeans and ggplot2 packages. This workshop is interactive, so if you wish to participate, please load both packages into R now

library(emmeans)
library(ggplot2)

Workshop data set

The workshop data set contains data from an experiment of mice being fed 3 different diets.

Variables

Note: the data are artificial. Do not make any inferences regarding mouse weights from these data.

# read in data
mouse <- read.csv("https://stats.oarc.ucla.edu/wp-content/uploads/2023/03/mouse.csv")
# display first few rows
head(mouse)
##   age    sex    diet litter   weight hyper
## 1   2   male control      2 14.76288     0
## 2   2 female     fat      4 24.18650     0
## 3   2   male  starch      6 21.14797     0
## 4   2 female control      8 21.60721     0
## 5   2   male     fat     10 20.22701     0
## 6   2 female  starch      2 16.17364     0

Review of regression

Regression with a continuous predictor

First, we review the interpretation of the coefficients of a regression model with a single continuous predictor.

In general, a regression coefficient for a predictor will be interpreted as the expected change in the dependent variable for a one-unit change in the predictor.

Below we model weight predicted by age:

\[{weight}_i = b_0 + b_{age}age_i + \epsilon_i\]

We use the lm() function to run a linear regression model.

# simple linear regression
m_age <- lm(weight ~ age, data=mouse)
# regression tabl
summary(m_age)
## 
## Call:
## lm(formula = weight ~ age, data = mouse)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.7664  -3.4502  -0.0978   3.3996  17.5530 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19.86225    0.34484   57.60   <2e-16 ***
## age          0.53043    0.03414   15.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.42 on 1198 degrees of freedom
## Multiple R-squared:  0.1677, Adjusted R-squared:  0.167 
## F-statistic: 241.3 on 1 and 1198 DF,  p-value: < 2.2e-16

Interpretation of coefficients:

Review of dummy variables

In regression models, categorical variables are typically represented by one or more dummy or indicator variables:

A variable with \(k\) categories can be represented by \(k-1\) dummy variables.

For example, we can represent sex (2 categories) with a single dummy variable that equals 0 for males and equals 1 for females.

sex female
male 0
female 1
female 1
male 0

For a 3-level categorical variable, we will need two dummies:

diet fat starch
fat 1 0
starch 0 1
starch 0 1
control 0 0
fat 1 0

We do not need a third dummy variable for control, because being fat=0 and starch=0 implies membership to the control group.

The omitted category is often referred to as the reference group in regression models.

Factors in R

Factors can be used to represent categorical variables in R.

Essentially, factors are integer variables with labels for each integer value.

We use the factor() function to convert a character variable (or integer variable) to a factor.

To see the distinct categories of the factor in the order of their integer representation, use levels() on the factor variable.

# character vector
income <- c("low", "mid", "high", "high", "low", "mid")
# convert it to factor
f_income <- factor(income)
# distinct categories, in order (first value is 1, second values is 2, ...)
levels(f_income)
## [1] "high" "low"  "mid"

The factor() function will use alphabetical ordering to order the categories by default.

If we want to specify the order, we use the levels= argument in factor():

# specifying the order of levels
f_income <- factor(income, levels=c("low", "mid", "high"))
levels(f_income)
## [1] "low"  "mid"  "high"

Although we will rarely need to use them, we can see the underlying integer values by converting the factor to a numeric vector:

# underlying integer values for factor
as.numeric(f_income)
## [1] 1 2 3 3 1 2

If we are starting with an integer variable that we wish to convert to a factor, we use labels= in factor() to apply the text labels which will then become the levels themselves:

# 0=low, 1=mid, 2=high
num_income <- c(0, 1, 2, 2, 0, 1)
f_num_income <- factor(num_income, labels=c("low", "mid", "high"))
# labels become the levels
levels(f_num_income)
## [1] "low"  "mid"  "high"
# the variable now displays the labels/levels
f_num_income
## [1] low  mid  high high low  mid 
## Levels: low mid high

Important: In regression models, R will automatically create dummy variables for each level of factor except for the first, which will thus be the reference group.

Let’s make sex and diet into factors.

# convert mouse vars to factors
mouse$sex <- factor(mouse$sex, levels=c("male", "female")) # male is reference
mouse$diet <- factor(mouse$diet, levels=c("control", "fat", "starch")) # control is reference
str(mouse)
## 'data.frame':    1200 obs. of  6 variables:
##  $ age   : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ sex   : Factor w/ 2 levels "male","female": 1 2 1 2 1 2 1 2 1 2 ...
##  $ diet  : Factor w/ 3 levels "control","fat",..: 1 2 3 1 2 3 1 2 3 1 ...
##  $ litter: int  2 4 6 8 10 2 4 6 8 10 ...
##  $ weight: num  14.8 24.2 21.1 21.6 20.2 ...
##  $ hyper : int  0 0 0 0 0 0 0 0 0 0 ...

Categorical variables in regression models

Categorical variables with \(k\) categories are represented in regression models by \(k-1\) dummy variables. We cannot add all \(k\) dummy variables to the model due to perfect collinearity.

The dummy variable omitted from the model represents the reference group, and the coefficients will represent differences from this reference group.

Let’s interpret the coefficients for a model where diet is predicting weight:

\[weight_i = b_0 + b_{fat}(diet_i=fat) + b_{sta}(diet_i=starch) + \epsilon_i\]

# regression with a categorical predictor
m_diet <- lm(weight ~ diet, data=mouse)
summary(m_diet)
## 
## Call:
## lm(formula = weight ~ diet, data = mouse)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.690  -3.925  -0.164   3.715  18.681 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.1505     0.2836  78.099   <2e-16 ***
## dietfat       4.0095     0.4011   9.996   <2e-16 ***
## dietstarch    3.4472     0.4011   8.594   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.672 on 1197 degrees of freedom
## Multiple R-squared:  0.08916,    Adjusted R-squared:  0.08764 
## F-statistic: 58.58 on 2 and 1197 DF,  p-value: < 2.2e-16

Coefficients:

Main effects model with continuous and categorical

Let’s run a model where weight is the dependent variable, and diet, sex, and age are independent variables.

A model without interactions is sometimes called a main effects model, in which the variables’ effects do not depend on other variables.

Reference groups for our categorical variables (first level of factor):

\[weight_i = b_0 + b_{fem}(sex_i=female) + b_{fat}(diet_i=fat) + b_{sta}(diet_i=starch) + b_{age}age_i + \epsilon_i\]

# main effects model
m_me <- lm(weight ~ sex + diet + age, data=mouse)
summary(m_me)
## 
## Call:
## lm(formula = weight ~ sex + diet + age, data = mouse)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.082  -3.242  -0.042   3.274  15.236 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.17791    0.40983  44.354  < 2e-16 ***
## sexfemale   -1.60246    0.29242  -5.480 5.18e-08 ***
## dietfat      4.00950    0.35814  11.195  < 2e-16 ***
## dietstarch   3.44720    0.35814   9.625  < 2e-16 ***
## age          0.53043    0.03191  16.625  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.065 on 1195 degrees of freedom
## Multiple R-squared:  0.275,  Adjusted R-squared:  0.2726 
## F-statistic: 113.3 on 4 and 1195 DF,  p-value: < 2.2e-16

Interpretation of regression coefficients:

Exercise 1

Load the auction data set from our website:

# exercise data set
auction <- read.csv("https://stats.oarc.ucla.edu/wp-content/uploads/2023/03/auction.csv")

These data describe how much participants were willing to bid on items in a blind auction (all bidders make a single bid simultaneously), depending on characteristics of the item and how long the auction would last:


  1. Currently, rarity and budget are character variables. Convert them both to factors, with the ordering of the levels for rarity and budget as (common, rare, unique) and (small, large), respectively.
  2. Run a main effects linear regression modeling bid (the outcome) predicted by hours and rarity.
  3. Interpret the coefficients

The emmeans package

Overview of the emmeans package

The emmeans package can:


The procedure for estimating effects (and contrasts) will differ depending on whether the variable whose effects we seek is categorical or continuous.

For effects of a categorical variable:


For effects (slopes) of a continuous variable we skip the emmeans() step and instead:


One of the strengths of the emmeans package is that it supports regression models from many packages.

The emmeans() function

The emmeans() function calculates EMMs. The required arguments are:

For example, below we use emmeans() to calculate EMMs across levels of diet based on our main effects model.

# EMMs for each diet level, averaging across levels of sex and at the mean age
emm_me <- emmeans(m_me, specs=c("diet")) 
emm_me # display the EMMs
##  diet    emmean    SE   df lower.CL upper.CL
##  control   22.2 0.253 1195     21.7     22.6
##  fat       26.2 0.253 1195     25.7     26.7
##  starch    25.6 0.253 1195     25.1     26.1
## 
## Results are averaged over the levels of: sex 
## Confidence level used: 0.95
# class of object created by emmeans
class(emm_me)
## [1] "emmGrid"
## attr(,"package")
## [1] "emmeans"

\[\begin{align}\widehat{weight}_{emm.con} &= b_0 + b_{fat}(0) + b_{sta}(0) + b_{fem}(0.5) + b_{age}(9) \\ &= 18.18 -1.6(0.5) + 0.53(9) \\ &= 22.2 \end{align}\]

\[\begin{align}\widehat{weight}_{emm.fat} &= b_0 + b_{fat}(1) + b_{sta}(0) + b_{fem}(0.5) + b_{age}(9) \\ &= 18.18 + 4(1) -1.6(0.5) + 0.53(9) \\ &= 26.2 \end{align}\]

We can add sex to specs= to get EMMs at each crossing of diet and sex (and at the mean age):

# EMMs for each crossing of diet and sex
emm_me2 <- emmeans(m_me, specs=c("diet", "sex"))
emm_me2
##  diet    sex    emmean    SE   df lower.CL upper.CL
##  control male     23.0 0.292 1195     22.4     23.5
##  fat     male     27.0 0.292 1195     26.4     27.5
##  starch  male     26.4 0.292 1195     25.8     27.0
##  control female   21.3 0.292 1195     20.8     21.9
##  fat     female   25.4 0.292 1195     24.8     25.9
##  starch  female   24.8 0.292 1195     24.2     25.4
## 
## Confidence level used: 0.95

Once we have created an emmeans object (class=emmGrid), we next specify it in contrast() to estimate differences between the EMMs or in emmip() to graph the interaction.

A quick note about averaging over categorical covariates

Variables not specified in specs= are averaged over. For categorical variables, the default is to give equal weight to each category. This typically makes sense for experimental data where there is counterbalancing of categorical covariates.

In observational data, we instead often want the EMMs to reflect the proportional representation of the categories in the population. For example, races are not equally represented in the US population, and if we want our EMMs to reflect the proportional representation of race in the US, we can specify weights=proportional in emmeans:

# when averaging over sex for these EMMs, use the frequencies of sex 
#  in the data to weight
emm_me <- emmeans(m_me, specs=c("diet"), weights="proportional") 
# not run, because all factors are balanced in the data

Note: The method of averaging over covariates will typically only affect the estimation of EMMs themselves and should not affect estimation of simple effects and interaction contrasts (because those covariate effects are differenced out).

Pairwise comparisons among factor levels with emmeans::contrast()

Although we get estimates of differences between the high fat diet and control as well as high starch diet and control, we do not get a direct estimate of the difference between high fat and high starch.

We can easily obtain all pairwise comparisons among levels of factor by running contrast() on the emmGrid object.

Important arguments for contrast():

# effects of diet
contrast(emm_me, method="pairwise", simple="diet")
##  contrast         estimate    SE   df t.ratio p.value
##  control - fat      -4.010 0.358 1195 -11.195  <.0001
##  control - starch   -3.447 0.358 1195  -9.625  <.0001
##  fat - starch        0.562 0.358 1195   1.570  0.2591
## 
## Results are averaged over the levels of: sex 
## P value adjustment: tukey method for comparing a family of 3 estimates

Multiple comparison adjustments for emmeans

In the output of contrast() estimating the effects of diet, a note at the bottom of the output indicates that p-values have been adjusted using Tukey’s method.

# notice p-val adjustment
contrast(emm_me, method="pairwise", simple="diet")
##  contrast         estimate    SE   df t.ratio p.value
##  control - fat      -4.010 0.358 1195 -11.195  <.0001
##  control - starch   -3.447 0.358 1195  -9.625  <.0001
##  fat - starch        0.562 0.358 1195   1.570  0.2591
## 
## Results are averaged over the levels of: sex 
## P value adjustment: tukey method for comparing a family of 3 estimates

The emmeans package will by default apply adjustments that vary by the method= used.

An adjust= option can be added to many functions from the emmeans package to specify which method of adjustment to use.

For example, we can request no p-value adjustment with adjust="none"

# no p-value adjustment
contrast(emm_me, method="pairwise", simple="diet", adjust="none")
##  contrast         estimate    SE   df t.ratio p.value
##  control - fat      -4.010 0.358 1195 -11.195  <.0001
##  control - starch   -3.447 0.358 1195  -9.625  <.0001
##  fat - starch        0.562 0.358 1195   1.570  0.1167
## 
## Results are averaged over the levels of: sex

See ?summary.emmGrid for a list of the many adjustments that are possible.

A discussion of multiple hypothesis adjustments is beyond the scope of this workshop, and henceforth we will not re-specify the adjustment.

Estimating slopes with emmeans::emtrends()

To estimate slopes of continuous variables, we skip the emmeans() step and use emtrends(), specifying:

Below we request estimates of the slope of age for each diet group:

# slopes of age across diets
emtrends(m_me, specs="diet", var="age")
##  diet    age.trend     SE   df lower.CL upper.CL
##  control      0.53 0.0319 1195    0.468    0.593
##  fat          0.53 0.0319 1195    0.468    0.593
##  starch       0.53 0.0319 1195    0.468    0.593
## 
## Results are averaged over the levels of: sex 
## Confidence level used: 0.95

The slope estimates are the same because this is a main effects model, where the effects of variables are constrained to be the same across levels of other variables.

Graphing model predictions with emmeans::emmip()

The emmip() function is designed to create interaction plots, but can be used to plot predictions from models without interactions.

Graphs of model predictions across values of the predictors are a common way to visualize predictor effects.

Important arguments for emmip() (see ?emmip for many more):


Generally, the variable whose effect we wish to visualize should be mapped to the x-axis.

For example, let’s visualize the effects of diet for each sex (we will need to use the emm_me2 in which both sex and diet were specified in specs=):

# graphs of diet effect for each sex
#   diet is on the x-axis
#   different lines/colors by sex
#   95% confidence intervals
emmip(emm_me2, sex ~ diet, CIs=TRUE)

The graph above is actually a plot of the EMMs estimated for each combination of diet and sex in the object emm_me2, and it allows us to visualize the differences between the expected weights for each diet and sex.

Graphing slopes with emmeans::emmip()

We can also visualize slopes with emmip(), using the following procedure:

Here we visualize the slopes of age for each diet group:

# model object is first argument to visualize slopes
# age on x-axis
# separate lines by diet
# at= specifies we want estimates at ages 2, 9, and 16
# 95% CIs
emmip(m_me, diet ~ age, 
      at=list(age=c(2,9,16)), 
      CIs=TRUE)

For more confidence interval estimates, specify more values inside of the range of the slope variable, such as each integer between 2 and 16 for age…

# now we estimate at each integer between 2 and 16 for age
emmip(m_me, diet ~ age, 
      at=list(age=2:16), 
      CIs=TRUE)

… or each number between 2 and 16 in increments of 0.2:

# now we estimate at each integer between 2 and 16 for age
#  we also change the label for the y-axis
emmip(m_me, diet ~ age, 
      at=list(age=seq(2,16,0.2)), 
      CIs=TRUE,
      ylab="predicted weight (g)")

Customizing emmip() plots with ggplot2

Conveniently, we can use ggplot2 functions and syntax to customize the plot:

# using ggplot2 code to modify emmip graph
emmip(m_me, diet ~ age, at=list(age=seq(2,16,0.2)), CIs=TRUE) +
  theme_classic() +
  scale_color_manual(values=c("orange", "brown", "purple")) +
  labs(x="diet", y="predicted weight (g)") # can also use emmip xlab=, ylab=

If you’d like to build the plot from scratch using the emmeans estimates, you can save the data used to build the emmip() plot by specifying plotit=FALSE and saving the result to an object:

# saving emmip plot data to data set
plotdata <- emmip(m_me, diet ~ age, 
                  at=list(age=seq(2,16,0.2)),
                  CIs=TRUE,
                  plotit=FALSE)
head(plotdata)
##  diet    age yvar    SE   df  LCL  UCL tvar    xvar
##  control 2.0 18.4 0.338 1195 17.8 19.1 control  2.0
##  fat     2.0 22.4 0.338 1195 21.8 23.1 fat      2.0
##  starch  2.0 21.9 0.338 1195 21.2 22.5 starch   2.0
##  control 2.2 18.5 0.333 1195 17.9 19.2 control  2.2
##  fat     2.2 22.6 0.333 1195 21.9 23.2 fat      2.2
##  starch  2.2 22.0 0.333 1195 21.3 22.6 starch   2.2
## 
## Results are averaged over the levels of: sex 
## Confidence level used: 0.95

Then we can use ggplot2 functions to plot the data ourselves. Here we change the CI style from geom_pointrange to geom_ribbon:

# plotting with CI bands
ggplot(plotdata, aes(x=age, y=yvar, color=diet, group=diet)) +
  geom_line() +
  geom_ribbon(aes(ymin=LCL, ymax=UCL, fill=diet), alpha=.5)

Exercise 2

Using the main effects model of bid regressed on hours and rarity (from the auction data) in Exercise 1:

  1. Estimate the slopes of hours for each rarity.
  2. Plot the slopes of hours for each rarity (from 2 to 9 hours).

Interactions and moderation models

Why model interactions?

An interaction allows the effect of a variable to depend on the value of another variable.

Main effects models assume the effect of a variable to be the same across values of another variable, whereas interactions relax this assumption.

If we enter the interaction of variable \(X\) and variable \(Z\) into a regression model, we can answer the following questions:


The same questions with the roles of \(X\) and \(Z\) reversed can also be answered.

The dependency of a variable’s effect on another variable is often called moderation or effect modification.

Product terms

An interaction between variables \(X\) and \(Z\) can be modeled with a product term (variable), \(XZ=X \times Z\).

Although we can create these variables ourselves and add them to the regression model, R provides a convenient syntax for interactions in regression models that does not require the product term to be in the data set.

We don’t usually need to create these variables ourselves because R provides a convenient syntax for regression model formulas to model interactions using only the component variables:

If either X or Z (or both) are factors, R will multiple each dummy variable representing that variable by the other variable.

The following two model specifications are equivalent.

# equivalent interaction notation in R
m_int1 <- lm(weight ~ diet + sex + diet:sex, data=mouse)
m_int2 <- lm(weight ~ diet*sex, data=mouse)

Generally, we should include the lower order terms \(X\) and \(Z\) in a model with their interaction \(XZ\), because omitting them constrains certain simple effects to zero, which can be result in a poorly fitting model.

General interpretation of coefficients in interaction models

For a model that includes \(X\), \(Z\), and their interaction \(XZ\) predicting an outcome \(Y\):

\[Y_i = b_0 + b_XX_i + b_ZZ_i + b_{X \times Z}XZ_i + \epsilon_i\] the general interpretation of the coefficients will be:


The coefficient for the interaction is a difference between two simple effects. For example, \(b_{X \times Z}\) equals the difference in the simple effect of \(X\) when \(Z=1\) and the simple effect of \(X\) when \(Z=0\).

In short:

Interaction of two continuous variables

Regression model with an interaction of two continuous variables

Let’s interpret the coefficients of a regression model that includes:

\[weight_i = b_0 + b_{age}age_i + b_{lit}litter_i + b_{a \times l}(age_i)(litter_i) + b_{sex}(sex_i=female) + \epsilon_i\]

We include sex in the model to discuss how to handle other covariates when visualizing an interaction.

# interaction of age and litter
m_agelitter <- lm(weight ~ age*litter + sex, data=mouse)
summary(m_agelitter)
## 
## Call:
## lm(formula = weight ~ age * litter + sex, data = mouse)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.5676  -3.5492  -0.0922   3.3112  16.0442 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.51989    0.81014  25.329  < 2e-16 ***
## age          0.67363    0.07876   8.553  < 2e-16 ***
## litter       0.02393    0.11991   0.200   0.8418    
## sexfemale   -1.60246    0.30778  -5.207 2.26e-07 ***
## age:litter  -0.02387    0.01187  -2.010   0.0446 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.331 on 1195 degrees of freedom
## Multiple R-squared:  0.1969, Adjusted R-squared:  0.1942 
## F-statistic: 73.24 on 4 and 1195 DF,  p-value: < 2.2e-16

Interpretation:


As we can see above,


With these coefficients, we can estimate the simple slope (effect) of age at different litter sizes (litter is the moderator here).

Simple effects not directly estimated by the regression will be calculated by adding the lower order term to the interaction term multiplied by the value of the moderator.

\[age.effect_{litter} = \hat{b}_{age} + \hat{b}_{a \times l}litter\] For example, the \(age\) effect when \(litter=0\) is:

\[\begin{align} age.effect_{0} &= \hat{b}_{age} + \hat{b}_{a \times l}0 \\ &= 0.67 - 0.024(0) \\ &= 0.67 \\ \end{align}\]

The \(age\) effect when \(litter=2\) is:

\[\begin{align} age.effect_{2} &= \hat{b}_{age} + \hat{b}_{a \times l}2 \\ &= 0.67 - 0.024(2) \\ &= 0.63 \end{align}\]

The \(age\) effect when \(litter=10\) is:

\[\begin{align} age.effect_{10} &= \hat{b}_{age} + \hat{b}_{a \times l}10 \\ &= 0.67 - 0.024(10) \\ &= 0.43 \end{align}\]

The age effect, the increase in weight per week, decreases as litter size increases. Litter appears to moderate the effect of age.

We can similarly estimate the effect of litter at various ages.

\[litter.effect_{age} = \hat{b}_{lit} + \hat{b}_{a \times l}age\] The \(litter\) effect when \(age=8\) is:

\[\begin{align} litter.effect_{8} &= \hat{b}_{lit} + \hat{b}_{a \times l}8 \\ &= 0.024 - 0.024(8) \\ &= -0.17 \end{align}\]

At 8 weeks, for each additional pup in the birth litter, a mouse is expected to weigh 0.17 grams less.

Using emmeans::emtrends() to estimate simple slopes

Because we wish to estimate the simple effects (slopes) of a continuous variable, we will use emtrends().

Below, we will estimate the simple slopes of age at different values of litter. For this problem, litter is the moderator. When the moderator is continuous, we will have to choose values at which we wish to estimate simple effects/slopes of the other variable. Common choices are:


Let’s estimate the effects of age at litter sizes 2, 6, and 10 (min, mean, and max).

Here, we will use the following arguments:

# simples slopes of age at litter=2,6,10
age_by_litter <- emtrends(m_agelitter, # model
                          specs="litter", # moderator
                          var="age",  # simple effects we want
                          at=list(litter=c(2, 6, 10))) # moderator values
age_by_litter
##  litter age.trend     SE   df lower.CL upper.CL
##       2     0.626 0.0582 1195    0.512    0.740
##       6     0.530 0.0336 1195    0.465    0.596
##      10     0.435 0.0582 1195    0.321    0.549
## 
## Results are averaged over the levels of: sex 
## Confidence level used: 0.95

We can see the decreasing slope of age as litter increases.

If you need p-values for the age slopes, put the result of emtrends() inside of test():

# p-values for age slopes
test(age_by_litter)
##  litter age.trend     SE   df t.ratio p.value
##       2     0.626 0.0582 1195  10.761  <.0001
##       6     0.530 0.0336 1195  15.795  <.0001
##      10     0.435 0.0582 1195   7.478  <.0001
## 
## Results are averaged over the levels of: sex

Visualizing the interaction of two continuous variables with emmeans::emmip()

Now we will graph the slopes of age at specific values of litter using emmip():

# graph of age slopes by litter values
emmip(m_agelitter, litter ~ age, 
      at=list(age=2:16, litter=c(2,6,10)),
      CIs=TRUE)

We see that mice from larger litters grow slower over time.

Exercise 3

Using the auction dataset:

  1. Run a linear regression modeling bid predicted by hours, quality, and their interaction.
  2. Estimate the simple slopes of hours at each quality from 0 to 5.
  3. Graph the simple slopes of hours at each quality from 0 to 5.

Interaction of two categorical variables

Modeling the interaction of two categorical variables

Now we will model the interaction of 2 categorical variables in a model that includes:

The diet variable is represented by \(2\) dummy variables, while the sex variable is represented by \(1\) dummy variable, so the interaction will be represented by \(2 \times 1 = 2\) variables.

The regression model:

\[\begin{align}weight_i = &b_0 + b_{fat}(diet_i=fat) + b_{sta}(diet_i=starch) + b_{fem}(sex_i=female) + \\ &b_{f \times f}(diet_i=fat)(sex_i=female) + b_{s \times f}(diet_i=starch)(sex_i=female) + b_{age}age_i + \epsilon_i \end{align}\]

# interaction of diet and sex
m_dietsex <- lm(weight ~ diet*sex + age, data=mouse)
summary(m_dietsex)
## 
## Call:
## lm(formula = weight ~ diet * sex + age, data = mouse)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.8664  -3.2698  -0.0475   3.2184  15.4413 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          17.962272   0.458595  39.168  < 2e-16 ***
## dietfat               4.011671   0.505993   7.928 5.07e-15 ***
## dietstarch            4.091955   0.505993   8.087 1.49e-15 ***
## sexfemale            -1.171179   0.505993  -2.315   0.0208 *  
## age                   0.530425   0.031875  16.641  < 2e-16 ***
## dietfat:sexfemale    -0.004334   0.715583  -0.006   0.9952    
## dietstarch:sexfemale -1.289503   0.715583  -1.802   0.0718 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.06 on 1193 degrees of freedom
## Multiple R-squared:  0.2777, Adjusted R-squared:  0.274 
## F-statistic: 76.43 on 6 and 1193 DF,  p-value: < 2.2e-16

Interpretation:


Questions not directly answered by the coefficients:

These questions can be answered by functions in the emmeans pacakge (or by re-parameterizing the model).

Simple effects analysis with emmeans::contrast()

Two of the regression coefficients from the model with diet and sex interacted could be interpreted as the simple effects of diet for males.

We can estimate all of these simple effects easily with contrast() from emmeans.

The first step is creating an emmGrid object from the emmeans() function, and we should specify both variables involved in the interaction in specs=.

# EMMs for diet*sex interaction
emm_dietsex <- emmeans(m_dietsex, specs=c("diet", "sex"))
emm_dietsex
##  diet    sex    emmean    SE   df lower.CL upper.CL
##  control male     22.7 0.358 1193     22.0     23.4
##  fat     male     26.7 0.358 1193     26.0     27.4
##  starch  male     26.8 0.358 1193     26.1     27.5
##  control female   21.6 0.358 1193     20.9     22.3
##  fat     female   25.6 0.358 1193     24.9     26.3
##  starch  female   24.4 0.358 1193     23.7     25.1
## 
## Confidence level used: 0.95

Then we specify the emmGrid object as the first argument to contrast, as well as:

# simple effects of diet by sex
contrast(emm_dietsex, method="pairwise", simple="diet")
## sex = male:
##  contrast         estimate    SE   df t.ratio p.value
##  control - fat     -4.0117 0.506 1193  -7.928  <.0001
##  control - starch  -4.0920 0.506 1193  -8.087  <.0001
##  fat - starch      -0.0803 0.506 1193  -0.159  0.9862
## 
## sex = female:
##  contrast         estimate    SE   df t.ratio p.value
##  control - fat     -4.0073 0.506 1193  -7.920  <.0001
##  control - starch  -2.8025 0.506 1193  -5.539  <.0001
##  fat - starch       1.2049 0.506 1193   2.381  0.0458
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

Above we see the estimated simple effects of diet for each sex.

Although the control - fat contrast looks similar between the sexes, the other two contrasts do not:

For confidence intervals, we can put the contrast() inside of confint():

# 95% CI for simple effects of diet by sex
confint(contrast(emm_dietsex, method="pairwise", simple="diet"))
## sex = male:
##  contrast         estimate    SE   df lower.CL upper.CL
##  control - fat     -4.0117 0.506 1193  -5.1991    -2.82
##  control - starch  -4.0920 0.506 1193  -5.2793    -2.90
##  fat - starch      -0.0803 0.506 1193  -1.2677     1.11
## 
## sex = female:
##  contrast         estimate    SE   df lower.CL upper.CL
##  control - fat     -4.0073 0.506 1193  -5.1947    -2.82
##  control - starch  -2.8025 0.506 1193  -3.9898    -1.62
##  fat - starch       1.2049 0.506 1193   0.0175     2.39
## 
## Confidence level used: 0.95 
## Conf-level adjustment: tukey method for comparing a family of 3 estimates

If we are instead interested in simple sex effects for each diet, we simply specify simple="sex" instead:

# simple effects of sex by diet
contrast(emm_dietsex, method="pairwise", simple="sex")
## diet = control:
##  contrast      estimate    SE   df t.ratio p.value
##  male - female     1.17 0.506 1193   2.315  0.0208
## 
## diet = fat:
##  contrast      estimate    SE   df t.ratio p.value
##  male - female     1.18 0.506 1193   2.323  0.0203
## 
## diet = starch:
##  contrast      estimate    SE   df t.ratio p.value
##  male - female     2.46 0.506 1193   4.863  <.0001

Interaction contrasts using contrast()

One of the interaction contrasts (difference between simple effects) is not directly estimated in the regression model: the difference between (fat - starch) for males and (fat - starch) for females (or, equivalently, the difference between (male - female) for the fat diet vs (male - female) for the starch diet.)

We can easily get all interaction contrasts with contrast(), by specifying interaction=TRUE instead of simple= (but keeping method="pairwise"):

# interaction contrasts
contrast(emm_dietsex, method="pairwise", interaction=TRUE)
##  diet_pairwise    sex_pairwise  estimate    SE   df t.ratio p.value
##  control - fat    male - female -0.00433 0.716 1193  -0.006  0.9952
##  control - starch male - female -1.28950 0.716 1193  -1.802  0.0718
##  fat - starch     male - female -1.28517 0.716 1193  -1.796  0.0728

The final contrast with estimate \(-1.28517\) is the one interaction not directly estimated by the regression model we ran.

Visualizing interactions of two categorical variables with emmeans::emmip()

To visualize the simple effects of diet across levels of sex (the moderator here), we specify in emmip():

# diet mapped to x-axis, separate lines/colors by sex, with 95% CIs
emmip(emm_dietsex, sex ~ diet, CI=TRUE) 

The lack of parallelism between the lines suggests an interaction. Specifically, it seems the starch diet affects weight (vs control or vs fat) differently between the sexes.

If we instead wish to visualize the simple effects of sex across diet, we reverse the formula:

# sex mapped to x-axis, separate lines/colors by diet, with 95% CIs
emmip(emm_dietsex, diet ~ sex, CI=TRUE) 

Exercise 4

Using the auction data set

  1. Run a linear regression modeling bid predicted by rarity, budget, and their interaction.
  2. Estimate the simple effects of rarity across budgets.
  3. Graph the simple effects of rarity across budgets.

Interaction of a categorical and continuous variable

Regression model with an interaction of a categorical and continuous variable

We will discuss how to interpret and visualize the interaction of a categorical and continuous variable by modeling weight regressed on:

\[\begin{align}weight_i = &b_0 + b_{fat}(diet_i=fat) + b_{sta}(diet_i=starch) + b_{age}age_i + \\ &b_{f \times a}(diet_i=fat)(age_i) + b_{s \times a}(diet_i=starch)(age_i) + b_{lit}litter_i\end{align}\]

# interaction of diet and age
m_dietage <- lm(weight ~ diet*age + litter, data=mouse)
summary(m_dietage)
## 
## Call:
## lm(formula = weight ~ diet * age + litter, data = mouse)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.5675  -3.3093  -0.0853   3.1501  15.1545 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    20.66186    0.63530  32.523  < 2e-16 ***
## dietfat         0.23384    0.78533   0.298 0.765939    
## dietstarch      0.80315    0.78533   1.023 0.306664    
## age             0.29266    0.05498   5.323 1.22e-07 ***
## litter         -0.19088    0.05143  -3.711 0.000216 ***
## dietfat:age     0.41952    0.07776   5.395 8.25e-08 ***
## dietstarch:age  0.29378    0.07776   3.778 0.000166 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.039 on 1193 degrees of freedom
## Multiple R-squared:  0.2835, Adjusted R-squared:  0.2799 
## F-statistic: 78.68 on 6 and 1193 DF,  p-value: < 2.2e-16

Interpretation of coefficients:


Some quantities not directly estimated by this regression model:

Simple slopes analysis using emmeans::emtrends()

To estimate simple slopes, we again skip the step of estimating EMMs with emmeans() and use emtrends() directly, specifying:

# simple slopes of age by diet
emtrends(m_dietage, specs=c("diet"), var="age")
##  diet    age.trend    SE   df lower.CL upper.CL
##  control     0.293 0.055 1193    0.185    0.401
##  fat         0.712 0.055 1193    0.604    0.820
##  starch      0.586 0.055 1193    0.479    0.694
## 
## Confidence level used: 0.95

We can see that mice on the control diet grow considerably slower than those on the other two diets.

For p-values, place the emtrends() within test():

# p-values of slopes of age by diet
test(emtrends(m_dietage, specs=c("diet"), var="age"))
##  diet    age.trend    SE   df t.ratio p.value
##  control     0.293 0.055 1193   5.323  <.0001
##  fat         0.712 0.055 1193  12.952  <.0001
##  starch      0.586 0.055 1193  10.666  <.0001

Simple effects analysis using emmeans::contrast()

We may also be interested in the simple effects of diets at specific ages.

We return to our procedure for estimating simple effects by first using emmeans() to obtain the EMMs.

# EMMs for each diet at ages 2, 9, and 16
emm_dietage <- emmeans(m_dietage, specs=c("diet", "age"), at=list(age=c(2,9,16)))
emm_dietage
##  diet    age emmean    SE   df lower.CL upper.CL
##  control   2   20.1 0.460 1193     19.2     21.0
##  fat       2   21.2 0.460 1193     20.3     22.1
##  starch    2   21.5 0.460 1193     20.6     22.4
##  control   9   22.2 0.252 1193     21.7     22.6
##  fat       9   26.2 0.252 1193     25.7     26.7
##  starch    9   25.6 0.252 1193     25.1     26.1
##  control  16   24.2 0.460 1193     23.3     25.1
##  fat      16   31.1 0.460 1193     30.2     32.0
##  starch   16   29.7 0.460 1193     28.8     30.6
## 
## Confidence level used: 0.95

Then, to estimate the simple effects of diet, we use contrast() just as before, with method="pairwise" and simple="diet":

# simple effects of diet by age
contrast(emm_dietage, method="pairwise", simple="diet")
## age =  2:
##  contrast         estimate    SE   df t.ratio p.value
##  control - fat      -1.073 0.651 1193  -1.649  0.2256
##  control - starch   -1.391 0.651 1193  -2.138  0.0828
##  fat - starch       -0.318 0.651 1193  -0.489  0.8768
## 
## age =  9:
##  contrast         estimate    SE   df t.ratio p.value
##  control - fat      -4.010 0.356 1193 -11.252  <.0001
##  control - starch   -3.447 0.356 1193  -9.674  <.0001
##  fat - starch        0.562 0.356 1193   1.578  0.2556
## 
## age = 16:
##  contrast         estimate    SE   df t.ratio p.value
##  control - fat      -6.946 0.651 1193 -10.677  <.0001
##  control - starch   -5.504 0.651 1193  -8.460  <.0001
##  fat - starch        1.442 0.651 1193   2.217  0.0687
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

Visualizing a categorical by continuous interaction using emmeans:emmip()

To graph the simple slopes of age by diet, we again skip the emmeans() step and proceed immediately to using emmip().

# slopes of age from 2 to 16 weeks for each diet
emmip(m_dietage, diet ~ age, at=list(age=2:16), CIs=TRUE) 

If instead we wish to plot simple effects of diet at specific ages we can use either the emmGrid object we created before (which has the ages specified)…

# diet effects at ages 2 6 and 9 (specified in emmeans)
emmip(emm_dietage, age ~ diet)

…or use the lm model object and specify the ages as above.

# not run, same as above
emmip(m_dietage, age ~ diet, at=list(age=c(2,9,16)))

Exercise 5

Using the auction dataset

  1. Run a linear regression modeling bid predicted by hours, rarity, and their interaction.
  2. Analyze the interaction both through estimation and graphing.

Interactions in generalized linear models

Considerations for interpreting interactions in generalized linear models

The methods we have used thus far will continue to work for generalized linear models (GLMs), such as logistic and Poisson regression models.

GLMs use a link function to transform an outcome variable with limited range (e.g between 0 and 1) typically to a variable that can take on any value, to allow modeling linear relationships with the predictors.

The predictors will have a linear relationship with the transformed outcome (e.g. the log-odds of the outcome), but a non-linear (i.e. curved) relationship with the untransformed outcome.

For example, here we show in a logistic regression that age has a linear relationship with the log-odds of hypertension.

# logistic regression
m_logistic <- glm(hyper ~ age, data=mouse, family="binomial")
# log-odds of hypertension vs age
emmip(m_logistic, ~ age, at=list(age=2:16))

…but a curved relationship with the probability of hypertension:

# probability of hypertension vs age
emmip(m_logistic, ~ age, at=list(age=2:16), type="response")

If we extend the range of age to beyond the observed data, we see the typical S-shape relationship between continuous predictors and probability of the outcome in logistic regression:

# probability of hypertension vs extended ages; notice the S-shape
emmip(m_logistic, ~ age, at=list(age=0:50), type="response")

The advantage of plotting the predictor against the transformed outcome is that the non-parallelism associated with an interaction will be more obvious. On the other hand, most people will have a harder time interpreting on the transformed scale (i.e. how to interpret changes in log odds).

Additionally, effect estimates from GLMs are often expressed on two scales. In logistic regression:

Thus, we can express simple effects in multiple ways.

We will use a logistic regression model to discuss analysis and visualization of interactions in GLMs.

Logistic regression model

Here we model the probability of hypertension (0/1) in a logistic regression that includes as predictors:

\[\begin{align}log\left(\frac{Pr(hyper_i=1)}{(1-Pr(hyper_i=1))}\right) = &b_0 + b_{fat}(diet_i=fat) + b_{sta}(diet_i=starch) + b_{age}age_i + \\ &b_{f \times a}(diet_i=fat)(age_i) + b_{s \times a}(diet_i=starch)(age_i) + b_{lit}litter_i\end{align}\]

# logistic regression with interaction of diet and age
m_hyper <- glm(hyper ~ diet*age + litter, data=mouse, family="binomial")
summary(m_hyper)
## 
## Call:
## glm(formula = hyper ~ diet * age + litter, family = "binomial", 
##     data = mouse)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -2.86236    0.64515  -4.437 9.13e-06 ***
## dietfat        -0.42293    0.74247  -0.570  0.56893    
## dietstarch     -0.53860    0.79490  -0.678  0.49804    
## age            -0.03290    0.06442  -0.511  0.60954    
## litter         -0.05673    0.03607  -1.573  0.11578    
## dietfat:age     0.23874    0.07312   3.265  0.00109 ** 
## dietstarch:age  0.17518    0.07697   2.276  0.02284 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 788.95  on 1199  degrees of freedom
## Residual deviance: 676.02  on 1193  degrees of freedom
## AIC: 690.02
## 
## Number of Fisher Scoring iterations: 6

Interpretation:

Odds Ratios

Coefficients in logistic regression are often exponentiated so that they be interpreted as odds ratios, except:

# exponentiating coefficients for odds ratios
exp(coef(m_hyper))
##    (Intercept)        dietfat     dietstarch            age         litter 
##      0.0571339      0.6551253      0.5835626      0.9676350      0.9448537 
##    dietfat:age dietstarch:age 
##      1.2696526      1.1914637

Interpretation:

Exponentiated logistic regression interaction coefficients have a challenging interpretation – they are ratios of odds ratios. For example, if we compare the odds ratio representing the age effect (i.e. how much the odds of hypertension change when age increases by 1 week) between high fat and control mice, that ratio is \(exp(\hat{b}_{f \times a})=1.27\). In other words, the increasing risk of hypertension with age is 26% larger for high fat mice than control mice.

Estimating logistic regression simple effects with contrast()

The procedure to estimate simple effects across levels of a moderator are the same as before.

For simple effects of a categorical variable, we begin by estimating EMMs with emmeans(). Here we estimate the simple effects of diet at ages 2, 9, and 16.

By default, these will be expressed in log-odds (logits).

# EMMs for diet*age expressed as log-odds
emm_hyper_logodds <- emmeans(m_hyper, specs=c("diet", "age"),
                             at=list(age=c(2,9,16)))
emm_hyper_logodds
##  diet    age emmean    SE  df asymp.LCL asymp.UCL
##  control   2 -3.269 0.504 Inf    -4.256   -2.2811
##  fat       2 -3.214 0.355 Inf    -3.909   -2.5185
##  starch    2 -3.457 0.428 Inf    -4.297   -2.6170
##  control   9 -3.499 0.297 Inf    -4.080   -2.9174
##  fat       9 -1.773 0.162 Inf    -2.090   -1.4563
##  starch    9 -2.461 0.200 Inf    -2.854   -2.0679
##  control  16 -3.729 0.573 Inf    -4.853   -2.6052
##  fat      16 -0.332 0.209 Inf    -0.742    0.0774
##  starch   16 -1.465 0.266 Inf    -1.986   -0.9438
## 
## Results are given on the logit (not the response) scale. 
## Confidence level used: 0.95

If we instead want the EMMs expressed as probabilities, we can add type="response":

# EMMs for diet*age expressed as probabilities
emm_hyper_prob <- emmeans(m_hyper, specs=c("diet", "age"),
                          at=list(age=c(2,9,16)),
                          type="response")
emm_hyper_prob
##  diet    age   prob      SE  df asymp.LCL asymp.UCL
##  control   2 0.0367 0.01780 Inf   0.01398    0.0927
##  fat       2 0.0386 0.01318 Inf   0.01966    0.0746
##  starch    2 0.0306 0.01270 Inf   0.01343    0.0681
##  control   9 0.0293 0.00845 Inf   0.01662    0.0513
##  fat       9 0.1452 0.02006 Inf   0.11009    0.1890
##  starch    9 0.0787 0.01453 Inf   0.05449    0.1123
##  control  16 0.0235 0.01313 Inf   0.00774    0.0688
##  fat      16 0.4177 0.05083 Inf   0.32262    0.5193
##  starch   16 0.1877 0.04054 Inf   0.12070    0.2801
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the logit scale

Then, to estimate the simple effects we use contrast(). With the emmGrid object where EMMs are expressed in log-odds, the simple effects will be interpreted as differences in log-odds.

# using EMMs expressed as log-odds, effects will be differences in log-odds
contrast(emm_hyper_logodds, method="pairwise", simple="diet")
## age =  2:
##  contrast         estimate    SE  df z.ratio p.value
##  control - fat     -0.0546 0.616 Inf  -0.089  0.9957
##  control - starch   0.1882 0.661 Inf   0.285  0.9563
##  fat - starch       0.2428 0.556 Inf   0.437  0.9002
## 
## age =  9:
##  contrast         estimate    SE  df z.ratio p.value
##  control - fat     -1.7258 0.337 Inf  -5.116  <.0001
##  control - starch  -1.0380 0.357 Inf  -2.904  0.0103
##  fat - starch       0.6877 0.257 Inf   2.677  0.0203
## 
## age = 16:
##  contrast         estimate    SE  df z.ratio p.value
##  control - fat     -3.3970 0.610 Inf  -5.566  <.0001
##  control - starch  -2.2643 0.632 Inf  -3.584  0.0010
##  fat - starch       1.1326 0.338 Inf   3.351  0.0023
## 
## Results are given on the log odds ratio (not the response) scale. 
## P value adjustment: tukey method for comparing a family of 3 estimates

If we instead use contrast() on the emmGrid object where the EMMs are expressed as probabilities, the simple effects will be expressed as odds ratios:

# using EMMs expressed as probabilities, effects will be odds ratios
contrast(emm_hyper_prob, method="pairwise", simple="diet")
## age = 2:
##  contrast         odds.ratio     SE  df null z.ratio p.value
##  control / fat        0.9469 0.5831 Inf    1  -0.089  0.9957
##  control / starch     1.2071 0.7979 Inf    1   0.285  0.9563
##  fat / starch         1.2748 0.7085 Inf    1   0.437  0.9002
## 
## age = 9:
##  contrast         odds.ratio     SE  df null z.ratio p.value
##  control / fat        0.1780 0.0601 Inf    1  -5.116  <.0001
##  control / starch     0.3541 0.1266 Inf    1  -2.904  0.0103
##  fat / starch         1.9892 0.5111 Inf    1   2.677  0.0203
## 
## age = 16:
##  contrast         odds.ratio     SE  df null z.ratio p.value
##  control / fat        0.0335 0.0204 Inf    1  -5.566  <.0001
##  control / starch     0.1039 0.0656 Inf    1  -3.584  0.0010
##  fat / starch         3.1039 1.0492 Inf    1   3.351  0.0023
## 
## P value adjustment: tukey method for comparing a family of 3 estimates 
## Tests are performed on the log odds ratio scale

Estimating logistic regression simple slopes with emtrends()

To estimate the simple slopes of age across diets, we use emtrends(), which by default will be expressed as changes in log-odds per week:

# slopes of age by diet in log-odds
emtrends(m_hyper, specs="diet", var="age")
##  diet    age.trend     SE  df asymp.LCL asymp.UCL
##  control   -0.0329 0.0644 Inf   -0.1592    0.0934
##  fat        0.2058 0.0346 Inf    0.1380    0.2737
##  starch     0.1423 0.0421 Inf    0.0597    0.2248
## 
## Confidence level used: 0.95

No option in emtrends() will transform these simple slopes to odds ratios, but we can do it manually:

# extract the table of estimates
emt <- summary(emtrends(m_hyper, specs="diet", var="age"))
# exponentiate the estimate and the confidence limits
emt$OR <- exp(emt$age.trend)
emt$OR_LCL <- exp(emt$asymp.LCL)
emt$OR_UCL <- exp(emt$asymp.UCL)
# display the results, removing the unexponentiated columns and SE
emt
##  diet    age.trend     SE  df asymp.LCL asymp.UCL    OR OR_LCL OR_UCL
##  control   -0.0329 0.0644 Inf   -0.1592    0.0934 0.968  0.853   1.10
##  fat        0.2058 0.0346 Inf    0.1380    0.2737 1.229  1.148   1.31
##  starch     0.1423 0.0421 Inf    0.0597    0.2248 1.153  1.062   1.25
## 
## Confidence level used: 0.95

Forest-style plots of odds ratios

Odds ratios are commonly displayed in plots that resemble Forest plots (often used for meta-analyses), where the odds ratio estimate is displayed with its confidence interval.

Using plot() on a contrast() object will produce a Forest-style plot of the simple effects.

Below we use the emmGrid object where the contrasts are expressed as odds ratios to plot odds ratios.

# Forest-style plot of odds ratios
plot(contrast(emm_hyper_prob, method="pairwise", simple="diet")) 

A reference line at x=1 is often added to signify “no effect”:

# Forest-style plot of odds ratios
plot(contrast(emm_hyper_prob, method="pairwise", simple="diet")) +
  geom_vline(xintercept=1, linetype=2) # dashed vertical line at x=1

Interaction plots on the log-odds scale

Plotting the interaction on the log-odds scale will make the interaction more apparent (assuming it’s large enough), but log-odds units are hard to interpret.

Here we use emmip() to plot simple effects of diet at weeks 2, 9, and 16. We use the emmGrid object where EMMs are estimated in log odds.

# using EMMs expressed as log odds
#  diet on x-axis, separate lines/colors by age
#  predicted log-odds of hypertension
emmip(emm_hyper_logodds, age ~ diet, CIs=TRUE)

We can see that at age=2, the risk of hypertension appears quite similar across diets, but they diverge more as age increases.

We can also plot the simple slopes of age by diet using emmip():

# age on x-axis, separate lines by diet
#  predicted log-odds of hypertension
emmip(m_hyper, diet ~ age, at=list(age=2:16), CIs=TRUE)

We see the increasing risk of hypertension with age on the two experimental diets.

Interaction plots on the probability scale

Plotting the interaction on the log-odds scale will generally make the interaction less apparent because it is difficult to judge the parallelism of curves, but probability units are much more natural to interpret for most researchers.

To change the outcome to probabilities, we simply add type="response" to emmip() specifications we used before:

# diet on x-axis, separate lines/colors by age
#  predicted probabilities of hypertension 
emmip(emm_hyper_logodds, age ~ diet, CIs=TRUE, type="response")

# age on x-axis, separate lines by diet
#  predicted probabilities of hypertension
emmip(m_hyper, diet ~ age, at=list(age=2:16), CIs=TRUE, type="response")

With predicted probability curves like above, it can be difficult to judge whether the curves look different because the underlying age effect is different or simply because of consistent diet effects across ages.

Do it yourself (DIY) analysis and visualization of interactions

Overview of DIY analysis of interactions

This section looks at methods for analyzing interactions with base R coding and visualizing interactions with the ggplot2 package.

One advantage of learning to analyze interactions without emmeans is that these methods will work for regression models and packages not supported by emmeans.

On the other hand, doing it yourself requires that you have a solid grasp of how to interpret regression coefficients.

DIY Estimating simple effects and slopes in R

As we have discussed throughout this workshop, the coefficient for the lower order term of \(X\) when it is interacted with \(Z\) represents effect of \(X\) when \(Z=0\). We change the interpretation of this coefficient by changing the meaning of \(Z=0\):


Re-centering a continuous moderator

Imagine we have a variable \(Z^* = Z - 5\). If we model:

\[Y_i = b_0 + b_XX + b_{Z^*}Z^* + b_{X \times Z^*}XZ^* + \epsilon_i\]

\(b_X\) is interpreted as the effect of \(X\) when \(Z^*=0\). However, when \(Z^*=0\), \(Z=5\), so we can also interpret \(b_X\) as the effect of \(X\) when \(Z=5\).

Thus, we can estimate a coefficient that represents the effect of \(X\) at any value of \(Z\) by re-centering \(Z\).

Let’s revisit our model with the interaction of diet and age:

# diet*age model again
m_dietage <- lm(weight ~ diet*age + litter, data=mouse)
summary(m_dietage)
## 
## Call:
## lm(formula = weight ~ diet * age + litter, data = mouse)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.5675  -3.3093  -0.0853   3.1501  15.1545 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    20.66186    0.63530  32.523  < 2e-16 ***
## dietfat         0.23384    0.78533   0.298 0.765939    
## dietstarch      0.80315    0.78533   1.023 0.306664    
## age             0.29266    0.05498   5.323 1.22e-07 ***
## litter         -0.19088    0.05143  -3.711 0.000216 ***
## dietfat:age     0.41952    0.07776   5.395 8.25e-08 ***
## dietstarch:age  0.29378    0.07776   3.778 0.000166 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.039 on 1193 degrees of freedom
## Multiple R-squared:  0.2835, Adjusted R-squared:  0.2799 
## F-statistic: 78.68 on 6 and 1193 DF,  p-value: < 2.2e-16

The coefficients for diet above, \(b_{fat}\) and \(b_{sta}\), are interpreted as the differences from the control diet when age=0.

If we instead want to estimate diet effects at other ages, we can re-center age, such as at ages 9 and 16 weeks.

Although we could add various re-centered versions a variable to a data set and enter them into regression models to estimate these simple effects…

# not run
mouse$age9 <- mouse$age - 9
m_dietage9 <- lm(weight ~ diet*age9 + litter, data=mouse)

R provides a syntax shortcut that obviates the need for these variables in the data.

The I() function can be specified in regression formulas to allow operators like +, -, and ^ to be used as arithmetic operators.

Here, we estimate the effect of diet when age is 9 by specifying I(age-9), which R interprets as “age variable minus 9”

# re-centering age using I()
m_dietage9 <- lm(weight ~ diet*I(age-9) + litter, data=mouse)
# simple diet effects are now at age=9
summary(m_dietage9)
## 
## Call:
## lm(formula = weight ~ diet * I(age - 9) + litter, data = mouse)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.5675  -3.3093  -0.0853   3.1501  15.1545 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           23.29578    0.39840  58.474  < 2e-16 ***
## dietfat                4.00950    0.35634  11.252  < 2e-16 ***
## dietstarch             3.44720    0.35634   9.674  < 2e-16 ***
## I(age - 9)             0.29266    0.05498   5.323 1.22e-07 ***
## litter                -0.19088    0.05143  -3.711 0.000216 ***
## dietfat:I(age - 9)     0.41952    0.07776   5.395 8.25e-08 ***
## dietstarch:I(age - 9)  0.29378    0.07776   3.778 0.000166 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.039 on 1193 degrees of freedom
## Multiple R-squared:  0.2835, Adjusted R-squared:  0.2799 
## F-statistic: 78.68 on 6 and 1193 DF,  p-value: < 2.2e-16

Those lower order coefficients for diet, \(\hat{b}_{fat}=4.01\) and \(\hat{b}_{sta}=3.45\) are now interpreted as differences from the control group when age = 9.

They are the same as the diet simple effects at age=9 we estimated with contrast() from the emmeans package:

# compare to emmeans::contrast()
contrast(emm_dietage, method="pairwise", simple="diet")
## age =  2:
##  contrast         estimate    SE   df t.ratio p.value
##  control - fat      -1.073 0.651 1193  -1.649  0.2256
##  control - starch   -1.391 0.651 1193  -2.138  0.0828
##  fat - starch       -0.318 0.651 1193  -0.489  0.8768
## 
## age =  9:
##  contrast         estimate    SE   df t.ratio p.value
##  control - fat      -4.010 0.356 1193 -11.252  <.0001
##  control - starch   -3.447 0.356 1193  -9.674  <.0001
##  fat - starch        0.562 0.356 1193   1.578  0.2556
## 
## age = 16:
##  contrast         estimate    SE   df t.ratio p.value
##  control - fat      -6.946 0.651 1193 -10.677  <.0001
##  control - starch   -5.504 0.651 1193  -8.460  <.0001
##  fat - starch        1.442 0.651 1193   2.217  0.0687
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

Let’s compare the coefficients from the model with uncentered age to the model with age centered at 9:

uncentered age
age centered at 9
term estimate std.error p.value term estimate std.error p.value
(Intercept) 20.662 0.635 0.000 (Intercept) 23.296 0.398 0
dietfat 0.234 0.785 0.766 dietfat 4.010 0.356 0
dietstarch 0.803 0.785 0.307 dietstarch 3.447 0.356 0
age 0.293 0.055 0.000 I(age - 9) 0.293 0.055 0
litter -0.191 0.051 0.000 litter -0.191 0.051 0
dietfat:age 0.420 0.078 0.000 dietfat:I(age - 9) 0.420 0.078 0
dietstarch:age 0.294 0.078 0.000 dietstarch:I(age - 9) 0.294 0.078 0

Above we can see that only the intercept, \(\hat{b}_0\), and the two diet simple effects, \(\hat{b}_{fat}\) and \(\hat{b}_{sta}\) have changed after re-centering.

Importantly, replacing a variable with a re-centered version of it will not change the overall fit of the model:

# log-likelihoods of 2 models
c(uncentered=logLik(m_dietage), centered.at.9=logLik(m_dietage9))
##    uncentered centered.at.9 
##     -3639.956     -3639.956

Mini-exercise: Recreate the estimates of the simple effects of diet at age=2 and age=16 shown in the output of contrast() above.


Re-centering a categorical moderator

Let’s revisit our model with the interaction of diet and uncentered age one more time:

# diet*age model one more time
m_dietage <- lm(weight ~ diet*age + litter, data=mouse)
summary(m_dietage)
## 
## Call:
## lm(formula = weight ~ diet * age + litter, data = mouse)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.5675  -3.3093  -0.0853   3.1501  15.1545 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    20.66186    0.63530  32.523  < 2e-16 ***
## dietfat         0.23384    0.78533   0.298 0.765939    
## dietstarch      0.80315    0.78533   1.023 0.306664    
## age             0.29266    0.05498   5.323 1.22e-07 ***
## litter         -0.19088    0.05143  -3.711 0.000216 ***
## dietfat:age     0.41952    0.07776   5.395 8.25e-08 ***
## dietstarch:age  0.29378    0.07776   3.778 0.000166 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.039 on 1193 degrees of freedom
## Multiple R-squared:  0.2835, Adjusted R-squared:  0.2799 
## F-statistic: 78.68 on 6 and 1193 DF,  p-value: < 2.2e-16

The lower order coefficient for age, \(\hat{b}_{age}=0.29\), is interpreted as the slope of age for the control group (i.e. fat=0 and starch=0).

We might be interested in the slope of age for the high-fat and high-starch groups. We can get those directly estimated by the model if we change the reference groups for diet.

Although we can create a new variable with a changed reference group using factor(), we can instead change reference groups of an existing factor variable temporarily for a regression model using relevel(), which has a ref= argument that specifies the reference group.

Here we change the reference group of diet to the high-fat group:

# changing reference group of diet from control to fat
m_dietfatage <- lm(weight ~ relevel(diet, ref="fat")*age + litter, data=mouse)
# age slope is now for high-fat group
summary(m_dietfatage)
## 
## Call:
## lm(formula = weight ~ relevel(diet, ref = "fat") * age + litter, 
##     data = mouse)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.5675  -3.3093  -0.0853   3.1501  15.1545 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           20.89570    0.63530  32.891  < 2e-16 ***
## relevel(diet, ref = "fat")control     -0.23384    0.78533  -0.298 0.765939    
## relevel(diet, ref = "fat")starch       0.56931    0.78533   0.725 0.468640    
## age                                    0.71218    0.05498  12.952  < 2e-16 ***
## litter                                -0.19088    0.05143  -3.711 0.000216 ***
## relevel(diet, ref = "fat")control:age -0.41952    0.07776  -5.395 8.25e-08 ***
## relevel(diet, ref = "fat")starch:age  -0.12573    0.07776  -1.617 0.106150    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.039 on 1193 degrees of freedom
## Multiple R-squared:  0.2835, Adjusted R-squared:  0.2799 
## F-statistic: 78.68 on 6 and 1193 DF,  p-value: < 2.2e-16

Several estimates have changed here, because all coefficients that involve diet are now interpreted as differences from the fat group (rather than differences from the control group as in our original model).

The \(\hat{b}_{age}=0.71\) coefficient is now interpreted as the slope of age for the high-fat group.

The final coefficient, \(\hat{b}_{s \times a}=-0.126\) is interpreted as the difference in age slopes between the high-starch mice and high-fat mice.

This estimate matches the slope of age for the high-fat group estimated by the emtrends() function:

# compare to emmeans::emtrends
emtrends(m_dietage, specs="diet", var="age")
##  diet    age.trend    SE   df lower.CL upper.CL
##  control     0.293 0.055 1193    0.185    0.401
##  fat         0.712 0.055 1193    0.604    0.820
##  starch      0.586 0.055 1193    0.479    0.694
## 
## Confidence level used: 0.95

The fit of the model with fat as the reference group is the same as the model with control as the reference group: