OARC Statistical Methods and Data Analytics
Regression explains variation in an outcome (dependent variable) using one or more predictors (independent variables).
A regression coefficient quantifies the strength of the relationship between outcome and predictor (effect size), but in units specific to the predictor, so comparison to other predictors’ coefficients is not straightforward.
Standardized effect sizes, like Cohen’s \(d\) and \(R^2\), are unit-less and thus permit direct comparison between predictors measured in different units, even across different studies.
Multilevel regression models are commonly used to model data that are hierarchically structured, in which observations collected within aggregate units are not independent.
However, currently no set of standardized effect size measures have been widely adopted by multilevel modelers.
This workshop discusses a new set of \(R^2\) measures introduced by Rights and Sterba in their 2019 article ” Quantifying Explained Variance in Multilevel Models: An Integrative Framework for Defining R-Squared Measures”.
The authors have themselves said that using their new \(R^2\) measures requires a solid understanding of multilevel models (Shaw et al., 2023).
To this end, we will discuss:
r2mlm
packageHow is variation measured?
Imagine we have an outcome, \(y_i\), measured for units \(i=1,...,N\).
One common way to quantify variation is to measure deviations from the mean, \(y_i - \bar{y}\), where \(\bar{y}\) is the mean of \(y_i\).
3 measures of variation:
Linear regression uses a set of predictors to explain the variation in the outcome.
Below we model outcome \(y_i\) to be predicted by a single predictor, \(x_i\):
\[y_i = \beta_0 + \beta_1x_i + \epsilon_i\] where
The errors \(\epsilon_i\) are assumed to be normally distributed with a zero mean and variance \(\sigma_{\epsilon}^2\):
\[\epsilon_i \sim N(0, \sigma_{\epsilon}^2)\]
\(\sigma_{\epsilon}^2\), the error variance, quantifies how much \(y_i\) variation is unexplained by the model.
As relevant predictors are added to the model, \(\sigma_{\epsilon}^2\) should decrease.
After fitting a model, we can estimate predicted outcome for each observations, \(\hat{y}_i\), based on the estimated coefficients.
In the intercept-only model, all predictions equal the intercept, which equals \(\bar{y}\) in this model:
\[\begin{align} \hat{y}_i &= \hat{\beta}_0 \\ &= \bar{y} \end{align} \]
For a model with predictor \(x_i\), the predictions lie on a line determined by \(\hat{\beta_0}\) and \(\hat{\beta}_1\):
\[\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_i\]
The intercept-only model explains no variation in \(y_i\).
As predictors are added, the predictions \(\hat{y}_i\) on average move away from the sample mean \(\bar{y}\) and closer to their observed values \(y_i\).
The distance that predictions move away from the sample mean towards the observed values measures how much the model is explaining variation.
A steeper slope will move the predicted values away more from the sample mean, explaining more variation.
Residuals, \(\hat{\epsilon}_i\), are the difference between the observed and predicted values and estimate the model errors, \(\epsilon_i\):
\[\hat{\epsilon}_i = y_i - \hat{y}_i\]
Residuals represent the variation in the outcome that cannot be explained by the model.
Residuals shrink as explained variation increases.
Below we see how the residual variance (expressed as a standard deviation) shrinks when we add a predictor to the model.
# intercept-only model (no predictors)
# residual standard error is estimate of square root of error variance
m0 <- lm(y ~ 1, data=small)
summary(m0)
##
## Call:
## lm(formula = y ~ 1, data = small)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.556 -6.556 1.444 5.694 16.444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.556 1.979 25.55 5.31e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.396 on 17 degrees of freedom
# Adding predictor x to the model reduces residual standard error
m <- lm(y ~ x, data=small)
summary(m)
##
## Call:
## lm(formula = y ~ x, data = small)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.113 -7.009 1.564 5.941 11.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.1008 11.6468 2.842 0.0118 *
## x 0.3574 0.2353 1.519 0.1482
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.091 on 16 degrees of freedom
## Multiple R-squared: 0.1261, Adjusted R-squared: 0.07144
## F-statistic: 2.308 on 1 and 16 DF, p-value: 0.1482
In regression, we can partition total variation into explained and residual variation:
\[\text{Total Variation = Explained Variation + Residual Variation}\]
As we add relevant predictors to the model, explained variation grows while residual variation shrinks.
\(R^2\) expresses the proportion of \(\text{Total Variation}\) that is \(\text{Explained Variation}\):
Most \(R^2\) measures can be expressed generally as:
\[R^2 = \frac{\text{Explained Variation}}{\text{Total Variation}}\] \(R^2\) is thus interpreted as the proportion of outcome variation explained by the model predictors.
Sometimes, residual variation is easier to estimate than explained variation, so some \(R^2\) are defined as :
\[R^2 = 1 - \frac{\text{Residual Variation}}{\text{Total Variation}}\]
Different \(R^2\) measures use different statistics (e.g. variances, sums of squares) to represent variation in the numerator and denominator.†
Because the intercept-only model explains no variation, \(R^2=0\).
## [1] 0
The predictor \(x_i\) explains about 12.6% of the variation in \(y_i\).
## [1] 0.1260599
Linear regression \(R^2\) properties
†\(R^2\) for linear regression uses sums of squares for explained and total variation
What determines how much variance is explained by a predictor?
Holding everything else constant, variance explained increases as:
## height male before_july disease
## 4 70.33877 0 1 1
## 5 62.55333 0 0 0
## 6 64.84237 0 1 0
## 7 61.73652 0 0 0
## 8 64.03135 0 1 0
## 9 64.42657 0 0 0
## 10 66.27407 0 1 0
# male and disease have similar coefficients
# before_july has nearly zero effect
summary(lm(height ~ male + before_july + disease, data=height))
##
## Call:
## lm(formula = height ~ male + before_july + disease, data = height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6396 -1.7393 0.0938 1.4779 6.9888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.2630 0.2760 232.816 < 2e-16 ***
## male 4.8323 0.3165 15.266 < 2e-16 ***
## before_july 0.1727 0.3165 0.546 0.586
## disease 4.9490 0.8077 6.128 4.81e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.238 on 196 degrees of freedom
## Multiple R-squared: 0.5802, Adjusted R-squared: 0.5738
## F-statistic: 90.3 on 3 and 196 DF, p-value: < 2.2e-16
Below we graph the individual contributions to \(R^2\) by predictor:
Multilevel data are hierarchically structured where sampled observations (level-1) can be aggregated into higher level units (level-2).
Multilevel data arise when:
Level-2 units are assumed to be independent of each other but level-1 observations within the same level-2 unit are not assumed to be independent, violating the independence assumption of single-level models.
Note: Henceforth, we will refer to multilevel data arising from clustered sampling but all concepts discussed apply equally to repeated measures data as well.
We will discuss multilevel model with a simulated dataset of worker stress, with 15 companies sampled and 10 employess within each company sampled.
Variables
company_id
employee_id
stress
: the outcome, ranges 1-100salary
: employee’s salary (level-1) in thousands of
dollarspct_open
: percent of company’s position that are
unfilled (level-2)The dataset is called empstress
.
## company_id employee_id salary pct_open stress
## 1 1 1 76.594 11.082 48
## 2 1 2 80.259 11.082 41
## 3 1 3 67.790 11.082 43
## 4 1 4 82.514 11.082 47
## 5 1 5 76.245 11.082 54
## 6 1 6 78.169 11.082 41
The multilevel model explains variation in outcome \(y_{ij}\), measured for individual \(i=1,...,n_j\) within level-2 unit \(j=1,...,J\).
Each of the \(J\) level-2 unit has \(n_j\) individuals sampled from it, and the total sample size is \(N =\sum_{j=1}^J{n_j}\).
Predictors in multilevel models can be measured at level-1 or level-2.
For the empstress
data
## company_id employee_id salary pct_open stress
## 1 1 1 76.594 11.082 48
## 2 1 2 80.259 11.082 41
## 3 1 3 67.790 11.082 43
## 4 1 4 82.514 11.082 47
## 5 1 5 76.245 11.082 54
## 6 1 6 78.169 11.082 41
For variables measured at level-1 like \(y_{ij}\), we can calculate 2 kinds of means:
\[\bar{y}_{..} = \frac{1}{N}\sum_{j=1}^J{\sum_{i=1}^{n_j}{y_{ij}}}\]
\[\bar{y}_{.j} = \frac{1}{n_j}\sum_{i=1}^{n_j}{y_{ij}}\]
Below we see a graph of stress scores (blue points), its overall mean (line) and company means (larger red points).
We can decompose a level-1 outcome \(y_{ij}\) into the sum of a cluster mean \(y_{.j}\) and the deviation from that cluster mean:
\[\begin{align} y_{ij} &= \bar{y}_{.j} + (y_{ij} - \bar{y}_{.j}) \\ y_{ij} &= \text{cluster mean + deviation from cluster mean} \end{align}\]
We can estimate the variation of each of the terms in the equation above:
The total variation of \(y_{ij}\) in a multilevel model can thus be decomposed as:
\[\text{Total Variation = Between-Cluster Variation + Within-Cluster Variation}\]
Importantly, both \(\text{Between-Cluster Variation}\) and \(\text{Within-Cluster Variation}\) can decomposed into explained and residual variation:
\[\text{Between-Cluster Variation = Explained Between-Cluster Variation + Residual Between-Cluster Variation}\]
\[\text{Within-Cluster Variation = Explained Within-Cluster Variation + Residual Within-Cluster Variation}\]
Describes variation of individuals within the same cluster (e.g. employees’ stress scores varying within a company).
Level-1 predictors explain within-cluster variation in the outcome (e.g. differences in employees’ salaries within a company explain differences in their stress scores).
Describes variation of cluster means about the overall mean (e.g. company stress score means varying around the overall mean).
Level-2 predictors explain between-cluster variation in the outcome (e.g. differences in percent unfilled positions can explain differences in mean stress across companies)
The random intercept model has no predictors and thus decomposes the total variation into residual within-cluster and between-cluster variation.
\[\text{Total Variation}_{ri} = \text{Residual Between-Cluster Variation + Residual Within-Cluster Variation}\]
\[\begin{align}y_{ij} &= \beta_{0j} + \epsilon_{ij} \\ \beta_{0j} &= \gamma_{00} + u_{0j} \end{align}\]
In the random intercept model:
## Linear mixed model fit by REML ['lmerMod']
## Formula: stress ~ 1 + (1 | company_id)
## Data: empstress
##
## REML criterion at convergence: 1107.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.2379 -0.6096 -0.0342 0.5618 3.3270
##
## Random effects:
## Groups Name Variance Std.Dev.
## company_id (Intercept) 116.08 10.774
## Residual 73.41 8.568
## Number of obs: 150, groups: company_id, 15
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 45.000 2.868 15.69
In the output above, we see:
Overall, \(stress_{ij}\) varies more between-companies than within-companies.
†Random effect estimates are partially pooled and shrunken towards the grand mean and will generally not equal the observed cluster means \(\bar{y}_{.j} = \sum_{i=1}^{n_j}{y_{ij}}\).
The random intercept model is often used to calculate the intraclass correlation coefficient (ICC), the proportion of total variation in \(y_{ij}\) that is residual between-cluster variation.
The ICC is calculated by dividing \(\hat{\tau}_{00}^2\), the random intercept variance, by \(\hat{\tau}_{00}^2 + \hat{\sigma}_{\epsilon}^2\), the sum of the random intercept variance and residual variance, which estimates the total variance.
\[ICC = \frac{\hat{\tau}_{00}^2}{\hat{\tau}_{00}^2 + \hat{\sigma}_{\epsilon}^2}\]
\(ICC>.5\) suggests that cluster means vary more than individual scores within cluster, or \(\hat{\tau}_{00}^2 > \hat{\sigma}_{\epsilon}^2\). Here, level-2 predictors will be important to explain variation in \(y_ij\).
\(ICC<.5\) suggests that cluster means vary less than individual scores within cluster, or \(\hat{\tau}_{00}^2 < \hat{\sigma}_{\epsilon}^2\). Here, level-1 predictors will be important to explain variation in \(y_{ij}\).
Like \(y_{ij}\), a predictor measured at level-1 can be decomposed into a cluster mean and a deviation from the cluster mean†:
\[\begin{align} x_{ij} &= \bar{x}_{.j} + (x_{ij} - \bar{x}_{.j}) \\ x_{ij} &= \text{cluster mean + deviation from cluster mean} \end{align}\]
A variable superscripted with \(^w\), e.g., \(x_{ij}^w\), signifies its deviation from the cluster mean, and is often called group-mean-centered or cluster-mean-centered:
\[x_{ij}^w = (x_{ij} - \bar{x}_{.j})\]
and thus
\[x_{ij} = \bar{x}_{.j} + x_{ij}^w\]
Below we use mutate()
to create both the company means
of salary
, salary_m
, and the
company-mean-centered salary
, salary_c
.
# to create group-mean centered salary, salary_c:
# 1. add cluster means of predictor salary (salary_m) to data
# 2. subtract salary_m from original salary to create salary_c
empstress <- empstress %>%
mutate(.by=company_id, # process data by company_id
salary_m = mean(salary)) %>%
mutate(salary_c = salary-salary_m) # separate mutate so we don't process by company
As a predictor, \(x_{ij}^w\) can only explain level-1 variation because it does not vary at level-2, as all between-cluster differences have been removed (all cluster means are 0):
The cluster means \(x_{.j}\) can themselves be added as level-2 predictors to explain level-2 variation in the outcome.
†If the cluster means of \(x_{ij}\) are all equal, which can happen with experimentally manipulated variables (e.g. using time as a predictor when everyone is measured at the same times), then \(x_{ij}\) will only explain level-1 variation so within-cluster centering is not necessary
Recall that
\[x_{ij} = x_{ij}^w + \bar{x}_{.j}\]
In the model in which \(x_{ij}^w\) and \(\bar{x}_{.j}\) are predictors:
\[y_{ij} = \beta_{0j} + \beta_1x_{ij}^w + \beta_2\bar{x}_{.j} + \epsilon_{ij}\]
Importantly, the above model is generally not equivalent to the model in which the original \(x_{ij}\) is the predictor:
\[y_{ij} = \beta_{0j}^* + \beta_1^*x_{ij} + \epsilon_{ij}\]
Implying that:
\[\begin{align}\beta_1^* &\ne \beta_1 \\ \beta_1^* &\ne \beta_2\end{align}\]
Instead, the coefficient for the original \(x_{ij}\), \(\beta_1^*\), will be some uninterpretable combination of the within-cluster and between-cluster effects, \(\beta_1\) and \(\beta_2\), respectively.
Compare the coefficients of the model where group_centered
salary_c
and company means salary_m
are
entered as predictor to the model where the original salary
is entered:
# salary_c and salary_m coefficients have clear within-company and between-company interpretations
m_salary_c_m <- lmer(stress ~ salary_c + salary_m +
(1 | company_id),
data=empstress)
summary(m_salary_c_m)
## Linear mixed model fit by REML ['lmerMod']
## Formula: stress ~ salary_c + salary_m + (1 | company_id)
## Data: empstress
##
## REML criterion at convergence: 1072
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.14261 -0.58580 0.02166 0.46986 3.14261
##
## Random effects:
## Groups Name Variance Std.Dev.
## company_id (Intercept) 22.71 4.766
## Residual 64.73 8.046
## Number of obs: 150, groups: company_id, 15
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 125.7257 11.9571 10.515
## salary_c 0.7414 0.1696 4.371
## salary_m -1.0053 0.1479 -6.798
##
## Correlation of Fixed Effects:
## (Intr) slry_c
## salary_c 0.000
## salary_m -0.993 0.000
# salary coefficient is uninterpretable average of coefficients above
m_salary <- lmer(stress ~ salary +
(1 | company_id),
data=empstress)
summary(m_salary)
## Linear mixed model fit by REML ['lmerMod']
## Formula: stress ~ salary + (1 | company_id)
## Data: empstress
##
## REML criterion at convergence: 1102.9
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.23736 -0.62578 0.00259 0.53223 3.10754
##
## Random effects:
## Groups Name Variance Std.Dev.
## company_id (Intercept) 239.80 15.485
## Residual 65.62 8.101
## Number of obs: 150, groups: company_id, 15
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 4.6013 13.3741 0.344
## salary 0.5031 0.1587 3.170
##
## Correlation of Fixed Effects:
## (Intr)
## salary -0.953
Multilevel models can include fixed and random effects.
The intercept and any regression coefficient for a level-1 predictor can be either
In either case, the fixed effect, e.g. \(\gamma_{10}\), is interpreted as the mean coefficient across clusters.
# fixed slope of salary_c to explain variation in stress
m_salary_c_fixed <- lmer(stress ~ salary_c +
(1 | company_id),
data=empstress)
summary(m_salary_c_fixed)
## Linear mixed model fit by REML ['lmerMod']
## Formula: stress ~ salary_c + (1 | company_id)
## Data: empstress
##
## REML criterion at convergence: 1091.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.22676 -0.61828 -0.01386 0.49847 3.00557
##
## Random effects:
## Groups Name Variance Std.Dev.
## company_id (Intercept) 116.95 10.814
## Residual 64.73 8.046
## Number of obs: 150, groups: company_id, 15
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 45.0000 2.8685 15.688
## salary_c 0.7414 0.1696 4.371
##
## Correlation of Fixed Effects:
## (Intr)
## salary_c 0.000
The coefficient \(\beta_1 = \gamma_{10}
\approx 0.741\) is interpreted as the fixed slope of group-mean
centered salary, salary_c
, which suggests that employees
with higher salaries than other employees within the same company have
more stress (perhaps because they have more responsibilities).
Random effects in multilevel models are typically used to model heterogeneity (variation) around the fixed effect
In the model below with level-1 predictors \(x_{ij}^w\) and \(z_{ij}^w\):
\[\begin{align}y_{ij} &= \beta_{0j} + \beta_{1j}x_{ij}^w + \beta_2z_{ij}^w + \epsilon_{ij} \\ \beta_{0j} &= \gamma_{00} + u_{0j} \\ \beta_{1j} &= \gamma_{10} + u_{1j} \\ \beta_2 &= \gamma_{20} \end{align}\]
Random effects like \(u_{0j}\) and \(u_{1j}\) are typically assumed to have a multivariate normal distribution, each with zero means and a covariance matrix quantifying their variances and covariances:
\[\Big(\begin{matrix} u_{0j} \\ u_{1j} \end{matrix}\Big) \sim N\Bigg( \Big(\begin{matrix} 0 \\ 0 \end{matrix}\Big), \Big(\begin{matrix} \tau_{00}^2 & \tau_{01} \\ \tau_{01} & \tau_{11}^2 \end{matrix}\Big) \Bigg)\]
The elements of the covariance matrix \(\tau_{00}^2\), \(\tau_{11}^2\), \(\tau_{01}=\tau_{10}\) are parameters estimated in the multilevel model.
# fixed and random slope of salary_c to explain variation in stress
m_salary_c_random <- lmer(stress ~ salary_c +
(1 + salary_c | company_id),
data=empstress)
summary(m_salary_c_random)
## Linear mixed model fit by REML ['lmerMod']
## Formula: stress ~ salary_c + (1 + salary_c | company_id)
## Data: empstress
##
## REML criterion at convergence: 1087.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.31751 -0.58010 -0.05927 0.47469 2.48187
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## company_id (Intercept) 117.4425 10.837
## salary_c 0.3329 0.577 0.19
## Residual 59.3291 7.703
## Number of obs: 150, groups: company_id, 15
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 45.0000 2.8679 15.691
## salary_c 0.8178 0.2245 3.643
##
## Correlation of Fixed Effects:
## (Intr)
## salary_c 0.126
salary_c
across companies.## $company_id
## (Intercept) salary_c
## 1 46.17442 0.48449509
## 2 27.79865 0.79443093
## 3 41.50957 0.95495935
## 4 31.01874 0.73785965
## 5 53.72239 0.71527322
## 6 51.56692 0.87882989
## 7 44.06388 0.89459952
## 8 52.49329 1.26991889
## 9 62.81527 1.60314719
## 10 39.82638 0.58183403
## 11 39.10364 0.78680564
## 12 27.18478 0.55404316
## 13 55.85613 -0.08883465
## 14 50.70981 0.86780080
## 15 51.15612 1.23142129
##
## attr(,"class")
## [1] "coef.mer"
Random slopes may provide a better fit of level-1 predictors to the data than the fixed slope.
To review multilevel modeling and practice using the
r2mlm
package, we will use a dataset that simulates the
following experiment:
Subjects are college students who were asked to study a different list of 100 words every night for 2 weeks for between 0 and 2 hours per night, at their discretion, and then asked to recall as many words as possible the next morning.
30 subjects, 14 measurements each
Variables
subj_id
: subject IDwords
: the outcome, number of words recalled each
morningsleep
: the number of hours the subject slept the night
beforestudy
: the number of hours the subject studied the
night beforegpa
: subject’s grade point average (level 2)words
as the
outcome and random intercepts by subj_id
.words
.dplyr::mutate()
to add the following to the data
set:
sleep
and study
(call
them sleep_m
and study_m
)sleep_c
and study_c
)sleep_c
sleep_m
to the model in (4)
Rights and Sterba (2019) decompose the model-implied total variance of the outcome as follows:
\[\widehat{var}(y_{ij}) = f_1 + f_2 + v + m + \hat{\sigma}_\epsilon^2\] Explained variance components
Residual variance components
This complete variance decomposition allowed Rights and Sterba (2019) to develop a general framework for defining multilevel \(R^2\) measures that are clearly interpretable and that subsumes most multilevel \(R^2\) measures previously defined in the literature.
†In this decomposition, all level-1 predictors are assumed to be cluster-mean-centered
†† Up until now we have interpreted random intercept
variation to be residual or unexplained between-cluster variation, but
the r2mlm
pacakge interprets it as variation explained by
cluster IDs.
Within this framework, we can also fully decompose the model-implied within-cluster variance decomposition:
\[\widehat{var}(y_{ij}-\bar{y}_{.j}) = f_1 + v + \hat{\sigma}_\epsilon^2\]
Explained variance components
Residual variance component
Below we see graphs that show how the fixed slope and random slope explain increasingly more within-company variation, reflected by the decreasing residuals.