How do I interpret the coefficients of an effect-coded variable involved in an interaction in a regression model?
Table of Contents:
- Categorical predictors in regression models
- What is effect coding?
- Effect coding for a binary predictor
- Effect coding for categorical predictors with 3 or more levels
- Effect-coded predictors interacted with a continuous covariate
- Interaction of 2 effect-coded categorical predictors
- Summary
Categorical predictors in regression models
Categorical or nominal variables that are to be included as predictors in regression models must be first be transformed into a set of variables (henceforth referred to as “regressors”), where each individual variable typically codes for membership to a single category. The resulting set of regressors are then entered into the model in the same way as a quantitative predictor variable. Many different coding schemes can be used for these regressors that will produce models with equivalent fit, but the coefficients will have different interpretations. The most commonly used coding scheme for regression is dummy coding (also known as reference or indicator coding), for which a \(0/1\) variable is created for each of the \(k\) categories, where a \(1\) represents membership to that category and \(0\) represents membership to a different category. Only \(k-1\) of these regressors are then entered into the regression model (because of linear dependencies), and the category represented by the omitted variable represents a reference group. The intercept in a model using dummy-coded variables is an estimate of the mean of the dependent variable of the reference group, and the regression coefficients for the regressors represent mean deviations of each category from this reference group.
What is effect coding?
Effect coding is an alternative coding scheme that produces coefficient estimates that have an interpretation analagous to effects estimated by analysis of variance (ANOVA). One popular reason to use effect coding is that in a model where the effect-coded categorical predictor is involved in an interaction, the coefficients for the regressors are interpreted as main effects. As mentioned above, the overall model fit and model predictions will be the same whether dummy coding or effect coding is used.
We again need to select one level of the categorical predictor whose regressor will be omitted from the regression model. Membership to this contrasting group is coded with a \(-1\) across all regressors. Membership to the category represented by the regressor is coded with \(1\), and membership to any category not represented by the target regressor besides the contrasting group is coded with a \(0\).
Effect coding for a binary predictor
With just 2 levels, we can effect code a binary predictor with a single regressor with values \(-1\) and \(1\).
Imagine we have a simple data set which looks at the effects of 2 different study methods, \(recitation\) and \(writing\), on the recall of a list of 20 words. The dependent variable is the number of words recalled.
Here are the data, where we have 5 subjects in the \(recitation\) condition and 3 subjects in the \(writing\) conditon:
Method | Words |
---|---|
recitation | 9 |
recitation | 12 |
recitation | 6 |
recitation | 6 |
recitation | 7 |
writing | 16 |
writing | 9 |
writing | 11 |
The overall mean of \(Words\) is \(\overline{Words}=9.5\). The means of \(Words\) by \(Method\) are \(\overline{Words}_{rec}=8\) and \(\overline{Words}_{wri}=12\).
We only need a single regressor to enter the \(Method\) categorical predictor into a regression model. We will choose \(writing\) as the contrasting group, so observations in this group will be assigned a \(-1\) on the regressor, while those in the “recitation” group will be assigned a \(1\). We will call this new effect-code variable \(M1\):
Method | M1 | Words |
---|---|---|
recitation | 1 | 9 |
recitation | 1 | 12 |
recitation | 1 | 6 |
recitation | 1 | 6 |
recitation | 1 | 7 |
writing | -1 | 16 |
writing | -1 | 9 |
writing | -1 | 11 |
The regression model equation of \(Words\) on \(M1\) would be:
\[Words_i = b_0 + b_1M1_i + \epsilon_i\]
where \(Words_i\) is the number of words recalled \(M1_i\) is the regressor value, and \(\epsilon_i\) is the error term for person \(i\). The regression coefficients are \(b_0\) and \(b_1\).
By plugging in the value \(1\) and \(-1\), we can get the predicted value for the two \(Method\) groups.
First, for the \(recitiation\) group, \(M1=1\), so the regression equation becomes (the \(\hat{}\) symbol denotes a coefficient estimated from our sample data):
\[ \begin{aligned} \widehat{Words}_{rec} & = \hat{b}_0 + \hat{b}_1(1) \\ & = \hat{b}_0 + \hat{b}_1 \\ \end{aligned} \]
The predicted number of words recalled for the \(recitation\) group is thus the intercept, \(\hat{b}_0\) ,plus the coefficient for \(M1\), \(\hat{b}_1\).
For the writing group, \(M1=-1\), so the regression equation becomes:
\[ \begin{aligned} \widehat{Words}_{wri} & = \hat{b}_0 + \hat{b}_1(-1) \\ & = \hat{b}_0 – \hat{b}_1 \\ \end{aligned} \]
The predicted number of words recalled for the \(writing\) group is the intercept minus \(\hat{b}_1\).
Interpretation of intercept \(\hat{b}_0\). We can understand how to intrepret the intercept by averaging the predictions for each of the two groups, which will isolate \(\hat{b}_0\):
\[ \begin{aligned} (\widehat{Words}_{rec} + \widehat{Words}_{wri})/2 & = ((\hat{b}_0 + \hat{b}_1) + (\hat{b}_0 – \hat{b}_1))/2 \\ & = 2\hat{b}_0/2 \\ & = \hat{b}_0 \end{aligned} \]
The intercept \(\hat{b}_0\) thus represents the average of the predicted means for the two \(Method\)s. Sometimes this average is referred to as the grand mean, but note that this grand mean is unweighted, which we will explain a bit later.
Interpretation of coefficient \(\hat{b}_1\). Now that we know the interpretation of the intercept \(\hat{b}_0\), we an easily understand \(\hat{b}_1\). The equation for the predicted mean of the \(recitation\) group is \(\widehat{Words}_{rec} = \hat{b}_0 + \hat{b}_1\). Because \(\hat{b}_0\) represents the estimate of the grand mean, \(\hat{b}_1\) must represent the deviation of the \(recitation\) group mean from the grand mean.
The equation for the predicted mean of the \(writing\) group is \(\widehat{Words}_{wri} = \hat{b}_0 – \hat{b}_1\). Therefore, \(\hat{b}_1\) is also the deviation of the \(writing\) group mean from the grand mean.
Regression model. Let’s take a look at the estimates of a regression of \(Words\) on \(M1\):
------------------------------------------------------------------------------ Words | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- M1 | -2 1.074968 -1.86 0.112 -4.630351 .6303512 _cons | 10 1.074968 9.30 0.000 7.369649 12.63035 ------------------------------------------------------------------------------
Above we can see that the estimate of the intercept (labeled “_cons”) is \(\hat{b}_0 = 10\). This is the unweighted grand mean, which is simply the mean of the group means, (\(\overline{Words}_{rec} + \overline{Words}_{wri})/2 = (8 + 12)/2 = 10\). This unweighted grand mean treats the groups as if they were balanced. The overall mean of all words recalled that we manually calculated above, \(\overline{Words}=9.5\), is a weighted mean, adjusting the mean for the relative sample sizes of the two groups (\(5*\overline{Words}_{rec} + 3*\overline{Words}_{wri})/8 = (40 + 36)/8 = 9.5\). Because the grand mean estimate \(\hat{b}_0\) is unweighted, one should not interpret the intercept as the population mean unless the groups are balanced in the population.
The coefficient for \(M1\), \(\hat{b}_1=-2\) represents the deviation of the \(recitation\) group, \(M1=1\), from the grand mean, and \(-\hat{b}_1=2\) is the deviation of the \(writing\) group, \(M1=-1\), from the grand mean.. The predicted score for the \(recitation\) group is \(\hat{b}_0 + \hat{b}_1 = 10 + -2 = 8\) and the predicted score for the \(writing\) group is \(\hat{b}_0 – \hat{b}_1 = 10 – -2 = 12\), which match our observed means!
Effect coding for categorical predictors with 3 or more levels
Now let’s look at how to effect code a categorical predictor with 3 levels and how to interpret its regression coefficients. The extension from a binary predictor is quite straightforward.
Imagine we have an additional method to memorizing a list of 20 words, which we’ll call the \(imagery\) condition. Here is some sample data:
Method | Words |
---|---|
recitation | 9 |
recitation | 12 |
recitation | 6 |
recitation | 6 |
recitation | 7 |
imagery | 13 |
imagery | 16 |
imagery | 9 |
imagery | 14 |
writing | 16 |
writing | 9 |
writing | 11 |
The means for each of the 3 groups are \(\overline{Words}_{rec}=8\), \(\overline{Words}_{ima}=13\) and \(\overline{Words}_{wri}=12\). The unweighted grand mean is \((8 + 13 + 12)/3 = 11\).
Now that we have 3 conditions, we will need 2 (\(k-1\)) effect-coded regressors. We will again use \(writing\) as the contrasting group.
Remember that each effect-coded variable has the value \(1\) for membership to the target category, a \(0\) for membership to another category other than either the target or the contrasting category, and \(-1\) for membership to the contrasting category.
We will create one regressor named \(M1\) for membership to \(recitation\) and another named \(M2\) for \(imagery\):
Method | M1 | M2 | Words |
---|---|---|---|
recitation | 1 | 0 | 9 |
recitation | 1 | 0 | 12 |
recitation | 1 | 0 | 6 |
recitation | 1 | 0 | 6 |
recitation | 1 | 0 | 7 |
imagery | 0 | 1 | 13 |
imagery | 0 | 1 | 16 |
imagery | 0 | 1 | 9 |
imagery | 0 | 1 | 14 |
writing | -1 | -1 | 16 |
writing | -1 | -1 | 9 |
writing | -1 | -1 | 11 |
The regression model equation is:
\[Words_i = b_0 + b_1M1_i + b_2M2_i + \epsilon_i\]
We can again get model predicted values for each group by substituting values of \(1\), \(0\), and \(-1\) for \(M1\) and \(M2\).
The predicted value for \(recitation\), where \(M1=1\) and \(M2=0\):
\[ \begin{aligned} \widehat{Words}_{rec} & = \hat{b}_0 + \hat{b}_1(1) + \hat{b}_2(0)\\ & = \hat{b}_0 + \hat{b}_1 \\ \end{aligned} \]
The predicted value for \(imagery\), where \(M1=0\) and \(M2=1\):
\[ \begin{aligned} \widehat{Words}_{ima} & = \hat{b}_0 + \hat{b}_1(0) + \hat{b}_2(1)\\ & = \hat{b}_0 + \hat{b}_2 \\ \end{aligned} \]
The predicted value for \(writing\), where \(M1=-1\) and \(M2=-1\):
\[ \begin{aligned} \widehat{Words}_{wri} & = \hat{b}_0 + \hat{b}_1(-1) + \hat{b}_2(-1)\\ & = \hat{b}_0 – \hat{b}_1 – \hat{b}_2 \\ \end{aligned} \]
Interpretation of intercept \(\hat{b}_0\). The intercept estimate \(\hat{b}_0\) is again interpreted as the grand (unweighted) mean of the 3 groups:
\[ \begin{aligned} (\widehat{Words}_{rec} + \widehat{Words}_{ima} + \widehat{Words}_{wri})/3 & = ((\hat{b}_0 + \hat{b}_1) + (\hat{b}_0 + \hat{b}_2) + (\hat{b}_0 – \hat{b}_1 – \hat{b}_2))/3 \\ & = 3\hat{b}_0/3 \\ & = \hat{b}_0 \end{aligned} \]
Interpretation of coefficients \(\hat{b}_1\) and \(\hat{b}_2\). It is easy to see from the prediction equations for \(recitation\), \(\widehat{Words}_{rec} = \hat{b}_0 + \hat{b}_1\), that \(\hat{b}_1\) is the deviation of the \(recitation\) group from the grand mean. Similarly, from \(\widehat{Words}_{ima} = \hat{b}_0 + \hat{b}_2\), we see that \(\hat{b}_2\) is the deviation of the \(imagery\) group from the grand mean.
Additionally, we see from \(\widehat{Words}_{wri} = \hat{b}_0 – \hat{b}_1 – \hat{b}_2\) that (\(\hat{b}_1 + \hat{b}_2\)) is the negative deviation of the \(writing\) group from the grand mean.
So in summary, regression coefficients for effect-coded regressors represent deviations of a particular category from the grand mean, and the sum of the regression coefficients for all effect-coded regressors is the negative deviation of the contrasting (omitted) group from the grand mean.
Regression model. Now let’s interpret the estimates of a regression of \(Words\) on \(M1\) and \(M2\):
------------------------------------------------------------------------------ Words | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- M1 | -3 1.154166 -2.60 0.029 -5.610905 -.3890955 M2 | 2 1.215131 1.65 0.134 -.7488172 4.748817 _cons | 11 .8685165 12.67 0.000 9.035279 12.96472 ------------------------------------------------------------------------------
The unweighted grand mean of words recalled across all groups is \(\hat{b}_0 = 11\). The deviation of the \(recitation\) group from the grandmean is \(\hat{b}_1 = -3\), so the predicted value for \(recitation\) is \(11 – 3 = 8\). The recitation group would thus be interepreted as significantly different from the mean of all groups, as \(p <0.05\). The deviation of the \(imagery\) group from the grandmean is \(hat{b}_2 = 2\), so the predicted value is \(11 + 2 = 13\). The negative deviation of the contrasting group \(writing\) is \(\hat{b}_1 + \hat{b}_1 = -3 + 2 = -1\), so the predicted value is \(11 – -1 = 12\). Each of these predicted values match the observed means we calculated above.
More than 3 categories. Extending these methods for more than 3 categories is straightforward. For a categorical predictor with \(k\) levels, we will need to enter \(k-1\) effect-coded predictors into the regression model. The intercept estimate \(\hat{b}_0\) is interpreted as the unweighted grand mean of all groups. Each regresion coefficient, \(\hat{b}_1\), \(\hat{b}_2\), \(…\), \(\hat{b}_{k-1}\), is interepreted as a deviation from the grand mean, and the sum of all the regression coefficients, \(\sum_{i=1}^{k-1}b_i\), is the negative deviation of the contrasting group from the group mean.
The reader is encouraged to confirm these interpretations with the following data and regression model:
Method | M1 | M2 | M3 | Words |
---|---|---|---|---|
recitation | 1 | 0 | 0 | 9 |
recitation | 1 | 0 | 0 | 12 |
recitation | 1 | 0 | 0 | 6 |
recitation | 1 | 0 | 0 | 6 |
recitation | 1 | 0 | 0 | 7 |
imagery | 0 | 1 | 0 | 13 |
imagery | 0 | 1 | 0 | 16 |
imagery | 0 | 1 | 0 | 9 |
imagery | 0 | 1 | 0 | 14 |
distraction | 0 | 0 | 1 | 4 |
distraction | 0 | 0 | 1 | 2 |
writing | -1 | -1 | -1 | 16 |
writing | -1 | -1 | -1 | 9 |
writing | -1 | -1 | -1 | 11 |
------------------------------------------------------------------------------ Words | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- M1 | -1 1.200694 -0.83 0.424 -3.675313 1.675313 M2 | 4 1.281275 3.12 0.011 1.14514 6.85486 M3 | -6 1.62532 -3.69 0.004 -9.62144 -2.37856 _cons | 9 .801041 11.24 0.000 7.215169 10.78483 ------------------------------------------------------------------------------
Effect-coded predictors interacted with a continuous covariate
One of the advantages of using effect coding in interaction models is that some of the coefficients for lower-order effects are interpreted as main (averaged) effects, instead of as simple (specific) effects (as they are when dummy/reference coding is used). Let’s take a look.
Imagine we have augmented our word recall data set with the number of hours (\(Hours\)) each subject spent studying the word list with a particular method:
Method | M1 | M2 | Hours | Words |
---|---|---|---|---|
recitation | 1 | 0 | 2 | 9 |
recitation | 1 | 0 | 3.5 | 12 |
recitation | 1 | 0 | 0 | 6 |
recitation | 1 | 0 | 1 | 6 |
recitation | 1 | 0 | 1 | 7 |
imagery | 0 | 1 | 6 | 13 |
imagery | 0 | 1 | 13 | 16 |
imagery | 0 | 1 | 5 | 9 |
imagery | 0 | 1 | 8 | 14 |
writing | -1 | -1 | 5 | 16 |
writing | -1 | -1 | 2 | 9 |
writing | -1 | -1 | 3.5 | 11 |
If we believe that the effect of the number of hours on words recalled varies by the method used, we should model an interaction between \(Method\) and \(Hours\). Interactions variables are formed by multiplying the component variables together. Here \(Method\) is represented by 2 variables, \(M1\) and \(M2\), so we will multiply each of these by \(Hours\) to create 2 interaction variables, \(M1Hours\) and \(M2Hours\).
Method | M1 | M2 | Hours | M1Hours | M2Hours | Words |
---|---|---|---|---|---|---|
recitation | 1 | 0 | 2 | 2 | 0 | 9 |
recitation | 1 | 0 | 3.5 | 3.5 | 0 | 12 |
recitation | 1 | 0 | 0 | 0 | 0 | 6 |
recitation | 1 | 0 | 1 | 1 | 0 | 6 |
recitation | 1 | 0 | 1 | 1 | 0 | 7 |
imagery | 0 | 1 | 6 | 0 | 6 | 13 |
imagery | 0 | 1 | 13 | 0 | 13 | 16 |
imagery | 0 | 1 | 5 | 0 | 5 | 9 |
imagery | 0 | 1 | 8 | 0 | 8 | 14 |
writing | -1 | -1 | 5 | -5 | -5 | 16 |
writing | -1 | -1 | 2 | -2 | -2 | 9 |
writing | -1 | -1 | 3.5 | -3.5 | -3.5 | 11 |
The model regression equation is:
\[Words_i = b_0 + b_1M1_i + b_2M2_i + b_3Hours_i + b_4M1_iHours_i + b_5M2_iHours_i + \epsilon_i\]
Let’s look at the predicted number of words recalled for each \(Method\) when the number of hours studied is \(Hours=h\).
For \(recitation\) (\(M1=1\), \(M2=0\)):
\[ \begin{aligned} \widehat{Words}_{rec} & = \hat{b}_0 + \hat{b}_1(1) + \hat{b}_2(0) + \hat{b}_3(h) + \hat{b}_4(1)(h) + \hat{b}_5(0)(h)\\ & = \hat{b}_0 + \hat{b}_1 + \hat{b}_3h + \hat{b}_4h \\ \end{aligned} \]
For \(imagery\) (\(M1=0\), \(M2=1\)):
\[ \begin{aligned} \widehat{Words}_{ima} & = \hat{b}_0 + \hat{b}_1(0) + \hat{b}_2(1) + \hat{b}_3(h) + \hat{b}_4(0)(h) + \hat{b}_5(1)(h)\\ & = \hat{b}_0 + \hat{b}_2 + \hat{b}_3h + \hat{b}_5h \\ \end{aligned} \]
For \(writing\) (\(M1=-1\), \(M2=-1\)):
\[ \begin{aligned} \widehat{Words}_{wri} & = \hat{b}_0 + \hat{b}_1(-1) + \hat{b}_2(-1) + \hat{b}_3(h) + \hat{b}_4(-1)(h) + \hat{b}_5(-1)(h)\\ & = \hat{b}_0 – \hat{b}_1 – \hat{b}_2 + \hat{b}_3h – \hat{b}_4h – \hat{b}_5h \\ \end{aligned} \]
Interpretation of intercept \(\hat{b}_0\). Let’s take the average of the 3 predicted values above (assuming \(Hours=h\) for each prediction):
\[ \begin{aligned} (\widehat{Words}_{rec} + \widehat{Words}_{ima} + \widehat{Words}_{wri})/3 & = ((\hat{b}_0 + \hat{b}_1 + \hat{b}_3h + \hat{b}_4h) + (\hat{b}_0 + \hat{b}_2 + \hat{b}_3h + \hat{b}_5h) + (\hat{b}_0 – \hat{b}_1 – \hat{b}_2 + \hat{b}_3h – \hat{b}_4h – \hat{b}_5h))/3 \\ & = (3\hat{b}_0 + 3\hat{b}_3h)/3 \\ & = \hat{b}_0 + \hat{b}_3h \end{aligned} \]
If we set \(Hours\) to zero, we will isolate the intercept \(\hat{b}_0\):
\[ \begin{aligned} (\widehat{Words}_{rec|h=0} + \widehat{Words}_{ima|h=0} + \widehat{Words}_{wri|h=0})/3 & = \hat{b}_0 + \hat{b}_3(0) \\ & = \hat{b}_0 \end{aligned} \]
The intercept estimate \(\hat{b}_0\) is thus interpreted as the unweighted grand mean of the 3 methods when \(Hours=0\).
Interpretation of effect-coded regressor coefficients \(\hat{b}_1\) and \(\hat{b}_2\). Let’s look at the prediction equations to interpret \(\hat{b}_1\) and \(\hat{b}_2\). The prediction equation for \(recitation\) is \(\widehat{Words}_{rec} = \hat{b}_0 + \hat{b}_1 + \hat{b}_3h + \hat{b}_4h\). If we set \(Hours=0\), then the equation becomes \(\widehat{Words}_{rec|h=0} = \hat{b}_0 + \hat{b}_1 + \hat{b}_3(0) + \hat{b}_4(0) = \hat{b}_0 + \hat{b}_1\). Therefore, \(\hat{b}_1\) is the deviation of the \(recitation\) group from the grand mean of all groups when \(Hours=0\).
Similarly, for \(imagery\), if we set \(Hours=0\), then the prediction equation is \(\widehat{Words}_{ima|h=0} = \hat{b}_0 + \hat{b}_2 + \hat{b}_3(0) + \hat{b}_5(0) = \hat{b}_0 + \hat{b}_2\). The coefficient \(\hat{b}_2\) is the deviation of the \(imagery\) group from the grand mean of all groups when \(Hours=0\).
The sum of the coefficients, \(\hat{b}_1 +\hat{b}_2\), is also the negative deviation of the contrasting group \(writing\) when \(Hours=0\), as \(\widehat{Words}_{wri|h=0} = \hat{b}_0 – \hat{b}_1 – \hat{b}_2 + \hat{b}_3(0) – \hat{b}_4(0) – \hat{b}_5(0) = \hat{b}_0 – \hat{b}_1 – \hat{b}_2\).
Interpretation of coefficient for continuous covariate \(\hat{b}_3\).
The effect of a continuous covariate, often called a “slope”, is expressed as the change in the expected value of the outcome per unit increase in the covariate. Let’s look at how an increase of one-unit of hours, from \(Hours=h\) to \(Hours=h+1\), predicts a change in the outcome in each of the \(Method\) groups.
Starting with \(recitation\), the prediction when \(Hours=h\) is:
\[\widehat{Words}_{rec|Hours=h} = \hat{b}_0 + \hat{b}_1 + \hat{b}_3h + \hat{b}_4h\]
Now substituting \(h+1\) for \(h\), we get the prediction for the \(recitation\) when \(Hours=h+1\):
\[\widehat{Words}_{rec|Hours=h+1} = \hat{b}_0 + \hat{b}_1 + \hat{b}_3(h+1) + \hat{b}_4(h+1)\]
The change in the outcome can be calculated by taking the difference between these two predictions:
\[ \begin{aligned} \widehat{Words}_{rec|Hours=h+1}-\widehat{Words}_{rec|Hours=h} & = (\hat{b}_0 + \hat{b}_1 + \hat{b}_3(h+1) + \hat{b}_4(h+1)) – (\hat{b}_0 + \hat{b}_1 + \hat{b}_3h + \hat{b}_4h) \\ & = \hat{b}_3(h + 1 -h) + \hat{b}_4(h + 1 -h) \\ & = \hat{b}_3 + \hat{b}_4 \end{aligned} \]
The expected change in the number of \(Words\) recalled for a one-unit increase in \(Hours\) when the person is using the \(recitation\) method is thus \(\hat{b}_3 + \hat{b}_4\).
We can do the same sort of calculations for the \(imagery\) group:
\[ \begin{aligned} \widehat{Words}_{ima|Hours=h+1}-\widehat{Words}_{ima|Hours=h} & = (\hat{b}_0 + \hat{b}_2 + \hat{b}_3(h+1) + \hat{b}_5(h+1)) – (\hat{b}_0 + \hat{b}_2 + \hat{b}_3h + \hat{b}_5h) \\ & = \hat{b}_3(h + 1 -h) + \hat{b}_5(h + 1 -h) \\ & = \hat{b}_3 + \hat{b}_5 \end{aligned} \]
The expected change in the number of \(Words\) recalled for a one-unit increase in \(Hours\) when the person is using the \(imagery\) method is thus \(\hat{b}_3 + \hat{b}_5\).
And for the \(writing\) group:
\[ \begin{aligned} \widehat{Words}_{wri|Hours=h+1} – \widehat{Words}_{wri|Hours=h+1} & = (\hat{b}_0 – \hat{b}_1 – \hat{b}_2 + \hat{b}_3(h+1) – \hat{b}_4(h+1) – \hat{b}_5(h+1)) – (\hat{b}_0 – \hat{b}_1 – \hat{b}_2 + \hat{b}_3h – \hat{b}_4h – \hat{b}_5h) \\ & = \hat{b}_3(h + 1 -h) – \hat{b}_4(h + 1 -h) – \hat{b}_5(h + 1 -h) \\ & = \hat{b}_3 – \hat{b}_4 – \hat{b}_5 \end{aligned} \]
The expected change in the number of \(Words\) recalled for a one-unit increase in \(Hours\) when the person is using the \(writing\) method is thus \(\hat{b}_3 – \hat{b}_4 – \hat{b}_5\).
If we take an average the effect of a one-unit increase in \(Hours\) across the 3 groups, we can isolate \(\hat{b}_3\):
\[ \begin{aligned} ((\widehat{Words}_{rec|Hours=h+1}-\widehat{Words}_{rec|Hours=h}) + (\widehat{Words}_{ima|Hours=h+1}-\widehat{Words}_{ima|Hours=h}) + (\widehat{Words}_{wri|Hours=h+1} – \widehat{Words}_{wri|Hours=h+1}))/3 & = ((\hat{b}_3 + \hat{b}_4) +(\hat{b}_3 + \hat{b}_5) + (\hat{b}_3 – \hat{b}_4 – \hat{b}_5))/3 \\ & = 3\hat{b}_3/3 \\ & = \hat{b}_3 \end{aligned} \]
The coefficient \(\hat{b}_3\) is thus interpreted as the unweighted average effect of a one-unit increase in \(Hours\) across all 3 groups. In other words, \(\hat{b}_3\) is the main effect of \(Hours\).
Interpretation of interaction coefficients \(\hat{b}_4\) and \(\hat{b}_5\). Now that we know that \(\hat{b}_3\) is the average effect of \(Hours\) across groups, we can easily interpret the interaction coefficients \(\hat{b}_4\) and \(\hat{b}_5\). Looking back at the equation for the effect of a one-unit increase in \(Hours\) in the \(recitation\) group, \(\widehat{Words}_{rec|Hours=h+1}-\widehat{Words}_{rec|Hours=h} = \hat{b}_3 + \hat{b}_4\). We now see that \(\hat{b}_4\) is the deviation from the average effect of hours (\(\hat{b}_3\)) for the \(recitation\) group. In other words, it is the additional effect of hours on top of the average effect, for the \(recitation\) group.
From the equation for the effect of hours in the \(imagery\) group, \(\widehat{Words}_{ima|Hours=h+1}-\widehat{Words}_{ima|Hours=h} = \hat{b}_3 + \hat{b}_5\), we see that \(\hat{b}_5\) is the deviation from the average effect of hours for the \(imagery\) group:
The sum of the coefficients \(\hat{b}_4 + \hat{b}_5\) is the negative deviation from the average effect of hours for the \(writing\) (contrasting) group, as \(\widehat{Words}_{wri|Hours=h+1} – \widehat{Words}_{wri|Hours=h+1} = \hat{b}_3 – \hat{b}_4 – \hat{b}_5\).
Regression model
To confirm that our interpretations are correct, we will run regressions of \(Words\) on \(Hours\) separately in the 3 groups.
For the \(recitation\) group:
Words | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- Hours | 1.857143 .2973809 6.24 0.008 .9107442 2.803541 _cons | 5.214286 .5681453 9.18 0.003 3.406194 7.022378 ------------------------------------------------------------------------------
For the \(imagery\) group:
------------------------------------------------------------------------------ Words | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- Hours | .7105263 .2994686 2.37 0.141 -.5779831 1.999036 _cons | 7.315789 2.567408 2.85 0.104 -3.730877 18.36246 ------------------------------------------------------------------------------
For the \(writing\) group:
------------------------------------------------------------------------------ Words | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- Hours | 2.333333 .5773503 4.04 0.154 -5.002597 9.669264 _cons | 3.833333 2.140872 1.79 0.324 -23.36903 31.03569 ------------------------------------------------------------------------------
The average of the three intercepts is \((5.21+7.32+3.83)/3=5.45\), which should be the \(\hat{b}_0\) (intercept) estimate in the combined interaction model. The average of the 3 \(Hours\) effect is \((1.86+.71+2.33 )/3 = 1.63\), which should be the \(\hat{b}_3\) (coefficient for \(Hours\)) estimate in the interaction model.
Now we will run all groups together in one regression model with interactions. The resulting estimates should be equivalent to those from the 3 separate models:
------------------------------------------------------------------------------ Words | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- M1 | -.2401838 1.154527 -0.21 0.842 -3.06521 2.584842 M2 | 1.86132 1.459926 1.27 0.249 -1.710991 5.433631 Hours | 1.633668 .2715401 6.02 0.001 .9692329 2.298102 M1Hours | .2234754 .3930287 0.57 0.590 -.7382313 1.185182 M2Hours | -.9231412 .2976688 -3.10 0.021 -1.651511 -.1947719 _cons | 5.45447 1.018941 5.35 0.002 2.96121 7.947729 ------------------------------------------------------------------------------
The estimate of \(\hat{b}_0\) is indeed 5.45 and the estimate of the average \(Hours\) effect, \(\hat{b}_3\) is 1.63.
We can recover the intercepts and \(Hours\) slopes for each \(Method\) with these model coefficients. For example, the intercept for the \(recitation\) condition should be grand mean intercept plus the intercept deviation for \(recitation\) (\(M1\)), \(\hat{b}_0 + \hat{b}_1 = 5.45 -.24 = 5.21\), and the slope for \(recitation\) should be the grand mean slope plus the slope deviation for \(recitation\), \(\hat{b}_3 + \hat{b}_4 = 1.63 + .22 = 1.85\). These two estimates match the estimates of the individual regression for the \(recitation\) group above.
Interaction of 2 effect-coded categorical predictors
Interpreting a regression model where 2 effect-coded categorical predictors are interacted will be very similar to interpreting a 2-way ANOVA with interactions.
To demonstrate, we will add a 2-level categorical predictor to our data that codes whether the subject was sleep-deprived while studying the word list (we will not be modeling the continuous covariate \(Hours\) for simplicity):
Method | M1 | M2 | Deprived | Words |
---|---|---|---|---|
recitation | 1 | 0 | No | 9 |
recitation | 1 | 0 | No | 12 |
recitation | 1 | 0 | Yes | 6 |
recitation | 1 | 0 | Yes | 6 |
recitation | 1 | 0 | Yes | 7 |
imagery | 0 | 1 | No | 13 |
imagery | 0 | 1 | Yes | 16 |
imagery | 0 | 1 | Yes | 9 |
imagery | 0 | 1 | No | 14 |
writing | -1 | -1 | No | 16 |
writing | -1 | -1 | Yes | 9 |
writing | -1 | -1 | Yes | 11 |
We will transform the \(Deprived\) variable into a single effect-coded regressor, \(D1\) where \(Yes\) will be coded as \(1\) and \(No\), the contrasting groupo, will be coded as \(-1\). We then create 2 interaction terms, \(M1D1\) and \(M2D1\), by multipling \(M1\) and \(M2\) by \(D1\):
Method | M1 | M2 | Deprived | D1 | M1D1 | M2D1 | Words |
---|---|---|---|---|---|---|---|
recitation | 1 | 0 | No | -1 | -1 | 0 | 9 |
recitation | 1 | 0 | No | -1 | -1 | 0 | 12 |
recitation | 1 | 0 | Yes | 1 | 1 | 0 | 6 |
recitation | 1 | 0 | Yes | 1 | 1 | 0 | 6 |
recitation | 1 | 0 | Yes | 1 | 1 | 0 | 7 |
imagery | 0 | 1 | No | -1 | 0 | -1 | 13 |
imagery | 0 | 1 | Yes | 1 | 0 | 1 | 16 |
imagery | 0 | 1 | Yes | 1 | 0 | 1 | 9 |
imagery | 0 | 1 | No | -1 | 0 | -1 | 14 |
writing | -1 | -1 | No | -1 | 1 | 1 | 16 |
writing | -1 | -1 | Yes | 1 | -1 | -1 | 9 |
writing | -1 | -1 | Yes | 1 | -1 | -1 | 11 |
For a model where we regress \(Words\) on \(Method\), \(Deprived,\) and their interaction, the model regression equation is:
\[Words_i = b_0 + b_1M1_i + b_2M2_i + b_3D1_i + b_4M1_iD1_i + b_5M2_iD1_i + \epsilon_i.\]
There are 6 possible groups formed by crossing the 3 \(Method\) groups by the 2 \(Deprived\) groups. Let’s get the prediction equation for each group.
First, \(recitation\) and \(Deprived=Yes\) (\(M1=1\), \(M2=0\), \(D1=1\)):
\[ \begin{aligned} \widehat{Words}_{rec, yes} & = \hat{b}_0 + \hat{b}_1(1) + \hat{b}_2(0) + \hat{b}_3(1) + \hat{b}_4(1)(1) + \hat{b}_5(0)(1)\\ & = \hat{b}_0 + \hat{b}_1 + \hat{b}_3 + \hat{b}_4 \\ \end{aligned} \]
Then, \(recitation\) and \(Deprived=No\) (\(M1=1\), \(M2=0\), \(D1=-1\)):
\[ \begin{aligned} \widehat{Words}_{rec, no} & = \hat{b}_0 + \hat{b}_1(1) + \hat{b}_2(0) + \hat{b}_3(-1) + \hat{b}_4(1)(-1) + \hat{b}_5(0)(-1)\\ & = \hat{b}_0 + \hat{b}_1 – \hat{b}_3 – \hat{b}_4 \\ \end{aligned} \]
Then, \(imagery\) and \(Deprived=Yes\) (\(M1=0\), \(M2=1\), \(D1=1\)):
\[ \begin{aligned} \widehat{Words}_{ima, yes} & = \hat{b}_0 + \hat{b}_1(0) + \hat{b}_2(1) + \hat{b}_3(1) + \hat{b}_4(0)(1) + \hat{b}_5(1)(1)\\ & = \hat{b}_0 + \hat{b}_2 + \hat{b}_3 + \hat{b}_5 \\ \end{aligned} \]
Then, \(imagery\) and \(Deprived=No\) (\(M1=0\), \(M2=1\), \(D1=-1\)):
\[ \begin{aligned} \widehat{Words}_{ima, yes} & = \hat{b}_0 + \hat{b}_1(0) + \hat{b}_2(1) + \hat{b}_3(-1) + \hat{b}_4(0)(-1) + \hat{b}_5(1)(-1)\\ & = \hat{b}_0 + \hat{b}_2 – \hat{b}_3 – \hat{b}_5 \\ \end{aligned} \]
Then, \(writing\) and \(Deprived=Yes\) (\(M1=-1\), \(M2=-1\), \(D1=1\)):
\[ \begin{aligned} \widehat{Words}_{ima, yes} & = \hat{b}_0 + \hat{b}_1(-1) + \hat{b}_2(-1) + \hat{b}_3(1) + \hat{b}_4(-1)(1) + \hat{b}_5(-1)(1)\\ & = \hat{b}_0 – \hat{b}_1 – \hat{b}_2 + \hat{b}_3 -\hat{b}_4 – \hat{b}_5 \\ \end{aligned} \]
Finally, \(writing\) and \(Deprived=No\) (\(M1=-1\), \(M2=-1\), \(D1=-1\)):
\[ \begin{aligned} \widehat{Words}_{ima, yes} & = \hat{b}_0 + \hat{b}_1(-1) + \hat{b}_2(-1) + \hat{b}_3(-1) + \hat{b}_4(-1)(-1) + \hat{b}_5(-1)(-1)\\ & = \hat{b}_0 – \hat{b}_1 – \hat{b}_2 – \hat{b}_3 + \hat{b}_4 + \hat{b}_5 \\ \end{aligned} \]
Interpretation of intercept \(\hat{b}_0\). Based on the previous examples, we should expect to interpret the intercept estimate \(\hat{b}_0\) as the unweighted grand mean across all 6 groups. That expectation will prove to be correct.
The mean of the predictions for the 6 groups is:
\[ \begin{aligned} grand mean & = (\widehat{Words}_{rec,yes} + \widehat{Words}_{rec,no} + \widehat{Words}_{ima,yes} + \widehat{Words}_{ima, no} + \widehat{Words}_{wri,yes} + \widehat{Words}_{wri,no})/6 \\ & = ((\hat{b}_0 + \hat{b}_1 + \hat{b}_3 + \hat{b}_4) + (\hat{b}_0 + \hat{b}_1 – \hat{b}_3 – \hat{b}_4) + (\hat{b}_0 + \hat{b}_2 + \hat{b}_3 + \hat{b}_5) + (\hat{b}_0 + \hat{b}_1 – \hat{b}_3 – \hat{b}_4) + (\hat{b}_0 – \hat{b}_1 – \hat{b}_2 + \hat{b}_3 -\hat{b}_4 – \hat{b}_5) + (\hat{b}_0 – \hat{b}_1 – \hat{b}_2 – \hat{b}_3 + \hat{b}_4 + \hat{b}_5))/6 \\ & = 6\hat{b}_0/6 \\ & = \hat{b}_0 \\ \end{aligned} \]
So we see that the intercept \(\hat{b}_0\) indeed the unweighted grand mean estimate across all 6 groups.
Interpretation of lower-order coefficients \(\hat{b}_1\), \(\hat{b}_2\), and \(\hat{b}_3\)
One of the reasons to use effect coding is so that the lower-order coefficients (not for the interactions) are interpreted as averaged or main effects.
To understand how to interpret \(\hat{b}_1\), let’s take the average of the predictions for the groups (\(recitation\), \(Deprived=Yes\)) and (\(recitation\), \(Deprived=No\)):
\[ \begin{aligned} (\widehat{Words}_{rec,yes} + \widehat{Words}_{rec,no})/2 & = ((\hat{b}_0 + \hat{b}_1 + \hat{b}_3 + \hat{b}_4) + (\hat{b}_0 + \hat{b}_1 – \hat{b}_3 – \hat{b}_4))/2 \\ & = (2\hat{b}_0 + 2\hat{b}_1)/2 \\ & = \hat{b}_0 + \hat{b}_1 \\ \end{aligned} \]
We know that \(\hat{b}_0\) is the grand mean of all observations, so \(\hat{b}_1\) is a deviation from this grand mean. The two groups above that we just averaged together are both \(Method=recitation\), but one is \(Deprived=Yes\) and the other is \(Deprived=No\). Therefore, \(\hat{b}_1\) is the average deviation of the \(reciation\) group from the grand mean, averaged across \(Deprived\) groups. In other words, \(\hat{b}_1\) is the main effect of \(recitation\).
To interpret \(\hat{b}_2\), we will average the predictions for the groups (\(imagery\), \(Deprived=Yes\)) and (\(imagery\), \(Deprived=No\)):
\[ \begin{aligned} (\widehat{Words}_{ima,yes} + \widehat{Words}_{ima,no})/2 & = ((\hat{b}_0 + \hat{b}_2 + \hat{b}_3 + \hat{b}_4) + (\hat{b}_0 + \hat{b}_2 – \hat{b}_3 – \hat{b}_4))/2 \\ & = (2\hat{b}_0 + 2\hat{b}_2)/2 \\ & = \hat{b}_0 + \hat{b}_2 \\ \end{aligned} \]
Using the same logic, we see that \(\hat{b}_2\) is the main effect of \(imagery\), or the average deviation of the \(imagery\) groups from the grand mean, averaged across \(Deprived=Yes\) and \(Deprived=No\).
The sum of the coefficients \(\hat{b}_1 + \hat{b}_2\) is also the average negative deviation of the two \(writing\) groups from the grand mean.
Finally, to interpret \(\hat{b}_3\), we will need to average the predictions from 3 groups, (\(recitation\), \(Deprived=Yes\)), (\(imagery\), \(Deprived=Yes\)) and (\(writing\), \(Deprived=Yes\)):
\[ \begin{aligned} (\widehat{Words}_{rec,yes} + \widehat{Words}_{ima,yes} + \widehat{Words}_{wri,yes})/3 & = ((\hat{b}_0 + \hat{b}_1 + \hat{b}_3 + \hat{b}_4) + (\hat{b}_0 + \hat{b}_2 + \hat{b}_3 + \hat{b}_5) + (\hat{b}_0 – \hat{b}_1 – \hat{b}_2 + \hat{b}_3 -\hat{b}_4 – \hat{b}_5))/3 \\ & = (3\hat{b}_0 + 3\hat{b}_3)/3 \\ & = \hat{b}_0 + \hat{b}_3 \\ \end{aligned} \]
We see that \(\hat{b}_3\) is average deviation of the 3 \(Deprived=Yes\) groups, averaged across the 3 \(Method\) groups. In other words, \(\hat{b}_3\) is the main effect of \(Deprived=Yes\), or the average effect of being sleep deprived, averaged across the 3 methods. It is also the negative averaged effect of being not sleep deprived (\(Deprived=No\)).
Interpretation of interaction coefficients \(\hat{b}_4\), \(\hat{b}_5\)
The interpretation of the interaction coefficients can be gleaned by looking at the prediction equations again. First we look at the equations for groups (\(recitation\), \(Deprived=Yes\)) and (\(recitation\), \(Deprived=No\)):
\[\widehat{Words}_{rec,yes} = \hat{b}_0 + \hat{b}_1 + \hat{b}_3 + \hat{b}_4\] \[\widehat{Words}_{rec,no} = \hat{b}_0 + \hat{b}_1 – \hat{b}_3 – \hat{b}_4\]
We know that \(\hat{b}_0\) is the estimate of the grand mean, \(\hat{b}_1\) is the average deviation for of the \(recitation\) groups from the grand mean, \(\hat{b}_3\) is the average deviation of the \(Deprived=Yes\) groups from the grand mean (and \(-\hat{b}_3\) is deviation of the \(Deprived=No\) group). The interaction coefficient \(\hat{b}_4\) is the additional deviation from the grand mean if you are both \(recitation\) and \(Deprived=Yes\), and \(-\hat{b}_4\) is the additional deviation if you are both \(recitation\) and \(Deprived=No\).
The coefficient \(\hat{b}_5\) is interpreted similarly. Looking at the prediction equations for (\(imagery\), \(Deprived=Yes\)) and (\(imagery\), \(Deprived=No\)):
\[\widehat{Words}_{ima,yes} = \hat{b}_0 + \hat{b}_2 + \hat{b}_3 + \hat{b}_5\] \[\widehat{Words}_{ima,no} = \hat{b}_0 + \hat{b}_2 – \hat{b}_3 – \hat{b}_5\]
The intercept \(\hat{b}_0\) is the grand mean, \(\hat{b}_2\) is the main effect of being in the \(imagery\) group, \(\hat{b}_3\) is the main effect of being in \(Deprived=Yes\) (and \(-\hat{b}_3\) is the main effect of being in the \(Deprived=No\) group). Thus, \(\hat{b}_5\) is the additional deviation for being in both \(imagery\) and \(Deprived=Yes\), while \(\hat{b}_5\) is the additional deviaation for being in both \(imagery\) and \(Deprived=No\).
The sums of the interaction coefficients \(\hat{b}_4 + \hat{b}_5\) also have additional interpretation for the \(writing\) group:
First, \(writing\) and \(Deprived=Yes\) (\(M1=-1\), \(M2=-1\), \(D1=1\)):
\[\widehat{Words}_{ima, yes} = \hat{b}_0 – \hat{b}_1 – \hat{b}_2 + \hat{b}_3 -\hat{b}_4 – \hat{b}_5\] \[\widehat{Words}_{ima, yes} = \hat{b}_0 – \hat{b}_1 – \hat{b}_2 – \hat{b}_3 + \hat{b}_4 + \hat{b}_5\]
The sum of the coefficients \(\hat{b}_4 + \hat{b}_5\) is the additional deviation for being in both \(writing\) and \(Deprived=No\), and \(-\hat{b}_4 – \hat{b}_5\) is the additional deviation for being in both \(writing\) and \(Deprived=Yes\).
Regression Model. Let’s see if we can use the coefficients of a fully-interacted regression model to recover the means of the 6 groups.
Here are the means:
Method, Deprived | mean Words |
---|---|
recitation, yes | 6.3333 |
recitation, no | 10.5 |
imagery, yes | 12.5 |
imagery, no | 13.5 |
writing, yes | 10 |
writing, no | 16 |
Now the regression of \(Words\) on \(M1\), \(M2\), \(D1\), \(M1D1\), and \(M2D1\):
------------------------------------------------------------------------------ Words | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- M1 | -3.055556 .93204 -3.28 0.017 -5.336175 -.7749358 M2 | 1.527778 .9711634 1.57 0.167 -.8485736 3.904129 D1 | -1.861111 .704556 -2.64 0.038 -3.585098 -.1371247 M1D1 | -.2222222 .93204 -0.24 0.819 -2.502842 2.058397 M2D1 | 1.361111 .9711634 1.40 0.211 -1.01524 3.737462 _cons | 11.47222 .704556 16.28 0.000 9.748236 13.19621 ------------------------------------------------------------------------------
The estimate for the group (\(recitation\), \(Deprived=Yes\)) is \(\hat{b}_0 + \hat{b}_1 + \hat{b}_3 + \hat{b}_4 = 11.47 – 3.06 -1.86 -.22 = 6.33\), which matches the observed mean of the group (\(recitation\), \(Deprived=Yes\)). We encourage the reader to confirm the other means as well.
Summary
When using effect coding for categorical predictors in a regression model without interactions:
- The intercept is interpreted as the estimate of the unweighted grand mean of the dependent variable across all groups comprising the categorical predictor(s) (and at zero for all other predictors in the model)
- a coefficient for an effect-coded regressors is interpreted as a deviation from the grand mean for the category coded as \(1\) for that regressor
- the sum of the coefficients for all effect-coded regressors representing a single categorical predictor is the negative deviation from the grand mean of the contrasting group (group coded by \(-1\) in the regressors)
When using effect coding for categorical predictors in regression models with interactions:
- the intercept is still interpreted the same as in a model without interactions
- if the effect-coded regressors are interacted a continuous predictor:
- the lower order coefficients for the effect-coded regressors are interepreted as deviations from the grand mean for that group (the group coded \(1\) on the regressor) when the interacting continuous predictor is zero
- the lower order coefficient for the continuous predictor is interpreted as the unweighted average slope (i.e. main effect) of the continuous predictor, averaged across groups of the interacting categorical predictor
- the interaction coefficients are interepreted as deviations of a group’s slope from the average slope estimate (or as change in the group effect per unit change in the slope)
- if the effect-coded regressor is interacted with another effect-coded predictor:
- the lower order coefficient of an effect-coded regressors is interpreted as the average effect (deviation from the grand mean) of the group coded as \(1\) on the regressor, across all levels of the interacting categorical predictor
- the interaction coefficients are interpreted as the additional effect of being simultaneously in the two groups defined by the interaction regressor (or, alternatively, if A and B are interacted, the change in the average effect of A for group B compared to the average effect of A)