Does the drug cause heart attacks?
| Heart attack | No heart attack | |
|---|---|---|
| Female | 1 | 19 |
| Male | 12 | 28 |
| Total | 13 | 47 |
| Heart attack | No heart attack | |
|---|---|---|
| Female | 3 | 37 |
| Male | 8 | 12 |
| Total | 11 | 49 |
Does the drug cause heart attacks?
| Heart attack | No heart attack | |
|---|---|---|
| Low blood pressure | 1 | 19 |
| High blood pressure | 12 | 28 |
| Total | 13 | 47 |
| Heart attack | No heart attack | |
|---|---|---|
| Low blood pressure | 3 | 37 |
| High blood pressure | 8 | 12 |
| Total | 11 | 49 |
Fictitious data from “The Book of Why”
| Heart attack | No heart attack | |
|---|---|---|
| Female | 1 | 19 |
| Male | 12 | 28 |
| Total | 13 | 47 |
| Heart attack | No heart attack | |
|---|---|---|
| Female | 3 | 37 |
| Male | 8 | 12 |
| Total | 11 | 49 |
Fictitious data from “The Book of Why”
Aggregating is not always wrong and partitioning is not always right.
The answer is in the data generating process.
Gender is a confounder on a backdoor path. Controlling for a confounder is appropriate.
Blood pressure is a mediator on a causal path.
Sometimes aggregating data is appropriate.
Good control: A variable that must be adjusted for to identify the causal effect of interest (i.e., to block confounding paths).
Bad control: A variable that introduces bias when included in the model, typically because it is affected by the treatment or acts as (or opens) a collider path.
Real-world outcomes have many causes.
Multiple variables often influence the outcome of interest.
When we estimate the causal effect of one exposure using a linear model, that relationship is typically confounded — mixed with other causal pathways.
“Deconfounding” means isolating the effect of the exposure from other influences.
Some bad rules:
Include everything you have (RM: causal salad)
Some heuristics (e.g., pre-treatment variables, not highly colinear)
All not satisfactory.
Because not all covariates are “good controls.” Some introduce bias (“bad controls”), others block the causal pathways.
The key is causal structure, not variable count.
The building blocks of any causal model:
1) Common cause (or a Fork).
2) Mediator (or a Pipe).
3) Common effect (or a Collider).
set.seed(200) # for reproducibility
N <- 300 # number of cases
# Z is independent
Z <- rbinom(N, size=1, prob=0.5)
# X and Y depend on Z
X <- rnorm(N, mean=2*Z-1, sd=1)
Y <- rnorm(N,mean=2*Z-1, sd=1)
d <- data.frame(X, Z, Y)
d$Z <- factor(d$Z) # make a categorical variable
p1 <- ggplot(d, aes(x = X, y = Y)) +
geom_point(size = 3) +
geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
theme_minimal() +
theme(
axis.title.x = element_text(size = 30),
axis.title.y = element_text(size = 30),
legend.title = element_text(size = 30),
legend.text = element_text(size = 26))
p2 <- ggplot(d, aes(x = X, y = Y, color = Z)) +
geom_point(size = 3) +
geom_smooth(data = d[d$Z == 1, ], method = "lm", se = FALSE, color = "#d73027", linewidth = 2) +
geom_smooth(data = d[d$Z == 0, ], method = "lm", se = FALSE, color = "#91bfdb", linewidth = 2) +
geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
scale_color_manual(values = c("#91bfdb", "#d73027") ) +
theme_minimal() +
theme(
axis.title.x = element_text(size = 30),
axis.title.y = element_text(size = 30),
legend.title = element_text(size = 30),
legend.text = element_text(size = 26))set.seed(20) # for reproducibility
N <- 300 # number of cases
# Z is independent
X <- rnorm(N, mean=0, sd=1)
# Z depends on X
Z <- rbinom(N, size=1,
prob=plogis(q=X, location=0, scale=1) )
# Y depends on Z
Y <- rnorm(N,mean=2*Z-1, sd=1)
d <- data.frame(X, Z, Y)
d$Z <- factor(d$Z) # make a categorical variable
p1 <- ggplot(d, aes(x = X, y = Y)) +
geom_point(size = 3) +
geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
theme_minimal() +
theme(
axis.title.x = element_text(size = 30),
axis.title.y = element_text(size = 30),
legend.title = element_text(size = 30),
legend.text = element_text(size = 26))
p2 <- ggplot(d, aes(x = X, y = Y, color = Z)) +
geom_point(size = 3) +
geom_smooth(data = d[d$Z == 1, ], method = "lm", se = FALSE, color = "#d73027", linewidth = 2) +
geom_smooth(data = d[d$Z == 0, ], method = "lm", se = FALSE, color = "#91bfdb", linewidth = 2) +
geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
scale_color_manual(values = c("#91bfdb", "#d73027") ) +
theme_minimal() +
theme(
axis.title.x = element_text(size = 30),
axis.title.y = element_text(size = 30),
legend.title = element_text(size = 30),
legend.text = element_text(size = 26))set.seed(1983) # for reproducibility
N <- 300 # number of cases
# X and Y are independent
X <- rnorm(N, mean=0, sd=1)
Y <- rnorm(N, mean=0, sd=1)
# Z depends on X and Y
Z <- rbinom(N,size=1,
prob=plogis(q=2*X+2*Y-2, location=0, scale=1))
d <- data.frame(X, Z, Y)
d$Z <- factor(d$Z) # make a categorical variable
p1 <- ggplot(d, aes(x = X, y = Y)) +
geom_point(size = 3) +
geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
theme_minimal() +
theme(
axis.title.x = element_text(size = 30),
axis.title.y = element_text(size = 30),
legend.title = element_text(size = 30),
legend.text = element_text(size = 26))
p2 <- ggplot(d, aes(x = X, y = Y, color = Z)) +
geom_point(size = 3) +
geom_smooth(data = d[d$Z == 1, ], method = "lm", se = FALSE, color = "#d73027", linewidth = 2) +
geom_smooth(data = d[d$Z == 0, ], method = "lm", se = FALSE, color = "#91bfdb", linewidth = 2) +
geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
scale_color_manual(values = c("#91bfdb", "#d73027") ) +
theme_minimal() +
theme(
axis.title.x = element_text(size = 30),
axis.title.y = element_text(size = 30),
legend.title = element_text(size = 30),
legend.text = element_text(size = 26))Controlling for a descendant of a variable is equivalent to “partially” controlling for that variable.
For example, Daily Excercise → Physical Fitness → Mental Health.
Body Mass Index (BMI) is a descendant of Physical Fitness.
| Path | Type | Not conditioning on Z | Conditioning on Z |
|---|---|---|---|
Common Causes (Confounders)X ← Z → Y |
Non-Causal (Spurious) | Open | Closed |
MediatorsX → Z → Y |
Causal | Open | Closed |
Common Effects (Colliders)X → Z ← Y |
Non-Causal (Spurious) | Closed | Open |
Controlling for the effects of a variable (Descendant) is equivalent to partially controlling for that variable
The target of our analysis is the Average Causal Effect (ACE).
ACE is expected increase in \(Y\) in response to a unit increase in \(X\) due to an intervention.
\[ACE(x) = E [ Y | do(x+1)] - E [ Y | do(x) ]\]
This formula compares the average outcomes under two interventions that differ only in the value of \(X\).
Their difference is the causal effect of increasing \(X\) by one unit.
\(Z\) is a confounder, a fork (a common cause).
A open backdoor path, \(X\) ← Z → \(Y\), creates a spurious association between \(X\) and \(Y\).
To block the backdoor path we control for \(Z\) to estimate unbiased ACE.
Models 2 and 3: not a traditional confounder.
Unobserved \(U\) is a confounder, \(Z\) is not a common cause of \(X\) and \(Y\).
However, controlling for Z blocks the backdoor path due to \(U\) and produces an unbiased estimate of ACE.
M-Bias
\(Z\) is pretreatment variable
Controlling for \(Z\) will induce bias by opening backdoor path, thus previously (without \(Z\) in the model) unbiased ACE will become biased.
See a detailed example in the Introduction to DAGs for Causal Inference in R
Neutral in terms of bias (not on any causal path)
Minimal adjustment set it not always optimal
Can improve efficiency (increase precision)
Possibly good for precision
Possibly bad for precision
Causes of \(Y\) which do not spoil identification are beneficial
Causes of \(X\) which are not necessary for identification are harmful for the asymptotic variance of the estimator
Bias Amplification
Simulation
n <- 1e4 # population size
Z <- rnorm(n) #exogenous
U <- rnorm(n) #exogenous
b_Z <- 2 #slope of Z for X
b_U <- 1 #slope for U for X
error_x <- rnorm(n) #random noise
# X is a function of Z, U and random noise
X <- b_Z*Z + b_U*U + error_x
b_X <- 1 #slope for X for Y
g_U <- 2 # slope for U for Y
error_Y <- rnorm(n) # random noise
# Y is a function of X, U and random noise
Y <- b_X*X + g_U*U + error_Y
# data frame (U is not observed, but we have it in the simulated data)
d <- data.frame(X, Z, Y, U)What is going on?
We assumed the following system:
\[X = \beta_Z Z + \beta_U U + \epsilon_X\] \[ Y = \beta_X X + \gamma_U U + \epsilon_Y\]
Total variation in \(X\):
\[ Var(X) = \beta_Z^2 Var(Z) + \beta_U^2 Var(U) + Var(\epsilon_X)\]
# The Variance of X in the model Y ~ X
Var_X <- b_Z^2 * var(d$Z) + b_U^2 * var(d$U) + var(error_x)
Var_X[1] 6.181868
After conditioning on \(Z\) (i.e. removing it’s contribution, see here ):
\[ Var(X|Z) = \beta_U^2 Var(U) + Var(\epsilon_X)\]
# The Variance of X when controlling for ze, Y ~ X + Z
Var_X_Z <- b_U^2 * var(d$U) + var(error_x)
Var_X_Z[1] 1.994982
\[ \text{Bias}_{\beta_X} = \frac{\text{Cov}(X, U)}{\text{Var}(X)} \, \gamma_{U}\]
\[ \text{Bias}_{\beta_{X|Z}} = \frac{\text{Cov}(X, U)}{\text{Var}(X|Z)} \, \gamma_{U}\]
Bias increases purely because we made \(X\) vary less overall, even though \(U\) didn’t change.
With less total variation in \(X\), \(U's\) influence makes up larger share, so the bias from \(U\) gets amplified.
Bias Amplification: Motivation and Training
To estimate the ACE, we want to avoid blocking any causal pathways from X to Y.
In Model 11, conditioning on Z blocks the entire causal pathway (X → Z → Y), producing classic overcontrol bias.
In Model 12, conditioning on Z blocks only part of the effect, because Z is a descendant of M, which lies on the causal path (X → M → Y).
The same logic applies even if there were an additional direct path X → Y.
Possibly good for precision
\(Z\) is a cause, not an effect of M (mediator), and consequently, a cause of \(Y\) (model 8)
Controlling for \(Z\) will be either neutral or may increase precision of the ACE estimate in finite sample
Some intuition of what is going on:
\(Z\) affects \(Y\) indirectly through \(M\). \(Y\) inherits some of \(Z\)’s influence, adding extra noise.
Adjusting for \(Z\) removes some of the variation in \(Y\) that is due to \(Z\). As a result, \(Y\) becomes less noisy, there’s less unexplained variance.
Precision of an estimate (SE of \(X\)) depends on how much unexplained variance exists in the model.
Possibly helpful in case of selection bias
Suppose we have only samples with \(W = w\) (a case selection bias)
Imagine a study examining the effect of academic performance in high school on future earnings.
Academic performance may increase future earnings.
Students with high academic performance are also more likely to have strong recommendation letters.
Strong recommendation letters help students get into prestigious colleges.
Family wealth also influences college admissions.
Family wealth directly affects future earnings.
set.seed(50)
# Simulate data
n <- 1e4
performance <- rnorm(n) # Academic performance
wealth <- rnorm(n) # Family wealth
# Binary Recommendation Letter (Z)
letter <- rbinom(n, 1, prob = plogis(2*performance))
# Binary College Admission (W)
college <- rbinom(n, 1, prob = plogis(3*letter + 3*wealth))
# Future Earnings (Y) depends on X and U
earnings <- 1*performance + 5*wealth + rnorm(n)
# Create a dataset
data <- data.frame(performance, wealth, letter, college, earnings)Selection bias: we only collect data from those students who attend prestigious universities.
This effectively is the same as conditioning on W (university prestige) in a model and therefore opens a spurious path due to conditioning on a collider.
Biased estimated due to collider.
Call:
lm(formula = earnings ~ performance, data = subd)
Residuals:
Min 1Q Median 3Q Max
-13.4359 -2.8610 -0.1702 2.7207 15.6359
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.26560 0.05109 44.343 <2e-16 ***
performance 0.44858 0.05139 8.728 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.134 on 6655 degrees of freedom
Multiple R-squared: 0.01132, Adjusted R-squared: 0.01117
F-statistic: 76.19 on 1 and 6655 DF, p-value: < 2.2e-16
Adjusting for Z blocks the spurious path.
Call:
lm(formula = earnings ~ performance + letter, data = subd)
Residuals:
Min 1Q Median 3Q Max
-14.370 -2.817 -0.221 2.655 16.040
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.37026 0.08861 38.03 <2e-16 ***
performance 1.01213 0.06277 16.12 <2e-16 ***
letter -1.92218 0.12701 -15.13 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.064 on 6654 degrees of freedom
Multiple R-squared: 0.04422, Adjusted R-squared: 0.04393
F-statistic: 153.9 on 2 and 6654 DF, p-value: < 2.2e-16
What is going on?
1. Two routes to prestigious universities.
- Students are admitted (black dots) either through family wealth or academic performance (via letters of recommendation).
- This means some really wealthy but low-performing students get in, and some very high-performing but less wealthy students get in.
- Among admitted students (black dots), wealth and performance become negatively related — unlike in the general population, where they’re unrelated.
2. The relationship between performance and earnings is now distorted.
- In the general population, better academic performance leads to higherfuture earnings.
- Within prestigious universities (black dots), nearly all students do well later in life because they are either smart or wealthy.
- This makes the observed effect of academic performance on earnings appear weaker than it really is.
3. Controlling for letters of recommendation corrects the distortion.
- The bias arises because selection on admission ties wealth and performance together.
- When we compare students with the same letters of recommendation (middle plot), that link disappears.
- Within each letter level, wealth varies as in the general population, because wealth does not affect letters of recommendation.
- Since academic performance is a parent of letters of recommendation, the wealth and academic performance is also not associated anymore. Sample is like the general population.
Selection bias
Controlling for Z is no longer harmless – it’s a collider and controlling opens an otherwise closed path(s)
The birth-weight paradox (Hernández-Díaz et al. 2006)
What is going on?
Conclusion:
Smoking does not protect against mortality. The apparent protective effect among LBW infants arises from selection bias due to conditioning on a collider (birth weight), which induces a spurious inverse relationship between maternal smoking and unmeasured birth defects. As a result, LBW infants of non-smoking mothers tend to have the most severe birth defects, leading to higher mortality in that subgroup.
Case-control bias
\(Z\) is a descendant of \(Y\) which is a “virtual collider” - once we condition on its descendant \(Z\), we might accidentally open a non-causal path
Let’s add some unobserved exogenous variables
“Case-control” bias
Selecting participants based on an outcome can distort causal relationships
The Antebellum puzzle (Schneider, 2020)
During XIX c., in Britain and USA, the average height of adult men fell even though the economic conditions of these countries improved alongside childhood nutrition
Possible explanation – selection bias (case-control) – data used from individuals enlisted in the military or in prisons, effectively conditioning on colliders
Example: Military records
CN = childhood nutrition
E = enlisted in military
H = height
The causal path from height → enlisted represents the fact that taller men may have better opportunities in the civilian market, and thus shorter men were more likely to enlist
Restricting analysis to those enlisted in the military is therefore equivalent to controlling for enlisted and leads to selection bias
The status of a single control as “good” or “bad” may change depending on the context of the variables under consideration
A set of control variables Z will be “good” if:
It blocks all non-causal paths from the treatment to the outcome
It leaves any mediating paths from the treatment to the outcome “untouched” (since we are interested in the total effect)
It does not open new spurious paths between the treatment and the outcome (e.g., due to colliders).
For efficiency: