DAGsRPeople who drink coffee have a substantially higher chance of developing pancreatic cancer compared to those who don’t.
Should you stop drinking coffee?
Does changing our coffee-drinking habits would change our chances of getting pancreatic cancer?
Coffee and Cigarettes, Jim Jarmusch (2003)
Coffee is not likely to cause pancreatic cancer.
A confounder: Smoking.
People who smoke are more likely to drink a lot of coffee.
Smoking is a well-established risk factor for pancreatic cancer.
Once later studies accounted for smoking, the apparent link between coffee and pancreatic cancer disappeared.
Coffee and Cigarettes, Jim Jarmusch (2003)
Judea Pearl, 1966
Is there a statistical procedure that mimics randomization?
do-calculus is a formal set of rules developed by Judea Pearl an colleagues that allows us to determine whether and how causal effects can be identified from observational data by mathematically transforming expressions involving interventions (do-operator) into purely observational probabilities.
It does it via:
Backdoor Criterion ←
Frontdoor Criterion: when some confounders are unmeasured
Beyond do-calculus: Instrumental Variables
set.seed(200) # for reproducibility
N <- 300 # number of cases
# Z is independent
Z <- rbinom(N, size=1, prob=0.5)
# X and Y depend on Z
X <- rnorm(N, mean=2*Z-1, sd=1)
Y <- rnorm(N,mean=2*Z-1, sd=1)
d <- data.frame(X, Z, Y)
head(d,3) X Z Y
1 1.2653517 1 1.2734351
2 0.3402119 1 0.4166855
3 0.5938416 1 -0.3498030
d$Z <- factor(d$Z) # make a categorical variable
p1 <- ggplot(d, aes(x = X, y = Y)) +
geom_point(size = 3) +
geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
theme_minimal() +
theme(
axis.title.x = element_text(size = 30),
axis.title.y = element_text(size = 30),
legend.title = element_text(size = 30),
legend.text = element_text(size = 26))
p2 <- ggplot(d, aes(x = X, y = Y, color = Z)) +
geom_point(size = 3) +
geom_smooth(data = d[d$Z == 1, ], method = "lm", se = FALSE, color = "red", linewidth = 2) +
geom_smooth(data = d[d$Z == 0, ], method = "lm", se = FALSE, color = "slateblue", linewidth = 2) +
geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
scale_color_manual(values = c("slateblue", "red") ) +
theme_minimal() +
theme(
axis.title.x = element_text(size = 30),
axis.title.y = element_text(size = 30),
legend.title = element_text(size = 30),
legend.text = element_text(size = 26))Chain or mediation
Influence of \(X\) on \(Y\) is transmitted through \(Z\)
\(X\) and \(Y\) are associated
Once accounting on \(Z\), the association is eliminated
The mediator \(Z\) “screens off” information about \(X\) from \(Y\)
set.seed(20) # for reproducibility
N <- 300 # number of cases
# Z is independent
X <- rnorm(N, mean=0, sd=1)
# Z depends on X
Z <- rbinom(N, size=1,
prob=plogis(q=X, location=0, scale=1) )
# Y depends on Z
Y <- rnorm(N,mean=2*Z-1, sd=1)
d <- data.frame(X, Z, Y)
head(d,3) X Z Y
1 1.1626853 1 0.2521431
2 -0.5859245 1 1.8369678
3 1.7854650 1 2.3036799
d$Z <- factor(d$Z) # make a categorical variable
p1 <- ggplot(d, aes(x = X, y = Y)) +
geom_point(size = 3) +
geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
theme_minimal() +
theme(
axis.title.x = element_text(size = 30),
axis.title.y = element_text(size = 30),
legend.title = element_text(size = 30),
legend.text = element_text(size = 26))
p2 <- ggplot(d, aes(x = X, y = Y, color = Z)) +
geom_point(size = 3) +
geom_smooth(data = d[d$Z == 1, ], method = "lm", se = FALSE, color = "red", linewidth = 2) +
geom_smooth(data = d[d$Z == 0, ], method = "lm", se = FALSE, color = "slateblue", linewidth = 2) +
geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
scale_color_manual(values = c("slateblue", "red") ) +
theme_minimal() +
theme(
axis.title.x = element_text(size = 30),
axis.title.y = element_text(size = 30),
legend.title = element_text(size = 30),
legend.text = element_text(size = 26))\(X\) and \(Y\) are not associated (share no causes)
\(X\) and \(Y\) both influence \(Z\)
Once accounting on \(Z\), \(X\) and \(Y\) are associated
set.seed(1983) # for reproducibility
N <- 300 # number of cases
# X and Y are independent
X <- rnorm(N, mean=0, sd=1)
Y <- rnorm(N, mean=0, sd=1)
# Z depends on X and Y
Z <- rbinom(N,size=1,
prob=plogis(q=2*X+2*Y-2, location=0, scale=1))
d <- data.frame(X, Z, Y)
head(d,3) X Z Y
1 -0.01705205 1 3.0004107
2 -0.78367184 0 0.7962816
3 1.32662703 1 0.7537700
d$Z <- factor(d$Z) # make a categorical variable
p1 <- ggplot(d, aes(x = X, y = Y)) +
geom_point(size = 3) +
geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
theme_minimal() +
theme(
axis.title.x = element_text(size = 30),
axis.title.y = element_text(size = 30),
legend.title = element_text(size = 30),
legend.text = element_text(size = 26))
p2 <- ggplot(d, aes(x = X, y = Y, color = Z)) +
geom_point(size = 3) +
geom_smooth(data = d[d$Z == 1, ], method = "lm", se = FALSE, color = "red", linewidth = 2) +
geom_smooth(data = d[d$Z == 0, ], method = "lm", se = FALSE, color = "slateblue", linewidth = 2) +
geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
scale_color_manual(values = c("slateblue", "red") ) +
theme_minimal() +
theme(
axis.title.x = element_text(size = 30),
axis.title.y = element_text(size = 30),
legend.title = element_text(size = 30),
legend.text = element_text(size = 26))Controlling for a descendant of a variable is equivalent to “partially” controlling for that variable.
The four elemental confounds:
Fork: Controlling for the common cause: \(X\) ⊥ \(Y\) | \(Z\)
Pipe: Controlling for the mediator: \(X\) ⊥ \(Y\) | \(Z\)
Collider: Controlling for the collider: \(X\) ⊥̸ \(Y\) | \(Z\)
Descendant: Depends on the causal relationship above
⊥ : d-separation (directional separation); implies conditional indepenence.
⊥̸ : not d-separated; implies conditional dependence.
\(X\) ⊥ \(Y\) | \(Z\) : \(X\) is d-separated from \(Y\) given \(Z\); Conditioning on \(Z\) renders \(X\) and \(Y\) independent.
How do we use this knowledge?
To estimate the total causal effect of \(X\) on \(Y\):
Identify all paths between \(X\) and \(Y\): causal and spurious (backdoor).
Do not block or adjust for any variables on the causal paths from \(X\) to \(Y\).
Block all backdoor paths between \(X\) and \(Y\) by conditioning on appropriate variables.
Causal path: All arrows flow from \(X\) to \(Y\) — represents how \(X\) affects \(Y\) (directly or via mediators).
\(X\) → \(Y\) : causal path
\(X\) → \(Z\) → \(Y\) : causal path
\(X\) → \(C\) → \(Y\) : causal path
Spurious path: Any non-causal path — induces association but not causation.
Backdoor path: A spurious path that starts with an arrow into \(X\).
\(X\) → \(Y\) : causal path
\(X\) → \(C\) → \(Y\) : causal path
\(X\) ← \(Z\) → \(Y\) : spurious backdoor path
\[Y = \alpha + \beta X\]
\[Y = \alpha + \beta_1 X + \beta_2 Z\]
\[Y = \alpha + \beta X\]
Total causal effect of \(X\) on \(Y\):
\(X\) → \(Y\)
\(X\) ← U → \(Z\) → \(Y\)
\(X\) → \(Y\): causal path
\(X\) ← U → \(Z\) → \(Y\): open backdoor path (BP), close with \(Z\)
Adjustment set: \(Z\).
Total causal effect of \(X\) on \(Y\):
\(X\) → \(M\) → \(Y\)
\(X\) ← \(Z\) → \(M\) → \(Y\)
\(X\) → \(M\) → \(Y\): causal path (with a mediator \(M\)).
\(X\) ← \(Z\) → \(M\) → \(Y\): open BP, close with \(Z\).
Do not control for \(M\)!
Adjustment set: \(Z\).
Total causal effect of \(X\) on \(Y\):
\(X\) → \(Y\)
\(X\) ← \(U1\) → \(Z\) ← \(U2\) → \(Y\)
\(X\) → \(Y\): causal path
\(X\) ← \(U1\) → \(Z\) ← \(U2\) → \(Y\): closed BP due to collider \(Z\)
Adjustment set: none.
Total causal effect of \(X\) on \(Y\):
\(X\) → \(Y\)
\(X\) ← \(C\) → \(Y\)
\(X\) ← \(Z\) → \(Y\)
\(X\) ← \(A\) → \(Z\) → \(Y\)
\(X\) ← \(Z\) ← \(B\) → \(Y\)
\(X\) ← \(A\) → \(Z\) ← \(B\) → \(Y\)
\(X\) → \(Y\): causal path
\(X\) ← \(C\) → \(Y\): open BP, close with \(C\)
\(X\) ← \(Z\) → \(Y\): open BP, close with \(Z\)
\(X\) ← \(A\) → \(Z\) → \(Y\): open BP, close with \(A\) or \(Z\)
\(X\) ← \(Z\) ← \(B\) → \(Y\): open BP, close with \(Z\) or \(B\)
\(X\) ← \(A\) → \(Z\) ← \(B\) → \(Y\): closed BP (\(Z\) is a collider)
Adjustment set: \(C\), \(Z\), \(A\) or \(B\)
Does civic engagement of Individual 1 lead to civic engagement of Individual 2 in a subsequent time period? (Elwert and Winship, 2014)
Total causal effect civic engagement of individual 1 on civic engagement of individual 2.
Civic eng 1 → Civic eng 2: causal path
Civic eng 1 ← Altruism 1 → Friends ← Altruism 2 → Civic eng 2: closed BP due to collider Friends
Adjustment set: none.
Individuals may show similar civic engagement not because one influences the other, but because similar individuals tend to form friendships and share similar levels of civic engagement.
The common cause—altruism (unobserved)—drives both friendship formation and each individual’s civic engagement.
Conditioning on friendship (a collider) creates an association between civic engagement of individual 1 and civic engagement of individual 2, even when there is no causal effect.
For each treatment/exposure, design a unique statistical model based on a DAG.
Beware The Table 2 Fallacy: (Westreich and Greenland, 2013)
It is common to present multiple adjusted effect estimates from one model in a single table.
Not all coefficients are causal effects.
The Table 2 Fallacy is misinterpreting coefficients of control variables as causal effects, even when the model wasn’t designed to estimate them.
RThe problem of deciding which variables to control for has been automatized.
# required package
library(dagitty)
# for reproducibility
set.seed(22)
# specify causal relationships
d <- dagitty("dag {
X -> Y;
A -> X;
A -> Z;
B -> Z;
B -> Y;
Z -> X;
Z -> Y;
C -> X;
C -> Y
}")
# plot the DAG
plot(d){ B, C, Z }
{ A, C, Z }
{ A, B }
# required packages
library(dagitty)
library(tidyverse)
library(ggdag)
# specify causal relationships
d <- dagitty("dag {
X -> Y;
A -> X;
A -> Z;
B -> Z;
B -> Y;
Z -> X;
Z -> Y;
C -> X;
C -> Y
}")
# assign coordinates for the nodes
coordinates(d) <-
list(x=c(X=0, Z=1, Y=2, A=0, B=2, C=1),
y=c(X=0, Z=1, Y=0, A=1, B=1, C=-1))
# save DAG as a dataset
d1 <- d %>%
tidy_dagitty()
# make a plot
p <- ggplot(d1, aes(x=x, y=y, xend=xend, yend=yend)) +
geom_dag_edges() +
geom_dag_point( aes(color = name), size=20, show.legend = FALSE) +
geom_dag_text(colour = 'black', size=8) +
scale_color_manual(values = c(rep("white",6))) +
theme_dag()
pR packages:
dagitty (Textor et al., 2016)
pclag (Kalisch et al., 2012)
causaleffect (Tikka and Karvanen, 2017)
Python packages:
causalgraphicmodels (Barr, 2018)
DoWhy (Sharma and Kicimen, 2018)
SAS:
CAUSALGRAPH (Thompson, 2019)
Other:
causalfusion (Bareinboim and Pearl, 2016)
DAGitty (Textor et al., 2016)
A Crash Course in Good and Bad Controls (Cinelli, Forney and Pearl, 2022)
The Table 2 Fallacy (Westreich and Greenland, 2013)
The C-Word (Hernán, 2018)