Of Coffee and Cancer


  • People who drink coffee have a substantially higher chance of developing pancreatic cancer compared to those who don’t.

  • Should you stop drinking coffee?

  • Does changing our coffee-drinking habits would change our chances of getting pancreatic cancer?

Photo of an article

Of Coffee and Cigarettes

Photo of an article
Coffee and Cigarettes, Jim Jarmusch (2003)

  • Coffee is not likely to cause pancreatic cancer.

  • A confounder: Smoking.

  • People who smoke are more likely to drink a lot of coffee.

  • Smoking is a well-established risk factor for pancreatic cancer.

  • Once later studies accounted for smoking, the apparent link between coffee and pancreatic cancer disappeared.

Was the initial finding wrong?

Image from Jim Jarmusch's movie 'Coffee and Cigarettes'
Coffee and Cigarettes, Jim Jarmusch (2003)

  • Answered the question: “Are coffee drinkers more likely to have pancreatic cancer?”
  • Different from a causal question: “If I stop drinking coffee, will my risk of pancreatic cancer decrease?”
  • The associations in the data alone cannot answer causal questions.

Classical Statistics

  • Focuses on associations rather than causation.
  • Only summarizes data.
  • No machine can derive explanations from raw data.


  • The research questions are often about causal relationships.
  • Some language implies causality (e.g., dependent and independent variable).

Experiments: the Gold Standard for Causality

  • A well-designed experiment (or a randomized controlled trial) enables the intervention (e.g., if I do \(X\), what will happen to \(Y\)) to establish cause-and-effect relationship.
  • Secret weapon – randomization:
    • Any confounder (measured or unmeasured) is equally distributed across treatment and control groups, breaking its ability to bias the results.
    • Eliminates confounder bias by ensuring that treatment and control groups have no systematic differences.

Experiments are just a special case of deconfounding

  • Causal inference is about deconfounding, and experiments are just one way to achieve it, not inherently superior.
  • Often experimentation is not possible:
    1. Intervention may be physically impossible.
      E.g., Study of the effects of body weight on heart disease: We cannot randomly assign patients to groups based on higher or lower body weight.
    2. Intervention might be unethical.
      E.g., Study on smoking: We cannot randomly assign people to smoke for 10 years.
  • Observational studies have the advantage of being conducted in the natural habitats of the target population, not in artificial lab setting.
  • If you can deconfound in another way, you don’t necessarily need an experiment for causal inference.

Causal Inference without Randomization

Photof Of Judea Pearl playing aguitar in 1966;
Judea Pearl, 1966

  • Is there a statistical procedure that mimics randomization?

  • do-calculus is a formal set of rules developed by Judea Pearl an colleagues that allows us to determine whether and how causal effects can be identified from observational data by mathematically transforming expressions involving interventions (do-operator) into purely observational probabilities.

  • It does it via:

    1. Backdoor Criterion ←

    2. Frontdoor Criterion: when some confounders are unmeasured

    3. Beyond do-calculus: Instrumental Variables

The Workshop Plan

  1. DAG basics
  2. Four elemental confounds
  3. Backdoor criterion
  4. Examples
  5. Software

Directed Acyclic Graphs (DAGs)


  • A heuristic representation of a causal model.
  • Specifies which variables influence (or “listen” to) others.
  • No assumptions need to be made about the functional form of the causal relationships, nor distribution of variables.

Generic DAG 2

Basics of DAGs


  • The letters represent random variables (\(X\), \(Y\), \(Z\)).
  • \(U\) typically represents unmeasured variables.
  • Arrows denote direct causal effects, e.g. \(X\) on \(Y\).
  • Analyze to deduce appropriate statistical model for causal effect of \(X\) on \(Y\):
    • Which adjusted (control) variables to include?
    • Absolutely not save to add everything.

Generic DAG 2

Richard McElreath's The Four Elemental Confounds

Richard McElreath’s The Four Elemental Confounds

The Fork


  • \(X\) and \(Y\) share a common cause \(Z\).
  • \(X\) and \(Y\) are statistically correlated, even though there is no direct causal link.
  • Once accounting for \(Z\), spurious association is eliminated.
  • Shoe size ← Age of Child → Reading ability.
  • Ice cream consumption ← Temperature → Violent crime rates.

A plot showing correlation between violent crime and ice cream consumption

The Fork DAG

Simulate Fork

The Fork DAG

set.seed(200) # for reproducibility

N <- 300 # number of cases
# Z is independent
Z <- rbinom(N, size=1, prob=0.5) 
# X and Y depend on Z
X <- rnorm(N, mean=2*Z-1, sd=1) 
Y <- rnorm(N,mean=2*Z-1, sd=1) 

d <- data.frame(X, Z, Y)
head(d,3)
          X Z          Y
1 1.2653517 1  1.2734351
2 0.3402119 1  0.4166855
3 0.5938416 1 -0.3498030
Plot code
d$Z <- factor(d$Z) # make a categorical variable

p1 <- ggplot(d, aes(x = X, y = Y)) +
  geom_point(size = 3) +  
  geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
  theme_minimal() +
  theme(
    axis.title.x = element_text(size = 30),  
    axis.title.y = element_text(size = 30),
    legend.title = element_text(size = 30),
    legend.text = element_text(size = 26))

p2 <- ggplot(d, aes(x = X, y = Y, color = Z)) +
  geom_point(size = 3) +  
  geom_smooth(data = d[d$Z == 1, ], method = "lm", se = FALSE, color = "red", linewidth = 2) +
  geom_smooth(data = d[d$Z == 0, ], method = "lm", se = FALSE, color = "slateblue", linewidth = 2) +
  geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
  scale_color_manual(values = c("slateblue", "red") ) +  
  theme_minimal() +
  theme(
    axis.title.x = element_text(size = 30),  
    axis.title.y = element_text(size = 30),
    legend.title = element_text(size = 30),
    legend.text = element_text(size = 26))

The Pipe


  • Chain or mediation

  • Influence of \(X\) on \(Y\) is transmitted through \(Z\)

  • \(X\) and \(Y\) are associated

  • Once accounting on \(Z\), the association is eliminated

  • The mediator \(Z\) “screens off” information about \(X\) from \(Y\)

  • Smoking → Lung damage → Shortness of breath

The Pipe DAG

Simulate Pipe

The Pipe DAG

set.seed(20) # for reproducibility

N <- 300 # number of cases
# Z is independent
X <- rnorm(N, mean=0, sd=1) 
# Z depends on X
Z <- rbinom(N, size=1,
            prob=plogis(q=X, location=0, scale=1) )
# Y depends on Z
Y <- rnorm(N,mean=2*Z-1, sd=1) 

d <- data.frame(X, Z, Y)
head(d,3)
           X Z         Y
1  1.1626853 1 0.2521431
2 -0.5859245 1 1.8369678
3  1.7854650 1 2.3036799
Plot code
d$Z <- factor(d$Z) # make a categorical variable

p1 <- ggplot(d, aes(x = X, y = Y)) +
  geom_point(size = 3) +  
  geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
  theme_minimal() +
  theme(
    axis.title.x = element_text(size = 30),  
    axis.title.y = element_text(size = 30),
    legend.title = element_text(size = 30),
    legend.text = element_text(size = 26))

p2 <- ggplot(d, aes(x = X, y = Y, color = Z)) +
  geom_point(size = 3) +  
  geom_smooth(data = d[d$Z == 1, ], method = "lm", se = FALSE, color = "red", linewidth = 2) +
  geom_smooth(data = d[d$Z == 0, ], method = "lm", se = FALSE, color = "slateblue", linewidth = 2) +
  geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
  scale_color_manual(values = c("slateblue", "red") ) +  
  theme_minimal() +
  theme(
    axis.title.x = element_text(size = 30),  
    axis.title.y = element_text(size = 30),
    legend.title = element_text(size = 30),
    legend.text = element_text(size = 26))

The Collider


  • \(X\) and \(Y\) are not associated (share no causes)

  • \(X\) and \(Y\) both influence \(Z\)

  • Once accounting on \(Z\), \(X\) and \(Y\) are associated

  • Food quality → Restaurant survival ← Location
  • Acting skills → Success ← Physical appearance

The Collider DAG

Simulate Collider

The Collider DAG

set.seed(1983) # for reproducibility

N <- 300 # number of cases
# X and Y are independent
X <- rnorm(N, mean=0, sd=1) 
Y <- rnorm(N, mean=0, sd=1) 
# Z depends on X and Y
Z <- rbinom(N,size=1,
  prob=plogis(q=2*X+2*Y-2, location=0, scale=1))

d <- data.frame(X, Z, Y)
head(d,3)
            X Z         Y
1 -0.01705205 1 3.0004107
2 -0.78367184 0 0.7962816
3  1.32662703 1 0.7537700
Plot code
d$Z <- factor(d$Z) # make a categorical variable

p1 <- ggplot(d, aes(x = X, y = Y)) +
  geom_point(size = 3) +  
  geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
  theme_minimal() +
  theme(
    axis.title.x = element_text(size = 30),  
    axis.title.y = element_text(size = 30),
    legend.title = element_text(size = 30),
    legend.text = element_text(size = 26))

p2 <- ggplot(d, aes(x = X, y = Y, color = Z)) +
  geom_point(size = 3) +  
  geom_smooth(data = d[d$Z == 1, ], method = "lm", se = FALSE, color = "red", linewidth = 2) +
  geom_smooth(data = d[d$Z == 0, ], method = "lm", se = FALSE, color = "slateblue", linewidth = 2) +
  geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
  scale_color_manual(values = c("slateblue", "red") ) +  
  theme_minimal() +
  theme(
    axis.title.x = element_text(size = 30),  
    axis.title.y = element_text(size = 30),
    legend.title = element_text(size = 30),
    legend.text = element_text(size = 26))

The Descendant


Controlling for a descendant of a variable is equivalent to “partially” controlling for that variable.


a descendant example DAG with three variables (X, Y, Z), where X point to Z, Z points to Y and A.

Bringing it all together

The four elemental confounds:

  1. Fork: Controlling for the common cause: \(X\)\(Y\) | \(Z\)

  2. Pipe: Controlling for the mediator: \(X\)\(Y\) | \(Z\)

  3. Collider: Controlling for the collider: \(X\) ⊥̸ \(Y\) | \(Z\)

  4. Descendant: Depends on the causal relationship above

⊥ : d-separation (directional separation); implies conditional indepenence.

⊥̸ : not d-separated; implies conditional dependence.

\(X\)\(Y\) | \(Z\) : \(X\) is d-separated from \(Y\) given \(Z\); Conditioning on \(Z\) renders \(X\) and \(Y\) independent.



How do we use this knowledge?

Backdoor criterion

  • Assume a DAG.
  • Define the exposure (\(X\)), and outcome (\(Y\)) based on the research question.

To estimate the total causal effect of \(X\) on \(Y\):

  1. Identify all paths between \(X\) and \(Y\): causal and spurious (backdoor).

  2. Do not block or adjust for any variables on the causal paths from \(X\) to \(Y\).

  3. Block all backdoor paths between \(X\) and \(Y\) by conditioning on appropriate variables.

Types of Paths

Causal path: All arrows flow from \(X\) to \(Y\) — represents how \(X\) affects \(Y\) (directly or via mediators).

\(X\)\(Y\) : causal path

\(X\)\(Z\)\(Y\) : causal path

\(X\)\(C\)\(Y\) : causal path

Spurious path: Any non-causal path — induces association but not causation.

Backdoor path: A spurious path that starts with an arrow into \(X\).

\(X\)\(Y\) : causal path
\(X\)\(C\)\(Y\) : causal path
\(X\)\(Z\)\(Y\) : spurious backdoor path

Statistical Trend vs. Causal Effect

\[Y = \alpha + \beta X\]

  • Parameter \(\beta\) = average observed trend; doesn’t convey a causal relationship.
  • A coefficient may represent a causal effect, but data alone won’t reveal when.
  • Two additional ingredients are required:
    1. Path diagram (e.g., DAG).
    2. The adjusted variable(s) \(Z\) should satisfy the backdoor criterion.

\[Y = \alpha + \beta_1 X + \beta_2 Z\]

Generic DAG

\[Y = \alpha + \beta X\]

Generic DAG 2

Example 1

Total causal effect of \(X\) on \(Y\):

  1. Identify all paths between \(X\) and \(Y\)
  2. Do not perturb causal paths
  3. Block all open backdoor paths by conditioning on appropriate variables

\(X\)\(Y\)

\(X\) ← U → \(Z\)\(Y\)

\(X\)\(Y\): causal path

\(X\) ← U → \(Z\)\(Y\): open backdoor path (BP), close with \(Z\)

Adjustment set: \(Z\).

Example 2

Total causal effect of \(X\) on \(Y\):

  1. Identify all paths between \(X\) and \(Y\)
  2. Do not perturb causal paths
  3. Block all open backdoor paths by conditioning on appropriate variables

\(X\)\(M\)\(Y\)

\(X\)\(Z\)\(M\)\(Y\)

\(X\)\(M\)\(Y\): causal path (with a mediator \(M\)).

\(X\)\(Z\)\(M\)\(Y\): open BP, close with \(Z\).

Do not control for \(M\)!

Adjustment set: \(Z\).

Example 3

The example M-Bias

Total causal effect of \(X\) on \(Y\):

  1. Identify all paths between \(X\) and \(Y\)
  2. Do not perturb causal paths
  3. Block all open backdoor paths by conditioning on appropriate variables

\(X\)\(Y\)

\(X\)\(U1\)\(Z\)\(U2\)\(Y\)

\(X\)\(Y\): causal path

\(X\)\(U1\)\(Z\)\(U2\)\(Y\): closed BP due to collider \(Z\)

Adjustment set: none.

Example 4

The example FAG

Total causal effect of \(X\) on \(Y\):

  1. Identify all paths between \(X\) and \(Y\)
  2. Do not perturb causal paths
  3. Block all open backdoor paths by conditioning on appropriate variables

\(X\)\(Y\)

\(X\)\(C\)\(Y\)

\(X\)\(Z\)\(Y\)

\(X\)\(A\)\(Z\)\(Y\)

\(X\)\(Z\)\(B\)\(Y\)

\(X\)\(A\)\(Z\)\(B\)\(Y\)

\(X\)\(Y\): causal path

\(X\)\(C\)\(Y\): open BP, close with \(C\)

\(X\)\(Z\)\(Y\): open BP, close with \(Z\)

\(X\)\(A\)\(Z\)\(Y\): open BP, close with \(A\) or \(Z\)

\(X\)\(Z\)\(B\)\(Y\): open BP, close with \(Z\) or \(B\)

\(X\)\(A\)\(Z\)\(B\)\(Y\): closed BP (\(Z\) is a collider)

Adjustment set: \(C\), \(Z\), \(A\) or \(B\)

Homophily bias in social networks

Does civic engagement of Individual 1 lead to civic engagement of Individual 2 in a subsequent time period? (Elwert and Winship, 2014)

The example M-Bias

Total causal effect civic engagement of individual 1 on civic engagement of individual 2.

  1. Identify all paths between civic eng 1 and civic eng 2
  2. Do not perturb causal paths
  3. Block all open backdoor paths by conditioning on appropriate variables

Civic eng 1Civic eng 2: causal path

Civic eng 1Altruism 1FriendsAltruism 2Civic eng 2: closed BP due to collider Friends

Adjustment set: none.

The example M-Bias

Individuals may show similar civic engagement not because one influences the other, but because similar individuals tend to form friendships and share similar levels of civic engagement.

The common cause—altruism (unobserved)—drives both friendship formation and each individual’s civic engagement.

Conditioning on friendship (a collider) creates an association between civic engagement of individual 1 and civic engagement of individual 2, even when there is no causal effect.

Causal Inference with Linear Models

  • For each treatment/exposure, design a unique statistical model based on a DAG.

  • Beware The Table 2 Fallacy: (Westreich and Greenland, 2013)

    • It is common to present multiple adjusted effect estimates from one model in a single table.

    • Not all coefficients are causal effects.

    • The Table 2 Fallacy is misinterpreting coefficients of control variables as causal effects, even when the model wasn’t designed to estimate them.

Dagitty in R

The problem of deciding which variables to control for has been automatized.


# required package
library(dagitty)

# for reproducibility
set.seed(22)

# specify causal relationships
d <- dagitty("dag { 
                X -> Y; 
                A -> X;  
                A -> Z;  
                B -> Z;  
                B -> Y;  
                Z -> X; 
                Z -> Y; 
                C -> X; 
                C -> Y
             }")

# plot the DAG
plot(d)

adjustmentSets(d, exposure = "X", outcome = "Y")
{ B, C, Z }
{ A, C, Z }
adjustmentSets(d, exposure = "Z", outcome = "Y")
{ A, B }


DAGs with ggdag package
# required packages
library(dagitty)
library(tidyverse)
library(ggdag)

# specify causal relationships
d <- dagitty("dag { 
                X -> Y; 
                A -> X;  
                A -> Z;  
                B -> Z;  
                B -> Y;  
                Z -> X; 
                Z -> Y; 
                C -> X; 
                C -> Y
             }")

# assign coordinates for the nodes
coordinates(d) <-
    list(x=c(X=0, Z=1, Y=2, A=0, B=2, C=1),
         y=c(X=0, Z=1, Y=0, A=1, B=1, C=-1))


# save DAG as a dataset
d1 <- d %>% 
  tidy_dagitty()

# make a plot
p <- ggplot(d1, aes(x=x, y=y, xend=xend, yend=yend)) +
  geom_dag_edges() +
  geom_dag_point( aes(color = name), size=20, show.legend = FALSE) +
  geom_dag_text(colour = 'black', size=8) +
  scale_color_manual(values = c(rep("white",6))) +
  theme_dag()
p

Other software


R packages:

dagitty (Textor et al., 2016)

pclag (Kalisch et al., 2012)

causaleffect (Tikka and Karvanen, 2017)

Python packages:

causalgraphicmodels (Barr, 2018)

DoWhy (Sharma and Kicimen, 2018)

SAS:

CAUSALGRAPH (Thompson, 2019)

Other:

causalfusion (Bareinboim and Pearl, 2016)

DAGitty (Textor et al., 2016)

Additional Resources

A Crash Course in Good and Bad Controls (Cinelli, Forney and Pearl, 2022)

The Table 2 Fallacy (Westreich and Greenland, 2013)

The C-Word (Hernán, 2018)