Introduction to `DAGs`
for Causal Inference in `R`

Of Coffee and Cancer

People who drink coffee have a substantially higher chance of developing pancreatic cancer compared to those who don’t.
Should you stop drinking coffee?
Does changing our coffee-drinking habits would change our chances of getting pancreatic cancer?

Photo of an article

Of Coffee and Cigarettes

Coffee and Cigarettes, Jim Jarmusch (2003)

Coffee is not likely to cause pancreatic cancer.
A confounder: Smoking.
People who smoke are more likely to drink a lot of coffee.
Smoking is a well-established risk factor for pancreatic cancer.
Once later studies accounted for smoking, the apparent link between coffee and pancreatic cancer disappeared.

Was the initial finding wrong?

Image from Jim Jarmusch's movie 'Coffee and Cigarettes'
Coffee and Cigarettes, Jim Jarmusch (2003)

Answered the question: “Are coffee drinkers more likely to have pancreatic cancer?”
Different from a causal question: “If I stop drinking coffee, will my risk of pancreatic cancer decrease?”
The associations in the data alone cannot answer causal questions.

Classical Statistics

Focuses on associations rather than causation.
Only summarizes data.
No machine can derive explanations from raw data.

The research questions are often about causal relationships.
Some language implies causality (e.g., dependent and independent variable).

Experiments: the Gold Standard for Causality

A well-designed experiment (or a randomized controlled trial) enables the intervention (e.g., if I do \(X\), what will happen to \(Y\)) to establish cause-and-effect relationship.
Secret weapon – randomization:
- Any confounder (measured or unmeasured) is equally distributed across treatment and control groups, breaking its ability to bias the results.
- Eliminates confounder bias by ensuring that treatment and control groups have no systematic differences.

Experiments are just a special case of deconfounding

Causal inference is about deconfounding, and experiments are just one way to achieve it, not inherently superior.
Often experimentation is not possible:
1. Intervention may be physically impossible.
  E.g., Study of the effects of body weight on heart disease: We cannot randomly assign patients to groups based on higher or lower body weight.
2. Intervention might be unethical.
  E.g., Study on smoking: We cannot randomly assign people to smoke for 10 years.

Observational studies have the advantage of being conducted in the natural habitats of the target population, not in artificial lab setting.
If you can deconfound in another way, you don’t necessarily need an experiment for causal inference.

Causal Inference without Randomization

Photof Of Judea Pearl playing aguitar in 1966;
Judea Pearl, 1966

Is there a statistical procedure that mimics randomization?
do-calculus is a formal set of rules developed by Judea Pearl an colleagues that allows us to determine whether and how causal effects can be identified from observational data by mathematically transforming expressions involving interventions (do-operator) into purely observational probabilities.
It does it via:
1. Backdoor Criterion ←
2. Frontdoor Criterion: when some confounders are unmeasured
3. Beyond do-calculus: Instrumental Variables

The Workshop Plan

DAG basics
Four elemental confounds
Backdoor criterion
Examples
Software

Directed Acyclic Graphs (DAGs)

A heuristic representation of a causal model.
Specifies which variables influence (or “listen” to) others.
No assumptions need to be made about the functional form of the causal relationships, nor distribution of variables.

Basics of DAGs

The letters represent random variables (\(X\), \(Y\), \(Z\)).
\(U\) typically represents unmeasured variables.
Arrows denote direct causal effects, e.g. \(X\) on \(Y\).
Analyze to deduce appropriate statistical model for causal effect of \(X\) on \(Y\):
- Which adjusted (control) variables to include?
- Absolutely not save to add everything.

Richard McElreath’s The Four Elemental Confounds

The Fork

\(X\) and \(Y\) share a common cause \(Z\).
\(X\) and \(Y\) are statistically correlated, even though there is no direct causal link.
Once accounting for \(Z\), spurious association is eliminated.

Shoe size ← Age of Child → Reading ability.

Ice cream consumption ← Temperature → Violent crime rates.

A plot showing correlation between violent crime and ice cream consumption

Simulate Fork

set.seed(200) # for reproducibility

N <- 300 # number of cases
# Z is independent
Z <- rbinom(N, size=1, prob=0.5) 
# X and Y depend on Z
X <- rnorm(N, mean=2*Z-1, sd=1) 
Y <- rnorm(N,mean=2*Z-1, sd=1) 

d <- data.frame(X, Z, Y)
head(d,3)

          X Z          Y
1 1.2653517 1  1.2734351
2 0.3402119 1  0.4166855
3 0.5938416 1 -0.3498030

Plot code

d$Z <- factor(d$Z) # make a categorical variable

p1 <- ggplot(d, aes(x = X, y = Y)) +
  geom_point(size = 3) +  
  geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
  theme_minimal() +
  theme(
    axis.title.x = element_text(size = 30),  
    axis.title.y = element_text(size = 30),
    legend.title = element_text(size = 30),
    legend.text = element_text(size = 26))

p2 <- ggplot(d, aes(x = X, y = Y, color = Z)) +
  geom_point(size = 3) +  
  geom_smooth(data = d[d$Z == 1, ], method = "lm", se = FALSE, color = "red", linewidth = 2) +
  geom_smooth(data = d[d$Z == 0, ], method = "lm", se = FALSE, color = "slateblue", linewidth = 2) +
  geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
  scale_color_manual(values = c("slateblue", "red") ) +  
  theme_minimal() +
  theme(
    axis.title.x = element_text(size = 30),  
    axis.title.y = element_text(size = 30),
    legend.title = element_text(size = 30),
    legend.text = element_text(size = 26))

Y ~ X
Y ~ X + Z

The Pipe

Chain or mediation
Influence of \(X\) on \(Y\) is transmitted through \(Z\)
\(X\) and \(Y\) are associated
Once accounting on \(Z\), the association is eliminated
The mediator \(Z\) “screens off” information about \(X\) from \(Y\)

Smoking → Lung damage → Shortness of breath

Simulate Pipe

set.seed(20) # for reproducibility

N <- 300 # number of cases
# Z is independent
X <- rnorm(N, mean=0, sd=1) 
# Z depends on X
Z <- rbinom(N, size=1,
            prob=plogis(q=X, location=0, scale=1) )
# Y depends on Z
Y <- rnorm(N,mean=2*Z-1, sd=1) 

d <- data.frame(X, Z, Y)
head(d,3)

           X Z         Y
1  1.1626853 1 0.2521431
2 -0.5859245 1 1.8369678
3  1.7854650 1 2.3036799

Plot code

d$Z <- factor(d$Z) # make a categorical variable

p1 <- ggplot(d, aes(x = X, y = Y)) +
  geom_point(size = 3) +  
  geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
  theme_minimal() +
  theme(
    axis.title.x = element_text(size = 30),  
    axis.title.y = element_text(size = 30),
    legend.title = element_text(size = 30),
    legend.text = element_text(size = 26))

p2 <- ggplot(d, aes(x = X, y = Y, color = Z)) +
  geom_point(size = 3) +  
  geom_smooth(data = d[d$Z == 1, ], method = "lm", se = FALSE, color = "red", linewidth = 2) +
  geom_smooth(data = d[d$Z == 0, ], method = "lm", se = FALSE, color = "slateblue", linewidth = 2) +
  geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
  scale_color_manual(values = c("slateblue", "red") ) +  
  theme_minimal() +
  theme(
    axis.title.x = element_text(size = 30),  
    axis.title.y = element_text(size = 30),
    legend.title = element_text(size = 30),
    legend.text = element_text(size = 26))

Y ~ X
Y ~ X + Z

The Collider

\(X\) and \(Y\) are not associated (share no causes)
\(X\) and \(Y\) both influence \(Z\)
Once accounting on \(Z\), \(X\) and \(Y\) are associated

Food quality → Restaurant survival ← Location

Acting skills → Success ← Physical appearance

Simulate Collider

set.seed(1983) # for reproducibility

N <- 300 # number of cases
# X and Y are independent
X <- rnorm(N, mean=0, sd=1) 
Y <- rnorm(N, mean=0, sd=1) 
# Z depends on X and Y
Z <- rbinom(N,size=1,
  prob=plogis(q=2*X+2*Y-2, location=0, scale=1))

d <- data.frame(X, Z, Y)
head(d,3)

            X Z         Y
1 -0.01705205 1 3.0004107
2 -0.78367184 0 0.7962816
3  1.32662703 1 0.7537700

Plot code

d$Z <- factor(d$Z) # make a categorical variable

p1 <- ggplot(d, aes(x = X, y = Y)) +
  geom_point(size = 3) +  
  geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
  theme_minimal() +
  theme(
    axis.title.x = element_text(size = 30),  
    axis.title.y = element_text(size = 30),
    legend.title = element_text(size = 30),
    legend.text = element_text(size = 26))

p2 <- ggplot(d, aes(x = X, y = Y, color = Z)) +
  geom_point(size = 3) +  
  geom_smooth(data = d[d$Z == 1, ], method = "lm", se = FALSE, color = "red", linewidth = 2) +
  geom_smooth(data = d[d$Z == 0, ], method = "lm", se = FALSE, color = "slateblue", linewidth = 2) +
  geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
  scale_color_manual(values = c("slateblue", "red") ) +  
  theme_minimal() +
  theme(
    axis.title.x = element_text(size = 30),  
    axis.title.y = element_text(size = 30),
    legend.title = element_text(size = 30),
    legend.text = element_text(size = 26))

Y ~ X
Y ~ X + Z

The Descendant

Controlling for a descendant of a variable is equivalent to “partially” controlling for that variable.

a descendant example DAG with three variables (X, Y, Z), where X point to Z, Z points to Y and A.

Bringing it all together

The four elemental confounds:

Fork: Controlling for the common cause: \(X\) ⊥ \(Y\) | \(Z\)
Pipe: Controlling for the mediator: \(X\) ⊥ \(Y\) | \(Z\)
Collider: Controlling for the collider: \(X\) ⊥̸ \(Y\) | \(Z\)
Descendant: Depends on the causal relationship above

⊥ : d-separation (directional separation); implies conditional indepenence.

⊥̸ : not d-separated; implies conditional dependence.

\(X\) ⊥ \(Y\) | \(Z\) : \(X\) is d-separated from \(Y\) given \(Z\); Conditioning on \(Z\) renders \(X\) and \(Y\) independent.

How do we use this knowledge?

Backdoor criterion

Assume a DAG.
Define the exposure (\(X\)), and outcome (\(Y\)) based on the research question.

To estimate the total causal effect of \(X\) on \(Y\):

Identify all paths between \(X\) and \(Y\): causal and spurious (backdoor).
Do not block or adjust for any variables on the causal paths from \(X\) to \(Y\).
Block all backdoor paths between \(X\) and \(Y\) by conditioning on appropriate variables.

Types of Paths

Causal path: All arrows flow from \(X\) to \(Y\) — represents how \(X\) affects \(Y\) (directly or via mediators).

\(X\) → \(Y\) : causal path

\(X\) → \(Z\) → \(Y\) : causal path

\(X\) → \(C\) → \(Y\) : causal path

Spurious path: Any non-causal path — induces association but not causation.

Backdoor path: A spurious path that starts with an arrow into \(X\).

\(X\) → \(Y\) : causal path
\(X\) → \(C\) → \(Y\) : causal path
\(X\) ← \(Z\) → \(Y\) : spurious backdoor path

Statistical Trend vs. Causal Effect

\[Y = \alpha + \beta X\]

Parameter \(\beta\) = average observed trend; doesn’t convey a causal relationship.
A coefficient may represent a causal effect, but data alone won’t reveal when.
Two additional ingredients are required:
1. Path diagram (e.g., DAG).
2. The adjusted variable(s) \(Z\) should satisfy the backdoor criterion.

\[Y = \alpha + \beta_1 X + \beta_2 Z\]

Generic DAG

\[Y = \alpha + \beta X\]

Generic DAG 2

Example 1

Goal
Paths
Solution

Total causal effect of \(X\) on \(Y\):

Identify all paths between \(X\) and \(Y\)
Do not perturb causal paths
Block all open backdoor paths by conditioning on appropriate variables

\(X\) → \(Y\)

\(X\) ← U → \(Z\) → \(Y\)

\(X\) → \(Y\): causal path

\(X\) ← U → \(Z\) → \(Y\): open backdoor path (BP), close with \(Z\)

Adjustment set: \(Z\).

Example 2

Goal
Paths
Solution

Total causal effect of \(X\) on \(Y\):

Identify all paths between \(X\) and \(Y\)
Do not perturb causal paths
Block all open backdoor paths by conditioning on appropriate variables

\(X\) → \(M\) → \(Y\)

\(X\) ← \(Z\) → \(M\) → \(Y\)

\(X\) → \(M\) → \(Y\): causal path (with a mediator \(M\)).

\(X\) ← \(Z\) → \(M\) → \(Y\): open BP, close with \(Z\).

Do not control for \(M\)!

Adjustment set: \(Z\).

Example 3

Goal
Paths
Solution

Total causal effect of \(X\) on \(Y\):

Identify all paths between \(X\) and \(Y\)
Do not perturb causal paths
Block all open backdoor paths by conditioning on appropriate variables

\(X\) → \(Y\)

\(X\) ← \(U1\) → \(Z\) ← \(U2\) → \(Y\)

\(X\) → \(Y\): causal path

\(X\) ← \(U1\) → \(Z\) ← \(U2\) → \(Y\): closed BP due to collider \(Z\)

Adjustment set: none.

Example 4

Goal
Paths
Solution

Total causal effect of \(X\) on \(Y\):

Identify all paths between \(X\) and \(Y\)
Do not perturb causal paths
Block all open backdoor paths by conditioning on appropriate variables

\(X\) → \(Y\)

\(X\) ← \(C\) → \(Y\)

\(X\) ← \(Z\) → \(Y\)

\(X\) ← \(A\) → \(Z\) → \(Y\)

\(X\) ← \(Z\) ← \(B\) → \(Y\)

\(X\) ← \(A\) → \(Z\) ← \(B\) → \(Y\)

\(X\) → \(Y\): causal path

\(X\) ← \(C\) → \(Y\): open BP, close with \(C\)

\(X\) ← \(Z\) → \(Y\): open BP, close with \(Z\)

\(X\) ← \(A\) → \(Z\) → \(Y\): open BP, close with \(A\) or \(Z\)

\(X\) ← \(Z\) ← \(B\) → \(Y\): open BP, close with \(Z\) or \(B\)

\(X\) ← \(A\) → \(Z\) ← \(B\) → \(Y\): closed BP (\(Z\) is a collider)

Adjustment set: \(C\), \(Z\), \(A\) or \(B\)

Causal Inference with Linear Models

For each treatment/exposure, design a unique statistical model based on a DAG.
Beware The Table 2 Fallacy: (Westreich and Greenland, 2013)
- It is common to present multiple adjusted effect estimates from one model in a single table.
- Not all coefficients are causal effects.
- The Table 2 Fallacy is misinterpreting coefficients of control variables as causal effects, even when the model wasn’t designed to estimate them.

Dagitty in `R`

The problem of deciding which variables to control for has been automatized.

# required package
library(dagitty)

# for reproducibility
set.seed(22)

# specify causal relationships
d <- dagitty("dag { 
                X -> Y; 
                A -> X;  
                A -> Z;  
                B -> Z;  
                B -> Y;  
                Z -> X; 
                Z -> Y; 
                C -> X; 
                C -> Y
             }")

# plot the DAG
plot(d)

adjustmentSets(d, exposure = "X", outcome = "Y")

{ B, C, Z }
{ A, C, Z }

adjustmentSets(d, exposure = "Z", outcome = "Y")

{ A, B }

DAGs with ggdag package

# required packages
library(dagitty)
library(tidyverse)
library(ggdag)

# specify causal relationships
d <- dagitty("dag { 
                X -> Y; 
                A -> X;  
                A -> Z;  
                B -> Z;  
                B -> Y;  
                Z -> X; 
                Z -> Y; 
                C -> X; 
                C -> Y
             }")

# assign coordinates for the nodes
coordinates(d) <-
    list(x=c(X=0, Z=1, Y=2, A=0, B=2, C=1),
         y=c(X=0, Z=1, Y=0, A=1, B=1, C=-1))


# save DAG as a dataset
d1 <- d %>% 
  tidy_dagitty()

# make a plot
p <- ggplot(d1, aes(x=x, y=y, xend=xend, yend=yend)) +
  geom_dag_edges() +
  geom_dag_point( aes(color = name), size=20, show.legend = FALSE) +
  geom_dag_text(colour = 'black', size=8) +
  scale_color_manual(values = c(rep("white",6))) +
  theme_dag()
p

Other software

R packages:

dagitty (Textor et al., 2016)

pclag (Kalisch et al., 2012)

causaleffect (Tikka and Karvanen, 2017)

Python packages:

causalgraphicmodels (Barr, 2018)

DoWhy (Sharma and Kicimen, 2018)

SAS:

CAUSALGRAPH (Thompson, 2019)

Other:

causalfusion (Bareinboim and Pearl, 2016)

DAGitty (Textor et al., 2016)

Additional Resources

A Crash Course in Good and Bad Controls (Cinelli, Forney and Pearl, 2022)

The Table 2 Fallacy (Westreich and Greenland, 2013)

The C-Word (Hernán, 2018)

Introduction to DAGs for Causal Inference in R

Of Coffee and Cancer

Of Coffee and Cigarettes

Was the initial finding wrong?

Classical Statistics

Experiments: the Gold Standard for Causality

Experiments are just a special case of deconfounding

Causal Inference without Randomization

The Workshop Plan

Directed Acyclic Graphs (DAGs)

Basics of DAGs

The Fork

Simulate Fork

The Pipe

Simulate Pipe

The Collider

Simulate Collider

The Descendant

Bringing it all together

Backdoor criterion

Types of Paths

Statistical Trend vs. Causal Effect

Example 1

Example 2

Example 3

Example 4

Homophily bias in social networks

Causal Inference with Linear Models

Dagitty in R

Other software

Additional Resources

Introduction to `DAGs`
for Causal Inference in `R`

Dagitty in `R`