To Aggregate or To Partition?


  • Aggregate: use the totals.
  • Partition: analyze males and female separately.
  • Bad/Bad/Good drug.
  • The drug cannot simultaneously cause heart attacks and at the same time prevent them.
  • The data alone cannot tell us which population we should use to make inferences about the effectiveness of the drug.

Does the drug cause heart attacks?

Table 1: Fictitious data from “The Book of Why”
(a) Control (No Drug)
Heart attack No heart attack
Female 1 19
Male 12 28
Total 13 47
(b) Treatment (Took Drug)
Heart attack No heart attack
Female 3 37
Male 8 12
Total 11 49

Does the drug cause heart attacks?

Table 2: Control (No Drug)
Heart attack No heart attack
Low blood pressure 1 19
High blood pressure 12 28
Total 13 47
Table 3: Treatment (Took Drug)
Heart attack No heart attack
Low blood pressure 3 37
High blood pressure 8 12
Total 11 49

Fictitious data from “The Book of Why”


Table 4: Control (No Drug)
Heart attack No heart attack
Female 1 19
Male 12 28
Total 13 47
Table 5: Treatment (Took Drug)
Heart attack No heart attack
Female 3 37
Male 8 12
Total 11 49

Fictitious data from “The Book of Why”

Aggregating is not always wrong and partitioning is not always right.

The answer is in the data generating process.

Simpsons paradox control for gender

Gender is a confounder on a backdoor path. Controlling for a confounder is appropriate.

Simpsons paradox control for blood pressure

Blood pressure is a mediator on a causal path.
Sometimes aggregating data is appropriate.

Screenshot of paper title


Good control: A variable that must be adjusted for to identify the causal effect of interest (i.e., to block confounding paths).

Bad control: A variable that introduces bias when included in the model, typically because it is affected by the treatment or acts as (or opens) a collider path.


Paper

Supplemental Code

Confounding: The Core Challenge

  • Real-world outcomes have many causes.
    Multiple variables often influence the outcome of interest.

  • When we estimate the causal effect of one exposure using a linear model, that relationship is typically confounded — mixed with other causal pathways.

  • “Deconfounding” means isolating the effect of the exposure from other influences.

  • Some bad rules:

    • Include everything you have (RM: causal salad)

    • Some heuristics (e.g., pre-treatment variables, not highly colinear)

  • All not satisfactory.

  • Because not all covariates are “good controls.” Some introduce bias (“bad controls”), others block the causal pathways.

  • The key is causal structure, not variable count.

Random Control Trials: The Gold Standard for Eliminating Confounding

  • Randomized controlled trials (RCTs) assign participants to treatment or control groups at random.
  • As a result, the distribution of confounders, both measured and unmeasured, is expected to be balanced between groups.
  • Therefore, any systematic difference in outcomes between groups can be attributed to the treatment, rather than to confounding.

Observational studies

  • In observational data, we did not randomize anything.
  • We need a strategy that mimics randomization to deconfound.
  • do-calculus (via the backdoor criterion) identifies a set of variables which, if controlled for, removes confounding bias between exposure and outcome.
  • Using DAGs (Directed Acyclic Graphs) and the backdoor criterion, we can determine which variables to adjust for.

Controlling for variables in linear models

  • When you condition on \(Z\) in a linear model, you compare units of \(X\) within each level of \(Z\). Averaging across these strata gives the overall effect.
  • Within each stratum, \(Z\) is constant, so differences in \(Y\) across levels of \(X\) cannot be attributed to \(Z\).
  • Conditioning does not remove the influence of \(Z\) on \(X\) and \(Y\) globally, only randomization can do that automatically. But by conditioning, we mimic randomization: we compare units of \(X\) that share the same value of \(Z\), effectively controlling for its confounding effect.

Controlling for Z in Linear Models

Directed Acyclic Graphs (DAGs)


  • A heuristic representation of a causal model.
  • Letter represent random variables.
  • Arrows denote direct causal effects.
  • Specifies which variables influence (or “listen” to) others.

Generic DAG 2

Three main sources of association


The building blocks of any causal model:
1) Common cause (or a Fork).
2) Mediator (or a Pipe).
3) Common effect (or a Collider).

The Fork

The Pipe DAG

  • \(X\) and \(Y\) share a common cause \(Z\).
  • \(Z\) is a confounder (backdoor path)
  • This path is open: \(X\) and \(Y\) are statistically correlated, even though there is no direct causal link.
  • Once we condition on \(Z\), the path is closed, spurious association is eliminated.
  • Drug ← Gender → Health outcomes.
Code
set.seed(200) # for reproducibility

N <- 300 # number of cases
# Z is independent
Z <- rbinom(N, size=1, prob=0.5) 
# X and Y depend on Z
X <- rnorm(N, mean=2*Z-1, sd=1) 
Y <- rnorm(N,mean=2*Z-1, sd=1) 

d <- data.frame(X, Z, Y)

d$Z <- factor(d$Z) # make a categorical variable

p1 <- ggplot(d, aes(x = X, y = Y)) +
  geom_point(size = 3) +  
  geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
  theme_minimal() +
  theme(
    axis.title.x = element_text(size = 30),  
    axis.title.y = element_text(size = 30),
    legend.title = element_text(size = 30),
    legend.text = element_text(size = 26))

p2 <- ggplot(d, aes(x = X, y = Y, color = Z)) +
  geom_point(size = 3) +  
  geom_smooth(data = d[d$Z == 1, ], method = "lm", se = FALSE, color = "#d73027", linewidth = 2) +
  geom_smooth(data = d[d$Z == 0, ], method = "lm", se = FALSE, color = "#91bfdb", linewidth = 2) +
  geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
  scale_color_manual(values = c("#91bfdb", "#d73027") ) +  
  theme_minimal() +
  theme(
    axis.title.x = element_text(size = 30),  
    axis.title.y = element_text(size = 30),
    legend.title = element_text(size = 30),
    legend.text = element_text(size = 26))

The Pipe

The Pipe DAG

  • Chain or mediator.
  • \(X\) causally affects \(Y\) through \(Z\).
  • The path is open.
  • If we condition on \(Z\), we block the flow of association.
  • Drug → Blood pressure → Health outcomes.
Code
set.seed(20) # for reproducibility

N <- 300 # number of cases
# Z is independent
X <- rnorm(N, mean=0, sd=1) 
# Z depends on X
Z <- rbinom(N, size=1,
            prob=plogis(q=X, location=0, scale=1) )
# Y depends on Z
Y <- rnorm(N,mean=2*Z-1, sd=1) 

d <- data.frame(X, Z, Y)
d$Z <- factor(d$Z) # make a categorical variable

p1 <- ggplot(d, aes(x = X, y = Y)) +
  geom_point(size = 3) +  
  geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
  theme_minimal() +
  theme(
    axis.title.x = element_text(size = 30),  
    axis.title.y = element_text(size = 30),
    legend.title = element_text(size = 30),
    legend.text = element_text(size = 26))

p2 <- ggplot(d, aes(x = X, y = Y, color = Z)) +
  geom_point(size = 3) +  
  geom_smooth(data = d[d$Z == 1, ], method = "lm", se = FALSE, color = "#d73027", linewidth = 2) +
  geom_smooth(data = d[d$Z == 0, ], method = "lm", se = FALSE, color = "#91bfdb", linewidth = 2) +
  geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
  scale_color_manual(values = c("#91bfdb", "#d73027") ) +  
  theme_minimal() +
  theme(
    axis.title.x = element_text(size = 30),  
    axis.title.y = element_text(size = 30),
    legend.title = element_text(size = 30),
    legend.text = element_text(size = 26))

The Collider

The Collider DAG

  • \(X\) and \(Y\) are not associated (share no causes), but both influence \(Z\).
  • Closed path. A collider does not induce any association between the causes if left alone.
  • Once conditioning on \(Z\), \(X\) and \(Y\) are non-causally associated.
  • Food quality → Restaurant survival ← Location.
Code
set.seed(1983) # for reproducibility

N <- 300 # number of cases
# X and Y are independent
X <- rnorm(N, mean=0, sd=1) 
Y <- rnorm(N, mean=0, sd=1) 
# Z depends on X and Y
Z <- rbinom(N,size=1,
  prob=plogis(q=2*X+2*Y-2, location=0, scale=1))

d <- data.frame(X, Z, Y)

d$Z <- factor(d$Z) # make a categorical variable

p1 <- ggplot(d, aes(x = X, y = Y)) +
  geom_point(size = 3) +  
  geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
  theme_minimal() +
  theme(
    axis.title.x = element_text(size = 30),  
    axis.title.y = element_text(size = 30),
    legend.title = element_text(size = 30),
    legend.text = element_text(size = 26))

p2 <- ggplot(d, aes(x = X, y = Y, color = Z)) +
  geom_point(size = 3) +  
  geom_smooth(data = d[d$Z == 1, ], method = "lm", se = FALSE, color = "#d73027", linewidth = 2) +
  geom_smooth(data = d[d$Z == 0, ], method = "lm", se = FALSE, color = "#91bfdb", linewidth = 2) +
  geom_smooth(data = d, method = "lm", se = FALSE, color = "black", linewidth = 2) +
  scale_color_manual(values = c("#91bfdb", "#d73027") ) +  
  theme_minimal() +
  theme(
    axis.title.x = element_text(size = 30),  
    axis.title.y = element_text(size = 30),
    legend.title = element_text(size = 30),
    legend.text = element_text(size = 26))

The Descendant


Controlling for a descendant of a variable is equivalent to “partially” controlling for that variable.

For example, Daily Excercise → Physical Fitness → Mental Health.
Body Mass Index (BMI) is a descendant of Physical Fitness.


a descendant example DAG with three variables (X, Y, Z), where X point to Z, Z points to Y and A.


Path Type Not conditioning on Z Conditioning on Z
Common Causes (Confounders)
X ← Z → Y
Non-Causal (Spurious) Open Closed
Mediators
X → Z → Y
Causal Open Closed
Common Effects (Colliders)
X → Z ← Y
Non-Causal (Spurious) Closed Open

Controlling for the effects of a variable (Descendant) is equivalent to partially controlling for that variable

Backdoor criterion

  • Assume a DAG.
  • Define exposure (\(X\)) and outcome (\(Y\)) based on research question.
  • Causal paths – paths consisting of mediators.
  • All other paths are non-causal.


  • To estimate the total causal effect of \(X\) on \(Y\):
  1. Identify all paths between \(X\) and \(Y\): causal and spurious (backdoor).
  2. Do not adjust for any variables on the causal paths from \(X\) to \(Y\).
  3. Block all backdoor paths between \(X\) and \(Y\) by conditioning on appropriate variables.

Average Causal Effect (ACE)


The target of our analysis is the Average Causal Effect (ACE).

ACE is expected increase in \(Y\) in response to a unit increase in \(X\) due to an intervention.


\[ACE(x) = E [ Y | do(x+1)] - E [ Y | do(x) ]\]

This formula compares the average outcomes under two interventions that differ only in the value of \(X\).
Their difference is the causal effect of increasing \(X\) by one unit.

Models 1, 2, 3: Good Controls

Model 1

\(Z\) is a confounder, a fork (a common cause).

A open backdoor path, \(X\) ← Z → \(Y\), creates a spurious association between \(X\) and \(Y\).

To block the backdoor path we control for \(Z\) to estimate unbiased ACE.

Model 2

Model 3

Models 2 and 3: not a traditional confounder.

Unobserved \(U\) is a confounder, \(Z\) is not a common cause of \(X\) and \(Y\).

However, controlling for Z blocks the backdoor path due to \(U\) and produces an unbiased estimate of ACE.

Models 4, 5, 6: Good Controls

Model 4

Model 5

Model 6
  • In Model 4, \(Z\) is a common cause of \(X\) and the mediator \(M\).
  • In Models 5 and 6, \(U\) is also a common cause of \(X\), \(Z\), and/or \(M\).
  • Any variable that causes both \(X\) and a mediator confounds the total effect of \(X\) on \(Y\).
  • Therefore, controlling for \(Z\) helps block these backdoor paths and yields an unbiased estimate of the causal effect of \(X\) on \(Y\).

Model 7: Bad controls

M-Bias

The M-Bias

Model 7



\(Z\) is pretreatment variable

Controlling for \(Z\) will induce bias by opening backdoor path, thus previously (without \(Z\) in the model) unbiased ACE will become biased.

See a detailed example in the Introduction to DAGs for Causal Inference in R

Neutral controls

  • Neutral in terms of bias (not on any causal path)

  • Minimal adjustment set it not always optimal

  • Can improve efficiency (increase precision)

Model 8: Neutral Control

Possibly good for precision

Model 8
  • \(Z\) is not a confounder.
  • Estimating the effect of \(X\) on \(Y\) requires variation in both.
  • The goal is to isolate \(Y\)’s variation that can be attributed to changes in \(X\).
  • Reducing unexplained variation in \(Y\) that arises from factors other than \(X\) improves precision.
  • If \(Z\) is a cause of \(Y\) but not a cause of \(X\), adjusting for \(Z\) removes variation in \(Y\) that is due to \(Z\), thereby reducing the residual variance and improving the precision of the estimated effect of \(X\) on \(Y\).

Model 9: Neutral Control

Possibly bad for precision

Model 9
  • \(Z\) is a cause of \(X\), not \(Y\), and does not open or close back-door paths between \(X\) and \(Y\).
  • Controlling for \(Z\) removes part of the natural variation in \(X\), reducing information to estimate its effect on \(Y\).
  • Unlike predictors of \(Y\), which reduce unexplained outcome variance ( Model 8 ), predictors of \(X\) primarily reduce the signal ( see here ).
  • As a result, the standard error of the estimated effect of \(X\) on \(Y\) typically increases, lowering precision.

General rule of thumb

  • Causes of \(Y\) which do not spoil identification are beneficial

  • Causes of \(X\) which are not necessary for identification are harmful for the asymptotic variance of the estimator

Model 10: Bad Controls

Bias Amplification

Model 10
  • \(Z\) is “pre-treatment” variable.
  • Bias amplification – control for \(Z\) will not only fail to deconfound, but, in linear models, will amplify any existing bias (since \(U\) is unknown).
  • stratifying by \(Z\) increases bias and is inneficient.


  • Covariation in \(X\) and \(Y\) requires variation in their causes.
  • Stratifying by \(Z\) reduces \(X\)’s variation ( Model 9 ).
  • Confound \(U\) becomes relatively more important within each \(Z\), amplifying the bias.

Simulation

n <- 1e4 # population size
Z <- rnorm(n) #exogenous
U <- rnorm(n) #exogenous

b_Z <- 2 #slope of Z for X
b_U <- 1 #slope for U for X
error_x <- rnorm(n) #random noise

# X is a function of Z, U and random noise
X <- b_Z*Z + b_U*U + error_x

b_X <- 1 #slope for X for Y
g_U <- 2 # slope for U for Y
error_Y <- rnorm(n) # random noise

# Y is a function of X, U and random noise
Y <- b_X*X + g_U*U + error_Y

# data frame (U is not observed, but we have it in the simulated data)
d <- data.frame(X, Z, Y, U)


# Biased estimate of X 
# Confounded by U
m1 <- lm(Y ~ X, data=d)
coef(m1)
 (Intercept)            X 
-0.005624466  1.333955337 


# Biased estimate of X 
# Confounded by U and amplified when controlling for Z
m2 <- lm(Y ~ X + Z, data=d)
coef(m2)
(Intercept)           X           Z 
 0.01042681  2.01180858 -2.00779634 


What is going on?

We assumed the following system:

\[X = \beta_Z Z + \beta_U U + \epsilon_X\] \[ Y = \beta_X X + \gamma_U U + \epsilon_Y\]

  1. Variance in \(X\) changes when we control for \(Z\)

Total variation in \(X\):

\[ Var(X) = \beta_Z^2 Var(Z) + \beta_U^2 Var(U) + Var(\epsilon_X)\]

# The Variance of X in the model Y ~ X
Var_X <- b_Z^2 * var(d$Z) + b_U^2 * var(d$U) + var(error_x)
Var_X
[1] 6.181868

After conditioning on \(Z\) (i.e. removing it’s contribution, see here ):

\[ Var(X|Z) = \beta_U^2 Var(U) + Var(\epsilon_X)\]

# The Variance of X when controlling for ze, Y ~ X + Z
Var_X_Z <-  b_U^2 * var(d$U) + var(error_x)
Var_X_Z
[1] 1.994982


  1. How this affects bias

\[ \text{Bias}_{\beta_X} = \frac{\text{Cov}(X, U)}{\text{Var}(X)} \, \gamma_{U}\]

Bias_betaX <- cov(d$X, d$U) / Var_X * g_U
Bias_betaX
[1] 0.3337669


\[ \text{Bias}_{\beta_{X|Z}} = \frac{\text{Cov}(X, U)}{\text{Var}(X|Z)} \, \gamma_{U}\]

Bias_betaX_Z <- cov(d$X, d$U) / Var_X_Z * g_U
Bias_betaX_Z
[1] 1.034247
  • Bias increases purely because we made \(X\) vary less overall, even though \(U\) didn’t change.

  • With less total variation in \(X\), \(U's\) influence makes up larger share, so the bias from \(U\) gets amplified.

Model 10 example

Bias Amplification: Motivation and Training

  • Motivation affects both training and job performance.
    • Motivated employees tend to complete more training and perform better.
    • Because motivation is unmeasured, it is a confounder.
  • Department affects training.
    • Engineering may require more mandatory training hours than Sales.
    • But department does not directly affect performance except through training.

  • When you don’t control for department, there’s a lot of natural variation in training (since departments differ).
    • That makes it easier to see the relationship between training and performance, and the bias from motivation is somewhat “diluted.”
  • When you do control for department, you’re only comparing employees within the same department ( see here ).
    • Within each department, training hours vary less — mostly due to motivation differences.
    • So motivation’s influence on training is relatively stronger, making the bias from motivation larger.

Models 11, 12: Bad Controls

Model 11

Model 12
  • To estimate the ACE, we want to avoid blocking any causal pathways from X to Y.

  • In Model 11, conditioning on Z blocks the entire causal pathway (X → Z → Y), producing classic overcontrol bias.

  • In Model 12, conditioning on Z blocks only part of the effect, because Z is a descendant of M, which lies on the causal path (X → M → Y).

  • The same logic applies even if there were an additional direct path X → Y.

Model 13: Neutral Control

Possibly good for precision

  • \(Z\) is a cause, not an effect of M (mediator), and consequently, a cause of \(Y\) (model 8)

  • Controlling for \(Z\) will be either neutral or may increase precision of the ACE estimate in finite sample

Some intuition of what is going on:

  • \(Z\) affects \(Y\) indirectly through \(M\). \(Y\) inherits some of \(Z\)’s influence, adding extra noise.

  • Adjusting for \(Z\) removes some of the variation in \(Y\) that is due to \(Z\). As a result, \(Y\) becomes less noisy, there’s less unexplained variance.

  • Precision of an estimate (SE of \(X\)) depends on how much unexplained variance exists in the model.

Models 14, 15: Neutral Controls

Possibly helpful in case of selection bias

Model 14

Model 15
  • Controlling for Z in Model 14 reduces variation of the treatment X and so may hurt the precision of the ACE in finite samples


  • Not all “post-treatment” variables are bad controls, e.g. Model 15 (a case selction bias)

Model 15

Model 15

Suppose we have only samples with \(W = w\) (a case selection bias)

  1. Implicitly conditioning on a collider. (see here)
  • Creates a spurious association between \(X\) and \(U\) in the sample.
  • The relationship between \(X\) and \(Y\) becomes statistically confounded by \(U\).
  • \(U\) behaves like a confounder in the data because it explains joint variation in both \(X\) and \(Y\).
  • Sample’s causal structure is no longer the same as in the population.
  1. Deconfounding.
  • Block this open path conditioning \(Z\).
  • Conditioning on \(Z\) d-separates \(X\) from \(U\) within the sample.
  • The conditional independence structure among \(X\), \(U\), and \(Y\) is restored to what was in the population DAG.
  • The causal effect of \(X\) on \(Y\) that we now estimate in the restricted sample reflects the same average causal effect (ACE) found the full population.

Model 15 simulation

Imagine a study examining the effect of academic performance in high school on future earnings.

Academic performance may increase future earnings.
Students with high academic performance are also more likely to have strong recommendation letters.
Strong recommendation letters help students get into prestigious colleges.
Family wealth also influences college admissions.
Family wealth directly affects future earnings.


Code
set.seed(50)  

# Simulate data
n <- 1e4  
performance <- rnorm(n) # Academic performance
wealth <- rnorm(n) # Family wealth 

# Binary Recommendation Letter (Z)
letter <- rbinom(n, 1, prob = plogis(2*performance))  

# Binary College Admission (W)
college <- rbinom(n, 1, prob = plogis(3*letter + 3*wealth))  

# Future Earnings (Y) depends on X and U
earnings <- 1*performance + 5*wealth + rnorm(n) 

# Create a dataset
data <- data.frame(performance, wealth, letter, college, earnings)

Selection bias: we only collect data from those students who attend prestigious universities.

This effectively is the same as conditioning on W (university prestige) in a model and therefore opens a spurious path due to conditioning on a collider.

Code
# Selection bias: Only keep students who got into a prestigious college (college = 1)
subd <- data[data$college == 1, ]

Biased estimated due to collider.

Code
# Model 1: regression without adjusting for letters of rec (biased due to collider bias)
m1 <- lm(earnings ~ performance, data = subd)
summary(m1)

Call:
lm(formula = earnings ~ performance, data = subd)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.4359  -2.8610  -0.1702   2.7207  15.6359 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.26560    0.05109  44.343   <2e-16 ***
performance  0.44858    0.05139   8.728   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.134 on 6655 degrees of freedom
Multiple R-squared:  0.01132,   Adjusted R-squared:  0.01117 
F-statistic: 76.19 on 1 and 6655 DF,  p-value: < 2.2e-16

Adjusting for Z blocks the spurious path.

Code
# Model 2: Adjusting for letter of rec (blocks spurious path through W)
m2 <- lm(earnings ~ performance + letter, data = subd)
summary(m2)

Call:
lm(formula = earnings ~ performance + letter, data = subd)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.370  -2.817  -0.221   2.655  16.040 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.37026    0.08861   38.03   <2e-16 ***
performance  1.01213    0.06277   16.12   <2e-16 ***
letter      -1.92218    0.12701  -15.13   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.064 on 6654 degrees of freedom
Multiple R-squared:  0.04422,   Adjusted R-squared:  0.04393 
F-statistic: 153.9 on 2 and 6654 DF,  p-value: < 2.2e-16

What is going on?

1. Two routes to prestigious universities.
- Students are admitted (black dots) either through family wealth or academic performance (via letters of recommendation).
- This means some really wealthy but low-performing students get in, and some very high-performing but less wealthy students get in.
- Among admitted students (black dots), wealth and performance become negatively related — unlike in the general population, where they’re unrelated.

2. The relationship between performance and earnings is now distorted.
- In the general population, better academic performance leads to higherfuture earnings.
- Within prestigious universities (black dots), nearly all students do well later in life because they are either smart or wealthy.
- This makes the observed effect of academic performance on earnings appear weaker than it really is.

3. Controlling for letters of recommendation corrects the distortion.
- The bias arises because selection on admission ties wealth and performance together.
- When we compare students with the same letters of recommendation (middle plot), that link disappears.
- Within each letter level, wealth varies as in the general population, because wealth does not affect letters of recommendation.
- Since academic performance is a parent of letters of recommendation, the wealth and academic performance is also not associated anymore. Sample is like the general population.

Models 16, 17: Bad Control

Selection bias

Model 16

Model 17

Controlling for Z is no longer harmless – it’s a collider and controlling opens an otherwise closed path(s)

Model 16 example

The birth-weight paradox (Hernández-Díaz et al. 2006)

  • Infants born to smokers have higher risk of mortality.
  • However, among infants with low birth weight (LBW), infant mortality is lower for smokers than non-smokers.
  • Competing theories for this paradox:
    • Is maternal smoking somehow protective among LBW infants?
    • Or is it a collider stratification bias?


What is going on?

  • Birth weight data is readily available and often used as a control variable.
  • Hernández-Díaz et al. (2006) use DAGs to represent causal structure.

  • Maternal smoking and birth defects both cause LBW.
  • Consequently, all LBW infants must have either been exposed to tobacco or had a birth defect (or both).
  • All LBW infants of non-smokers would necessarily have birth defects, which contribute more strongly to mortality than maternal smoking.
  • Conditioning on birth weigth (a collider) induces an inverse association between smoking and birth defects:
    • If the mother smokes, the infant likely has fewer or milder defects.
    • If the mother does not smoke, the infant must have more severe defects (to explain the LBW).
  • Result: within the LBW group, infants of non-smokers have more severe defects and therefore higher mortality than those of smokers.

Conclusion:

Smoking does not protect against mortality. The apparent protective effect among LBW infants arises from selection bias due to conditioning on a collider (birth weight), which induces a spurious inverse relationship between maternal smoking and unmeasured birth defects. As a result, LBW infants of non-smoking mothers tend to have the most severe birth defects, leading to higher mortality in that subgroup.

Model 18: Bad Control

Case-control bias

\(Z\) is a descendant of \(Y\) which is a “virtual collider” - once we condition on its descendant \(Z\), we might accidentally open a non-causal path


Let’s add some unobserved exogenous variables


  • “Case-control” bias

  • Selecting participants based on an outcome can distort causal relationships

  • The Antebellum puzzle (Schneider, 2020)

  • During XIX c., in Britain and USA, the average height of adult men fell even though the economic conditions of these countries improved alongside childhood nutrition

  • Possible explanation – selection bias (case-control) – data used from individuals enlisted in the military or in prisons, effectively conditioning on colliders

Example: Military records

CN = childhood nutrition

E = enlisted in military

H = height

  • The causal path from height → enlisted represents the fact that taller men may have better opportunities in the civilian market, and thus shorter men were more likely to enlist

  • Restricting analysis to those enlisted in the military is therefore equivalent to controlling for enlisted and leads to selection bias

Model 18: when X has no causal effect on Y

  • An exception: when X has no causal effect on \(Y\), \(X\) is still d-separated from \(Y\) even after conditioning on \(Z\). Thus, adjusting for \(Z\) is valid for testing whether the effect of \(X\) on \(Y\) is zero.

Model 18 X causes Y

Model 18 X has no causal effect on Y

General Rules

The status of a single control as “good” or “bad” may change depending on the context of the variables under consideration

A set of control variables Z will be “good” if:

  • It blocks all non-causal paths from the treatment to the outcome

  • It leaves any mediating paths from the treatment to the outcome “untouched” (since we are interested in the total effect)

  • It does not open new spurious paths between the treatment and the outcome (e.g., due to colliders).

For efficiency:

  • Give preference to those variables “closer” to the outcome, in opposition to those closer to the treatment – so long as this does not spoil identification

List of UCLA's OARC expertise