Of Coffee and Cancer


  • People who drink coffee have a substantially higher chance of developing pancreatic cancer compared to those who don’t.

  • Should you stop drinking coffee?

  • Does changing our coffee-drinking habits would change our chances of getting pancreatic cancer?

Photo of an article

Of Coffee and Cigarettes

Photo of an article
Coffee and Cigarettes, Jim Jarmusch (2003)

  • Coffee is not likely to cause pancreatic cancer.

  • A confounder: Smoking.

  • People who smoke are more likely to drink a lot of coffee.

  • Smoking is a well-established risk factor for pancreatic cancer.

  • Once later studies accounted for smoking, the apparent link between coffee and pancreatic cancer disappeared.

Was the initial finding wrong?

Image from Jim Jarmusch's movie 'Coffee and Cigarettes'
Coffee and Cigarettes, Jim Jarmusch (2003)

  • Answered the question: “Are coffee drinkers more likely to have pancreatic cancer?”
  • Different from a causal question: “If I stop drinking coffee, will my risk of pancreatic cancer decrease?”
  • The associations in the data alone cannot answer causal questions.

Classical Statistics

  • Focuses on associations rather than causation.
  • Only summarizes data.
  • No machine can derive explanations from raw data.


  • The research questions are often about causal relationships.
  • Some language implies causality (e.g., dependent and independent variable).

Experiments: the Gold Standard for Causality

  • A well-designed experiment (or a randomized controlled trial) enables the intervention (e.g., if I do \(X\), what will happen to \(Y\)) to establish cause-and-effect relationship.
  • Secret weapon – randomization:
    • Any confounder (measured or unmeasured) is equally distributed across treatment and control groups, breaking its ability to bias the results.
    • Eliminates confounder bias by ensuring that treatment and control groups have no systematic differences.

Experiments are just a special case of deconfounding

  • Causal inference is about deconfounding, and experiments are just one way to achieve it, not inherently superior.
  • Often experimentation is not possible:
    1. Intervention may be physically impossible.
      E.g., Study of the effects of body weight on heart disease: We cannot randomly assign patients to groups based on higher or lower body weight.
    2. Intervention might be unethical.
      E.g., Study on smoking: We cannot randomly assign people to smoke for 10 years.
  • Observational studies have the advantage of being conducted in the natural habitats of the target population, not in artificial lab setting.
  • If you can deconfound in another way, you don’t necessarily need an experiment for causal inference.

Causal Inference without Randomization

Photof Of Judea Pearl playing aguitar in 1966;
Judea Pearl, 1966

  • Is there a statistical procedure that mimics randomization?

  • do-calculus is a formal set of rules developed by Judea Pearl an colleagues that allows us to determine whether and how causal effects can be identified from observational data by mathematically transforming expressions involving interventions (do-operator) into purely observational probabilities.

  • It does it via:

    1. Backdoor Criterion ←

    2. Frontdoor Criterion: when some confounders are unmeasured

    3. Beyond do-calculus: Instrumental Variables

The Workshop Plan

  1. DAG basics
  2. Four elemental confounds
  3. Backdoor criterion
  4. Examples
  5. Software

Directed Acyclic Graphs (DAGs)


  • A heuristic representation of a causal model.
  • Specifies which variables influence (or “listen” to) others.
  • No assumptions need to be made about the functional form of the causal relationships, nor distribution of variables.

Generic DAG 2

Basics of DAGs


  • The letters represent random variables (\(X\), \(Y\), \(Z\)).
  • \(U\) typically represents unmeasured variables.
  • Arrows denote direct causal effects, e.g. \(X\) on \(Y\).
  • Analyze to deduce appropriate statistical model for causal effect of \(X\) on \(Y\):
    • Which adjusted (control) variables to include?
    • Absolutely not save to add everything.

Generic DAG 2