Chapter 18: Formalizing Causal Structure — The Three-Rung Ladder and do-calculus
From data alone, one can never deduce causation. Unless you are willing to admit that certain structures are assumptions, not discoveries.
Chapter 17 left us with an uncomfortable fact: probability theory, however refined, cannot distinguish "
This is not a problem of computational power, nor of data quantity, but a structural limitation of mathematics: observation describes how the world looks when it is still, not how the world looks after it is prodded.
Yet we prod the world constantly. Doctors prescribe medicine, policymakers adjust tax rates, engineers modify parameters. Every intervention asks a question that probability theory cannot answer: "If I change
Answering this question requires a new kind of inference rule. The task of this chapter is to formalize the very act of "changing."
18.0 The Causal Wire-Cutting Game: Observation Is Not Intervention
In the probability game, you can only see how the chips move. In the causal game, you finally get to cut the wires.
Imagine a causal circuit board. Each variable is a node, each directed edge is a wire:
This is the game meaning of
"Seeing
The Formal Skeleton of This Game
- State space: a directed acyclic graph
, together with a set of structural equations . - Legal moves: observe certain nodes; or execute intervention
, cutting all incoming edges to and fixing its value. - Transition rules: observation updates the distribution; intervention replaces structural equations, yielding the mutilated graph (the graph after wire-cutting).
- Victory condition: identify
or counterfactual quantities from the observable distribution and causal graph. - Failure mode: treating
as , misreading correlation as causation.
The enjoyable part of this game is that it turns "causation" from a metaphysical debate into a very concrete action: which wire to cut? Which wire to keep? After cutting, how does the signal propagate?
A backdoor path is a wire that sneaks around and comes in from behind; a confounding variable is a hand hidden behind the scenes pulling both nodes simultaneously; d-separation is the judgment of whether, after certain wires are blocked, information can still travel from one side to the other.
The first lesson of causal inference is brutal: seeing is not changing. Seeing the sprinkler on, versus turning the sprinkler on with your own hand, are not the same world. The most common arrogance in statistics is treating observation as intervention. The scissors of causal graphs are made precisely to cut away this arrogance.
18.1 The Three-Rung Ladder
Judea Pearl uses a metaphor to describe the hierarchical structure of causal inference; he calls it the Ladder of Causation. The ladder has three rungs, from low to high, each requiring capabilities that the previous rung lacks.
First Rung: Association
This is the territory of probability theory. The form of the question is:
"Seeing
, what is ?"
In mathematical language:
Second Rung: Intervention
The form of the question becomes:
"If I set
to a certain value, what would be?"
The difference from the first rung is fundamental. "Seeing
In passive observation,
This rung requires not only data but also action — or, when action is infeasible, some tool that allows you to simulate action mathematically.
Third Rung: Counterfactual
The question becomes even harder:
"If back then
had not been that value, what would have been?"
This is already an inquiry about a single individual in another possible world. "This patient recovered after taking the medicine — if he had not taken the medicine back then, would he still have recovered?" This question cannot be answered directly by any observation or experiment, because the parallel world of "not taking the medicine" is one we can never enter.
These three rungs correspond to genuine boundaries of capability. Pure observational data can only answer first-rung questions. Randomized controlled trials (RCTs) can answer second-rung questions, but at the cost of actually executing interventions. Counterfactual inference requires a complete causal model, plus additional assumptions about "individual mechanisms" — this exceeds the capability of any experiment. Most statisticians spend the better part of their careers working at the first rung, mistakenly believing they are answering second-rung questions. This confusion has produced erroneous inferences across vast swaths of scientific literature.
"Mistakenly believing" is too polite. This is not a cognitive error; it is a tool error. When the only hammer you hold is correlation, every problem looks like a correlation nail. The issue is not the intellect of statisticians, but that standard training has never clearly drawn the boundary between the first and second rungs.
18.2 Graphs: The Geometry of Causation
The first step is to give "causal structure" a precise mathematical representation.
Directed acyclic graphs (DAGs) are the most natural tool. Nodes represent variables, directed edges represent direct causal influence:
A simple example. Consider three variables: season (
Season affects the sprinkler (more likely on in summer), season also directly affects the grass (rain), and the sprinkler also directly affects the grass.
In this graph,
Probability theory sees the superposition of both, unable to distinguish them. The graph explicitly draws out this structure.
This is the most common beginner's mistake. A causal graph represents domain knowledge, not a product of statistical inference. You cannot derive this graph from
The direction of the arrows is brought in by you, not given by the data — this sentence will make many data scientists uncomfortable, because they are accustomed to "letting the data speak." But the data cannot speak on this matter. Admitting this requires courage; many papers evade this step. The cost of evasion is burying the assumption inside the method, pretending it is the conclusion.
18.3 Structural Causal Models
A graph is only a skeleton. To turn it into a machine capable of inference, content must be attached to every edge. This is the Structural Causal Model (SCM).
An SCM consists of three parts:
Exogenous variables
Endogenous variables
Structural equations: for each endogenous variable
where
This equation is not a statistical regression equation — it is a mechanism, describing "given the parent nodes and noise, what value does this variable take." This mechanism is stable and local: changing the equations of other variables does not affect this equation. This local stability is the core feature that distinguishes causal models from statistical models.
Returning to the sprinkler example, the structural equations could be:
18.4 The do Operator: Formalizing Intervention
Now we can precisely define "intervention."
The operational definition of intervention
The effect of this operation on the graph is intuitive: delete all edges pointing into
This "surgically altered graph" is called the intervention graph, denoted
The do Operator vs. Conditioning: Why are
Conditioning
Intervention
Concrete example:
: among those who chose to take medication, what is the recovery rate (possibly inflated, because patients with milder symptoms are more likely to choose medication) : if I randomly force a group of people to take medication (randomized controlled trial), what is the recovery rate (true drug efficacy)
The do operator = the mathematical expression of a randomized controlled experiment. When RCTs are infeasible, do-calculus provides rules for estimating intervention effects from observational data.
The probability distribution after intervention, denoted
Compare it with conditional probability:
The former is "the distribution of
The two can differ dramatically. In the sprinkler example,
The core operation of a randomized controlled trial (RCT) is, mathematically, precisely
Runnable do Operator: CocDo Implementation
The do operator is not merely a mathematical symbol — it can be precisely implemented as term substitution plus β-reduction in λ-calculus.
CocDo encodes each causal variable as a node in COC type theory, and each edge
do(X = v) is implemented in only two steps:
# 1. Replace variable X with constant v (severing all incoming edges)
intervened = subst(mechanism, var="X", replacement=Const("X", v))
# 2. β-reduction: propagate effects along topological order
result = beta_reduce(intervened)subst is capture-avoiding substitution; beta_reduce is call-by-value reduction to a fixed point. When both operands of Add/Mul are Const with values, the reducer directly computes tensor operations:
App(App(Mul, Const(w)), Const(v)) → Const(w · v)This means the entire propagation process of the structural equation
Correspondence with Pearl's definition:
| Pearl's do operator | CocDo implementation |
|---|---|
| Replace the structural equation of | subst(mechanism, "X", Const("X", v)) |
| Delete all edges pointing into | After substitution, the parent node terms of |
| Propagate effects along descendants | beta_reduce reduces along topological order |
| Cyclic graphs are illegal | Pi types require TypeError |
from cocdo import NeuralSCM
import numpy as np
# Three-node graph: ad_spend → clicks → revenue
A = np.array([[0, 0.9, 0.8],
[0, 0, 0.7],
[0, 0, 0]])
E = np.random.randn(3, 16)
scm = NeuralSCM.from_embeddings(["ad_spend", "clicks", "revenue"], A, E)
# do(ad_spend = 3.0): sever ad_spend's incoming edges, propagate effects
state, E_next = scm.step({"ad_spend": 3.0})
print(state) # {"ad_spend": 3.0, "clicks": ..., "revenue": ...}18.5 The Backdoor Criterion: The Geometry of Confounding
The core question of do-calculus is: under what conditions can causal effects be estimated from observational data — that is, when can
Answering this question requires understanding the geometric structure of "confounding."
In a causal graph, the total effect of
The Backdoor Criterion gives a precise condition: a set of variables
- No node in
is a descendant of ; blocks all backdoor paths from to .
If such a
This formula is called the adjustment formula. Its meaning is: for each value of
In the sprinkler example,
This quantity is entirely determined by observational data; no actual manipulation of the sprinkler is needed.
18.6 The Three Rules of do-calculus
The backdoor criterion covers a large number of practical situations, but not all. In some causal graphs, backdoor paths cannot be fully blocked by any set of observable variables — for instance, when unobservable confounding factors exist.
To handle more general cases, Pearl proposed do-calculus: three inference rules concerning the
Let
Rule 1 (Insertion/deletion of observations):
if and only if in
Meaning: if in the intervention graph,
Rule 2 (Action/observation exchange):
if and only if in
Meaning: under certain conditions, intervening on
Rule 3 (Insertion/deletion of interventions):
if and only if in
Meaning: under certain conditions,
Pearl and Shpitser proved: for any causal graph, if
The form of these three rules is highly similar to the inference rules of Chapter 14: above the line is the condition, below the line is the inference that may be drawn. The difference is that the "language" here is not merely propositions, but probability expressions carrying the
Returning to the causal wire-cutting game, the three rules of do-calculus are the move table after wire-cutting. When can an observation be deleted? When can an action be exchanged for an observation? When can an intervention be entirely eliminated? The answer lies not on the surface of the formula, but in the graph after wire-cutting. Causal inference is stronger than probabilistic inference because it has incorporated "action" into the grammar; it is also more dangerous, because every action depends on the correctness of the graph you have drawn.
18.7 d-separation: The Independence Language of Graphs
Each of the three rules of do-calculus depends on a core judgment: in a given graph, are two sets of variables conditionally independent given a third set of variables? How does one directly read conditional independence from graph structure? This is precisely the job of d-separation (directional separation).
Given a directed acyclic graph
The definition of "blocking" depends on the types of nodes along the path:
- Chain
: when is in , the path is blocked ( transmits information; controlling stops the flow of information). - Fork
: when is in , the path is blocked ( is a common cause; controlling makes the association vanish). - Collider
: when is not in , the path is blocked (collider nodes block information by default; but controlling them or their descendants instead opens the path — this is the source of "collider bias").
The behavior of collider nodes is counterintuitive and worth pausing to think through clearly. Consider "height" (
18.8 Counterfactuals: Another World for a Single Individual
The highest rung of the three-rung ladder, counterfactuals, requires a precise definition within structural causal models.
Consider a specific individual, in a specific situation, who experienced
In the SCM, this question has a three-step answer:
Step 1: Abduction. Using the observed
Step 2: Action. In the structural equations, replace the equation for
Step 3: Prediction. Under the modified model and the posterior distribution of
These three steps together give an operational definition of counterfactual inference — not philosophical speculation, but a computable procedure, provided you have a sufficiently complete SCM.
The "Average Treatment Effect" (ATE) commonly used in epidemiology is a second-rung concept:
18.9 Causation and Inference: A Larger Picture
Let us review the path traveled in this chapter.
Starting from Chapter 14, inference rules concerned propositions: from a set of assumptions in
Now, what Chapter 18 does is: add a new verb to the language of inference — intervention. The
This structure is entirely parallel to the structure of formal systems:
| Formal system | Causal inference |
|---|---|
| Propositional variables | Variables |
| Inference rules | The three rules of do-calculus |
| Axioms | Graph (local independence assumptions) |
| Provable ( | Identifiable |
| Incompleteness theorems | Unidentifiability theorems |
The identifiability theorem plays the role of the incompleteness theorem: there exist certain causal effects that, even with complete observational data and a complete graph, can never be computed from observation alone — an actual experiment must be performed. This is not a defect of the method, but a structural boundary.
Unresolved
Where do causal graphs come from? The entire chapter assumes you already possess a correct causal graph. But in practice, graphs are products of subjective knowledge and domain assumptions. If the graph is drawn incorrectly, the answers from do-calculus are also incorrect — the reliability of the system depends on the correctness of the graph, and the correctness of the graph cannot be verified from data. This is the deepest dilemma of causal inference: the more precise the tool, the heavier the weight of the assumptions.
Can causal graphs be automatically discovered from data? Causal discovery algorithms — PC algorithm, FCI algorithm, LiNGAM — attempt to infer certain aspects of causal structure from observational data. But in the general case, observable data can only determine a "Markov equivalence class" — a set of graphs that produce the same conditional independence relations, with arrow directions indistinguishable within the class. Selecting one graph from an equivalence class still requires domain knowledge, or additional assumptions about the error distribution.
What is the cost of performing inference on causal graphs? Even with a graph, the identification and computation of causal effects are, in the general case, computationally expensive. Determining whether a causal effect is identifiable can be done in polynomial time; but in complex graphs with latent variables, computing the specific expression for an identifiable effect sees the cost rise sharply. The full picture of this problem is the theme of Chapter 19: the cost of reasoning is not only a limitation of logic, but also a limitation of computation.
Exercises
★ Warm-up
Classify each of the following questions as belonging to which rung of the causal ladder (association/intervention/counterfactual), and give reasons.
- "How much higher is the lung cancer incidence among smokers compared to non-smokers?"
- "If a smoking ban were forcibly implemented, by how much would lung cancer incidence decrease?"
- "If this lung cancer patient had not smoked back then, would he now have cancer?"
- "Among the population detected with hypertension, what proportion uses antihypertensive medication?"
- "If this hypertensive patient is prescribed antihypertensive medication, by how much would his 10-year risk of cardiovascular events decrease?"
★★ Derivation
Consider the causal graph:
- List all paths from
to . Which are causal paths (following arrow directions), and which are backdoor paths? - Using the backdoor criterion, write the adjustment formula for
. - If
is an unobservable latent variable, can the adjustment formula still be used? Is identifiable in this case?
★★★ Challenge
In the causal graph
You want to separately estimate:
- The direct effect of
on (the part not passing through ) - The indirect effect of
on (the part passing through )
To which rung of the causal ladder do direct effects and indirect effects belong? Try to write the definitions of these two quantities using the language of the
