Bonus Chapter: From Theory to Code — CocDo Neural Causal Operators
Theory tells you what causal inference is. Code tells you how it gets done.
A Pallas's Cat Mini-Reflection from the Reasoning Kingdom: From Theory to Code — What Exactly Are We Translating?
The first volume's 13 chapters covered the theory: causal boundaries, attention mechanisms, the art of search, the limits of reasoning. Now the question is: how do these theories become runnable code?
This seems like an engineering problem, but it has a fundamental obstacle before the very first line of code: what gets lost in the translation between mathematical structure and computational structure?
Linear attention is a perfect example. Its algebraic core is a Monoid — the associativity of prefix sums allows
But a Monoid has no inverse. Once
This is the fundamental misalignment between mathematical structure and causal semantics: the Monoid's elegance comes from associativity; causal inference's precision comes from invertibility. The two are algebraically incompatible.
So you face a choice: preserve the Monoid's complexity advantage and give up the full semantics of the do operator; or abandon the Monoid, preserve the do operator, and bear the
The Spartacus model chose the first path, pushing the memory mechanism to its limit within the Monoid framework: vectorized decay gates give different dimensions independent memory lifespans—fast-decaying dimensions handle local syntax, slow-decaying dimensions handle long-range entity memory. It proves that even without upgrading to a Group, the vectorized decay structure can already approximate the behavior of "addressable memory."
CocDo chose the second path: abandon compressed history, preserve the complete causal adjacency matrix
The fork between these two paths is precisely whether the Monoid has an inverse element.
But this is not the end of the story. The translation from theory to code exposes other misalignments: how does the type system exclude cycles? How does β-reduction correspond to structural equation propagation? How does gradient descent become the inverse of the do operator? What is the relationship between attention mechanisms and causal discovery?
The answers to these questions are not in the theory—they are in the code. This chapter showcases CocDo—a neural causal model library that fuses Pearl's causal calculus with the Calculus of Constructions (CoC) type theory. It is not a toy implementation, but a complete system that can train, intervene, and plan on real data.
The translation process will lose some elegance, introduce some compromises, but will also expose structures invisible in pure theory—like "the attention matrix is the soft causal adjacency matrix," like "gradient planning is the calculus version of search," like "the system can use itself as a knowledge graph."
Theory tells you what causal inference is. Code tells you how it gets done—and why some things, within existing mathematical structures, simply cannot be done.
The first volume traversed 13 chapters: from entropy increase and prediction (Chapter 1), to the ceiling of symbolic systems (Chapter 2), to the boundaries of causality (Chapter 6), attention mechanisms (Chapter 9), the art of search (Chapter 10), and finally stopping before the boundaries of reasoning (Chapter 13).
How do these theories become runnable code?
This chapter showcases CocDo—a neural causal model library that fuses Pearl's causal calculus with CoC type theory. It is not a toy implementation, but a complete system that can train, intervene, and plan on real data.
The translation process from theory to code exposes problems invisible in pure mathematics: how does the type system exclude cycles at the structural level? What is the relationship between β-reduction and matrix propagation? How does gradient descent become the inverse of the do operator? What is the relationship between attention mechanisms and causal discovery?
B.1 The Type System Excludes Cycles
Chapter 6 said that causal graphs must be acyclic—causality cannot form loops, otherwise it creates a paradox in time. But how is "acyclic" guaranteed in code?
The traditional approach is a runtime check: after constructing the graph, run a topological sort or DFS, and report an error if a cycle is found. The problem is that this error occurs after you have already spent a lot of time constructing the graph and training the model.
CocDo uses the type system to exclude cycles at compile time. The core idea is a single sentence:
Each causal edge
is encoded as a dependent Pi type , requiring .
What is Dependent Type Theory (CoC)?
Calculus of Constructions (CoC) is a type theory where types themselves can depend on values.
Key concepts:
- Sort (universe):
Sort(i)represents the type universe at level, Sort(0)is the most basic type layer - Pi type:
is the dependent function type—"for any of type , return type " - Level constraint: CoC requires strict stratification of type universes; the type of
is , preventing self-referential paradoxes
CocDo borrows this structure: each causal node is assigned a level
What does a cyclic graph mean? It means there exists a path
from cocdo.kernel.terms import Sort, Pi, Var
from cocdo.kernel.typing import type_of, Context
# Legal edge: X(level 0) → Y(level 1)
ctx: Context = {"X": Sort(0), "Y": Sort(1)}
edge = Pi("X", Sort(0), Sort(1)) # ✓ 0 < 1
# Illegal edge: Y(level 1) → X(level 0), forms a cycle
bad_edge = Pi("Y", Sort(1), Sort(0)) # type_of will reject thisThe significance of this design is not just engineering defense. It says: the acyclicity of a causal graph is not a runtime check, but a type invariant. A causal model with cycles simply cannot be constructed in CocDo—just as in a strongly-typed language, you cannot assign a string to an integer variable.
The type system elevates constraints from "runtime crash" to "compile-time rejection." This is not a technical detail; it is an epistemological step: you are not checking whether a model is legal; you are making it so that illegal models cannot be expressed. Chapter 14 says the value of formal systems is eliminating ambiguity; the value here is eliminating the possibility of an entire class of errors.
B.2 The Implementation of do(): Term Substitution in λ-Calculus
Chapter 6 defined the do operator—"replace
In λ-calculus, this operation has a precise name: capture-avoiding substitution, denoted
β-reduction and capture-avoiding substitution: the two core operations of
β-reduction is the core computation rule of
Read as: applying the function
Capture-avoiding substitution
Significance in CocDo: the semantics of do(X=v) is precisely "replace beta_reduce is "propagating the effect to descendant nodes along the topological order." Pearl's mathematical definition and the computational semantics of
CocDo's implementation:
def subst(term, var: str, replacement):
"""Replace all free occurrences of var in term with replacement."""
if isinstance(term, Var):
return replacement if term.name == var else term
elif isinstance(term, Lam):
if term.var == var: # bound variable shadows, stop substitution
return term
return Lam(term.var, term.domain, subst(term.body, var, replacement))
elif isinstance(term, App):
return App(subst(term.func, var, replacement),
subst(term.arg, var, replacement))
return term # Const, Sort contain no free variables"Capture-avoiding" handles a subtle situation: if if term.var == var: return term line in the code is this check.
After substitution, β-reduction is needed to simplify the result to normal form:
def beta_reduce(term, steps=100):
"""Call-by-value reduction to a fixed point."""
for _ in range(steps):
reduced = _step(term)
if reduced is term: # fixed point: cannot reduce further
break
term = reduced
return termThe core rule of _step is β-reduction: Add/Mul are Const with values, the reducer directly computes:
App(App(Mul, Const(w=0.9)), Const(v=3.0)) → Const(2.7)This means the propagation of the structural equation
| Pearl's operation | λ-calculus operation |
|---|---|
| Replace | subst(mechanism, "X", Const(v)) |
| Delete all incoming edges of | After substitution, parent node terms vanish, no longer appear on the reduction path |
| Propagate effect along descendants | beta_reduce reduces to a fixed point along topological order |
| Cyclic graph illegal | Pi type requires strictly increasing levels; a cycle is a TypeError |
B.3 NOTEARS: Learning Causal Structure from Data
Chapter 6 said "observational data is never enough"—because correlation does not equal causation. But we still need to learn causal structure from data; this is the causal discovery problem.
Traditional methods (PC algorithm, FCI) are combinatorial search—finding the DAG that best fits the data among all possible DAGs. With
In 2018, Zheng et al. proposed NOTEARS, which transforms the discrete constraint "is it a DAG" into a continuous differentiable equality constraint:
where
NOTEARS: Using the matrix exponential to turn "is it acyclic" into continuous optimization (prior work: Zheng et al., 2018)
The traditional difficulty of causal discovery: one must search among all possible DAGs (directed acyclic graphs) for the one that best fits the data. The number of DAGs for
NOTEARS' key insight: transform the discrete constraint "is a DAG" into a smooth continuous equality:
is the edge weight matrix, represents the weight of edge is element-wise squaring (making negative values positive) - The trace of
satisfies , with equality if and only if the matrix is acyclic
With this continuous constraint, causal discovery becomes a gradient optimization problem—it can be directly solved using standard optimizers like Adam, without combinatorial search. This is the key step that makes causal structure learning feasible within a neural network framework.
CocDo uses
With this constraint, causal discovery becomes a continuous optimization with an equality constraint:
What is the reconstruction loss
Under the linear structural equation model (SEM) assumption, each variable is a linear combination of its parent nodes plus noise:
This is a least-squares objective—maximizing the degree to which the causal graph "explains" the data. Minimizing it alone would cause
Augmented Lagrangian Method
The standard Lagrangian method handles the equality constraint
But the pure Lagrangian method is numerically unstable; the multiplier
The algorithm alternates two steps:
- Fix
, perform gradient descent on (inner optimization) - Update the multiplier:
; if the constraint converges too slowly, increase
The larger
Solved using the augmented Lagrangian method, tightening the multiplier examples/demo_gcastle.py):
# Use gCastle to generate a 5-node linear Gaussian DAG, 1000 samples
from castle.datasets import DAG, IIDSimulation
true_dag = DAG.erdos_renyi(n_nodes=5, n_edges=6, seed=42)
dataset = IIDSimulation(W=true_dag, n=1000, method="linear", sem_type="gauss")
X = dataset.X # (1000, 5)
# Train CausalFFNN to learn the causal weight matrix A
ffnn = CausalFFNN(d_embed=32, hidden=128)
rho, lam = 1.0, 0.0
for epoch in range(500):
A, _ = ffnn(E_raw)
recon = ((X @ A - X) ** 2).mean()
h = acyclicity_loss(A) # ≈ tr(e^{A∘A}) - n
loss = recon + lam * h + (rho / 2) * h ** 2
loss.backward(); optim.step()
# Tighten constraint every 50 steps
if (epoch + 1) % 50 == 0:
lam += rho * h.item()
if h > 0.25 * h_prev:
rho = min(rho * 10, 1e6)After training, A is the learned causal weight matrix. CocDo automatically extracts the topological order from A (Kahn's algorithm) and constructs NeuralSCM.
Attention is causal discovery. CausalFFNN uses low-rank bilinear scoring:
This is mathematically identical to Transformer attention. The only differences are: using sigmoid instead of softmax (edges compete independently), and forcing the diagonal to zero (a variable cannot be its own cause).
A Transformer's attention heads compute "the influence weight of token
Abstract Algebra Interlude: Why Linear Attention Is Inherently Unable to Do the do Operator (Yes, It's Pallas's Cat's Research Again)
Standard attention requires accessing the full KV cache at each step, at a cost of
Behind this recurrence lies an algebraic structure. Define the binary operator:
This operator satisfies associativity, with identity element
The engineering dividend from associativity is direct: during training, one can do
But Monoid has no inverse.
A Group = Monoid + inverse. In the matrix addition group,
Pearl's do operator requires: sever all incoming edges of variable
Spartacus goes further within this framework: it uses vectorized decay instead of scalar decay,
This is an attempt to push the memory mechanism to its limit within the Monoid structure. It does not solve the do operator problem—because the Monoid itself does not allow it. But it proves: even without upgrading to a Group, the vectorized decay structure can already let different dimensions take on different time scales, approximating the behavior of "addressable memory" without truly addressing it.
CocDo chose the other path: abandon compressed history, preserve the complete causal adjacency matrix
The fork between these two paths is precisely whether the Monoid has an inverse element.
For Spartacus' full implementation and model weights, see NoesisLab/Spartacus-1B-Instruct.
B.4 Gradient Planning: The Calculus Version of Search
Chapter 10 discussed the art of search—navigating vast reasoning spaces. Traditional search is sampling: try many times, remember which ones were good. Gradient planning is calculus: walk along the gradient direction.
The do operator is forward: given an intervention value
where
The core of CausalPlanner is a differentiable single-step propagation:
# do-calculus in embedding space:
A_do = A * (1 - col_mask) # zero out the incoming edge columns of intervened nodes
E_do = (1 - row_mask) * E + row_mask * interv_E # inject intervention values
E_next = A_do.T @ E_do + U # structural equation propagationThen compute the gradient with respect to the intervention value
a = torch.tensor([0.0], requires_grad=True)
opt = torch.optim.Adam([a], lr=0.05)
for _ in range(200):
E_next = planner._step(a, interv_nodes, E_init)
energy = ((E_next[target_idx].norm(dim=-1) - scalar_targets) ** 2).sum()
energy.backward(); opt.step()The entire computation graph backpropagates from the target all the way to the intervention value—no sampling needed, no reinforcement learning.
Reinforcement learning turns planning into a sampling problem: try many times, remember which ones were good. Gradient planning turns planning into a calculus problem: walk along the gradient direction. The latter requires a differentiable world model—this is precisely what neural SCMs provide. The cost is that the world must be differentiable, or at least approximable by a differentiable model. This assumption does not always hold, but in embedding space it is usually good enough.
B.5 CausalSearch: The System Uses Itself as a Knowledge Graph
Chapter 13 discussed the boundaries of reasoning, mentioning that systems begin to reason about themselves—this is the beginning of self-reference. CausalSearch is a concrete self-referential application: using the chapters of the Reasoning Kingdom as a causal knowledge graph.
The previous four sections dealt with "variables"—scalars or vectors, with explicit numerical meanings. But knowledge can also be nodes in a causal structure: one paragraph depends on another, one concept is built upon another.
demo_causal_search.py embeds all chapters of the Reasoning Kingdom (22 chapters, ~1800 paragraphs) using BGE, trains CausalFFNN to learn the causal weight matrix
Step 1: Abduction—find the paragraph
Step 2: Action—
Step 3: Prediction—
Positive
Query: "The relationship between Transformer attention and Bayesian inference"
[Vector Retrieval RAG]
1. ch9·The success of Transformers triggered... cos=0.763
2. ch9·Attention as causality... cos=0.682
[CausalSearch · Pearl's Three Steps]
Abduction anchor → ch9·The success of Transformers...
+ Downstream activation:
ch17·Comparison of Bayesian updating and ch14... Δ=+2.69e-02
ch20·PAC and Bayesian: continuation of ch17... Δ=+2.34e-02
* Unique to CausalSearch (missed by RAG):
ch17 Bayesian inference ×4, ch1 generative model layer, ch19 proofs...Vector retrieval finds "surface similarity"; CausalSearch finds "causal relatedness"—propagating along learned causal edges rather than measuring distance in embedding space.
RAG is the first rung of the ladder: association. CausalSearch is the second rung: intervention. What you're asking is not "which paragraphs are similar to this query," but "if I inject this query into the knowledge graph, which nodes will be activated." These are two completely different questions; they just happen to both be called "retrieval."
B.6 Review: The Mapping from Theory to Code
Having reached this point in the first volume, we can draw a complete line:
| Volume 1 Chapter | Core Problem | Corresponding in CocDo |
|---|---|---|
| Chapter 6 Causal Boundaries | Observational data cannot deduce causality | NOTEARS: learning DAG structure from data |
| Chapter 6 do operator | Intervention vs. observation | subst + beta_reduce = implementation of do |
| Chapter 9 Transformer | Mathematical structure of attention | CausalFFNN: low-rank bilinear scoring |
| Chapter 9 Bonus Attention as Causality | Attention matrix is the soft causal adjacency matrix | sigmoid edge weights, diagonal forced to zero |
| Chapter 10 The Art of Search | Navigating reasoning spaces | CausalPlanner: gradient planning, not sampling |
| Chapter 13 Boundaries of Reasoning | System reasoning about itself | CausalSearch: using its own chapters as a knowledge graph |
This is not a coincidence. Every chapter of the first volume asks: what is reasoning, where are its boundaries, how does it work under real constraints? CocDo is a runnable answer to these questions—incomplete, but honest.
The value of formalization is not only mathematical rigor, but also that it can be implemented. The CoC type system makes cycles into type errors; NOTEARS turns discrete search into gradient optimization—these are the engineering dividends of formalization.
Thought Exercises
★ Warm-up
- In CocDo, why is
sigmoidused instead ofsoftmaxto compute edge weights? What impact does this choice have on the sparsity of the causal graph? - What situation does the "capture-avoiding" handling in the
substfunction address? Give an example of what would go wrong without this check.
★★ Derivation
Consider a three-node graph
- Manually compute the value of
after (express in terms of ). - Verify your calculation using CocDo's
stepmethod, settingrollout_steps=2. Why isrollout_steps=2needed instead of 1? - If only
rollout_steps=1were used, what would the value ofbe? Which rung of Pearl's ladder does this correspond to?
★★★ Challenge
NOTEARS' acyclicity constraint
- Prove:
(the trivial DAG). In what situations is this approximation exact? - Construct a non-zero DAG (with at least one edge) such that
but . In what situations does this approximation fail? - During actual training, the reconstruction loss
pushes to explain the data, while pushes toward zero. What is the equilibrium point of these two forces? How does it relate to L1 regularization?
