Chapter 8: The Contract of Heuristics: How Much Courage Does It Take to Accept "Close Enough"?
If A*'s heuristic function lies, it will still give you a path. It's just that the path will be wrong.
I. The 1968 Wager
In 1968, Peter Hart, Nils Nilsson, and Bertram Raphael of the Stanford Research Institute proposed an algorithm in a paper that would later prove to be one of the most elegant ideas in the history of artificial intelligence.
The problem they wanted to solve was simple: let the robot Shakey find the shortest path from point A to point B in a room.
Brute-force search can guarantee finding the optimal solution—try all possible paths and pick the shortest one. But Shakey's world has millions of possible path combinations, and brute-force search requires an astronomical amount of time.
Greedy search is fast—always move toward the target direction. But it will crash into obstacles, walk into dead ends, and the path it finds may be much longer than the optimal solution.
The A* algorithm proposed by Hart and his two colleagues made a seemingly impossible promise: I can be orders of magnitude faster than brute-force search, while still guaranteeing finding the optimal solution.
What is the price of this promise? A contract.
If you give A* an "honest" heuristic function—one that never overestimates the remaining cost—it will keep its promise. But if the heuristic function "lies," overestimating the remaining cost, A* will still give you a path. It's just that the path may be wrong.
This is not a bug in the algorithm; it is its design. A* uses a mathematical contract to exchange for a balance between speed and optimality.
Over fifty years later, the idea of this contract remains at the core of heuristic algorithms: when perfection is unattainable, we need approximations with clearly defined costs.
II. The Curse of Combinatorial Explosion
Let me first quantify the matter of "perfection unattainable."
In 2016, AlphaGo defeated Lee Sedol. The world marveled at AI's "intuition" and "creativity." But few people noticed a technical detail: behind every move AlphaGo made, Monte Carlo tree search explored millions of possibilities on the 19×19 board.
Not hundreds, not thousands, but millions.
Why so many? Because the state space of Go is approximately
Even if you could search
This is not an engineering problem; this is physical impossibility.
Similar combinatorial explosions are everywhere:
Traveling Salesman Problem: All possible routes for visiting n cities number
. For 20 cities, that's routes. Protein Folding: For a protein of 100 amino acids, with each amino acid having 3 main conformations, there are approximately
possible folding configurations. Theorem Proving: In first-order logic, for a proof of length n, the number of possible inference steps grows exponentially with n.
This is the reason heuristics exist: when the optimal solution is unattainable, we need a "good enough" solution.
But how good is "good enough"? This is not a vague engineering question, but a precise mathematical one. Heuristic algorithms must obey certain contracts—if these contracts are violated, the "approximate solution" they provide may be completely wrong.
Let me return to 1968 and see how Hart, Nilsson, and Raphael designed this contract.
III. A*'s Mathematical Contract
Suppose you are in a maze, trying to get from start point S to end point G. Every step has a cost (such as time or distance). You want to find the path with the minimum cost.
Brute-force search (Dijkstra's algorithm) will try all possible paths, guaranteeing finding the optimal solution, but the time complexity is
Dijkstra's Algorithm: The Brute-Force Guarantee of Shortest Paths
Dijkstra's algorithm is a shortest-path algorithm proposed by Dutch computer scientist Edsger Dijkstra in 1956. The idea is: starting from the origin, at each step select the unvisited node with the smallest currently known cost, expand its neighbors, and continue until reaching the destination.
- Guarantee: If all edge costs are non-negative, Dijkstra will certainly find the optimal (minimum-cost) path
- Cost: It explores all possible paths, without knowing "roughly which direction is correct," making it inefficient
Branching factor
Search depth
Combined, the search space is
Greedy search at each step chooses the direction that "looks closest to the goal," which is fast but may walk into dead ends. Worse, the path it finds may be several times longer than the optimal solution.
When the A* algorithm was proposed in 1968, it made a seemingly impossible promise: I can be faster than Dijkstra, while still guaranteeing finding the optimal solution.
Its core is an evaluation function:
: the actual cost from the start to node n (known, definite) : the estimated cost from n to the goal (unknown, guessed) : the total estimated cost of reaching the goal via n
A* at each step expands the node with the smallest
The intuition behind this design is simple: if you have already taken 10 steps to reach node n, and you estimate you still need 5 steps to reach the goal, then the total cost via n is approximately 15 steps. A* prioritizes exploring the path with the smallest total cost.
But there is a fatal question here: how do you estimate
If
If
Hart, Nilsson, and Raphael proved a profound theorem: if the heuristic function satisfies admissibility, A* is guaranteed to find the optimal solution.
Definition of admissibility:
where
In other words, the heuristic function must be optimistic—it must never overestimate the remaining cost. You may underestimate (this will slow down the search), but you must not overestimate (this will cause the search to find the wrong answer).
This is A*'s contract: give me an admissible heuristic function, and I guarantee finding the optimal solution. Violate the contract, and the guarantee is void.
Why is this contract effective? Let me illustrate with a concrete example.
IV. Hands-On: Seeing How the Contract Is Broken
Admissibility is not an abstract mathematical requirement. It is a necessary and sufficient condition for A* to find the optimal solution.
Let us verify this with our own hands.
Experiment objective: Using the same maze, run A* with an admissible heuristic function and an inadmissible heuristic function respectively, and see how the results differ.
Maze map (0 = passable, 1 = obstacle, S = start, G = goal):
S . . . . .
. . . # . .
. . . . # .
. . . . . .
. . . . . #
. . . . . GS is the start point (0,0), G is the goal (5,5)
.is a passable cell, cost per step = 1#is an obstacle, impassable
Task: Find the shortest path from S to G.
import heapq
# ------------------------------------------------------------------
# Define the maze: 6 rows, 6 cols, 1 = obstacle, 0 = passable
# S=(0,0), G=(5,5)
# ------------------------------------------------------------------
GRID = [
[0, 0, 0, 0, 0, 0], # row 0: S . . . . .
[0, 0, 0, 1, 0, 0], # row 1: . . . # . .
[0, 0, 0, 0, 1, 0], # row 2: . . . . # .
[0, 0, 0, 0, 0, 0], # row 3: . . . . . .
[0, 0, 0, 0, 0, 1], # row 4: . . . . . #
[0, 0, 0, 0, 0, 0], # row 5: . . . . . G
]
ROWS, COLS = len(GRID), len(GRID[0])
START = (0, 0)
GOAL = (5, 5)
def neighbors(r, c):
"""Return the four (up/down/left/right) passable neighbors"""
for dr, dc in [(-1,0),(1,0),(0,-1),(0,1)]:
nr, nc = r+dr, c+dc
if 0 <= nr < ROWS and 0 <= nc < COLS and GRID[nr][nc] == 0:
yield nr, nc
def reconstruct_path(parent, node):
"""Backtrack from the parent dict to reconstruct the path"""
path = []
while node is not None:
path.append(node)
node = parent[node]
return list(reversed(path))
def astar(h_func, label="A*"):
"""
Generic A* search.
h_func(r, c) is the heuristic function; returns (path length, path, number of expanded nodes).
"""
# Priority queue elements: (f, g, node)
open_heap = [(h_func(*START), 0, START)]
g_cost = {START: 0} # Actual cost from start to each node
parent = {START: None} # Path backtracking
closed = set()
expanded = 0 # Count of expanded nodes
while open_heap:
f, g, node = heapq.heappop(open_heap)
if node in closed:
continue
closed.add(node)
expanded += 1
# Reached goal
if node == GOAL:
path = reconstruct_path(parent, node)
return len(path) - 1, path, expanded # steps = nodes - 1
# Expand neighbors
for nb in neighbors(*node):
new_g = g + 1 # cost per step = 1
if nb not in g_cost or new_g < g_cost[nb]:
g_cost[nb] = new_g
parent[nb] = node
f_val = new_g + h_func(*nb)
heapq.heappush(open_heap, (f_val, new_g, nb))
return None, [], expanded # No solution
# ------------------------------------------------------------------
# Step 1: Admissible heuristic function — Manhattan distance
# h₁(n) = |n_row - G_row| + |n_col - G_col|
# Never overestimates remaining cost → satisfies admissibility → guarantees optimal solution
# ------------------------------------------------------------------
def h_admissible(r, c):
"""Manhattan distance: admissible, does not overestimate"""
return abs(r - GOAL[0]) + abs(c - GOAL[1])
steps1, path1, exp1 = astar(h_admissible, "Admissible A*")
print("=== Step 1: Admissible heuristic (Manhattan distance) ===")
print(f"Path length: {steps1} steps")
print(f"Expanded nodes: {exp1}")
print(f"Path: {path1}")
# ------------------------------------------------------------------
# Step 2: Inadmissible heuristic function — Manhattan distance × 2
# h₂(n) = 2 × (|n_row - G_row| + |n_col - G_col|)
# Overestimates remaining cost → violates admissibility → may find suboptimal path
# ------------------------------------------------------------------
def h_inadmissible(r, c):
"""2× Manhattan distance: inadmissible, overestimates"""
return 2 * (abs(r - GOAL[0]) + abs(c - GOAL[1]))
steps2, path2, exp2 = astar(h_inadmissible, "Inadmissible A*")
print("\n=== Step 2: Inadmissible heuristic (2× Manhattan distance) ===")
print(f"Path length: {steps2} steps")
print(f"Expanded nodes: {exp2}")
print(f"Path: {path2}")
# ------------------------------------------------------------------
# Comparison conclusion
# ------------------------------------------------------------------
print("\n=== Comparison Conclusion ===")
print(f"Admissible heuristic → {steps1} steps (optimal solution guaranteed)")
print(f"Inadmissible heuristic → {steps2} steps ({'optimal' if steps1==steps2 else 'suboptimal, non-optimal'})")
if steps1 != steps2:
print(f" Gap: {steps2 - steps1} steps (contract was broken, a longer path was found)")
print(f"\nNote: Admissible version expanded {exp1} nodes, inadmissible version expanded {exp2} nodes.")
print(" The inadmissible heuristic expands fewer nodes, is faster, but sacrifices the optimality guarantee.")Step 1: Using an Admissible Heuristic Function
The simplest admissible heuristic function is Manhattan Distance:
This is the cost of "if there were no obstacles, walking in a straight line to the goal." Because the actual path must go around obstacles,
Starting from S, compute
| Node | g(n) | h₁(n) | f(n) |
|---|---|---|---|
| S right | 1 | 9 | 10 |
| S down | 1 | 9 | 10 |
Both tie at
Step 2: Using an Inadmissible Heuristic Function
Now deliberately violate the contract. Define an overestimating heuristic function:
This function estimates the remaining cost as twice the Manhattan distance—clearly overestimating.
Restarting from S:
| Node | g(n) | h₂(n) | f(n) |
|---|---|---|---|
| S right | 1 | 18 | 19 |
| S down | 1 | 18 | 19 |
Both tie at
This creates a trap. After (0,2), at the fork:
- Right path: 5 consecutive steps straight to (0,5),
decreasing at each step, like a snowball rolling faster and faster - Gap path: down to (1,2), the right neighbor (1,3) is blocked by an obstacle, must go down one more level before turning right—needs one extra step to see progress
The decreasing effect of
Result: The inadmissible
This failure is not accidental. A fundamental limitation of heuristic design is precisely this: you cannot reliably predict the behavior of a heuristic function before runtime. We tend to think that overestimating a little is merely "less accurate," but the
This is the meaning of admissibility: if
But admissibility also has a cost.
V. An Information-Theoretic Perspective: The Entropy Cost of Heuristic Functions
Admissibility guarantees correctness, but at what cost?
Consider two extremes:
Extremely conservative:
This is clearly admissible (because the true cost
Extremely aggressive:
This is the ideal case—A* would walk directly along the optimal path without exploring any useless nodes. This amounts to zero-entropy search: you already know the answer, no exploration needed. But the problem is: if you already know
Real-world heuristic functions lie between the two: as close as possible to
This trade-off can be quantified using the effective branching factor
where
Pearl, in his classic 1984 textbook, provided experimental data:
When
, (on the 8-puzzle) When
, When using more precise heuristic functions (such as pattern databases),
Going from 2.8 down to 1.1 means the number of nodes in the search tree drops from
The quality of the heuristic function directly determines the computational resources the search consumes.
But there is a deep tension here: a good heuristic function requires domain knowledge. Manhattan distance works for grids, but what about graph search? What about continuous spaces? What about Go?
In complex environments, designing a heuristic function that is both admissible and close to
This is precisely the problem that \[Zixi Li, 2026a\]'s ADS work seeks to solve: can we let machines automatically learn heuristic functions?
VI. ADS: The Information-Theoretic Reformulation of Heuristics
Traditional heuristic design has a fundamental limitation:
But the search space is dynamic. In open areas, the heuristic function is trustworthy; in obstacle-dense regions, any heuristic function may mislead. A fixed weight
\[Zixi Li, 2026a\]'s ADS (Adaptive Dual Search) posed a question: can we use information theory to quantify the trustworthiness of the heuristic function, letting the weight adapt automatically to the uncertainty of the current state?
The core idea is to link the heuristic function's weight
- When the model is highly certain about the current state (low entropy),
is large—trust the heuristic, advance quickly - When the model is highly uncertain about the current state (high entropy),
—forming a potential barrier that refuses entry into high-uncertainty regions
This is not parameter tuning, but rather quantifying uncertainty into an actionable search constraint. A*'s admissibility ensures "no overestimation"; ADS's entropy barrier ensures "no exploration of high-entropy dead ends."
Relationship to admissibility: Admissibility is a constraint on the value range of
The more complete technical details of ADS, and how it is applied to the LLM reasoning space, are unfolded in Chapter 10.
VII. The Contract of Approximation Algorithms: Abandoning Optimality, But Not Abandoning Guarantees
A*'s admissibility guarantees the optimal solution, but at the cost of having to explore all "potentially optimal" paths. On some problems, this is still too slow.
In 1973, the theory of computational complexity met a turning point. Stephen Cook proved that the SAT problem is NP-complete—if you can solve it in polynomial time, you can solve all NP problems.
This proof triggered a profound philosophical shift: perhaps we should give up seeking the optimal solution and instead seek a "good enough" solution.
But "good enough" needs a precise definition. Otherwise, any algorithm could claim to be "good enough."
In 1974, David Johnson introduced the concept of approximation ratio. For minimization problems, an α-approximation algorithm guarantees:
This is a worst-case guarantee—no matter the input, the approximation ratio will not exceed α.
Let me illustrate with a classic example.
7.1 Vertex Cover: A 2-Approximation Algorithm
Vertex Cover problem: Given a graph, find the smallest set of vertices such that every edge in the graph has at least one endpoint in the set.
This problem is NP-complete. Brute-force search requires checking
But there is an extremely simple 2-approximation algorithm:
import random
def greedy_vertex_cover(edges):
"""
Greedy vertex cover: 2-approximation algorithm, time complexity O(|E|).
edges: [(u, v), ...] list of edges
Returns: vertex cover set C, guaranteeing |C| ≤ 2|C*|
"""
remaining = list(edges)
C = set()
while remaining:
u, v = random.choice(remaining) # randomly pick an edge
C.add(u)
C.add(v)
# Delete all edges connected to u or v
remaining = [(a, b) for a, b in remaining if a not in (u, v) and b not in (u, v)]
return CThe running time of this algorithm is
where
Proof (this is a classic technique in approximation algorithm analysis):
For each edge
Our algorithm selects both. So the number of vertices we select is at most twice the optimal solution.
This guarantee is tight—there exist inputs where the algorithm achieves exactly 2 times. But it is a worst-case guarantee: no matter the input, the approximation ratio will never exceed 2.
7.2 PTAS: Arbitrarily Close to Optimal
The holy grail of approximation algorithms is PTAS (Polynomial-Time Approximation Scheme).
A PTAS is a family of algorithms such that, for any
In other words, you can get arbitrarily close to the optimal solution—want 99% optimal? You can. Want 99.9% optimal? You can too. The cost is only a polynomial increase in time.
\[Li & Jin, 2015\] provided the first PTAS for the weighted unit disk cover problem. This problem has important applications in wireless network design: given points in the plane, cover all points with the minimum number of unit disks.
The time complexity of their algorithm is
This reveals the limitation of PTAS: theoretically feasible, practically may be infeasible.
The contract of approximation algorithms: I do not guarantee giving you the optimal solution, but I guarantee the solution I give will not be too bad, and I can tell you precisely the degree to which it "won't be too bad." This is an honest contract—much stronger than "I tried my best."
VIII. Pseudocode: A* and Adaptive Heuristic Weights
Let me formalize the core algorithms discussed above.
Algorithm 1: Standard A* Search
import heapq
def astar(graph, start, goal, h):
# graph: dict {node: [(neighbor, cost), ...]}
# h: heuristic function h(node) -> estimated cost to goal
# Returns: shortest path list, or None (no solution)
open_heap = [(h(start), 0, start, [start])] # (f, g, node, path)
closed = set()
while open_heap:
f, g, node, path = heapq.heappop(open_heap)
if node == goal:
return path
if node in closed:
continue
closed.add(node)
for neighbor, cost in graph.get(node, []):
if neighbor in closed:
continue
g_new = g + cost
heapq.heappush(open_heap, (g_new + h(neighbor), g_new, neighbor, path + [neighbor]))
return None # No solutionTime complexity:
Algorithm 2: ADS Adaptive Heuristic Weights
import math
def ads_search(start, goal, cost_fn, h_fn, policy_fn):
# policy_fn(s) -> action probability distribution dict {action: prob}
# h_fn(s) -> heuristic estimate (cost to goal)
# cost_fn(s,a) -> cost of executing action a
def entropy(probs):
return -sum(p * math.log(p + 1e-10) for p in probs if p > 0)
s, path, g = start, [start], 0.0
while s != goal:
action_probs = policy_fn(s)
actions = list(action_probs.keys())
probs = list(action_probs.values())
U_s = entropy(probs)
H_max = math.log(len(actions) + 1e-10)
# Logarithmic barrier: as B_s -> 1, alpha -> infinity, completely excluding high-entropy dead ends
B_s = U_s / (H_max + 1e-10)
alpha = -math.log(1 - B_s + 1e-10)
best_action, best_f = None, float("inf")
for action in actions:
f_val = g + cost_fn(s, action) + alpha * h_fn(action)
if f_val < best_f:
best_f, best_action = f_val, action
g += cost_fn(s, best_action)
s = best_action
path.append(s)
return pathKey innovation: α is not fixed, but dynamically adjusted based on the information-theoretic properties of the current state. When entropy is high (large uncertainty), reduce reliance on the heuristic; when entropy is low, trust the heuristic to advance quickly.
IX. Visualization: How Heuristic Weights Affect Search Paths
Imagine a 2D grid world with obstacles and a goal. Different heuristic weights α lead to completely different search behaviors.

Search behavior under three heuristic weight configurations: (a) α=0 (Dijkstra): uniform diffusion, exploring all directions, guaranteed optimal but inefficient. (b) α=1 (standard A): advancing along heuristic direction, moderate backtracking when encountering obstacles. (c) α=2 (greedy): over-trusting the heuristic, may fall into local optima. Green is the start, red is the goal, blue shading indicates expanded nodes, yellow line is the final path.*
Key observations: - α too small: search is too conservative, expanding many useless nodes - α too large: search is too aggressive, potentially missing the optimal path - α=1 with h admissible: the equilibrium point, guaranteeing optimality with high efficiency
The innovation of ADS lies in: α is not fixed, but adaptively adjusted based on local terrain.
This adaptivity comes from the computation of the information-theoretic barrier B(s): - In open areas: IG is high (large information gain along the heuristic direction), U is low (subsequent states are deterministic) → B is high → α is high - Near obstacles: IG is low (heuristic direction is blocked), U is high (multiple detour choices) → B is low → α is low
X. The Philosophy of Heuristics: The Courage to Accept Imperfection
Let me return to the question that opened this chapter: when perfection is unattainable, what should we do?
The answer from heuristic algorithms is: accept approximation with clearly defined costs.
This requires courage, because you must admit that perfection is unattainable. But this is not failure; it is wisdom.
Admissibility is A*'s contract: I may be optimistic, but I cannot lie. If
Approximation ratio is the contract of approximation algorithms: I do not promise optimality, but I promise not to be too bad. A 2-approximation algorithm guarantees that the solution does not exceed twice the optimal solution. This is a worst-case mathematical guarantee—not an average case, not "usually good," but "will never exceed 2 times."
Probabilistic guarantee is the contract of randomized algorithms: I do not guarantee correctness every time, but I guarantee correctness with probability
These three contracts correspond to three different kinds of "imperfection":
Conditional guarantee: If the assumptions hold, I guarantee optimality (A*)
Worst-case guarantee: I guarantee not to be too bad, but may not be optimal (approximation algorithms)
Probabilistic guarantee: I am correct with high probability, but have a small probability of failure (randomized algorithms)
By contrast, contractless heuristics—"I tried my best," "usually good," "works on the data I tested"—are dishonest approximations. They do not tell you the conditions of failure, do not quantify the degree of "not good enough," do not give the probability of risk.
But heuristics also have costs:
Design cost: Good heuristic functions require domain knowledge. Manhattan distance works for grids, but what about graph search? What about continuous spaces? Every new problem requires redesigning the heuristic function.
Computational cost: A complex heuristic function may be slower than the search itself. If
Generalization risk: A heuristic designed for a specific problem may fail when the scenario changes. This is why ADS uses neural networks to learn heuristic strategies—it can generalize across tasks. But the cost is the loss of mathematical guarantees.
This leads to a deep tension: provable guarantees vs. empirical generalization.
A* has admissibility guarantees, but requires handcrafted
This is the question of Chapter 12. For now, hold onto this tension.
XI. Yao's Principle: Algorithm Design as a Game
Before discussing how Collins breaks through deterministic lower bounds, we need to first understand a deeper question: who, exactly, is the adversary in heuristic design?
In 1977, Andrew Chi-Chih Yao, in a paper just a few pages long, gave an answer that changed the landscape of algorithmic complexity theory.
8.1 Turning Algorithm Design into a Game
Imagine a game like this.
There are two players:
- Player A (the algorithm designer): Chooses an algorithm
to solve some problem - Player B (the adversary / input distribution): Chooses an input distribution
, trying to make the algorithm run as slowly as possible
Does Player A move first or second? If A chooses the algorithm first, B constructs a worst-case input targeting that algorithm. If B chooses the distribution first, A optimizes the algorithm for that distribution.
This is a zero-sum game.
8.2 The Two Sides of the MiniMax Theorem
Yao's Minimax Principle says: Between deterministic algorithms and randomized inputs, there exists a MiniMax equality.
The left side of this equality is: "The expected time of the optimal deterministic algorithm under the worst-case input distribution."
The right side of this equality is: "The expected time of the optimal deterministic algorithm under some fixed input distribution."
The two sides are equal. This means: You can obtain a lower bound for all randomized algorithms by analyzing the lower bound of deterministic algorithms under a fixed distribution.
8.3 A Feynman-Style Explanation: Why Is This Profound?
Let me make this clear with an analogy.
Suppose you are a general (algorithm designer), trying to formulate a universal tactic (algorithm) to face different enemies (inputs). You don't know how the enemy will deploy.
Pessimistic scenario (MaxMin): You choose your tactic first, then the enemy chooses the deployment most unfavorable to you.
- Your optimal tactic = "The best result I can achieve under the worst enemy deployment"
Optimistic scenario (MinMax): The enemy chooses the distribution first, then you choose the best tactic.
- Your optimal tactic = "The best result I can achieve knowing the enemy's average deployment statistics"
The MiniMax theorem says: These two values are equal.
What does this mean?
The worst-case lower bound for deterministic algorithms = The average-case lower bound for randomized algorithms.
More bluntly: if you want to prove that "any deterministic algorithm requires at least
This is an extraordinarily powerful tool—you convert a "worst-case" analysis into an "average-case" analysis, and the latter is usually easier to handle.
8.4 A Game-Theoretic Perspective on Heuristic Design
Now apply this framework to heuristic design.
The design of the heuristic function
| Player | Choice | Goal |
|---|---|---|
| Heuristic designer | Heuristic function | Minimize search time |
| Problem instance generator | Distribution of input graphs/mazes | Maximize search time |
Yao's principle tells us: No matter how clever a deterministic heuristic is, there is a performance lower bound determined by the input distribution.
This lower bound is not an engineering limitation, but an information-theoretic limitation: if your heuristic function is deterministic, the adversary (input distribution) can always find a class of problems that degrades your performance to the worst case.
8.5 From Relative Lower Bounds to Absolute Lower Bounds
Here a critical distinction emerges:
Relative Lower Bound:
- "For the deterministic heuristic
, there exists an input such that the search time is " - This is a lower bound relative to a specific algorithm—change the algorithm, and the lower bound may change
Absolute Lower Bound:
- "For any deterministic algorithm, the search time is
" - This is a lower bound about the problem itself—no deterministic algorithm can break it
Yao's principle builds a bridge between the two: through the game-theoretic MiniMax equality, you can elevate the worst-case lower bound of a specific algorithm into an absolute lower bound that holds for all deterministic algorithms.
But there is a trap here: this absolute lower bound only holds for deterministic algorithms.
Randomized algorithms are not subject to the same constraints.
Why? Because randomized algorithms break the symmetry of the game: the adversary cannot construct a worst-case input targeting a randomized algorithm—because the randomized algorithm runs differently each time, and the adversary does not know how to "target" it.
This brings us to Collins's core idea.
XII. Collins: Breaking Deterministic Lower Bounds with Randomization
Heuristics are typically deterministic: given state s, h(s) always returns the same value. But \[Zixi Li, 2026b\]'s Collins work shows that randomization can break through deterministic information-theoretic lower bounds.
Collins targets the optimizer state compression problem, but its core idea reveals a deep principle of heuristic design.
8.1 The Deterministic Information-Theoretic Lower Bound
Proposition 1 (Deterministic lower bound): Any deterministic data structure that maintains the state of d parameters with precision ε must use
This is not an engineering limitation; this is an information-theoretic counting argument:
To distinguish
This means: if you want to use a deterministic method to store the first-order and second-order moments of the Adam optimizer (for d parameters), you need at least
For large language models (d can be in the billions), this is a massive bottleneck.
8.2 The Breakthrough of Randomization: Trading Probability for Space
Collins uses Count-Sketch + 2-Universal Hashing to reduce state complexity from
The key idea: do not store exact states, but store random projections of the states.
This is akin to coarse-graining in physics: you do not track the precise position of every particle, but track macroscopic statistics. The cost is introducing noise, but if the noise is sufficiently small, the macroscopic behavior remains correct.
The Chernoff bound gives a safety upper bound:
where
Experiments verified the theoretical prediction: for
8.3 Implications for Heuristics: Deterministic vs. Probabilistic Guarantees
Collins's work reveals a deep trade-off in heuristic design:
Deterministic heuristics: Guarantee that for a given input, they always return the same estimate. But subject to information-theoretic lower bounds—a good heuristic function requires storing a large amount of domain knowledge.
Randomized heuristics: Trade probabilistic guarantees for lower computational/storage costs. Do not guarantee correctness every time, but guarantee "correct with high probability."
This is a different kind of contract—from "guaranteed correct" to "correct with probability
From a physics perspective, this is the manifestation of the Second Law of Thermodynamics: you can trade entropy (randomness) for energy (computational resources). Deterministic systems need to maintain a low-entropy state, at the cost of high energy consumption. Randomized systems permit higher entropy, so energy consumption is lower.
The Landauer principle gives the theoretical lower bound: erasing 1 bit of information requires at least
Heuristics perform approximation on known structures. In the next chapter, we will see how Transformers fundamentally reconstruct the concept of "structure" by dynamically connecting all nodes.
Unresolved Questions
Can the "goodness" of a heuristic function be formally defined?
What
is theoretically optimal? Pearl proved that consistency is stronger than admissibility, but is this the strongest useful condition? Does there exist "the admissible function closest to "? This question is equivalent to: given finite computational resources, how should they be optimally allocated to the design of the heuristic function? What is the boundary of applicability of the entropy barrier?
ADS's dynamic logarithmic barrier works on arithmetic tasks but fails on abstract algebra. Is this failure principled or engineering? In other words: is entropy regularization itself unsuitable for high-order abstraction, or is it just that the 360M model is too small? If given a 3B model, would abstract algebra tasks succeed? This question touches on the fundamental limits of entropy as a reasoning constraint.
What is the essential difference between implicit heuristics learned by neural networks and explicit heuristics designed by humans?
AlphaGo's value network is a kind of heuristic function, but it does not satisfy admissibility—why does it still work? One possible answer: what the neural network learns is an average-case heuristic, while admissibility is a worst-case guarantee. But what does this mean? Does there exist a concept of "average-case admissibility"?
What are the theoretical boundaries of randomized heuristics?
Collins proved the deterministic lower bound
, but what is the upper bound for randomized approaches? The Chernoff bound gives , but is this tight? Does there exist an "optimal randomized heuristic"—one that minimizes computational cost for a given failure probability ? Can heuristics learn online?
Classical heuristics are designed offline—fixed before the search begins. But could we design a heuristic function that learns and adjusts itself during the search? Each time a node is expanded, update the estimate of
, making subsequent search more efficient? This is similar to reinforcement learning, but the goal is to accelerate the current search, not to generalize to future tasks. Where are the lower bounds for approximation algorithms?
We know vertex cover has a 2-approximation algorithm, but can we achieve 1.5-approximation? 1.1-approximation? For some NP-complete problems, there exist approximation lower bounds—unless P=NP, no
-approximation algorithm exists. These lower bounds tell us: for some problems, even approximation is hard.
Extra Chapter: Search as Convergence—A Dynamical Systems Perspective on Heuristics
Every step of A* is, in fact, a Euler step.
Return to the evaluation function
This has exactly the same structure as the Euler discretization of gradient descent:
What is a Euler step?
In calculus, the derivative describes continuous change. But computers can only do discrete things—each time, take a small step.
The Euler method divides "continuous flow" into small steps:
Meaning: currently at
Gradient descent is a special case of the Euler step—the direction of change is the negative gradient (steepest downhill direction), the step size is the learning rate
A*'s search has the same structure: current state
The triangle inequality is the guarantee of convergence.
The consistency condition
What is the triangle inequality?
You've known since elementary school: the sum of any two sides of a triangle is greater than the third side.
In heuristic search, this rule becomes: the estimated cost from
Intuition: your guess about the remaining distance cannot be more pessimistic than "guess again after actually taking one step." Otherwise, your map is self-contradictory—like a map claiming "Beijing to Shanghai: 2000 km, but Beijing to Nanjing: only 100 km, Nanjing to Shanghai: only 100 km." You would suspect it's lying.
A heuristic function satisfying this rule is called consistent. Consistency guarantees that A* will not backtrack, and each node only needs to be expanded once.
What is a Lyapunov function?
Imagine a ball rolling in a bowl. No matter where you release it, the ball will eventually roll to the bottom of the bowl.
The Lyapunov function is that "height of the bowl"—a numerical measure of how far the system is from the goal. As long as this value decreases (or does not increase) at each step, the system will certainly converge to the goal and won't wander off.
In A*, the
Lyapunov used this idea in 1892 to prove the stability of a large class of dynamical systems. A*'s optimality proof is, in essence, the same thing.
What is ADS doing?
ADS's logarithmic barrier
The barrier is not "tuning a parameter," but reshaping the geometry of the search space based on local uncertainty.
This "search as convergence" intuition is, in the context of heuristics, a geometric story.
But if we replace the search space with the LLM's belief space, replace
To be revealed in Chapter 12.
Further Reading
Hart, Nilsson, Raphael (1968). A Formal Basis for the Heuristic Determination of Minimum Cost Paths — The original A* algorithm paper, defining admissibility and consistency
Yao, A. C.-C. (1977). Probabilistic Computations: Toward a Unified Measure of Complexity — The original paper on Yao's principle; the MiniMax theorem elevates deterministic lower bounds to randomized lower bounds
→ [FOCS 1977]\[Zixi Li, 2026a\] — ADS adaptive dual search, the information-theoretic reformulation of heuristics, knowledge distillation framework
\[Zixi Li, 2026b\] — Collins randomized optimizer, deterministic lower bound and Chernoff upper bound, phase transition at
\[Li & Jin, 2015\] — PTAS for weighted unit disk cover, the gold standard of approximation algorithms
→ [arXiv:1502.04918]\[Esmer et al., 2022\] — Exponential-time approximation algorithms for approximate monotone local search
→ [arXiv:2206.13481]Pearl, J. (1984). Heuristics: Intelligent Search Strategies for Computer Problem Solving — The classic textbook on heuristic search
Motwani, R. & Raghavan, P. (1995). Randomized Algorithms — The authoritative textbook on randomized algorithms published by Cambridge University Press, systematically covering Chernoff bounds, 2-Universal hashing, the MiniMax principle, and their applications in algorithm design
Russell & Norvig (2020). Artificial Intelligence: A Modern Approach (4th ed.) — Chapters 3-4 discuss A* and its variants in detail
