Chapter 23: Stability and Convergence Boundaries of Reasoning Systems
What is the biggest pain point of Lyapunov functions? Exactly — you have to manually specify an energy function. If we use Yonglin's convergence hypothesis to derive the Lyapunov function for a reasoning system, that would be much better.
At the end of Chapter 22, we stood at the boundary of self-reference and emergence, witnessing the strange moment when a reasoning system begins to reason about itself. That boundary, mathematically, manifests as fixed points, diagonalization, undecidability.
But the boundary is not only logical. The boundary is also dynamical — how the system evolves over time, whether it is stable, where it converges. The Yonglin formula gives a specific convergence pattern: the inference chain ultimately returns to the prior anchor point. This convergence, in the language of dynamical systems, is precisely an attractor.
This chapter will do one thing: translate the convergence of the Yonglin formula into the language of Lyapunov stability. And then the reverse: use the convergence hypothesis to derive the Lyapunov function, rather than manually specifying it.
Objective: Using the Yonglin convergence hypothesis + the information distance of KL distribution bias + Euler-step iterative updates, derive the Lyapunov function of the reasoning system. It is known that Lyapunov functions traditionally require manual construction, dependent on human priors. But by observing the system's dynamical behavior (Yonglin convergence), we reverse-engineer the energy function — this is deriving structure from behavior, another recurrence of the inverse problem, and also an embodiment of the art of dynamical systems.
23.0 Prologue: The Dynamical Story of a Stack
To understand Lyapunov functions, let us first play a little game.
Imagine a stack data structure. It has two pointers:
pointer: always points to the bottom of the stack, fixed in place pointer: points to the top of the stack, can move up and down
The operations on the stack are simple:
push(x): push elementxonto the top of the stack, thepointer moves up one cell pop(): pop the top element, thepointer moves down one cell
But there is an important constraint: negative stack is illegal. That is, the pop() on an empty stack, the operation is invalid — the pop operations are ignored.
Students of the Olympiad in Informatics (OI) are all too familiar with this structure. But Professor Pallas's Cat is not asking an algorithmic question today, but a dynamical question:
Given a very, very long — so long you cannot imagine — sequence of
pushandpopoperations, will the position of thepointer (its height relative to the bottom of the stack) converge? Will it settle at a specific height, no matter how the operation sequence is arranged?
Wait, we also have the "negative stack illegal" constraint. Under this constraint, things become even more interesting.
Now the pop when the stack is empty (
Take three seconds to think.
What intuition might tell you: Since pop is invalid when the stack is empty, the stack may become empty more easily — because an empty stack cannot be further popped, while push can make it non-empty. But conversely, if there are many push operations, the stack will become very tall.
Key observation: The stack height
Stack Height as Energy
Define stack height push increases pop decreases
Now define a function
Observation:
, and if and only if (stack empty) - If the operation sequence is random (
pushandpopequally probable), then the expected change inis zero — a random walk
But Professor Pallas's Cat is asking not about expectation but about deterministic behavior: if we know whether each step is push or pop (not random), how will
Something Magical Happens
Consider the continuous-time approximation: assume operations occur very fast, treating discrete operations as a continuous flow. Define the rate of change of stack height push, pop.
Now compute the rate of change of
This seems patternless —
If the operation sequence is ultimately balanced — push and pop are roughly equal in number — then
A Smarter Energy Function
Let us try a different definition: push and pop ultimately equal in number), then
But if we consider the variance
The Magic of Negative Stack Illegal: Convergence to Empty Stack
Now add the constraint that negative stack is illegal. The pop on an empty stack is ignored.
Professor Pallas's Cat's answer: Under such constraints, if the operation sequence is sufficiently long and contains enough pop operations, the
Why?
Consider
- Execute
push:increases by 1, increases - Execute
pop:decreases by 1, decreases
But the key is: when pop is ignored, push).
If the operation sequence is infinitely long and the number of pop operations is sufficiently large (not necessarily more than push, but as long as pop occurs), the system has a probability of entering the pop operations are invalid; only push can pull it out. But if the sequence is non-deterministic (e.g., random), then in the long run, the system will frequently visit the
More rigorously: define
push:pop(when): pop(when):
But this is not Lyapunov stability, because
Key Insight
What does the stack story tell us?
- Dynamical system: The stack height
is a dynamical system; its evolution is driven by the operation sequence . - Energy function:
is an "energy" metric for the system. We want to use to judge whether the system converges. - Convergence condition: If there exists
such that (energy does not increase), then the system is stable. - Problem: For the stack, simple
or do not satisfy , because can be positive or negative. - Effect of the lower bound: The hard constraint
gives the system an absorbing state , but the existence of an absorbing state does not guarantee that the stability theorem holds.
So we need a more ingenious
Lessons from the Stack Model
The stack is the simplest possible dynamical system, but we still need to think hard to find a suitable
This is the pain point of Lyapunov functions: you have to guess a
But wait — if we observe the system's behavior and find that it does converge, can we reverse-engineer
Stack and the MP Game: A Comparison
In the MP game of Chapter 22, the proof sequence
The stack story is a concrete version of the same thing:
- The height
of the pointer is the state push/popis the evolution operator- The empty stack
is the absorbing state — the fixed point of the MP game
The question left by the MP game is: fixed points exist, but starting from an arbitrary
23.2 The Pain Point of Lyapunov Functions
Definition (Lyapunov Function): For a dynamical system
, and if and only if (equilibrium point) holds for all
then
The pain point:
The construction of Lyapunov functions is an art, not a science. You guess a
If a reasoning system is a dynamical system, do we also have to manually guess a
23.3 The Reasoning System as a Dynamical System
We formalize the reasoning process as a discrete-time dynamical system.
Let
A reasoning step is a mapping
The Yonglin formula in this language is:
where
Key observation:
Returning to the MP game of Chapter 22:
23.4 Dynamical Construction: From Euler-Step Iteration to Energy Function
Now we do something bolder: rather than starting from a Lyapunov function to derive Yonglin, we do the reverse — using the Yonglin convergence hypothesis + the information distance of KL distribution bias + Euler-step iterative updates, we construct the Lyapunov function.
This is an art of dynamical systems: observe how the system evolves step by step, and "read off" the energy function from its behavior.
Euler Steps: Discrete-Time Dynamics
What is an Euler step? It is the key that turns mathematics from static description into dynamical evolution.
A continuous-time dynamical system is described by a differential equation
But differential equations are continuous — time
Euler steps make mathematics "move." Without them, a differential equation is merely a static relational formula; with them, we can simulate the system's evolution step by step and see how it develops from the initial state into the future.
Discrete-time systems are more direct:
For a reasoning system,
Ordinary differential equations (ODEs) describe the relationship between rates of change and states. Solving an ODE means finding the complete trajectory of the state over time. Analytic solutions (solutions expressed as formulas) are often hard to find, or may not even exist. Numerical solutions (such as Euler's method) abandon the "perfect formula" and accept the "approximate trajectory."
This abandonment is not a compromise but an epistemological shift: from pursuing "knowing the exact value at every moment" to "being able to simulate approximate values at arbitrary moments." In AI reasoning, we can rarely write down an analytic formula for the evolution of beliefs, but we can observe how the model updates step by step — this is the idea of numerical solutions.
The error of the Euler step is
The Yonglin Hypothesis: Existence of an Attractor
The core hypothesis of the Yonglin formula is: the system converges to the prior anchor point
This hypothesis is not a mathematical theorem but an empirical observation (though Chapter 12 provided theoretical support). We accept it as a dynamical fact.
KL Divergence: The Information Distance of Distribution Bias
Now we introduce the KL divergence
Intuitively: if the current belief
From Observation to Construction
Here comes the crucial step. We observe the system's evolution: starting from
If convergence occurs, then
Because convergence means
This inequality is not something we prove, but something we infer from observation. We observe the system converging, and we infer that KL divergence decreases.
Constructing the Lyapunov Function
Define
- Non-negativity:
, and if and only if (a property of KL divergence) - Decreasing property:
, because we observe
Thus
Note: We did not guess
This is the essence of the art of dynamical systems: not sitting in a chair guessing an energy function, but standing up and observing how the system moves, "listening" from its trajectory for the energy decreasing. The Yonglin hypothesis tells you where the system ultimately stops; KL divergence tells you how to measure "how far you still are from there"; Euler steps show how each step shortens that distance. Combine the three, and the energy function emerges naturally.
Comparison with the traditional approach:
- Traditional: guess
→ verify - Here: observe convergence → define
using KL divergence → verify that decreases (guaranteed by convergence)
Why does this solve the pain point? Because we no longer need to manually guess
Concrete example: Suppose the system's update rule is
23.5 Reverse Derivation: How the Lyapunov Function Explains the Yonglin Formula
Now we look at the other direction: if we already have
The Lyapunov stability theorem says: if
The Yonglin inference formula
But the Yonglin formula has a second part:
From the perspective of
Key insight: The
The KL divergence
When
Thus
This interpretation transforms the stability problem of reasoning systems into an information efficiency problem: the system is optimizing information encoding, evolving toward the most economical state (the one requiring the fewest extra bits). That state happens to be the prior anchor point
23.6 Combined: The Yonglin-Lyapunov Correspondence
Now we put both directions together.
Yonglin → Lyapunov: Observing convergence to
Lyapunov → Yonglin: Given
These two directions form a closed loop: convergence behavior defines the energy function, and the energy function guarantees the convergence behavior. The core parameter of this closed loop is the prior anchor point
The structure of this combination is the same pattern as learning as inverse inference in Chapter 21: from data (observed convergence), reverse-engineer the law (Lyapunov function). Another inverse problem. But here there is an extra layer: the law (
23.7 Connection to Gödelian Incompleteness
Gödel's theorem of Chapter 15 says: any sufficiently strong formal system has true propositions that it cannot prove. The core of the theorem's proof is self-reference — constructing a proposition that talks about its own provability.
Within the Yonglin-Lyapunov combination, there is also a self-referential structure: the system's convergence behavior defines its energy function, and the energy function describes its convergence behavior. This self-reference is not the self-reference of logical propositions, but dynamical self-reference.
More profoundly, Gödel's theorem reveals the rupture between the internal perspective and the external perspective of a formal system: the system cannot internally prove certain true propositions about itself. The Yonglin formula reveals the rupture between the object level and the meta-level of a reasoning system: the system can generate inference chains (object level), but cannot verify the correctness of inference chains (meta-level).
The Lyapunov function, in this analogy, is a tool of the external perspective: it describes the system's stability from outside. But through the Yonglin-Lyapunov combination, we internalize this external tool — deriving it from the system's own convergence behavior. This is somewhat like attempting to construct, inside the system, a proof about its own stability. Does this attempt encounter Gödelian limitations?
23.8 Significance: Interpretability and Stability Guarantees
What is the practical significance of this combination?
Significance One: Interpretability. The Lyapunov function
Significance Two: Stability guarantee. Once we have
Significance Three: No need for manual design. Traditional Lyapunov methods require the intuition and trial-and-error of engineers. Here,
But the cost: This
23.9 Unresolved
Is Yonglin convergence universal? The Yonglin formula has been observed experimentally, but what is the scope of its theoretical validity? Do all reasoning systems based on statistical learning satisfy this convergence? Or is it applicable only to specific architectures (such as Transformers)? This question requires more rigorous mathematical characterization.
The multi-attractor case: If the system has multiple attractors (multiple prior anchor points, corresponding to different tasks or contexts), how should the Lyapunov function be defined?
Uniqueness of the Lyapunov function: Given convergence behavior, is
Connection to learning theory: Chapter 21's learning as inverse inference used the MDL principle to interpret generalization as compression.
Exercises
★ Warm-up
- Does the Lyapunov function
satisfy the conditions for the system ? Compute and determine whether the system is stable. - In the Yonglin formula, if the training data is perfectly balanced (50% positive, 50% negative examples), what is the prior anchor point
? What form does take in this case?
★★ Derivation
Discrete-time Lyapunov: For a discrete system
, the Lyapunov condition is . Assume and is the following update: , where . Prove that . for multiple attractors: Suppose the system has two attractors and , with convergence depending on initial conditions. Design a function such that is zero at both attractors, positive elsewhere, and decreases along system trajectories. Hint: consider . What is the problem with this ? (Non-differentiable, hard to verify decrease)
★★★ Challenge
In the proof of Gödel's theorem, the key step is constructing the self-referential proposition
The answer to this question may point toward incompleteness in dynamical systems: the stability of certain systems cannot be determined from their own dynamics, requiring an external perspective. This conjecture extends Gödelian incompleteness from the logical domain into the dynamical domain.
The Yonglin-Lyapunov combination tells us: the system's limit is encoded in its energy function. And the energy function can be read off from the limit. This cycle is not a logical paradox but a dynamical harmony — the observer and the observed system mutually define each other within this cycle. This definition ultimately settles at the prior anchor point. Not because we want to stop there, but because the system's energy is lowest there.
References
- [Zixi Li, 2025b] — Yonglin Formula, a theoretical proof of inference incompleteness
- Lyapunov, A. M. (1892) — The General Problem of the Stability of Motion
- Cover, T. M., & Thomas, J. A. (2006) — Elements of Information Theory (KL divergence)
- Chapter 15 — Consistency and Completeness (Gödelian incompleteness)
- Chapter 21 — Learning as Inverse Inference (MDL principle)
