Part B (Continued): TD-MPC and Planning Mechanism Comparison

Mechanism 3: TD-MPC, the Bridge Between the Two

TD-MPC (Temporal Difference Model Predictive Control) [Hansen et al., 2022] combines the lookahead planning capability of MPC with the temporal-difference learning efficiency of Actor-Critic.

Core design:

Component	Role
Latent consistency loss	Trains the implicit dynamics model: ${\hat{z}}_{t + 1} = f (z_{t}, a_{t})$ should be consistent with the encoder output $sg (z_{t + 1})$
Temporal-difference target	Updates the Q function (action-value function, $Q (s, a)$ represents the expected cumulative discounted reward obtained by executing action $a$ in state $s$ and following the policy thereafter) via the Bellman equation: $Q (z_{t}, a_{t}) = r_{t} + γ \cdot Q (z_{t + 1}, π (z_{t + 1}))$ , where $γ$ (discount factor) causes future rewards to decay exponentially
CEM planning	At each decision step, uses CEM to search for the optimal action sequence in latent space

These three components are trained jointly: the consistency loss shapes the latent space, while the TD target trains the Q function to guide CEM search.

The role of stop-gradient: The sg(z_{t+1}) in the consistency loss denotes stop-gradient. If both sides of the encoder can receive gradient updates, the model may learn an "identity function" that maps all states to a single point, driving the consistency loss to zero while being completely meaningless. Stop-gradient fixes the target side, preventing this mode collapse (where the model finds a degenerate solution: mapping all different inputs to the same output, minimizing the loss but producing no useful representation).

📖 Bellman Equation: $Q (s_{t}, a_{t}) = r_{t} + γ \cdot max_{a^{'}} Q (s_{t + 1}, a^{'})$ . This transforms the infinite-horizon cumulative reward problem into a form that only looks at "one-step reward + next-step Q value". Bootstrapping: using the model's own estimates (such as $Q (s_{t + 1}, a^{'})$ ) as training targets, "predicting from oneself". TD learning uses the Bellman equation for bootstrapping, allowing learning to occur at every step without waiting for an episode to end.

TD learning uses the Bellman equation to substitute "current reward + next-step Q value estimate" for a full rollout, reducing the effective planning depth from "exact model steps" to "1 step + Q function bootstrapping".

Comparison with DreamerV3:

Dimension	DreamerV3	TD-MPC2
World model form	Explicit generative (reconstructs pixels/observations)	Implicit (only guarantees accurate value prediction)
Planning approach	Latent space Actor-Critic	CEM + TD
Applicable task scope	Visually complex tasks requiring rich observations	State-observation tasks, efficient continuous control
Interpretability	Can visualize reconstructions	Latent space has no direct semantics

Comparison of Three Planning Mechanisms

Dimension	CEM-MPC	Dreamer Actor-Critic	TD-MPC
Planning approach	Random search	Policy gradient (differentiable)	Random search + TD
Requires pixel reconstruction	No	Yes	No
Long-horizon planning capability	Limited by $H$	Relies on Critic bootstrapping	TD + MPC combined
Computational cost	High (large $N$ )	Medium (imagined rollouts)	Low to medium
High-dimensional action space	Low efficiency	Gradient optimizes directly	Q function guides search
Model exploitation risk	Medium (myopic)	High (policy can exploit model)	Medium (TD suppresses accumulated error)
Typical scenario	Simple continuous control	Visually complex tasks	Efficient continuous control

Lecture Summary

Seven architecture families represent different directions for overcoming the GRU memory bottleneck: RNN/RSSM is the most computationally lightweight, Transformer handles long-range dependencies best, Diffusion produces the most realistic visuals, JEPA focuses most on semantics, RWM focuses most on deployment stability, Genie automatically discovers actions from video, and WAM unifies world prediction with action planning.
Three learning paradigms determine the knowledge boundary of a model: observation-based learns visual patterns but cannot control, interaction-based learns action causality but data is expensive, counterfactual-based learns value reasoning but has weak interpretability. WAM represents a fourth paradigm: video as dense physical supervision for joint training of world and action.
Three planning mechanisms determine how a model is used for decision-making: CEM is the most straightforward but inefficient in high-dimensional spaces, Actor-Critic is the most elegant but carries model exploitation risk, and TD-MPC most pragmatically balances both.
Dreamer = interaction-based paradigm + RSSM + latent Actor-Critic, and is the core reference system for this curriculum.
TD-MPC = counterfactual-based paradigm + CEM + TD, and will be implemented hands-on and compared with Dreamer in P04.

Next Lecture

After building and running world models, the next question is: how do we judge whether they are good? Lecture 4 provides dedicated evaluation metrics for each architecture: FID and reward correlation for Dreamer, MCTS visit entropy for MuZero, latent consistency loss for TD-MPC, long-horizon PSNR for STORM, and one universal failure mode that all models encounter: horizon drift.

Part B (Continued): TD-MPC and Planning Mechanism Comparison ​

Mechanism 3: TD-MPC, the Bridge Between the Two ​

Comparison of Three Planning Mechanisms ​

Lecture Summary ​

Next Lecture ​

Further Reading ​

Part B (Continued): TD-MPC and Planning Mechanism Comparison

Mechanism 3: TD-MPC, the Bridge Between the Two

Comparison of Three Planning Mechanisms

Lecture Summary

Next Lecture

Further Reading