Skip to content

Part B (Continued): TD-MPC and Planning Mechanism Comparison

Mechanism 3: TD-MPC, the Bridge Between the Two

TD-MPC (Temporal Difference Model Predictive Control) [Hansen et al., 2022] combines the lookahead planning capability of MPC with the temporal-difference learning efficiency of Actor-Critic.

Core design:

ComponentRole
Latent consistency lossTrains the implicit dynamics model: z^t+1=f(zt,at) should be consistent with the encoder output sg(zt+1)
Temporal-difference targetUpdates the Q function (action-value function, Q(s,a) represents the expected cumulative discounted reward obtained by executing action a in state s and following the policy thereafter) via the Bellman equation: Q(zt,at)=rt+γQ(zt+1,π(zt+1)), where γ (discount factor) causes future rewards to decay exponentially
CEM planningAt each decision step, uses CEM to search for the optimal action sequence in latent space

These three components are trained jointly: the consistency loss shapes the latent space, while the TD target trains the Q function to guide CEM search.

The role of stop-gradient: The sg(z_{t+1}) in the consistency loss denotes stop-gradient. If both sides of the encoder can receive gradient updates, the model may learn an "identity function" that maps all states to a single point, driving the consistency loss to zero while being completely meaningless. Stop-gradient fixes the target side, preventing this mode collapse (where the model finds a degenerate solution: mapping all different inputs to the same output, minimizing the loss but producing no useful representation).

📖 Bellman Equation: Q(st,at)=rt+γmaxaQ(st+1,a). This transforms the infinite-horizon cumulative reward problem into a form that only looks at "one-step reward + next-step Q value". Bootstrapping: using the model's own estimates (such as Q(st+1,a)) as training targets, "predicting from oneself". TD learning uses the Bellman equation for bootstrapping, allowing learning to occur at every step without waiting for an episode to end.

TD learning uses the Bellman equation to substitute "current reward + next-step Q value estimate" for a full rollout, reducing the effective planning depth from "exact model steps" to "1 step + Q function bootstrapping".

Comparison with DreamerV3:

DimensionDreamerV3TD-MPC2
World model formExplicit generative (reconstructs pixels/observations)Implicit (only guarantees accurate value prediction)
Planning approachLatent space Actor-CriticCEM + TD
Applicable task scopeVisually complex tasks requiring rich observationsState-observation tasks, efficient continuous control
InterpretabilityCan visualize reconstructionsLatent space has no direct semantics

Comparison of Three Planning Mechanisms

DimensionCEM-MPCDreamer Actor-CriticTD-MPC
Planning approachRandom searchPolicy gradient (differentiable)Random search + TD
Requires pixel reconstructionNoYesNo
Long-horizon planning capabilityLimited by HRelies on Critic bootstrappingTD + MPC combined
Computational costHigh (large N)Medium (imagined rollouts)Low to medium
High-dimensional action spaceLow efficiencyGradient optimizes directlyQ function guides search
Model exploitation riskMedium (myopic)High (policy can exploit model)Medium (TD suppresses accumulated error)
Typical scenarioSimple continuous controlVisually complex tasksEfficient continuous control

Lecture Summary

  • Seven architecture families represent different directions for overcoming the GRU memory bottleneck: RNN/RSSM is the most computationally lightweight, Transformer handles long-range dependencies best, Diffusion produces the most realistic visuals, JEPA focuses most on semantics, RWM focuses most on deployment stability, Genie automatically discovers actions from video, and WAM unifies world prediction with action planning.
  • Three learning paradigms determine the knowledge boundary of a model: observation-based learns visual patterns but cannot control, interaction-based learns action causality but data is expensive, counterfactual-based learns value reasoning but has weak interpretability. WAM represents a fourth paradigm: video as dense physical supervision for joint training of world and action.
  • Three planning mechanisms determine how a model is used for decision-making: CEM is the most straightforward but inefficient in high-dimensional spaces, Actor-Critic is the most elegant but carries model exploitation risk, and TD-MPC most pragmatically balances both.
  • Dreamer = interaction-based paradigm + RSSM + latent Actor-Critic, and is the core reference system for this curriculum.
  • TD-MPC = counterfactual-based paradigm + CEM + TD, and will be implemented hands-on and compared with Dreamer in P04.

Next Lecture

After building and running world models, the next question is: how do we judge whether they are good? Lecture 4 provides dedicated evaluation metrics for each architecture: FID and reward correlation for Dreamer, MCTS visit entropy for MuZero, latent consistency loss for TD-MPC, long-horizon PSNR for STORM, and one universal failure mode that all models encounter: horizon drift.


Further Reading

Key papers covered in this lecture, listed in order of appearance:

Foundational Architectures

Transformer Architectures

Diffusion Architectures

Planning Mechanisms

JEPA Series

Genie / Interactive Generation

RWM / Robot Deployment

WAM / Joint Learning