Skip to content

Part B: Latent Dynamics

The Encoder Is Not Enough: We Need to Predict the Future

With a VAE encoder, we can compress the current frame ot into zt. But the central task of a world model is predicting the future:

In latent space, given the current state zt and action at, predict the next state zt+1.

This prediction capability lets the agent "simulate" the future internally, enabling planning without actually executing actions in the environment. This is the key reason world models reduce sample complexity.


The Simplest Dynamics Model: GRU

The Gated Recurrent Unit (GRU) is a foundational tool for sequence modeling. As a dynamics model, the GRU takes (zt,at) and predicts the next latent state:

zt+1=GRU(zt,at;θ)

📖 GRU internals (brief): A GRU controls information flow through two gates: the reset gate decides "how much of the past to forget", and the update gate decides "how much of the old state to retain vs. how much new information to write in". Gate values lie between 0 and 1, determined jointly by the current input and the previous hidden state. This allows the GRU to selectively retain long-term dependencies while discarding irrelevant information, making it better at handling longer sequences than a plain RNN. Compared to an LSTM, the GRU has one fewer gate (no separate memory cell), fewer parameters, and trains faster.

The GRU's strengths are simplicity and stable training. Its limitation is that it produces deterministic predictions and cannot express uncertainty. In real environments, the same action can lead to multiple different outcomes (for example, pushing a box might succeed or might get stuck).


MDN-RNN: Modeling Uncertainty

MDN-RNN (Mixture Density Network + RNN), proposed in Ha & Schmidhuber (2018), models uncertainty over the next state using a mixture of Gaussians:

p(zt+1|zt,at)=k=1KπkN(zt+1;μk,σk2)
  • K Gaussian components, each with its own mean μk (center of the distribution) and variance σk2 (width of the distribution)
  • Mixture weights πk: the probability mass of the k-th Gaussian component, satisfying k=1Kπk=1, πk0. These can be read as "the probability that the k-th future occurs". The network outputs πk values and normalizes them through softmax to ensure the weights sum to 1.

MDN-RNN can capture multimodal distributions: the environment may transition to several distinct next states, and the model can represent all of them.

MDN-RNN: combination of a mixture density network and an RNN, outputting a multimodal Gaussian mixture distribution
MDN-RNN architecture from Ha & Schmidhuber (2018): the RNN hidden state is passed through a fully connected layer to produce K parameter groups (π_k, μ_k, σ_k), representing mixture weights, means, and variances, which together define the Gaussian mixture distribution over the next latent state.

RSSM: Separating Deterministic and Stochastic Components

The RSSM (Recurrent State Space Model) is the core innovation of the Dreamer series. It splits the state into two parts:

  • Deterministic hidden state ht: maintained by an RNN, aggregating information from the historical trajectory, with no stochasticity
  • Stochastic latent state zt: sampled from a distribution conditioned on ht, expressing current uncertainty

Core equations of the RSSM:

📖 Subscript ϕ (phi): the subscript ϕ in fϕ, pϕ, qϕ denotes "this function has parameters ϕ", i.e., the learnable weights of the neural network. fϕ() is read as "function f parameterized by ϕ". During training, gradient descent updates ϕ so that the predictions of these functions become increasingly accurate. Similarly, θ (theta) that appears later is another commonly used symbol for a distinct set of learnable parameters.

ht=fϕ(ht1, zt1, at1)(deterministic update, GRU/RNN)ztpϕ(ztht)(prior: no access to real observations, infers current state from history ht alone; used for pure imagination/prediction)ztqϕ(ztht, ot)(posterior: corrects the prior using real observation ot; used during training)

📖 Prior vs. posterior: these are fundamental concepts in Bayesian statistics. The prior is "belief before seeing data", the RSSM's guess about the current state zt based on historical memory ht. The posterior is "belief updated after seeing data", refining the prior with real observation ot to obtain a more accurate estimate. During training, the posterior generates zt and the KL loss is computed (measuring the gap between prior and posterior). During inference and imagination, only the prior is available (there is no real ot), so the RSSM rolls forward using the prior alone.

Why separate them?

StateRoleProperty
htMemoryDeterministic, aggregates history
ztPerceptionStochastic, expresses uncertainty

After separation, the model can roll forward using only the prior p(zt|ht) without real observations, enabling planning purely in imagination. This is the fundamental reason for Dreamer's sample efficiency.

The PlaNet paper (Hafner et al., ICML 2019) verified this design through ablation studies (systematically removing one component of the model and observing the change in performance, thereby confirming the component's necessity): a purely stochastic path (no deterministic ht) struggles to reliably retain information across multiple steps, and training optimization may fail to find solutions where some dimensions collapse to near-zero variance to store long-term information; a purely deterministic path (no stochastic zt) cannot express the inherent stochasticity of the environment, and the distribution gap between imagined and real trajectories grows larger. Both paths are indispensable. The observation model is therefore conditioned on both ht and zt: otp(ot|ht,zt), with deterministic memory and stochastic perception jointly determining the reconstructed image.


Comparison of Three Dynamics Models

ModelUncertainty ModelingMemory MechanismPrimary Use
GRUNone (deterministic output)Fixed-dimension hidden state htSimple sequence prediction, rapid prototyping
MDN-RNNMixture of Gaussians (multimodal)Fixed-dimension hidden state htMultimodal uncertainty, Ha & Schmidhuber M-module
RSSMSeparated prior/posterior (Gaussian)Dual-track: deterministic ht + stochastic ztCore of Dreamer, supports pure-imagination planning

The three form a progression: GRU establishes the foundation for sequence modeling, MDN-RNN introduces uncertainty, and RSSM further decouples "memory" from "perceptual uncertainty", enabling the model to roll forward and plan without real observations.