Part B: Latent Dynamics

The Encoder Is Not Enough: We Need to Predict the Future

With a VAE encoder, we can compress the current frame $o_{t}$ into $z_{t}$ . But the central task of a world model is predicting the future:

In latent space, given the current state $z_{t}$ and action $a_{t}$ , predict the next state $z_{t + 1}$ .

This prediction capability lets the agent "simulate" the future internally, enabling planning without actually executing actions in the environment. This is the key reason world models reduce sample complexity.

The Simplest Dynamics Model: GRU

The Gated Recurrent Unit (GRU) is a foundational tool for sequence modeling. As a dynamics model, the GRU takes $(z_{t}, a_{t})$ and predicts the next latent state:

z_{t + 1} = GRU (z_{t}, a_{t}; θ)

📖 GRU internals (brief): A GRU controls information flow through two gates: the reset gate decides "how much of the past to forget", and the update gate decides "how much of the old state to retain vs. how much new information to write in". Gate values lie between 0 and 1, determined jointly by the current input and the previous hidden state. This allows the GRU to selectively retain long-term dependencies while discarding irrelevant information, making it better at handling longer sequences than a plain RNN. Compared to an LSTM, the GRU has one fewer gate (no separate memory cell), fewer parameters, and trains faster.

The GRU's strengths are simplicity and stable training. Its limitation is that it produces deterministic predictions and cannot express uncertainty. In real environments, the same action can lead to multiple different outcomes (for example, pushing a box might succeed or might get stuck).

MDN-RNN: Modeling Uncertainty

MDN-RNN (Mixture Density Network + RNN), proposed in Ha & Schmidhuber (2018), models uncertainty over the next state using a mixture of Gaussians:

p (z_{t + 1} | z_{t}, a_{t}) = \sum_{k = 1}^{K} π_{k} \cdot N (z_{t + 1}; μ_{k}, σ_{k}^{2})

$K$ Gaussian components, each with its own mean $μ_{k}$ (center of the distribution) and variance $σ_{k}^{2}$ (width of the distribution)
Mixture weights $π_{k}$ : the probability mass of the $k$ -th Gaussian component, satisfying $\sum_{k = 1}^{K} π_{k} = 1$ , $π_{k} \geq 0$ . These can be read as "the probability that the $k$ -th future occurs". The network outputs $π_{k}$ values and normalizes them through softmax to ensure the weights sum to 1.

MDN-RNN can capture multimodal distributions: the environment may transition to several distinct next states, and the model can represent all of them.

MDN-RNN: combination of a mixture density network and an RNN, outputting a multimodal Gaussian mixture distribution — MDN-RNN architecture from Ha & Schmidhuber (2018): the RNN hidden state is passed through a fully connected layer to produce K parameter groups (π_k, μ_k, σ_k), representing mixture weights, means, and variances, which together define the Gaussian mixture distribution over the next latent state.

RSSM: Separating Deterministic and Stochastic Components

The RSSM (Recurrent State Space Model) is the core innovation of the Dreamer series. It splits the state into two parts:

Deterministic hidden state $h_{t}$ : maintained by an RNN, aggregating information from the historical trajectory, with no stochasticity
Stochastic latent state $z_{t}$ : sampled from a distribution conditioned on $h_{t}$ , expressing current uncertainty

Core equations of the RSSM:

📖 Subscript $ϕ$ (phi): the subscript $ϕ$ in $f_{ϕ}$ , $p_{ϕ}$ , $q_{ϕ}$ denotes "this function has parameters $ϕ$ ", i.e., the learnable weights of the neural network. $f_{ϕ} (\cdot)$ is read as "function $f$ parameterized by $ϕ$ ". During training, gradient descent updates $ϕ$ so that the predictions of these functions become increasingly accurate. Similarly, $θ$ (theta) that appears later is another commonly used symbol for a distinct set of learnable parameters.

h_{t} = f_{ϕ} (h_{t - 1}, z_{t - 1}, a_{t - 1}) (deterministic update, GRU/RNN)

z_{t} \sim p_{ϕ} (z_{t} ∣ h_{t}) (prior: no access to real observations, infers current state from history h_{t} alone; used for pure imagination/prediction)

z_{t} \sim q_{ϕ} (z_{t} ∣ h_{t}, o_{t}) (posterior: corrects the prior using real observation o_{t}; used during training)

📖 Prior vs. posterior: these are fundamental concepts in Bayesian statistics. The prior is "belief before seeing data", the RSSM's guess about the current state $z_{t}$ based on historical memory $h_{t}$ . The posterior is "belief updated after seeing data", refining the prior with real observation $o_{t}$ to obtain a more accurate estimate. During training, the posterior generates $z_{t}$ and the KL loss is computed (measuring the gap between prior and posterior). During inference and imagination, only the prior is available (there is no real $o_{t}$ ), so the RSSM rolls forward using the prior alone.

Why separate them?

State	Role	Property
$h_{t}$	Memory	Deterministic, aggregates history
$z_{t}$	Perception	Stochastic, expresses uncertainty

After separation, the model can roll forward using only the prior $p (z_{t} | h_{t})$ without real observations, enabling planning purely in imagination. This is the fundamental reason for Dreamer's sample efficiency.

The PlaNet paper (Hafner et al., ICML 2019) verified this design through ablation studies (systematically removing one component of the model and observing the change in performance, thereby confirming the component's necessity): a purely stochastic path (no deterministic $h_{t}$ ) struggles to reliably retain information across multiple steps, and training optimization may fail to find solutions where some dimensions collapse to near-zero variance to store long-term information; a purely deterministic path (no stochastic $z_{t}$ ) cannot express the inherent stochasticity of the environment, and the distribution gap between imagined and real trajectories grows larger. Both paths are indispensable. The observation model is therefore conditioned on both $h_{t}$ and $z_{t}$ : $o_{t} \sim p (o_{t} | h_{t}, z_{t})$ , with deterministic memory and stochastic perception jointly determining the reconstructed image.

Comparison of Three Dynamics Models

Model	Uncertainty Modeling	Memory Mechanism	Primary Use
GRU	None (deterministic output)	Fixed-dimension hidden state $h_{t}$	Simple sequence prediction, rapid prototyping
MDN-RNN	Mixture of Gaussians (multimodal)	Fixed-dimension hidden state $h_{t}$	Multimodal uncertainty, Ha & Schmidhuber M-module
RSSM	Separated prior/posterior (Gaussian)	Dual-track: deterministic $h_{t}$ + stochastic $z_{t}$	Core of Dreamer, supports pure-imagination planning

The three form a progression: GRU establishes the foundation for sequence modeling, MDN-RNN introduces uncertainty, and RSSM further decouples "memory" from "perceptual uncertainty", enabling the model to roll forward and plan without real observations.

Part B: Latent Dynamics ​

The Encoder Is Not Enough: We Need to Predict the Future ​

The Simplest Dynamics Model: GRU ​

MDN-RNN: Modeling Uncertainty ​