Skip to content

Part B (Continued): Dreamer Series Architecture Iterations

Transformer Dynamics: From GRU to Sequence Modeling

The core limitation of the GRU comes from its information bottleneck: all historical information must be compressed into a fixed-dimensional hidden state ht. The longer the sequence, the harder it becomes to retain early information, and long-range dependencies are easily lost. This is not a serious problem on short video game frames, but in tasks that require remembering events from dozens of steps ago to make correct decisions, the GRU's memory capacity becomes a hard constraint.

Transformer takes a different approach. Instead of summarizing history with a single hidden state, it performs attention directly over the entire history of latent states. Each step's prediction can "look back" at any historical state, with no information compression bottleneck. The trade-off is that computation grows with context length, and inference memory usage is higher. The full principles and formulas of the Transformer self-attention mechanism are covered in the Transformer architecture section of Lecture 03.

STORM (2023) replaced the GRU backbone in RSSM with a Transformer, achieving measurable gains in prediction accuracy and policy return on long-sequence Atari tasks. Dreamer V4 (2025) made the same replacement and combined it with offline policy learning, making long-horizon imagined trajectories more coherent and reliable. Lecture 03 will use RSSM as a baseline and compare these two backbone types side by side across different task constraints.


Architecture Iterations of the Dreamer Series

RSSM is the foundational architecture established by Dreamer V1. The three subsequent versions evolved incrementally on top of it, with each iteration targeting a specific bottleneck of the previous version.

Dreamer V1 (2019) established the overall framework of RSSM plus latent space Actor-Critic, the structure described earlier in this lecture. It is the starting point for all subsequent versions.

Dreamer V2 (2020) replaced the continuous Gaussian zt with a discrete Categorical latent variable (selecting from a finite set of categories rather than sampling from a continuous real-valued space), and used the straight-through estimator (a technique that lets gradients "pass through" a non-differentiable discrete sampling operation: the forward pass uses the discrete sample, while the backward pass treats the operation as an identity function so gradients flow through directly) to propagate gradients. Discrete latent variables produced two effects: training curves became notably more stable, and the semantic structure of the latent space became clearer. The dynamics backbone remained GRU, and the policy was still trained online.

Dreamer V3 (2023) changed the training recipe rather than the architecture. Two key techniques: symlog transform (symmetric log, applying symmetric logarithmic compression to reward values: symlog(x)=sign(x)ln(|x|+1), compressing rewards of vastly different magnitudes into a comparable numerical range to prevent extreme reward values from dominating gradients) compresses extreme reward values; percentile normalization (using the 5th and 95th percentiles of the reward distribution as scaling references rather than fixed min/max values, making normalization robust to outliers) decouples reward scaling from the choice of units. The result is that a single set of hyperparameters can be run directly on the full Atari suite, DMControl, and Minecraft without per-task tuning. Training an agent from scratch in Minecraft that can mine diamonds is the landmark result of this version, and it shows that the GRU backbone still has untapped potential given a sufficiently robust training recipe.

Dreamer V4 (2025) is a qualitative architectural change rather than a recipe adjustment. The dynamics core switches from GRU to Transformer, giving the world model the ability to model longer contexts and improving long-horizon prediction accuracy. The policy learning method also switches from online Actor-Critic to offline policy learning (the policy is trained entirely from pre-stored trajectory data without requiring real-time interaction with the environment; the distinction from "online" learning is that online learning updates while interacting, whereas offline learning uses only a fixed dataset): the policy is trained entirely from stored imagined trajectories, no longer relying on online rollouts. This design is architecturally very close in philosophy to STORM (Zhang et al., 2023) and IRIS (Micheli et al., 2022) introduced in Lecture 03. In a sense, Dreamer V4 represents the GRU camp's formal convergence toward the Transformer camp.

VersionDynamics CoreLatent Variable TypePolicy LearningKey Advance
V1GRUContinuous GaussianOnline Actor-CriticRSSM architecture established
V2GRUDiscrete CategoricalOnline Actor-CriticDiscrete latent variables, stable training
V3GRUDiscrete CategoricalOnline Actor-CriticSingle hyperparameters across domains, Minecraft benchmark
V4TransformerDiscrete CategoricalOffline policy learningArchitectural shift, long-horizon reasoning

Each version targets a specific bottleneck of its predecessor rather than redesigning the whole system.

PlaNet open-loop state diagnostics: predicting ground-truth positions, velocities, and reward from frozen RSSM latent states
Open-loop state diagnostics from Hafner et al. (2019): the RSSM dynamics model is frozen and small neural networks are trained to predict the simulator's ground-truth positions, velocities, and reward from the learned latent states. Accurate long-horizon prediction of these quantities confirms that the latent space captures most of the information present in the underlying system, further than the planning horizons used in the paper.

The Encoder's Role as a Bridge in Dreamer

The encoder is more than a compression tool. It is the bridge connecting the pixel world to the latent dynamics world. The complete Dreamer pipeline:

  1. Encode: otencoderzt
  2. Dynamics: (zt,at)RSSMzt+1,zt+2, (pure imagination)
  3. Policy learning: train Actor-Critic on imagined trajectories, without interacting with the real environment
  4. Execution: apply the policy to the real environment, collect a small number of new samples, and iterate

The quality of the encoder directly determines the upper bound of RSSM: the more semantically clear the latent space, the easier it is for the dynamics model to learn meaningful transition patterns.


Summary

ConceptRoleKey Equation / Structure
VAE encoderCompress pixels to zELBO = reconstruction loss - KL divergence
GRU dynamicsDeterministic prediction of next statezt+1=GRU(zt,at)
MDN-RNNModel multimodal uncertaintyMixture-of-Gaussians output distribution
RSSMSeparate deterministic/stochastic stateht (memory) + zt (perception)
Transformer dynamicsGlobal attention replacing fixed hidden stateht=Attention(z1:t,a1:t1)
Dreamer seriesStepwise evolution from V1 to V4GRU to Transformer, continuous to discrete latent, online to offline policy

A good world model equals a good encoder (perceptual compression) plus a good dynamics model (temporal prediction). RSSM achieves an elegant balance between expressiveness and computational efficiency by separating the two types of state. The evolution across the four Dreamer versions shows that beyond the architecture itself, the type of latent variable and the training recipe are equally decisive factors.


Next Lecture

The question for Lecture 03 is: RSSM is not the only option. How do Transformer-backbone world models (STORM, IRIS) perform on long-sequence tasks, and where does Dreamer V4 stand relative to them after switching to a Transformer?

After completing P01 and P02, you have a working RSSM baseline. Lecture 03 uses it as an anchor to compare six architecture families side by side, including Transformer dynamics, diffusion models, and JEPA, and explains where Dreamer V4 sits on that map. The comparison is not a ranking of better versus worse, but a map of where each architecture applies given different task constraints.


Further Reading