Skip to content

P05: World Model Evaluation Dashboard

Format: Jupyter notebook (p05_evaluation_dashboard.ipynb)
Prerequisites: P03, P04, L04


What You Will Build

A diagnostic notebook that loads both trained world models, runs them on the same held-out environment episodes, computes L04 metrics for each, and renders the results side by side using inline plots. The goal is not to declare a winner but to make the characteristic failure modes of each architecture visible.


Notebook Sections

Section 1: Load Both Models

Load the P03 Dreamer checkpoint (encoder + RSSM + Actor) and the P04 Transformer world model. Freeze both. Run a set of held-out episodes to collect ground-truth trajectories.

Section 2: Per-Model Metrics

Compute for each model on the held-out set:

MetricModelWhat it measures
Reward correlation ρDreamerHow well imagined rewards track real rewards
Token prediction lossTransformerOne-step latent prediction accuracy
Long-horizon PSNR (steps 1, 3, 5, 10)BothHow fast prediction quality degrades
Latent driftBothDivergence between imagined and real latent trajectories

Section 3: Side-by-Side Plots

Render in the notebook:

  • PSNR vs horizon step: both models on one axes
  • Latent drift curves: both models on one axes
  • Decoded frame sequences: ground truth, Dreamer imagined, Transformer imagined (a grid of frames at steps 1, 5, 10)
  • Summary metrics table printed inline

Section 4: Diagnostic Summary

For each metric where the two models differ substantially, write one markdown cell explaining the architectural reason. Use the L04 framework: which failure mode does this metric surface, and which design choice explains the gap?


Deliverables

  • Completed notebook with all cells executed and all plots rendered inline
  • Metrics table with values filled in for both models
  • PSNR and latent drift plots
  • Diagnostic summary (5-10 sentences in markdown cells)

Reference

Per-model metrics and interpretation: L04. Horizon drift and mitigation strategies: L04.