ICLR 2025 Oral | Playing MineCraft with Pure Vision on a Single 3090: LS-Imagine for Reinforcement Learning over Long Short-Term Imagination in Open Worlds

LS-Imagine plays Minecraft through pure visual observation, learning RL control policies by mimicking human players without cheats or privileged information.

Training visual reinforcement learning agents in high-dimensional open worlds presents numerous challenges. While model-based reinforcement learning (MBRL) methods improve sample efficiency by learning interactive world models, these agents often suffer from "myopia" because they are typically trained only on short imagined experience fragments. We argue that the primary challenge in open-world decision-making lies in how to improve exploration efficiency in vast state spaces, particularly for tasks requiring long-term reward consideration. Therefore, we propose a novel reinforcement learning method: LS-Imagine, which constructs a Long Short-Term World Model that simulates goal-driven jump-style state transitions and computes corresponding Affordance Maps by zooming into specific regions of single images, enabling the agent to expand its imagination horizon within limited state transition steps and explore behaviors that may yield favorable long-term rewards.

Paper Title: Open-World Reinforcement Learning over Long Short-Term Imagination Authors: Jiajian Li*, Qi Wang*, Yunbo Wang (corresponding author), Xin Jin, Yang Li, Wenjun Zeng, Xiaokang Yang (* equal contribution) Project Page: https://qiwang067.github.io/ls-imagine Paper Link: https://openreview.net/pdf?id=vzItLaEoDa Code Link: https://github.com/qiwang067/LS-Imagine

1. Introduction

In the context of reinforcement learning, decision-making in open worlds exhibits the following characteristics:

Vast State Space: The agent operates in an interactive environment with an enormous state space.
Highly Flexible Policy: The learned policy possesses high flexibility, enabling the agent to interact with various objects in the environment.
Environmental Perception Uncertainty: The agent cannot fully observe the internal state and physical dynamics of the external world, meaning its perception of the environment (e.g., raw images) typically carries significant uncertainty.

For example, Minecraft is a typical open-world game that satisfies the above properties.

Based on recent advances in visual control, the goal of open-world decision-making is to train agents to approach human-level intelligence using only high-dimensional visual observations. However, this also brings numerous challenges. For instance, in Minecraft tasks:

High-level API-based methods (such as Voyager) perform high-level control through environment-specific APIs, which do not conform to standard visual control settings, limiting generalization ability and applicability.
Model-free reinforcement learning methods (such as DECKARD) lack understanding of the underlying mechanisms of the environment, relying primarily on costly trial-and-error exploration, resulting in low sample efficiency and poor exploration effectiveness.
Model-based reinforcement learning methods (such as DreamerV3), while improving sample efficiency, exhibit "myopia" problems due to optimizing policies only on short-term experience, making effective long-term exploration difficult.

To improve the efficiency of behavior learning in model-based reinforcement learning, we propose a novel method: LS-Imagine. The core of this method lies in enabling the world model to efficiently simulate the long-term impact of specific behaviors without repeatedly performing step-by-step predictions.

Intro — Figure 1: Overall framework of LS-Imagine

As shown in Figure 1, the core of LS-Imagine lies in training a Long Short-Term World Model that integrates task-specific guidance during representation learning. After training, the world model can perform both immediate state transitions and jump-style state transitions, while generating corresponding intrinsic rewards, thereby optimizing the policy in a joint space of short-term and long-term imagination. Jump-style state transitions enable the agent to bypass intermediate states and directly simulate task-relevant future states $s_{t + H}$ in a single imagination step, encouraging the agent to explore behaviors that may yield favorable long-term rewards.

However, this approach raises a classic "chicken-and-egg" problem:

Without real data showing that the agent has already achieved the goal, how can we effectively train the model to simulate jump-style transitions from the current state to future states highly correlated with the goal?

To address this problem, we continuously perform zoom-in operations on specific regions of observation images to simulate the continuous video frames the agent would observe while approaching that region, and perform correlation assessment between these video frames and the task's text description, thereby generating Affordance Maps to highlight potentially key regions in the observation related to the task. Based on this, we collect image observation pairs from adjacent time steps as well as image pairs spanning longer time intervals through interaction with the environment as a dataset, and train specific branches of the world model to enable it to perform immediate state transitions and jump-style state transitions. After the world model is trained, we generate a series of imagined latent state sequences based on the world model to optimize the agent's policy. During decision-making, jump-style state transitions can be leveraged to directly estimate long-term rewards, thereby enhancing the agent's decision-making capability.

2. Main Innovations and Contributions

We propose a novel model-based reinforcement learning method capable of simultaneously performing immediate state transitions and jump-style state transitions, applying them to behavior learning to improve the agent's exploration efficiency in open worlds.

LS-Imagine brings the following four specific contributions:

A world model architecture combining long-term and short-term components.
A method for generating affordance maps by simulating exploration processes through image zooming.
A novel intrinsic reward mechanism based on affordance maps.
An improved behavior learning method that incorporates long-term value estimation and operates on mixed long-short-term imagination sequences.

3. Method

LS-Imagine includes the following key algorithmic steps:

1. Affordance Map Computation

As shown in Figure 2, to generate affordance maps, we simulate and evaluate the agent's exploration process without relying on real successful trajectories.

Specifically, for a single-frame observation image, we use a sliding bounding box to scan the entire observation image from left to right and top to bottom. For each position of the sliding bounding box, we crop 16 images starting from the original image, narrowing the field of view to focus on the region where the bounding box is located, and resize them back to the original image size, obtaining 16 consecutive frames to simulate the visual changes as the agent moves toward the region indicated by the bounding box.

Subsequently, we use the pre-trained MineCLIP model to evaluate the correlation between the simulated exploration video and the task text description, using this as the potential exploration value of that region. After the sliding bounding box scans the entire image, we fuse the correlation values from all bounding box positions to generate a complete affordance map, providing guidance for the agent's exploration.

📖 MineCLIP (Fan et al., 2022): a CLIP-style model (see L01 for CLIP) pretrained specifically on Minecraft gameplay videos paired with narration text, so that it can score how well a short video clip matches a natural-language task description. Here it is repurposed as a reward signal: instead of classifying images, it evaluates whether a simulated exploration trajectory is moving toward the goal described in text.

2. Fast Affordance Map Generation

The affordance map computation process in step 1 above involves extensive window traversal and computation using a pre-trained video-text alignment model for each window position. This method is computationally intensive and time-consuming, making it difficult to apply to real-time tasks. To address this, we designed a multimodal U-Net architecture based on Swin-Unet, and trained this multimodal U-Net architecture using the virtual exploration-based affordance map computation method described above to annotate data as supervision signals, enabling it to efficiently generate affordance maps at each time step using visual observations and language instructions, as shown in Figure 3.

UNet — Figure 3: Efficient affordance map generation using multimodal U-Net

3. Computing Intrinsic Rewards from Affordance Maps and Assessing the Necessity of Jump-Style State Transitions

As shown in Figure 4, to leverage the task-relevant prior knowledge provided by affordance maps, we compute the mean of element-wise multiplication between the affordance map and a 2D Gaussian matrix of the same size, using it as the affordance-driven intrinsic reward. This reward encourages the agent to continuously approach the target and align it in the center of the view.

Furthermore, to assess the necessity of jump-style transitions during imagination, we introduce a jumping flag. As shown in Figure 5, when a distant task-relevant target appears in the agent's observation, it manifests as highly concentrated high-value regions on the affordance map, which also causes the kurtosis of the affordance map to increase significantly. In such cases, the agent should adopt jump-style state transitions (also called long-term transitions) to efficiently reach the target region.

Jumping flag based on affordance map kurtosis — Figure 5: Assessment of jump-style state transition necessity

4. Long Short-Term World Model

In LS-Imagine, the world model needs to simultaneously support immediate state transitions (short-term state transitions) and jump-style state transitions (long-term state transitions). Therefore, as shown in Figure 6 (a), we designed short-term and long-term branches in the state transition model. The short-term state transition model combines the current state and action to perform single-step immediate state transitions to predict the next adjacent time step's state. The long-term transition model simulates goal-oriented jump-style state transitions, guiding the agent to rapidly imagine exploration toward the goal. The agent can decide which type of transition to adopt based on the current state and predict the next state through the selected transition branch.

Figure 6: Long short-term world model architecture and behavior learning based on long short-term imagination

Unlike traditional world model architectures, we specifically designed a Jump predictor to determine which type of transition should be performed based on the current state. Additionally, for jump-style state transitions, we designed an Interval predictor to estimate the number of environment time steps ${\hat{Δ}}_{t}^{'}$ between states before and after the jump, as well as the cumulative discounted reward ${\hat{G}}_{t}^{'}$ during that period, which will be used to estimate long-term rewards in subsequent behavior learning. Furthermore, we also input the affordance map $M_{t}$ to the encoder, which can provide goal-based prior guidance for the agent to enhance the effectiveness of the decision-making process.

Based on this architecture, the agent interacts with the environment and collects new data, obtaining sample pairs from adjacent time steps corresponding to short-term state transitions, and modeling sample pairs spanning longer time intervals corresponding to long-term state transitions based on affordance maps. We use this data to update the replay buffer and sample from it to train the long short-term world model.

5. Behavior Learning on Long Short-Term Imagination Sequences

As shown in Figure 6 (b), LS-Imagine employs an actor-critic algorithm to learn behavior through latent state sequences predicted by the world model. The actor's objective is to optimize the policy to maximize the discounted cumulative reward $R_{t}$ , while the critic's role is to estimate the discounted cumulative reward for each state based on the current policy.

Long short-term imagination sequence rollout — Figure 7: Dynamically selecting long-term or short-term transition models to predict long short-term imagination sequences

As shown in Figure 7, starting from the initial state encoded from sampled observations and affordance maps, we dynamically select the long-term or short-term state transition model based on the jumping flag ${\hat{j}}_{t}$ predicted by the jump predictor to predict subsequent states. In a long short-term imagination sequence with imagination horizon $L$ , we predict information such as the reward ${\hat{r}}_{t}$ corresponding to the state, the continue flag ${\hat{c}}_{t}$ , the number of environment time steps ${\hat{Δ}}_{t}$ between adjacent states, and the cumulative discounted reward ${\hat{G}}_{t}$ during that period through various predictors in the world model, and adopt an improved bootstrap $λ$ -returns combining long-term and short-term imagination to compute the discounted cumulative reward for each state:

R_{t}^{λ} ≐ {\begin{cases} {\hat{c}}_{t} {{\hat{G}}_{t + 1} + γ^{{\hat{Δ}}_{t + 1}} [(1 - λ) v_{ψ} ({\hat{s}}_{t + 1}) + λ R_{t + 1}^{λ}]} & if t < L \\ v_{ψ} ({\hat{s}}_{L}) & if t = L \end{cases},

and employ the actor-critic algorithm for behavior learning.

4. Experimental Results

We conducted experiments in the Minecraft game environment to test the LS-Imagine agent. We set up 5 open-ended tasks as shown in Table 1 for experimentation:

Task	Language Description	Max Steps
Collect logs in plains	"Cut a tree."	1000
Collect water with bucket	"Obtain water."	1000
Collect sand	"Obtain sand."	1000
Shear sheep	"Obtain wool."	1000
Mine iron ore	"Mine iron ore."	2000

We compared LS-Imagine with various methods including VPT, STEVE-1, PTGM, Director, and DreamerV3. The evaluation metrics include success rate in completing tasks within specified steps and average interaction steps required to complete tasks. The numerical results are shown in Table 2.

📖 Baseline methods in this comparison: VPT (Video PreTraining, Baker et al., 2022) pretrains a Minecraft policy on unlabeled YouTube gameplay video by first learning to infer actions from video with a small labeled dataset, then using that inverse dynamics model to pseudo-label millions of hours of unlabeled footage for behavior cloning. STEVE-1 (Lifshitz et al., 2023) builds a text-conditioned or visual-goal-conditioned policy on top of VPT, following the instruction-tuning recipe used for language models. PTGM (Pretraining with Task-Guided Merging, Yuan et al., 2024) discretizes VPT's action space into a small skill codebook so that high-level task instructions can be mapped to short reusable action sequences. Director (Hafner et al., 2022) is a hierarchical extension of Dreamer where a manager policy sets latent-space subgoals and a worker policy learns to reach them, aimed at exactly the same long-horizon "myopia" problem that LS-Imagine addresses with jump-style transitions. All four serve here as prior model-free or model-based baselines that LS-Imagine is compared against.

Model	Collect logs in plains		Collect water with bucket		Collect sand		Shear sheep		Mine iron ore
Model	succ. (%)	succ. step	succ. (%)	succ. step	succ. (%)	succ. step	succ. (%)	succ. step	succ. (%)	succ. step
VPT	6.97	963.32	0.61	987.65	12.99	880.54	1.94	987.49	0.00	—
STEVE-1	57.00	752.47	6.00	989.07	37.00	770.40	3.00	992.36	0.00	—
PTGM	41.86	811.19	2.78	977.78	17.71	833.64	21.54	887.03	15.14	1586.03
Director	8.67	968.09	20.90	931.74	36.36	825.35	1.27	995.99	7.82	1906.31
DreamerV3	53.33	711.22	55.72	628.79	59.88	548.76	25.13	841.14	16.79	1789.06
LS-Imagine	80.63	503.35	77.31	502.61	62.68	601.18	54.28	633.78	20.28	1748.55

We found that LS-Imagine performs significantly better than comparison models, with its advantages being particularly pronounced in task scenarios with sparsely distributed targets.

Additionally, we present visualization results of reconstructed observation images and affordance maps based on long short-term imagination state sequences in Figure 10. The first row shows latent states before and after jump-style state transitions, decoded back to pixel space to intuitively present state changes. The second row visualizes affordance maps reconstructed from latent states to more clearly understand how affordance maps facilitate jump-style state transitions and whether they can provide effective goal-oriented guidance. The last row overlays affordance maps on reconstructed observation images through transparent superposition to more intuitively highlight the regions the agent focuses on.

Figure 10: Visualization of long short-term imagination sequences

These visualization results demonstrate that LS-Imagine's long short-term world model can adaptively decide when to perform long-term imagination based on current visual observations. Furthermore, the generated affordance maps can effectively align with regions highly correlated with the final goal, thereby facilitating more efficient policy exploration by the agent.

Moreover, given that our method relies on affordance maps to identify high-value exploration regions to achieve long-term state jumps, one might think that if the target is occluded or invisible, our method would fail. To demonstrate that our affordance map generation method is not merely a target recognition algorithm and does not only highlight relevant regions when targets are visible, we present examples of affordance maps generated when targets are occluded or invisible in Figure 11.

Figure 11: Affordance maps when targets are occluded or invisible

Thanks to the MineCLIP model's pre-training on a large number of expert demonstration videos, our affordance map generation method can generate affordance maps that provide effective guidance for exploration even when targets are completely occluded or invisible. For example, as shown in Figure 11(a), in the village-finding task, although the village is not visible in the current observation, the affordance map can still provide clear exploration directions, suggesting the agent explore toward the forest on the right or the open area on the left hillside. Similarly, in the mining task shown in Figure 11(b), although ores are typically located underground and occluded in the current observation, the affordance map can still guide the agent to dig into the mountain on the right or underground ahead. These examples fully demonstrate that even when targets are occluded, affordance maps can still help agents explore effectively.

5. Conclusion

Our work proposes a novel method, LS-Imagine, aimed at overcoming the challenges faced in training visual reinforcement learning agents in high-dimensional open worlds. By expanding the imagination horizon and leveraging a long short-term world model, LS-Imagine can efficiently perform policy exploration in vast state spaces. Additionally, introducing goal-based jump-style state transitions and affordance maps enables the agent to better understand long-term value, thereby enhancing its decision-making capability. Experimental results show that in the Minecraft environment, LS-Imagine achieves significant performance improvements compared to existing methods. This not only highlights LS-Imagine's potential in open-world reinforcement learning but also provides new inspiration for future research in this field.

The paper's code, checkpoints, and environment configuration documentation are all provided. We welcome GitHub stars and citations!

GitHub link: https://github.com/qiwang067/LS-Imagine

Citation:

bibtex

@inproceedings{li2025open,
    title={Open-World Reinforcement Learning over Long Short-Term Imagination},
    author={Jiajian Li and Qi Wang and Yunbo Wang and Xin Jin and Yang Li and Wenjun Zeng and Xiaokang Yang},
    booktitle={ICLR},
    year={2025}
}

ICLR 2025 Oral | Playing MineCraft with Pure Vision on a Single 3090: LS-Imagine for Reinforcement Learning over Long Short-Term Imagination in Open Worlds ​

1. Introduction ​

2. Main Innovations and Contributions ​

3. Method ​

1. Affordance Map Computation ​

2. Fast Affordance Map Generation ​

3. Computing Intrinsic Rewards from Affordance Maps and Assessing the Necessity of Jump-Style State Transitions ​

4. Long Short-Term World Model ​

5. Behavior Learning on Long Short-Term Imagination Sequences ​

4. Experimental Results ​

5. Conclusion ​