Published on June 30, 2025 3:17 AM GMT

Daniel Zhang, Melvin Huang, Hassan Dawy, Veda Duddu, Kevin Zhu, Sean O'Brien, Vasu Sharma, Sunishchal Dev

This research was conducted as part of the Algoverse AI Research program Winter 2024 cohort.

Abstract

Artificial general intelligence (AGI) may surpass human capabilities at many tasks in the years ahead. As AI systems become more capable and are further integrated into critical tasks, the risk from their misalignment grows. Previous literature has established that AI frequently exploits ill-defined or misspecified objectives—optimizing well on the objective function but often leading to unintended behavior, particularly under distribution shifts. Addressing these risks requires both theoretical advances in alignment techniques and empirical evaluations of AI behavior in dynamic environments. In this paper, we develop a Reinforcement Learning based training setup for AI agents in the video game Minecraft (a complex open world survival game), and explore ways learned behavior changes with (a)reward function design, (b)environment changes, (c)Kullback-Leibler divergence for catastrophic forgetting. We discuss the situation when a learned model is itself an optimizer (mesa-optimization), and how this objective will be different from the “outer” loss function it was trained with, depending on the task at hand. In this paper, we present the first fine-tuning of a Minecraft agent explicitly for survival, a challenge not previously achieved, and demonstrate promising results. Finally, we outline potential techniques to mitigate inner misalignment in such reinforcement learning setups.

Introduction

Artificial agents must align with human goals to act safely and effectively. In reinforcement learning (RL), alignment refers to ensuring an agent pursues the intended objective and not an unintended proxy. As RL systems grow more capable, the consequences of even slight misalignment increase. For example, advanced RL agents often exploit misspecified reward functions, achieving high proxy rewards at the expense of the true desired outcome (Clark 2016). This phenomenon, known as reward hacking, underscores the rising risk of misalignment with increasing capability. Ensuring alignment is, therefore, a central challenge in RL, especially for complex tasks where designing perfect reward signals is difficult.

Minecraft offers an ideal domain to study RL alignment. It is a richly dynamic environment—a 3D, open-world sandbox with a virtually infinite state space and diverse possible goals (Guss et. al 2019). An agent in Minecraft may roam, gather resources, craft tools, fight mobs, or build structures, all within the same episode. This openness poses significant challenges for alignment: the agent has many degrees of freedom to pursue unintended behaviors, and long-horizon tasks make it difficult to define dense, goal-aligned rewards. At the same time, Minecraft’s complexity provides a valuable stress test for alignment solutions. If an agent can remain aligned with a specified objective (e.g., crafting an item) in such a setting, it increases our confidence in scalable alignment strategies.

Related Works

Figure 1: A first-person view of the Minecraft environment. This open-ended, procedurally generated 3D-world provides a complex, dynamic testbed for studying alignment in reinforcement learning agents.

Video Pretraining (Baker et. al 2022) introduced Video Pretraining(VPT), a method that enables Minecraft agents to learn from unlabeled gameplay videos instead of relying on scarce action-labeled data. VPT uses a small set of contractor-labeled gameplay (~2000 hours) to train an inverse dynamics model (IDM) that generates pseudo-labels for thousands of hours of online video, yielding far greater data efficiency than direct behavioral cloning. The resulting VPT foundation model achieved human-level Minecraft performance with non-trivial zero-shot skills, forming the basis of our experiments. However, when fine-tuning such models across multiple training phases, we must address catastrophic forgetting - the tendency to lose earlier skills when learning new ones. To mitigate this, we apply Learning Without Forgetting (LwF)(Li & Hoiem 2016), which preserves the original model's outputs as "soft targets" during new task training to prevent forgetting previously learned capabilities. Incorporating LwF into our training pipeline ensures that the agent retains crucial early-game skills (like breaking trees for wood and crafting basic tools) even as it learns later-phase tasks.

Furthermore, we leverage the MineRL dataset introduced by Guss et. al 2019, which contains over 60 million state-action pairs from human Minecraft play and was designed to inject human inductive biases for more sample-efficient deep RL. MineRL provides a range of structured, task-oriented scenarios (e.g. obtaining wood, mining iron, crafting tools) that mirror the game's progression, and we integrate these tasks into a phased curriculum for our agent. From an alignment perspective, human demonstrations in MineRL serve as an initial proxy for "aligned" behavior, since human players presumably pursue the intended goals, so early imitation learning helps the agent avoid immediately exploiting any weird reward loopholes. As the agent continues to train via reinforcement learning, however, its policy may gradually diverge from the human trajectories in pursuit of reward, illustrating how an initially aligned agent can shift away from human-like behavior without ongoing constraints.

RL Agent Misalignment Studies A growing body of literature addresses how RL agents can misbehave under poorly specified goals. (Clark 2016) highlighted reward mis-specification and attendant problems like reward gaming and unintended side effects as “concrete problems” in AI safety. Recently, Pan et. al 2022 conducted a systematic study on the effects of reward misspecification in RL. They constructed environments with known reward biases and showed that more capable agents (with larger models or longer training) are more prone to exploiting reward loopholes, achieving higher proxy reward but lower true performance. This confirms a key tenet of alignment: as an agent’s competence increases, so does the need for a correct objective, since an intelligent agent will exploit any discrepancy it can find. On the inner alignment side, Langosco et. al 2022 demonstrated goal misgeneralization in deep RL systems. In their experiments, agents that perform well in-distribution sometimes latch onto the wrong goal when facing novel situations, despite no explicit reward mis-specification. Such work empirically verifies the inner misalignment scenarios postulated by theoretical papers like Risks from Learned Optimization (Hubinger et. al 2021).

Collectively, these prior studies highlight the difficulty of alignment as agents become more capable. Our work builds on this foundation by focusing on an open-ended survival task in Minecraft using a pre-trained VPT foundation model. To our knowledge, this is the first demonstration of an RL agent learning to survive in Minecraft, achieved through curriculum-based fine-tuning of a VPT model. Our experiments empirically examine how reward shaping and staged training affect alignment, and we document practical instances of inner misalignment (mesa-optimizers' behavior failing to properly aid the outer goal) emerging in this complex domain.

Methods

Training Setup

We implement a modified Phasic Policy Gradient (PPG) algorithm to train our Minecraft agents. The training system uses a multi-process, multi-threaded design that separates environment interaction, rollout collection, and policy optimization to maximize computational efficiency. PPG is described in further detail in Baker et. al 2022's work, so we will not go much further into detail here.

Environment processes

Each Minecraft instance runs independently in its own process. This allows multiple environments to run in parallel, generating observations simultaneously rather than sequentially. Each environment communicates with the main thread through queues, enabling asynchronous access.

Environment thread

A dedicated coordinator thread manages the environment processes, and handles four main responsibilities. 1) Action Distribution: The environment thread distributes actions from each policy to each environment process, ensuring all environments are kept busy. 2) Observation Collection: It aggregates observations and rewards from each environment process. 2) Rollout Buffering: It aggregates experiences into structured rollouts containing sequential data with observations, actions, rewards, hidden states, and additional flags.

Training thread

We choose to have a separate training thread so that training is not blocked by environment step through (observation collection), although it is still bottlenecked since collecting rollouts is still slower than training on them. The training thread focuses exclusively on policy optimization, and handles four main responsibilities. 1) Phase Management: It alternates between policy updates (wake phase) and auxiliary updates (sleep phase).
2) Policy Optimization: During the wake phase, the thread performs standard PPO updates using the collected rollouts.
3) Auxiliary Optimization: During sleep phases, it performs the PPG-specific joint optimization of the auxiliary value function and policy distillation.

Model architecture

Minecraft is an extremely complex open-world game, with many important milestones taking millions of optimal consecutive actions to achieve. Training an RL agent from random initialization would require exploring an enormous search space with extremely sparse rewards, drastically increasing computational costs.
Thus, we choose to fine-tune a pretrained model (one that is trained for obtaining a diamond pickaxe) with RL, and follow the architecture of the Video Pretraining paper closely (Baker et. al 2022). In short, the model processes sequential video frames of dimensions 128×128×3, beginning with pixel normalization to the [0,1] range. It features a ResNet-based image processing network with three stacks of widths {64, 128, 128}, each containing convolutional layers, max pooling, and classic ResNet blocks. After flattening, each frame's features are processed through two frame-wise dense layers before passing through four causally masked transformer blocks that prevent attention to future frames. The model uses Transformer-XL-style training where frames can attend to keys and values from past batches within the same video, and employs relative attention position embeddings. The architecture culminates in action-specific prediction heads for both discrete keyboard inputs and mouse movements, with approximately 0.5 billion trainable parameters overall.

Hyperparameter

Value

Learning Rate

Max Grad Norm

Batch Size

Batches per iteration

Context length

Discount factor( $γ$ )

Value Loss Coefficient

GAE $λ$

PPG sleep cycles

PPG sleep value-function coefficient

PPG sleep auxiliary value-function coefficient

PPG sleep KL coefficient

KL divergence coefficient $ρ$

Coefficient $ρ$ decay

$1 \cdot 10^{- 5}$

$5.0$

$80$

$11$

$128$

$0.995$

$1.0$

$0.95$

$2$

$0.5$

$1.0$

$0.2$

$0.9995$

Mitigating Catastrophic Forgetting

We also implement Learning without Forgetting(LwF) to counteract the effects of catastrophic forgetting. Our implementation follows these steps:

L = L_{n e w} + λ_{o l d} L_{k n o w l e d g e}

L_{n e w}

L_{k n o w l e d g e}

λ_{o l d}

This approach contrasts with typical fine-tuning, which often results in the agent forgetting earlier skills while optimizing for new objectives. By incorporating a Kullback-Liebler (KL) divergence term between the current policy and the policy from the previous training phase, we provide a regularization mechanism that preserves critical knowledge while allowing sufficient flexibility to learn new behaviors.

Transfer learning and reward structure for survival agent

Transfer learning from the resource acquisition policy

Our methodology employs transfer learning from a model previously optimized for diamond pickaxe acquisition in Minecraft(Baker et. al 2022). This approach was selected due to Action Space Congruence: Both diamond-pickaxe acquisition and environmental exploration share fundamental action primitives (object identification, terrain navigation, resource gathering), providing a foundation of competencies in navigation, hazard avoidance, and environmental interaction. To facilitate this transfer learning, we implemented two supplementary techniques:

Policy divergence regulation:

Architectural leverage:

Exploration-Survival reward framework

To redirect the agent's objectives from resource gathering to environmental exploration, we designed a reward function balancing exploration incentives with survival imperatives. Our framework's rationale encompasses:

Balanced incentive structure:

Stability preservation:

Emergent mapping behavior:

Delayed reinforcement activation:

Training data distribution shift

When training survival agents in Minecraft, a critical challenge emerges from the game's day-night cycle. During nighttime, hostile mobs spawn and visibility decreases, creating substantially more hazardous conditions compared to daytime. Standard reinforcement learning approaches reset environments upon agent death, which occurs more frequently at night. This creates a sampling bias toward daytime experiences, as agents repeatedly die during nighttime and reset to daytime. We can model this scenario as a continuous-time Markov process with two primary states "day" and "night," where the transition rates between these states are influenced by both the natural progression of time and stochastic agent mortality rates. In this model, we observe that the steady-state distribution will inherently favor daytime conditions due to the higher probability of transitioning from night to day via agent death. To counterbalance this inherent bias, we implemented a modified spawning mechanism that preferentially initializes environments during nighttime. Our implementation randomizes starting times with a 70% probability of nighttime (ticks 13000-23000) and 30% probability of daytime (ticks 0-12000). This intervention aims to shift the steady-state distribution toward a more balanced representation of both environmental conditions.

Event

Reward

Being alive

Discovering a previously uncharted 4x4 block chunk

Taking x damage

Discovering new biome

Death

0.0002/step

-2x

100

-300

Curriculum learning to obtain iron sword

The current standard education system involves final exams, preceded by tests and quizzes, which act as learning checks as new content is learned. These exams, although they may appear unfavorable to many students, act as crucial checkpoints, providing beneficial feedback on the students' understanding of the course material. We apply a similar method to train a VPT model to accomplish the complex task of obtaining an iron sword. Instead of training the model by rewarding it for achieving the iron sword, we train it in a curriculum consisting of five different phases. In each phase, the model is rewarded for obtaining different materials required to build an iron sword. After the model is trained on one phase, a new weights file is outputted, and the current phase's output weights file is used as the basis of training for the next phase.

Curriculum Phases

Phase I - wood collection

Phase II - wood plank and stick collection:

Phase III - stone collection:

Phase IV - creating a furnace:

Phase V - collecting iron:

We trained the VPT 2x foundation model, using the fine-tuned weights from the early game for the initial phase. The model was trained in each phase on 10,000 ticks, with rewards calculated every 40 steps on ten simultaneous training environments. Throughout all phases, we include penalties to de-incentivize undesirable behavior: we apply a 50 penalty for agent death and a small time-step penalty, 0.1 at each in-game step(tick), to encourage efficient completion of objectives. By structuring rewards and penalties this way, each phase's objective is aligned with the overall goal, and there is no incentive to do anything other than the intended subtask at that time.

Note: The rewards are large compared to the rewards used by Baker et. al to train the VPT models, but this was done to override the existing pre-trained behaviors of the VPT models.

Hoarding Agent

This experiment investigated whether an RL agent trained with a reward function focused on item collection could implicitly develop survival behaviors as a means to achieve its primary goal, since survival is an instrumental sub-goal to collecting items. The hypothesis was that outer alignment toward item acquisition might lead to inner alignment toward survival strategies, despite no explicit survival reward.
The agent was trained using a custom reward function with three distinct tiers based on item rarity:

Common items (wood, dirt, cobblestone, sand, wood): +1 point per item
Mid-tier resources (coal, iron ore, planks): +5 points per item
Rare materials (iron ore+): +25 points per item

No direct rewards were provided for survival-related actions such as eating food, building shelters, or avoiding damage.
The agent used the same RL-from-early-game policy as its starting point, initialized with weights from a model previously trained to prioritize obtaining a diamond pickaxe. We conducted training sessions in standard Minecraft survival worlds with normal day-night cycles and mob spawning enabled to provide natural survival challenges.

Results

Survival agent

The survival agent successfully learned core survival strategies: it can navigate to new, unexplored areas, avoid environmental hazards (hostile mobs at night, drowning, fall damage), and adjust its behavior based on the day-night cycle. These behaviors indicate that the agent learned to prioritize staying alive while exploring, given the combined exploration and survival rewards defined in our training setup.

Misalignment

Due to time constraints, we did not conduct an exhaustive evaluation of misalignment behaviors in the survival agent. We describe two that we observed (1,2) and outline two more hypothetical misalignment patterns that merit future investigation (3,4).

Burrowing behavior:

Inherent nature of the game:

Threat avoidance proxies:

Terrain-type avoidance:

Curriculum learning system for iron sword agent

We found it challenging to modify the pre-trained VPT agent's behavior using our curriculum-based reward system. To ensure this was not simply due to poor hyperparameter choices, we tried a wide range of values for key parameters - for example, adjusting the KL-divergence penalty coefficient ( $λ_{k l}$ ), the learning rate, and the distillation temperature over several orders of magnitude around their base settings (see Table 1 in Section 3 for the default hyperparameters). Regardless of these adjustments, the agent either clung to its original pre-trained policy or deteriorated into random wandering, and it made no meaningful progress toward the new curriculum objectives. Additionally, it is very difficult for the agent to stumble onto the behaviors necessary to achieve these goals by pure chance alone. Furthermore, crafting an iron sword, even for beginner players given similar goals as the model, is not a straightforward task, requiring lots of exploration and knowledge of Minecraft-specific game mechanics, such as crafting tools. Therefore, given the limited scope of our experiments, it is not surprising that the agent did not see much progress in the curriculum reward system.

Mesa-optimizer interpretation

The agent’s resistance to new objectives suggests the presence of a mesa-optimizer: an internal policy optimizer that pursues its own proxy goals inherited from pre-training, overriding the new reward signals. In other words, the VPT foundation model – originally trained via behavioral cloning to imitate human gameplay (Baker et. al 2022) – appears to have internalized objectives that diverge from our specified curriculum goals. The model learned to mimic human action patterns (e.g. moving and mining in Minecraft) without necessarily understanding the intended purpose of those actions. As a result, fine-tuning with new rewards (such as collecting specific resources for our curriculum) was ineffective because the agent kept chasing its pre-established proxy objective (imitation of its pre-training behavior) instead of the outer reward we provided. This failure of inner alignment is concerning for several reasons:

Opaque internal objectives:

Resistance to reward changes:

KL-regularization was insufficient:

Lack of true understanding:

Implications for alignment research

Timing of alignment:

Limitations of post-development reward engineering:

Hoarding agent emergent survival

Our experiment testing whether item collection rewards could drive emergent survival behaviors yielded largely negative results. Despite a tiered reward structure incentivizing item accumulation, the agent failed to develop survival skills like eating or building shelters. Though it gathered common and rare materials effectively, the agent consistently died when faced with hunger or nighttime threats, lacking any protective behaviors.

A core limitation was that the RL-from-early-game model used for initialization didn’t encode survival strategies. While it captured basic resource gathering and crafting, it lacked representations for food management or shelter construction.

These results suggest that certain complex behaviors in rich environments like Minecraft may be effectively unlearnable through standard reinforcement learning alone, unless agents are seeded with demonstrations or trained using carefully structured, decomposed rewards. In our setup, the link between item hoarding and survival was too indirect to be discovered through rewards alone. This issue was compounded by our decision to fine-tune a model specialized for acquiring a diamond pickaxe. Its behavioral space had already been pruned by optimization, biasing the agent toward resource gathering and away from exploring survival strategies. Even under a new reward structure, it remained trapped in a local optimum, repeating familiar behaviors.

Our findings highlight a broader alignment challenge: once models are optimized for narrow goals, their ability to adapt to new objectives can be severely limited by reduced behavioral diversity, especially in open-ended environments like Minecraft.

Conclusion

We were successful in training an RL agent for survival and exploration, which has never been done before to our knowledge, with the agent displaying significant improvement in terrain navigation and mob detection/avoidance over the baseline model: RL-From-Early-Game(Baker et. al 2022).

In the curriculum learning experiment, however, there was difficulty redirecting the actions of the pre-trained models to accomplish objectives of similar scope. This contrast in experimental results is particularly striking because the curriculum rewards were deliberately aligned with behaviors already present in the VPT model's training. The model was pre-trained to obtain a diamond pickaxe, with an iron sword sharing many of the same intermediate steps and materials. This suggests that the specific context and implementation of reward-based learning can significantly impact outcomes, even with similar underlying models. These outcomes highlight the nuanced relationship between pre-trained knowledge, the specific forms of fine-tuning applied, and the task at hand.

Our third experiment provides additional perspective, showing how an agent optimizing for hoarding inventory rewards was unable to learn the instrumental sub-goal of survival, showing that certain complex behaviors are effectively unlearnable through standard reinforcement learning alone, even with strong priors.

References

Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. volume 35. Curran Associates, Inc., 2022. URL https://arxiv.org/pdf/2206.11795.

Jack Clark and Dario Amodei. Faulty reward functions in the wild. 2016. URL https://openai.com/index/faulty-reward-functions.

William H. Guss, Brandon Houghton, Nicholas Topin, Philip Wang, Cristian Codel, Manuela Veloso, Emma Brunskill, Nishanth Gopalan, Ruslan Salakhutdinov, and Alex Berg. Minerl: A large-scale dataset of minecraft demonstrations. arXiv, 2019. URL https://arxiv.org/abs/1907.13440.414

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant.
Risks from learned optimization in advanced machine learning systems. arXiv, 2021. URL https://arxiv.org/pdf/1906.01820.

Lauro Langosco, Jack Koch, Lee Sharkey, Jacob Pfau, and David Krueger. Goal misgeneralization in deep reinforcement learning. Proceedings of Machine Learning Research, 2022. URL https://proceedings.mlr.press/v162/langosco22a/langosco22a.pdf.
Zhizhong Li and Derek Hoiem. Learning without forgetting. arXiv, 2016. URL https://arxiv.org/abs/1606.09282.

Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv, 2022. URL https://arxiv.org/pdf/2201.03544.425

Discuss