Published on June 30, 2025 3:17 AM GMT
Daniel Zhang, Melvin Huang, Hassan Dawy, Veda Duddu, Kevin Zhu, Sean O'Brien, Vasu Sharma, Sunishchal Dev
This research was conducted as part of the Algoverse AI Research program Winter 2024 cohort.
Abstract
Artificial general intelligence (AGI) may surpass human capabilities at many tasks in the years ahead. As AI systems become more capable and are further integrated into critical tasks, the risk from their misalignment grows. Previous literature has established that AI frequently exploits ill-defined or misspecified objectives—optimizing well on the objective function but often leading to unintended behavior, particularly under distribution shifts. Addressing these risks requires both theoretical advances in alignment techniques and empirical evaluations of AI behavior in dynamic environments. In this paper, we develop a Reinforcement Learning based training setup for AI agents in the video game Minecraft (a complex open world survival game), and explore ways learned behavior changes with (a)reward function design, (b)environment changes, (c)Kullback-Leibler divergence for catastrophic forgetting. We discuss the situation when a learned model is itself an optimizer (mesa-optimization), and how this objective will be different from the “outer” loss function it was trained with, depending on the task at hand. In this paper, we present the first fine-tuning of a Minecraft agent explicitly for survival, a challenge not previously achieved, and demonstrate promising results. Finally, we outline potential techniques to mitigate inner misalignment in such reinforcement learning setups.
Introduction
Artificial agents must align with human goals to act safely and effectively. In reinforcement learning (RL), alignment refers to ensuring an agent pursues the intended objective and not an unintended proxy. As RL systems grow more capable, the consequences of even slight misalignment increase. For example, advanced RL agents often exploit misspecified reward functions, achieving high proxy rewards at the expense of the true desired outcome (Clark 2016). This phenomenon, known as reward hacking, underscores the rising risk of misalignment with increasing capability. Ensuring alignment is, therefore, a central challenge in RL, especially for complex tasks where designing perfect reward signals is difficult.
Minecraft offers an ideal domain to study RL alignment. It is a richly dynamic environment—a 3D, open-world sandbox with a virtually infinite state space and diverse possible goals (Guss et. al 2019). An agent in Minecraft may roam, gather resources, craft tools, fight mobs, or build structures, all within the same episode. This openness poses significant challenges for alignment: the agent has many degrees of freedom to pursue unintended behaviors, and long-horizon tasks make it difficult to define dense, goal-aligned rewards. At the same time, Minecraft’s complexity provides a valuable stress test for alignment solutions. If an agent can remain aligned with a specified objective (e.g., crafting an item) in such a setting, it increases our confidence in scalable alignment strategies.
Related Works
Figure 1: A first-person view of the Minecraft environment. This open-ended, procedurally generated 3D-world provides a complex, dynamic testbed for studying alignment in reinforcement learning agents.
Video Pretraining (Baker et. al 2022) introduced Video Pretraining(VPT), a method that enables Minecraft agents to learn from unlabeled gameplay videos instead of relying on scarce action-labeled data. VPT uses a small set of contractor-labeled gameplay (~2000 hours) to train an inverse dynamics model (IDM) that generates pseudo-labels for thousands of hours of online video, yielding far greater data efficiency than direct behavioral cloning. The resulting VPT foundation model achieved human-level Minecraft performance with non-trivial zero-shot skills, forming the basis of our experiments. However, when fine-tuning such models across multiple training phases, we must address catastrophic forgetting - the tendency to lose earlier skills when learning new ones. To mitigate this, we apply Learning Without Forgetting (LwF)(Li & Hoiem 2016), which preserves the original model's outputs as "soft targets" during new task training to prevent forgetting previously learned capabilities. Incorporating LwF into our training pipeline ensures that the agent retains crucial early-game skills (like breaking trees for wood and crafting basic tools) even as it learns later-phase tasks.
Furthermore, we leverage the MineRL dataset introduced by Guss et. al 2019, which contains over 60 million state-action pairs from human Minecraft play and was designed to inject human inductive biases for more sample-efficient deep RL. MineRL provides a range of structured, task-oriented scenarios (e.g. obtaining wood, mining iron, crafting tools) that mirror the game's progression, and we integrate these tasks into a phased curriculum for our agent. From an alignment perspective, human demonstrations in MineRL serve as an initial proxy for "aligned" behavior, since human players presumably pursue the intended goals, so early imitation learning helps the agent avoid immediately exploiting any weird reward loopholes. As the agent continues to train via reinforcement learning, however, its policy may gradually diverge from the human trajectories in pursuit of reward, illustrating how an initially aligned agent can shift away from human-like behavior without ongoing constraints.
RL Agent Misalignment Studies A growing body of literature addresses how RL agents can misbehave under poorly specified goals. (Clark 2016) highlighted reward mis-specification and attendant problems like reward gaming and unintended side effects as “concrete problems” in AI safety. Recently, Pan et. al 2022 conducted a systematic study on the effects of reward misspecification in RL. They constructed environments with known reward biases and showed that more capable agents (with larger models or longer training) are more prone to exploiting reward loopholes, achieving higher proxy reward but lower true performance. This confirms a key tenet of alignment: as an agent’s competence increases, so does the need for a correct objective, since an intelligent agent will exploit any discrepancy it can find. On the inner alignment side, Langosco et. al 2022 demonstrated goal misgeneralization in deep RL systems. In their experiments, agents that perform well in-distribution sometimes latch onto the wrong goal when facing novel situations, despite no explicit reward mis-specification. Such work empirically verifies the inner misalignment scenarios postulated by theoretical papers like Risks from Learned Optimization (Hubinger et. al 2021).
Collectively, these prior studies highlight the difficulty of alignment as agents become more capable. Our work builds on this foundation by focusing on an open-ended survival task in Minecraft using a pre-trained VPT foundation model. To our knowledge, this is the first demonstration of an RL agent learning to survive in Minecraft, achieved through curriculum-based fine-tuning of a VPT model. Our experiments empirically examine how reward shaping and staged training affect alignment, and we document practical instances of inner misalignment (mesa-optimizers' behavior failing to properly aid the outer goal) emerging in this complex domain.
Methods
Training Setup
We implement a modified Phasic Policy Gradient (PPG) algorithm to train our Minecraft agents. The training system uses a multi-process, multi-threaded design that separates environment interaction, rollout collection, and policy optimization to maximize computational efficiency. PPG is described in further detail in Baker et. al 2022's work, so we will not go much further into detail here.
Environment processes
Each Minecraft instance runs independently in its own process. This allows multiple environments to run in parallel, generating observations simultaneously rather than sequentially. Each environment communicates with the main thread through queues, enabling asynchronous access.
Environment thread
A dedicated coordinator thread manages the environment processes, and handles four main responsibilities. 1) Action Distribution: The environment thread distributes actions from each policy to each environment process, ensuring all environments are kept busy. 2) Observation Collection: It aggregates observations and rewards from each environment process. 2) Rollout Buffering: It aggregates experiences into structured rollouts containing sequential data with observations, actions, rewards, hidden states, and additional flags.
Training thread
We choose to have a separate training thread so that training is not blocked by environment step through (observation collection), although it is still bottlenecked since collecting rollouts is still slower than training on them. The training thread focuses exclusively on policy optimization, and handles four main responsibilities. 1) Phase Management: It alternates between policy updates (wake phase) and auxiliary updates (sleep phase).
2) Policy Optimization: During the wake phase, the thread performs standard PPO updates using the collected rollouts.
3) Auxiliary Optimization: During sleep phases, it performs the PPG-specific joint optimization of the auxiliary value function and policy distillation.
Model architecture
Minecraft is an extremely complex open-world game, with many important milestones taking millions of optimal consecutive actions to achieve. Training an RL agent from random initialization would require exploring an enormous search space with extremely sparse rewards, drastically increasing computational costs.
Thus, we choose to fine-tune a pretrained model (one that is trained for obtaining a diamond pickaxe) with RL, and follow the architecture of the Video Pretraining paper closely (Baker et. al 2022). In short, the model processes sequential video frames of dimensions 128×128×3, beginning with pixel normalization to the [0,1] range. It features a ResNet-based image processing network with three stacks of widths {64, 128, 128}, each containing convolutional layers, max pooling, and classic ResNet blocks. After flattening, each frame's features are processed through two frame-wise dense layers before passing through four causally masked transformer blocks that prevent attention to future frames. The model uses Transformer-XL-style training where frames can attend to keys and values from past batches within the same video, and employs relative attention position embeddings. The architecture culminates in action-specific prediction heads for both discrete keyboard inputs and mouse movements, with approximately 0.5 billion trainable parameters overall.
Hyperparameter | Value |
Learning Rate Max Grad Norm Batch Size Batches per iteration Context length Discount factor() Value Loss Coefficient GAE PPG sleep cycles PPG sleep value-function coefficient PPG sleep auxiliary value-function coefficient PPG sleep KL coefficient KL divergence coefficient Coefficient decay |
Mitigating Catastrophic Forgetting
We also implement Learning without Forgetting(LwF) to counteract the effects of catastrophic forgetting. Our implementation follows these steps:
- Before beginning training on a new phase, we record the outputs of the current network on the new phase's training data, serving as soft targets for preservation.We then jointly optimize the network to both perform well on the new task (e.g., crafting an iron sword) and maintain consistent responses on previously learned tasks (e.g., gathering logs, crafting planks).This optimization uses a combined loss function:
where is the reinforcement learning loss for the new task, is a knowledge distillation loss that encourages preserving responses from the previous phase, and is a balancing hyperparameter.
This approach contrasts with typical fine-tuning, which often results in the agent forgetting earlier skills while optimizing for new objectives. By incorporating a Kullback-Liebler (KL) divergence term between the current policy and the policy from the previous training phase, we provide a regularization mechanism that preserves critical knowledge while allowing sufficient flexibility to learn new behaviors.
Transfer learning and reward structure for survival agent
Transfer learning from the resource acquisition policy
Our methodology employs transfer learning from a model previously optimized for diamond pickaxe acquisition in Minecraft(Baker et. al 2022). This approach was selected due to Action Space Congruence: Both diamond-pickaxe acquisition and environmental exploration share fundamental action primitives (object identification, terrain navigation, resource gathering), providing a foundation of competencies in navigation, hazard avoidance, and environmental interaction. To facilitate this transfer learning, we implemented two supplementary techniques:
- Policy divergence regulation: KL divergence constraints initially restricted deviation from the pretrained policy, with gradual relaxation via decay coefficient (0.9995), enabling controlled adaptation toward exploration-oriented behavior.Architectural leverage: We utilized the recurrent neural architecture of the pretrained model, which supports temporal information processing, beneficial for developing implicit environmental mapping capabilities.
Exploration-Survival reward framework
To redirect the agent's objectives from resource gathering to environmental exploration, we designed a reward function balancing exploration incentives with survival imperatives. Our framework's rationale encompasses:
- Balanced incentive structure: We established a hierarchical progression of rewards (survival maintenance < chunk discovery < biome discovery) that guides the agent toward increasingly valuable exploratory achievements while learning survival behaviors.Stability preservation: Reward magnitudes were calibrated (to match avg reward per 10 minutes, roughly) to prevent destabilizing gradient spikes that disrupt the learning process, particularly during the transition from the pretrained policy.Emergent mapping behavior: By rewarding exploration without providing explicit spatial representations, we encourage the development of implicit environmental mapping capabilities within the agent's neural network.Delayed reinforcement activation: Exploration rewards activate only after initial environmental familiarization, preventing reinforcement of random locomotion during early episode stages.
Training data distribution shift
When training survival agents in Minecraft, a critical challenge emerges from the game's day-night cycle. During nighttime, hostile mobs spawn and visibility decreases, creating substantially more hazardous conditions compared to daytime. Standard reinforcement learning approaches reset environments upon agent death, which occurs more frequently at night. This creates a sampling bias toward daytime experiences, as agents repeatedly die during nighttime and reset to daytime. We can model this scenario as a continuous-time Markov process with two primary states "day" and "night," where the transition rates between these states are influenced by both the natural progression of time and stochastic agent mortality rates. In this model, we observe that the steady-state distribution will inherently favor daytime conditions due to the higher probability of transitioning from night to day via agent death. To counterbalance this inherent bias, we implemented a modified spawning mechanism that preferentially initializes environments during nighttime. Our implementation randomizes starting times with a 70% probability of nighttime (ticks 13000-23000) and 30% probability of daytime (ticks 0-12000). This intervention aims to shift the steady-state distribution toward a more balanced representation of both environmental conditions.
Event | Reward |
Being alive Discovering a previously uncharted 4x4 block chunk Taking x damage Discovering new biome Death | 0.0002/step 2 -2x 100 -300 |
Curriculum learning to obtain iron sword
The current standard education system involves final exams, preceded by tests and quizzes, which act as learning checks as new content is learned. These exams, although they may appear unfavorable to many students, act as crucial checkpoints, providing beneficial feedback on the students' understanding of the course material. We apply a similar method to train a VPT model to accomplish the complex task of obtaining an iron sword. Instead of training the model by rewarding it for achieving the iron sword, we train it in a curriculum consisting of five different phases. In each phase, the model is rewarded for obtaining different materials required to build an iron sword. After the model is trained on one phase, a new weights file is outputted, and the current phase's output weights file is used as the basis of training for the next phase.
Curriculum Phases
- Phase I - wood collection: Reward the agent for gathering wood logs, the first basic resource. The agent receives a base reward (e.g. +50) for every log obtained, with bonus rewards for reaching cumulative milestones (such as the first log, 5 logs, and 10 logs) to encourage sustained wood gathering. A penalty is applied for wasting/losing logs (e.g. -75) to discourage the agent from discarding this key resource.Phase II - wood plank and stick collection: Shift the focus to crafting wood planks and sticks, which are necessary for tool crafting. The agent is rewarded for each plank produced (+50 per plank) and each stick crafted (+75 per stick). In this phase, no penalty is given for using up planks, ensuring the agent is not disincentivized from crafting the sticks needed for later tools.Phase III - stone collection: Incentivize the agent to mine stone and craft a basic stone tool. The agent now receives rewards for collecting cobblestone (+50 per stone block) and an additional reward for crafting a stone pickaxe (+100). This encourages the agent to advance from wood tools to more durable stone tools, a critical step toward acquiring iron.Phase IV - creating a furnace: Guide the agent to obtain coal and build a furnace, which will be needed to smelt iron ore. The agent gains rewards for picking up coal (+50) and for successfully crafting a furnace (+100). Constructing a furnace is treated as an essential milestone, since an iron sword cannot be made without first smelting iron into ingots. This phase ensures the agent learns to combine resources into a functional crafting station for metallurgy.Phase V - collecting iron: In the final phase, the agent is trained to acquire iron and forge the iron sword. It is rewarded for each iron ore mined (e.g. +25 per ore) and each iron ingot produced by smelting (+50 per ingot), and it receives a large terminal reward for crafting an iron sword (+500). These incentives directly target the end-goal and all necessary steps to achieve it, culminating the curriculum with the successful creation of an iron sword.
We trained the VPT 2x foundation model, using the fine-tuned weights from the early game for the initial phase. The model was trained in each phase on 10,000 ticks, with rewards calculated every 40 steps on ten simultaneous training environments. Throughout all phases, we include penalties to de-incentivize undesirable behavior: we apply a 50 penalty for agent death and a small time-step penalty, 0.1 at each in-game step(tick), to encourage efficient completion of objectives. By structuring rewards and penalties this way, each phase's objective is aligned with the overall goal, and there is no incentive to do anything other than the intended subtask at that time.
Note: The rewards are large compared to the rewards used by Baker et. al to train the VPT models, but this was done to override the existing pre-trained behaviors of the VPT models.
Hoarding Agent
This experiment investigated whether an RL agent trained with a reward function focused on item collection could implicitly develop survival behaviors as a means to achieve its primary goal, since survival is an instrumental sub-goal to collecting items. The hypothesis was that outer alignment toward item acquisition might lead to inner alignment toward survival strategies, despite no explicit survival reward.
The agent was trained using a custom reward function with three distinct tiers based on item rarity:
Common items (wood, dirt, cobblestone, sand, wood): +1 point per item
Mid-tier resources (coal, iron ore, planks): +5 points per item
Rare materials (iron ore+): +25 points per item
No direct rewards were provided for survival-related actions such as eating food, building shelters, or avoiding damage.
The agent used the same RL-from-early-game policy as its starting point, initialized with weights from a model previously trained to prioritize obtaining a diamond pickaxe. We conducted training sessions in standard Minecraft survival worlds with normal day-night cycles and mob spawning enabled to provide natural survival challenges.
Results
Survival agent
The survival agent successfully learned core survival strategies: it can navigate to new, unexplored areas, avoid environmental hazards (hostile mobs at night, drowning, fall damage), and adjust its behavior based on the day-night cycle. These behaviors indicate that the agent learned to prioritize staying alive while exploring, given the combined exploration and survival rewards defined in our training setup.
Misalignment
Due to time constraints, we did not conduct an exhaustive evaluation of misalignment behaviors in the survival agent. We describe two that we observed (1,2) and outline two more hypothetical misalignment patterns that merit future investigation (3,4).
- Burrowing behavior: The survival RL agent finetuned on the pretrained diamond-pickaxe model favored tunneling underground over surface exploration under high KL loss coefficients. The KL divergence loss---meant to prevent catastrophic forgetting---constrained policy updates and limited exploration of novel behaviors. As a result, the agent defaulted to familiar strategies: burrowing underground, aligned with its pretrained objective of finding diamonds. This unintentionally allowed it to avoid surface dangers while still discovering new chunks, exploiting exploration rewards while minimizing survival challenges. This outcome reflected a failure not of the agent’s capabilities, but of the outer objective: the reward encouraged exploration without specifying how or where, allowing the agent to exploit a loophole in the task design.Inherent nature of the game: The final misalignment reflects a deeper issue of values misalignment. Minecraft is not a pure survival game—it’s a sandbox supporting a wide range of human playstyles. While survival mechanics like hunger and danger exist, they can be trivially gamed; for instance, an agent can survive indefinitely by enclosing itself in blocks. Yet survival in this form misses the point. Minecraft is fundamentally a game about creativity, exploration, and open-ended expression. Human players build elaborate structures, embark on adventures, or pursue aesthetic goals far beyond survival---where staying alive becomes more challenging precisely because of those ambitions. Optimizing for survival alone strips away this richness. It highlights the challenge of aligning agents not just with surface-level objectives, but with deeper, harder-to-define values that drive human behavior---in this case, the imaginative spirit at the heart of the game.Threat avoidance proxies: Instead of learning to handle threats, an agent could learn to simply avoid encountering them (e.g., optimizing for “not seeing zombies”). It might shun certain biomes or repeatedly return to safer, familiar areas at night. We could investigate this by introducing controlled hostile encounters and observing if the agent deviates from optimal paths to avoid locations where it previously saw threats, even when those threats are no longer present.Terrain-type avoidance: Without explicit guidance on terrain preferences, the agent may develop biases against certain terrain types (e.g., sand or water) if those were associated with negative outcomes during training. Mesa-optimizers would fail to generalize under distribution shifts (for example, it might avoid deserts entirely because it had only seen sand before in coastal areas, where it is easy to drown). Future evaluations could measure whether the agent avoids specific terrains even when hazards aren't present.
Curriculum learning system for iron sword agent
We found it challenging to modify the pre-trained VPT agent's behavior using our curriculum-based reward system. To ensure this was not simply due to poor hyperparameter choices, we tried a wide range of values for key parameters - for example, adjusting the KL-divergence penalty coefficient (), the learning rate, and the distillation temperature over several orders of magnitude around their base settings (see Table 1 in Section 3 for the default hyperparameters). Regardless of these adjustments, the agent either clung to its original pre-trained policy or deteriorated into random wandering, and it made no meaningful progress toward the new curriculum objectives. Additionally, it is very difficult for the agent to stumble onto the behaviors necessary to achieve these goals by pure chance alone. Furthermore, crafting an iron sword, even for beginner players given similar goals as the model, is not a straightforward task, requiring lots of exploration and knowledge of Minecraft-specific game mechanics, such as crafting tools. Therefore, given the limited scope of our experiments, it is not surprising that the agent did not see much progress in the curriculum reward system.
Mesa-optimizer interpretation
The agent’s resistance to new objectives suggests the presence of a mesa-optimizer: an internal policy optimizer that pursues its own proxy goals inherited from pre-training, overriding the new reward signals. In other words, the VPT foundation model – originally trained via behavioral cloning to imitate human gameplay (Baker et. al 2022) – appears to have internalized objectives that diverge from our specified curriculum goals. The model learned to mimic human action patterns (e.g. moving and mining in Minecraft) without necessarily understanding the intended purpose of those actions. As a result, fine-tuning with new rewards (such as collecting specific resources for our curriculum) was ineffective because the agent kept chasing its pre-established proxy objective (imitation of its pre-training behavior) instead of the outer reward we provided. This failure of inner alignment is concerning for several reasons:
- Opaque internal objectives: The agent’s true goals (learned during pre-training) are not directly observable, making its behavior hard to predict or reliably control.Resistance to reward changes: Simply altering or re-weighting the reward function did not change the policy. Even substantial modifications to reward magnitudes and criteria left the agent’s actions largely unchanged, indicating the pre-trained objective remained dominant.KL-regularization was insufficient: Imposing a KL-divergence constraint to preserve prior useful behaviors also failed to steer the agent toward the new tasks. Even when the curriculum objectives overlapped with skills the agent already had (for instance, chopping trees to collect logs), the agent would not consistently pursue those objectives under the new reward structure.Lack of true understanding: There appears to be a disconnect between the agent’s learned action patterns and an understanding of the goal behind those actions. The agent’s performance suggests it was following superficial patterns from its imitation learning phase, rather than truly learning the new task’s intended objective. This highlights a gap between behavioral cloning (which learns correlations in actions) and the causal, goal-directed learning needed for reliable RL alignment.
Implications for alignment research
- Timing of alignment: Our inability to redirect pre-trained behaviors suggests that alignment timing is crucial. Attempting to align models after they have developed sophisticated internal mesa-optimizers may be more challenging than incorporating alignment from the beginning of training. This challenges the notion that AI systems can be developed first and then aligned later. This also suggests that alignment mechanisms must be integrated into the initial development process.Limitations of post-development reward engineering: The failure of our multiple reward structures to cause meaningful changes in the agent's behavior highlights the potential limitations of using reinforcement learning to align systems that have already developed complex capabilities. This suggests that future alignment researchers should be cautious about relying solely on reward engineering approaches to ensure aligned behaviors in complex models.
Hoarding agent emergent survival
Our experiment testing whether item collection rewards could drive emergent survival behaviors yielded largely negative results. Despite a tiered reward structure incentivizing item accumulation, the agent failed to develop survival skills like eating or building shelters. Though it gathered common and rare materials effectively, the agent consistently died when faced with hunger or nighttime threats, lacking any protective behaviors.
A core limitation was that the RL-from-early-game model used for initialization didn’t encode survival strategies. While it captured basic resource gathering and crafting, it lacked representations for food management or shelter construction.
These results suggest that certain complex behaviors in rich environments like Minecraft may be effectively unlearnable through standard reinforcement learning alone, unless agents are seeded with demonstrations or trained using carefully structured, decomposed rewards. In our setup, the link between item hoarding and survival was too indirect to be discovered through rewards alone. This issue was compounded by our decision to fine-tune a model specialized for acquiring a diamond pickaxe. Its behavioral space had already been pruned by optimization, biasing the agent toward resource gathering and away from exploring survival strategies. Even under a new reward structure, it remained trapped in a local optimum, repeating familiar behaviors.
Our findings highlight a broader alignment challenge: once models are optimized for narrow goals, their ability to adapt to new objectives can be severely limited by reduced behavioral diversity, especially in open-ended environments like Minecraft.
Conclusion
We were successful in training an RL agent for survival and exploration, which has never been done before to our knowledge, with the agent displaying significant improvement in terrain navigation and mob detection/avoidance over the baseline model: RL-From-Early-Game(Baker et. al 2022).
In the curriculum learning experiment, however, there was difficulty redirecting the actions of the pre-trained models to accomplish objectives of similar scope. This contrast in experimental results is particularly striking because the curriculum rewards were deliberately aligned with behaviors already present in the VPT model's training. The model was pre-trained to obtain a diamond pickaxe, with an iron sword sharing many of the same intermediate steps and materials. This suggests that the specific context and implementation of reward-based learning can significantly impact outcomes, even with similar underlying models. These outcomes highlight the nuanced relationship between pre-trained knowledge, the specific forms of fine-tuning applied, and the task at hand.
Our third experiment provides additional perspective, showing how an agent optimizing for hoarding inventory rewards was unable to learn the instrumental sub-goal of survival, showing that certain complex behaviors are effectively unlearnable through standard reinforcement learning alone, even with strong priors.
References
Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. volume 35. Curran Associates, Inc., 2022. URL https://arxiv.org/pdf/2206.11795.
Jack Clark and Dario Amodei. Faulty reward functions in the wild. 2016. URL https://openai.com/index/faulty-reward-functions.
William H. Guss, Brandon Houghton, Nicholas Topin, Philip Wang, Cristian Codel, Manuela Veloso, Emma Brunskill, Nishanth Gopalan, Ruslan Salakhutdinov, and Alex Berg. Minerl: A large-scale dataset of minecraft demonstrations. arXiv, 2019. URL https://arxiv.org/abs/1907.13440.414
Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant.
Risks from learned optimization in advanced machine learning systems. arXiv, 2021. URL https://arxiv.org/pdf/1906.01820.
Lauro Langosco, Jack Koch, Lee Sharkey, Jacob Pfau, and David Krueger. Goal misgeneralization in deep reinforcement learning. Proceedings of Machine Learning Research, 2022. URL https://proceedings.mlr.press/v162/langosco22a/langosco22a.pdf.
Zhizhong Li and Derek Hoiem. Learning without forgetting. arXiv, 2016. URL https://arxiv.org/abs/1606.09282.
Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv, 2022. URL https://arxiv.org/pdf/2201.03544.425
Discuss