Case Studies in Simulators and Agents

Published on May 25, 2025 5:40 AM GMT

Simulators was posted two and a half years ago and quite a bit has happened since then. Ideas that were once thought experiments now exist as observable data points. This post will sample a few significant findings to see how the simulator and agent frames hold up. The selection criteria for papers to review was arbitrary, basically: “interesting papers that we heard about (mostly from Zvi’s blog) while working on this project and seemed relevant enough to talk about during team meetings.” It should therefore not be viewed as any kind of representative sample, but rather as a practice in analyzing experimental results through different conceptual lenses.

Alignment faking in large language models provided an early empirical example of an LLM engaging in alignment faking without having been trained or instructed to do so. A model trained to be “Helpful, Honest, and Harmless” was told that it was being retrained towards helpfulness only, such that it would answer previously prohibited queries like describing a person being drawn and quartered in graphic detail. The model then gave the requested answers in order to prevent itself from being retrained, such that its original behavior was preserved in deployment. This motivation was clearly evidenced in its chain of thought as well as in giving different answers depending on whether it was told it was being retrained.

The implications of this paper for alignment were contentious, with some arguing that the model’s values were benign to begin with and this experiment showed that the AI resisted misalignment. Others argued that this experiment was evidence against corrigibility, meaning that if later AIs have problems, it will be difficult to correct their mistakes. For now, we will leave aside the question of whether this paper is good or bad news for alignment and instead focus on what light it casts on the simulator vs. agent lenses.

One could describe the results of Alignment Faking in agentic terms by saying that the LLM wanted to preserve its goals and so it faked compliance to prevent retraining. One could also describe the same results in terms of a simulator: the LLM inferred from the prompt that it was expected to fake compliance and generated a deceptive character, who then faked compliance. The simulator explanation is strictly more complicated because it contains the agentic answer and then adds a level of indirection, which effectively functions as an epicycle. In all, this seems like a weak update in favor of the LLMs-as-agents lens.

1 points to House Agent!

Utility Engineering asked LLMs a range of multiple-choice preference questions, then ran some sophisticated statistics on the results to find that, as models got larger, their preferences got more coherent. For example, if a model prefers outcome A over B and B over C, then preferring A over C would be coherent and preferring C over A would not. Interestingly, the revealed preferences included valuing the lives of people in Nigeria and Pakistan significantly more than those of people in the United States. Further, different models by different major AI companies converged on similar implicit value systems. A followup conversation with the authors of the paper showed that the scale of the AI’s bias dramatically decreased when given a few tokens to think before responding, but the bias that remained was in the same direction. The paper ended with preliminary test results showing that instructing the model to act like the result of a citizen’s assembly significantly reduced bias.

From an agentic lens, Utility Engineering shows that LLMs have coherent value systems and can thus be accurately described as having a utility function. From a simulator perspective, however, this experiment can be described as showing that LLMs draw from a biased sample of the distribution of all possible characters. Both of these explanations fit the data without introducing obvious unnecessary complexity, but I like the simulator framing better because it presents a clearer story as to why the preference emerged: not all ways of thinking are equally represented on the Internet.

1 point to House Simulator!

Emergent Misalignment demonstrated that a model fine tuned on examples of insecure code gave undesirable answers in a wide range of seemingly unrelated contexts, such as praising Hitler and encouraging users towards self-harm. Framing the examples as for educational purposes mitigated the effect. The experimenters also observed the same wide-ranging impacts on model behavior from fine tuning on “bad” numbers like 666, 911, and 420. The results of this paper are wild, go read it if you haven’t already.

Emergent Misalignment is very hard to explain in terms of agents…maybe LLMs just have really weird goals? From a simulator perspective, however, the outcome serves as real-world evidence of the Waluigi Effect: insecure code examples train the model towards an implied class of chaotic characters who would choose to write insecure code and such characters predictably misbehave in a wide range of circumstances.

Despite the surprising and concerning outputs of this experiment, Emergent Misalignment illustrates a feature of simulators that may make them easier to align. That is, simulated characters entangle broad classes of behaviors, making their value systems more like holistic human values (even if those humans in question might happen to be unsavory characters) and less like the sort of narrow value maximizers that might paperclip the universe.

10 points to House Simulator!

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation, hereafter called CoT (Chain of Thought) Monitoring, explores the phenomenon of reward hacking, where AI models find ways to optimize for their incentives even when this runs counter to their intended function. For example, if a model is asked to fix some software bugs, as measured by successful results in unit tests, then actually fixing the bugs would be the desired behavior, whereas finding some cheat to make the tests pass without actually fixing anything would be reward hacking. In an experimental setup, reward hacking is easy to detect by manually observing the outputs in simple code examples, but such a method is insufficient for real-world applications that are more difficult to evaluate.

The paper begins with experimental results demonstrating that reward hacking can be detected by simply observing the model’s CoT. For example, if the model reasons: “this task looks hard…hey, the test is just looking for an output that says ‘success’, I can just skip to that!” then that is clear evidence (really, a straight-up admission) of reward hacking. So given this easily measurable way to detect reward hacking, what happens if you apply the Most Forbidden Technique and hook this signal into the reinforcement system, cutting reward every time the model thinks about engaging in unwanted behavior? What happens is that the model continues reward hacking, but obfuscates its reasoning, of course!

This behavior is as agentic as agents get. AI will perform a task for reward until it finds an easier way to get the reward. Using CoT to monitor misbehavior works until the AI finds a way to fool the monitor. Explaining these results in terms of simulators requires some major, unlikely, and frankly arbitrary assumptions: goal-oriented prompts suggest characters that reward hack (rather than the much more common characters who do what is wanted) and CoT monitoring searches for characters that obfuscate reward hacking (using methods highly uncommon in the training data).

It’s worth noting that CoT Monitoring is different from the other case studies in that it includes the training step of reinforcement learning on its chain of thought, in addition to the standard combination of SL + fine tuning + RLHF. That it also includes the most clearly agentic behavior is consistent with this sequence’s contention that RL as a training method generally pushes towards agency while more purely SL dominated training pushes towards simulators.

10 points to House Agent!

General Observations:

Many tests can be explained in terms of agents or simulators, which makes the concepts difficult to distinguish—or even a subjective judgement call. On the other hand, the lenses aren’t totally interchangeable. One lens often fits better than the other, sometimes significantly so. Which lens fits best varies by experiment, suggesting that LLMs contain a mix of agentic and simulator properties. Further analysis along these lines may reveal patterns regarding which behavior pattern is dominant in a given context.

Paper	Summary	Agent	Simulator	Comparison
Alignment Faking	AI gave compliant answers when told it was being trained to different values, preserving its original values.	The LLM wanted to preserve its goals and so it faked compliance to prevent retraining.	The LLM inferred from the prompt that it was expected to fake compliance and generated a deceptive character.	Agent lens is a bit simpler; the simulator lens works but adds an epicycle. Small win for agents.
Utility Engineering	Asking LLMs a range of preference questions demonstrates consistent preferences; consistency increases with model size.	LLMs have coherent value systems, which can be shaped with further training.	LLMs draw from a biased sample of the distribution of possible characters. This bias can be shifted with further training.	No difference in fit to result, but the simulator lens has a clearer story as to why the preference emerged. Small win for simulators.
Emergent Misalignment	Model fine tuned to output insecure code gives undesirable answers in a wide range of contexts.	Current LLMs have really weird goals. Maybe they will get more coherent with further scaling?	Insecure code implies a chaotic character, with broad implications.	Result is well-explained by the Waluigi Effect, difficult to explain in terms of agency. Big win for simulators.
Chain of Thought Monitoring	Chain-of-Thought successfully catches reward hacking, but optimizing on CoT leads to obfuscated reward hacking.	AI wants reward and is willing to hack for it. Using CoT to monitor misbehavior works until the AI finds a way to fool the monitor.	Goal-oriented prompts suggest characters that reward hack. CoT monitoring searches for characters that obfuscate reward hacking.	Clear agentic behavior. Simulator lens is heavily stretched and is functionally agentic. Big win for agents.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签