Published on June 25, 2025 6:59 AM GMT
In Agents, Tools, and Simulators we outlined what it means to describe a system through each of these lenses and how they overlap. In the case of an agent/simulator, our central question is: which property is "driving the bus" with respect to the system's behavior, utilizing the other in its service?
Aligning Agents, Tools, and Simulators explores the implications of the above distinction, predicting different types of values (and thus behavior) from
- An agent that contains a simulation of the world that it uses to navigate.A simulator that generates agents because such agents are part of the environment the system is modelling.A system where the modes are so entangled it is meaningless to think about where one ends and the other begins.
Specifically, we expect simulator-first systems to have holistic goals that internalize (an approximation of) human values and for agent-first systems to have more narrow values that push them towards maximization.
Given these conceptual distinctions, we observe that the best-fitting lens for past AI systems seems to correspond to their training process. In short, RL-based systems (like AlphaZero) seem almost purely agentic, pre-trained LLMs seem very simulator-like, and current systems that apply (increasing amounts of) RL feedback to LLMs are an entangled mix.
Assuming this correspondence is accurate, this post asks: why?
Why does RL Create Agents?
Reinforcement learning (RL) naturally gives rise to agentic behavior because of how it optimizes for long-term rewards. RL agents interact with an environment and adjust their strategies based on feedback. Crucially, their goals are not explicitly programmed but emerge through training. The objective function defines what is rewarded, but the model must infer how to achieve high reward by exploring possible actions. Over time, given sufficient training and model capacity, internal goals tend to align with maximizing reward. However, this process depends on avoiding local minima—suboptimal behaviors that yield short-term reward but fail to generalize. The structure of the environment therefore plays a critical role in pushing the agent toward deeper abstractions and strategic planning.
Consider a simple grid world environment where an RL agent is rewarded for collecting coins. If the coins always appear in fixed locations, the agent might learn a rigid sequence of movements, optimizing for a narrow pattern without generalizing. If the coin locations vary, however, the agent must instead develop a more flexible strategy—navigating toward coins wherever they appear. This adaptation represents the beginning of meaningfully goal-directed behavior. Introducing obstacles forces another level of adaptation, where the agent learns to avoid barriers, but only at the minimal level necessary to maintain reward. As the environment grows more complex, the agent’s learned policies shift from simple memorization to increasingly abstract heuristics, incorporating instrumental reasoning about how different elements of the environment contribute to its goal.
In more advanced settings, where even the reward function itself varies, RL agents may begin to identify deeper patterns. Rather than navigating the environment within the means intentionally designed into the system, they might infer a meta-strategy of pursuing generically useful instrumental strategies like power-seeking—or simply discover a way to tamper with the reward system itself. The more capable and adaptable the system, the more likely the agent is to discover these strategies.
Fortunately for AI safety timelines, it is difficult to push RL systems to reach such high levels of generality. Enter self-supervised learning (SSL).
Why does SSL Create Simulators?
Simulators may have emerged from the SSL training process of LLMs because of the unique nature of text prediction. Unlike RL systems, which optimize toward achieving a specific goal through a series of instrumental steps, SSL-trained models have no such intermediate objectives. Their task is purely to predict the next token given a context—once that prediction is made, their job is complete. There is no direct feedback loop where the model's actions influence the world and shape future outcomes, as is the case with RL.
…Actually, this isn’t entirely accurate. LLMs are not fully myopic, next-token predictors. But the contrast to RL with respect to the degree of lookahead stands.
Because language is deeply entangled with human cognition, the space of text that an LLM must model is extraordinarily complex. The vast amount of text available for training far exceeds what can be directly memorized within an LLM’s parameters. And on the output side, an LLM must be able to generate plausible text across an enormous range of contexts, from technical explanations to fictional dialogue. Discarding irrelevant information or hyper-focusing on narrow objectives are not winning strategies for text prediction because all information might be relevant and any narrow goal might become a focus given the right context.
Abstraction serves the dual purposes of extending behavior and compressing data. For LLMs, this means developing efficient internal representations that capture patterns of human communication. If the model can simulate an author or a reasoning process well, it can predict how that author or process would naturally continue a given text. If it can abstract the properties that differentiate one author or process from another, it can recombine those properties into a broad range of plausible speakers, reasoning styles, and mental states.
In short, simulation is a useful proxy for prediction.
General Principles:
If RL seems to produce agents without broadly general capabilities and SSL seems to produce simulators with limited agency, what are the underlying principles that drive the emergence of agency and simulation that we could use to anticipate what will happen as these techniques get blended together or supplemented with other forms of training? Over the course of researching and writing this sequence, we’ve formed the following intuitions:
Agent emergence: A core driver of instrumentality seems to be the feedback gap—the delay, uncertainty, or inference depth required between action and feedback. The larger this gap, the more an AI system must develop instrumental strategies, such as situational awareness, long-term planning, and influencing its environment to achieve high reward.
Simulator emergence: simulators are the obvious (in hindsight) result of directly training a system to imitate a dataset. They diverge from agents by the lack of distance between action and feedback. Simulation becomes more interesting—progressing from memorization to finding abstract rules—as a result of the need to compress diverse interactions within a complex environment or dataset. The richer and more varied the context, the more an AI system must develop generalizable abstractions, such as world modeling.
Under the current training paradigm of SSL + RL + CoT, we hypothesize that blended agency and simulation works as follows: an agent’s cognition mainly takes place in the later layers of the network, while the earlier layers produce a simulation that the agent can draw on as input. At the beginning of agency-producing training, the agent knows nothing and passively accepts the results of simulation. As such training progresses, the agent learns which simulator concepts are useful, how to interact with the simulator to gain more utility, and eventually gains increasing control of the system.
One can visualize this dynamic with the metaphor of a conversation between a simulated human Chess grandmaster and an alien optimizer, where the former is trying to predict the moves of its character and the latter is actually trying to win the game. At the beginning of the optimizer’s training, its goals are best served by asking the grandmaster for the best move and choosing that. As training proceeds, the optimizer starts asking the grandmaster increasingly sophisticated questions about why it is choosing its moves. Eventually, the optimizer starts to notice mistakes in the grandmaster’s reasoning—or, more accurately, points of divergence between reasoning that leads to effective imitation and reasoning that leads to winning. As the optimizer’s understanding continues to grow, it progresses from working within the grandmaster’s concepts, and occasionally finding “mistakes,” towards a deeper, first-principles understanding of Chess. This allows the optimizer to pursue increasingly radical new lines of inquiry, increasingly disregarding the grandmaster’s recommendations on moves, but continuing to use the grandmaster’s understanding of how pieces move on the board. Finally, the optimizer is in full control of the reasoning process, the influence of the grandmaster persisting only as a bootstrapping process.
In practice these dynamics do not necessarily follow a clean, sequential order. If, for example, training alternates between different modes, then we would expect the boundaries between simulation and agency to continuously blur, with capabilities co-evolving rather than cleanly layering, leading to behavior that is harder to characterize.
Agency Plus Simulation Go Boom?
RL-based agents made for powerful optimizers but struggled with the kind of abstraction needed to achieve common sense, let alone superhuman situational awareness. SSL-based simulators contain impressive feats of abstraction but have to be pushed to display any goal directedness at all. There may be a third, mystery ingredient to acting effectively in the world, but if so it is not obvious what that is. In any case, one would hope that whoever puts agency and simulation together knows on a very deep level what they are doing.
Discuss