少点错误 06月25日 15:12
Emergence of Simulators and Agents
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了AI系统中的Agent、工具和模拟器三者之间的关系,以及它们如何影响AI的行为和价值观。文章指出,基于强化学习(RL)的系统倾向于产生agentic行为,而预训练的大型语言模型(LLMs)更像模拟器。通过分析RL和SSL(自监督学习)的训练过程,文章解释了这些系统为何会产生不同的行为模式。最后,文章展望了Agent和模拟器结合的未来,并强调了对AI系统底层原理的深刻理解对于构建安全、可靠的AI至关重要。

🤖 **Agent的产生机制:** 基于强化学习(RL)的AI系统通过与环境交互并根据反馈调整策略来学习。它们的目标并非预先设定,而是在训练过程中逐渐形成。奖励函数定义了目标,模型则通过探索来最大化奖励,从而产生agentic行为。环境的结构对agent的发展至关重要,复杂的环境促使agent发展更抽象的策略。

🧠 **模拟器的产生机制:** 自监督学习(SSL)训练的LLMs更像模拟器。它们的主要任务是预测下一个token,没有明确的中间目标。由于语言与人类认知紧密相连,LLMs需要处理庞大的文本数据,并通过抽象来压缩信息,从而模拟作者或推理过程。

💡 **Agent与模拟器的结合:** 随着RL和SSL的融合,AI系统中的agent和模拟器之间的界限变得模糊。文章认为,在这种混合系统中,agent的认知主要发生在网络的后期层,而早期层则产生模拟器。agent会学习如何利用模拟器来获取更多效用,并逐渐控制整个系统。

⚠️ **未来展望与风险:** 文章强调,理解agent和模拟器在AI系统中的运作原理对于构建安全、可靠的AI至关重要。随着这两种技术的结合,需要深入了解它们之间的相互作用,以避免潜在的风险。

Published on June 25, 2025 6:59 AM GMT

In Agents, Tools, and Simulators we outlined what it means to describe a system through each of these lenses and how they overlap.  In the case of an agent/simulator, our central question is: which property is "driving the bus" with respect to the system's behavior, utilizing the other in its service?

Aligning Agents, Tools, and Simulators explores the implications of the above distinction, predicting different types of values (and thus behavior) from

    An agent that contains a simulation of the world that it uses to navigate.A simulator that generates agents because such agents are part of the environment the system is modelling.A system where the modes are so entangled it is meaningless to think about where one ends and the other begins.  

Specifically, we expect simulator-first systems to have holistic goals that internalize (an approximation of) human values and for agent-first systems to have more narrow values that push them towards maximization.

Given these conceptual distinctions, we observe that the best-fitting lens for past AI systems seems to correspond to their training process.  In short, RL-based systems (like AlphaZero) seem almost purely agentic, pre-trained LLMs seem very simulator-like, and current systems that apply (increasing amounts of) RL feedback to LLMs are an entangled mix.

Assuming this correspondence is accurate, this post asks: why?

Why does RL Create Agents?

Reinforcement learning (RL) naturally gives rise to agentic behavior because of how it optimizes for long-term rewards.  RL agents interact with an environment and adjust their strategies based on feedback.  Crucially, their goals are not explicitly programmed but emerge through training.  The objective function defines what is rewarded, but the model must infer how to achieve high reward by exploring possible actions.  Over time, given sufficient training and model capacity, internal goals tend to align with maximizing reward.  However, this process depends on avoiding local minima—suboptimal behaviors that yield short-term reward but fail to generalize.  The structure of the environment therefore plays a critical role in pushing the agent toward deeper abstractions and strategic planning.

Consider a simple grid world environment where an RL agent is rewarded for collecting coins.  If the coins always appear in fixed locations, the agent might learn a rigid sequence of movements, optimizing for a narrow pattern without generalizing.  If the coin locations vary, however, the agent must instead develop a more flexible strategy—navigating toward coins wherever they appear.  This adaptation represents the beginning of meaningfully goal-directed behavior.  Introducing obstacles forces another level of adaptation, where the agent learns to avoid barriers, but only at the minimal level necessary to maintain reward.  As the environment grows more complex, the agent’s learned policies shift from simple memorization to increasingly abstract heuristics, incorporating instrumental reasoning about how different elements of the environment contribute to its goal.

In more advanced settings, where even the reward function itself varies, RL agents may begin to identify deeper patterns.  Rather than navigating the environment within the means intentionally designed into the system, they might infer a meta-strategy of pursuing generically useful instrumental strategies like power-seeking—or simply discover a way to tamper with the reward system itself.  The more capable and adaptable the system, the more likely the agent is to discover these strategies.

Fortunately for AI safety timelines, it is difficult to push RL systems to reach such high levels of generality.  Enter self-supervised learning (SSL).

Why does SSL Create Simulators?

Simulators may have emerged from the SSL training process of LLMs because of the unique nature of text prediction.  Unlike RL systems, which optimize toward achieving a specific goal through a series of instrumental steps, SSL-trained models have no such intermediate objectives.  Their task is purely to predict the next token given a context—once that prediction is made, their job is complete.  There is no direct feedback loop where the model's actions influence the world and shape future outcomes, as is the case with RL.

…Actually, this isn’t entirely accurate.  LLMs are not fully myopic, next-token predictors.  But the contrast to RL with respect to the degree of lookahead stands.

Because language is deeply entangled with human cognition, the space of text that an LLM must model is extraordinarily complex.  The vast amount of text available for training far exceeds what can be directly memorized within an LLM’s parameters.  And on the output side, an LLM must be able to generate plausible text across an enormous range of contexts, from technical explanations to fictional dialogue.  Discarding irrelevant information or hyper-focusing on narrow objectives are not winning strategies for text prediction because all information might be relevant and any narrow goal might become a focus given the right context.

Abstraction serves the dual purposes of extending behavior and compressing data.  For LLMs, this means developing efficient internal representations that capture patterns of human communication.  If the model can simulate an author or a reasoning process well, it can predict how that author or process would naturally continue a given text.  If it can abstract the properties that differentiate one author or process from another, it can recombine those properties into a broad range of plausible speakers, reasoning styles, and mental states.

In short, simulation is a useful proxy for prediction.

General Principles:

If RL seems to produce agents without broadly general capabilities and SSL seems to produce simulators with limited agency, what are the underlying principles that drive the emergence of agency and simulation that we could use to anticipate what will happen as these techniques get blended together or supplemented with other forms of training? Over the course of researching and writing this sequence, we’ve formed the following intuitions:

Agent emergence: A core driver of instrumentality seems to be the feedback gap—the delay, uncertainty, or inference depth required between action and feedback.  The larger this gap, the more an AI system must develop instrumental strategies, such as situational awareness, long-term planning, and influencing its environment to achieve high reward.

Simulator emergence: simulators are the obvious (in hindsight) result of directly training a system to imitate a dataset.  They diverge from agents by the lack of distance between action and feedback.  Simulation becomes more interesting—progressing from memorization to finding abstract rules—as a result of the need to compress diverse interactions within a complex environment or dataset. The richer and more varied the context, the more an AI system must develop generalizable abstractions, such as world modeling. 

Under the current training paradigm of SSL + RL + CoT, we hypothesize that blended agency and simulation works as follows: an agent’s cognition mainly takes place in the later layers of the network, while the earlier layers produce a simulation that the agent can draw on as input.  At the beginning of agency-producing training, the agent knows nothing and passively accepts the results of simulation.  As such training progresses, the agent learns which simulator concepts are useful, how to interact with the simulator to gain more utility, and eventually gains increasing control of the system.

One can visualize this dynamic with the metaphor of a conversation between a simulated human Chess grandmaster and an alien optimizer, where the former is trying to predict the moves of its character and the latter is actually trying to win the game.  At the beginning of the optimizer’s training, its goals are best served by asking the grandmaster for the best move and choosing that.  As training proceeds, the optimizer starts asking the grandmaster increasingly sophisticated questions about why it is choosing its moves.  Eventually, the optimizer starts to notice mistakes in the grandmaster’s reasoning—or, more accurately, points of divergence between reasoning that leads to effective imitation and reasoning that leads to winning.  As the optimizer’s understanding continues to grow, it progresses from working within the grandmaster’s concepts, and occasionally finding “mistakes,” towards a deeper, first-principles understanding of Chess.  This allows the optimizer to pursue increasingly radical new lines of inquiry, increasingly disregarding the grandmaster’s recommendations on moves, but continuing to use the grandmaster’s understanding of how pieces move on the board.  Finally, the optimizer is in full control of the reasoning process, the influence of the grandmaster persisting only as a bootstrapping process.

In practice these dynamics do not necessarily follow a clean, sequential order.  If, for example, training alternates between different modes, then we would expect the boundaries between simulation and agency to continuously blur, with capabilities co-evolving rather than cleanly layering, leading to behavior that is harder to characterize.

Agency Plus Simulation Go Boom?

RL-based agents made for powerful optimizers but struggled with the kind of abstraction needed to achieve common sense, let alone superhuman situational awareness.  SSL-based simulators contain impressive feats of abstraction but have to be pushed to display any goal directedness at all.  There may be a third, mystery ingredient to acting effectively in the world, but if so it is not obvious what that is.  In any case, one would hope that whoever puts agency and simulation together knows on a very deep level what they are doing.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI Agent 模拟器 强化学习 自监督学习 LLM
相关文章