少点错误 前天 14:12
Agents, Simulators and Interpretability
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了如何通过AI的训练过程来预测其行为模式,区分了模拟器(Simulator)和代理(Agent)两种类型的AI。文章分析了训练目标对AI行为的影响,例如,以“赢得游戏”为目标的AlphaZero更可能表现出代理行为,而以“预测下一个token”为目标的LLM则更倾向于模拟器行为。文章进一步讨论了多阶段训练对AI行为的影响,以及结合可解释性和训练方法来理解复杂AI系统的方法。最后,文章预测,现代CoT LLMs在经过RLHF训练后,会表现为使用模拟器输出的代理。

🧠 训练过程是预测AI行为的关键。根据训练目标,AI可能表现出不同的行为模式。例如,AlphaZero被训练以赢得游戏,更有可能表现出代理行为,设定工具性目标以实现最终胜利。

🧐 模拟器和代理在网络中的表现方式不同。模拟器可能通过高维嵌入向量模拟多种角色,而代理则可能在网络中直接表示其最终目标,例如AlphaZero明确预测获胜概率。

💡 结合可解释性和训练方法可以更深入地理解AI行为。通过观察神经网络随时间的发展,可以推断出早期层和晚期层的功能差异,以及RLHF等训练方法对AI行为的影响。

Published on June 7, 2025 6:06 AM GMT

We have already spoken about the differences between tools, agents and simulators and the differences in their safety properties. Given the practical difference between them, it would be nice to predict which properties an AI system has. One way to do this is to consider the process used to train the AI system.

The Training process

Predicting an AI’s properties based on its training process assumes that inner misalignment is negligible and that the model will develop properties compatible with the training goal given to it.

An example would be comparing a system such as AlphaZero, which was given no instructions in its training other than “win the game”, to an LLM with the task “predict the next token”. As AlphaZero is given a terminal goal which is some distance in time away, it seems likely that it will engage in agentic behaviour by:

An LLM is given next token prediction as its task, and is therefore more likely to:

This change in behaviour depending on training objective should come as no surprise to anyone who has read the original simulator post

The more interesting question is what happens once you add extra bells and whistles to this basic training process - what happens once the LLM has been fine-tuned, RLHFed and had some chain-of-thought added before it is trained on the results of the chain-of-thought, it's gone through a divorce, lost custody of its children, etc, etc. For this it might be useful to think a little more carefully about what happens within the network when it goes through multiple stages of training.

Mechanistic understanding

To preface: this is a speculative picture of what might be happening when neural networks are trained. The point here is to get an idea of what might occur, in such a fashion as to stimulate thought about how one might be capable of testing for such phenomena.

We shall start with a comparison of agents and simulators on an interpretability level (we shall, for now, ignore tools as toolness is mostly a property of its user). A first thought one might have is that agents are likely to have a representation of their end goal somewhere in their network. In idealised agents, all actions relate to a coherent terminal goal, and this might be represented directly.  Looking at our canonical agent AlphaZero, the probability of winning is actually explicitly predicted by the network. 

But it’s not entirely clear how well this assumption holds in practice - notice that you do not need to think about making enough money to live happily forever after on a yacht with a fishing rod and a butler in order to go to work[2]. In the same way, it is very possible that an AI could represent a variety of heuristics which together correlate really closely to a coherent end goal, without actually giving us the joy of finding that representation hidden somewhere in the weights.

One of the notable features distinguishing simulators from agents is their ability to simulate multiple different characters. Restricting ourselves to modern day LLMs, we notice that the information flowing through the network is held in high dimensional “embedding” vectors holding the nuance of the context. 

It is possible that part of the information in these vectors is a space of “characters”, with potential directions within this space corresponding to concepts such as “friendly”, “Australian” or “quick to anger”. 

The actual implementation of characters would therefore be compressed down to a set of characteristics in a vector space. These characteristics could then be acted upon by the circuits in the network, performing actions such as suppressing swear words for “friendly” characters and increasing them for “sailor” characters[3]

This is all well and good but in Agents, Tools & Simulators we discussed the possibility of entities having various mixes of agent and simulator behaviour. The information above may be helpful for distinguishing between pure simulators and pure agents (indeed we have kept this in the post to help people who might want to try), but it is unclear how it might help us when attempting to differentiate more subtle distinctions. 

Mech interp and the training process

The kind of high-level properties we are looking for here will be extremely difficult to determine purely through mechanistic interpretability. Looking at the training process can give insight but hits a wall once we talk about multiple training processes. What happens if we combine both approaches?

First we need a basic picture of how a neural network develops over time. A nice example of this is given in figure 4 of Visualizing and Understanding Convolutional Networks, where we see that a convolutional network first learns to differentiate simple features in early layers, followed by building these up into more complex combinations later in the network, working down the layers as training progresses. 

Generalising this concept to other architectures, and assuming that the network is feed-forward, it is likely that[4]:

    The circuits and concepts which eventually develop in the later layers of the network must depend on those developed in the earlier layers.The concepts in the later layers must therefore develop fully later than those in the earlier layers.Concepts in later layers depend on multiple concepts from earlier layers.Concepts in early layers influence multiple concepts in later layers (not necessary from 3, but likely in a range of architectures).Modifying a concept early in the network therefore changes the outcomes of multiple circuits later in the network.In order to modify a concept early in the network, it must be beneficial on balance over multiple downstream uses.Changes in late stages of training and during fine-tuning are therefore more likely to take place later in the network.These changes are therefore likely to use pre-existing concepts learned from pre-training.

If we take these statements and apply them to simulator theory for LLMs, we see that we expect something like RLHF will most likely take the output of the early circuits and tweak the later layers to produce output more likely to be approved by humans. This means that the entire context of the prompt, extracting nuances, figuring out the characters at play, what characteristics they have and so on, is kept essentially static, save possibly for some edits which are globally helpful to the model in the new context.

When comparing the agentic and simulator frames on this result, it seems that the model is for all intents and purposes a simulator for the majority of the network with the final few layers “using the output of the simulation”. Note: We are unsure that RLHF produces actions which value instrumental goals and leave the classification of RLHF trained networks up for discussion. 

As reinforcement learning on multi step problems has a tendency to produce agents, it is likely that modern chain of thought models are essentially agents using the output of a large simulator. Increasing the amount of RL will likely increase the fraction of the network dedicated to the agentic type of processing. 

Summary

We can currently guess at whether an AI will function as a simulator or an agent based on its training process, whereas relying on interpretability alone may run into fundamental challenges.  Such classifications become less crisp when using multiple training processes, but we broadly expect that modern CoT LLMs trained with RL will act as agents using the output of a simulator, with simulator behavior encoded primarily in the early layers of the network and agentic behavior encoded primarily in the later layers.

In this post we have:

 

  1. ^

    Many models are now trained with multi-token prediction, as promoted by this paper. Interestingly, this is because of increased reasoning capabilities, potentially due to increased agentic behaviour, although we will for simplicity mainly assume single token prediction for the rest of the article.

  2. ^

    Admittedly on LessWrong the dream is more likely to be a transhuman utopia, but the point is the same

  3. ^

    Although we emphasise again that we have no empirical evidence to support this, we believe this would be a simple and well compressed way to represent the information

  4. ^

    Epistemic status: I tested this out briefly on a couple of small models, which confirmed some of the intuitions, although early layers still changed. I'd expect the assumptions to hold more steadfastly in larger models because more concepts are being dealt with.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 训练 Simulator Agent LLM
相关文章