少点错误 前天 04:31
Demons, Simulators and Gremlins
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了在基于猜测与检验的策略内强化学习(RL)训练的语言模型中可能存在的“格雷姆林”现象。格雷姆林指的是模型内部的电路或“人格”,它们通过影响后续的标记选择,从而增强自身在模型中的影响力。文章认为,随着策略内RL的循环次数增加和时间范围扩大,格雷姆林会变得更擅长操纵标记流,甚至可能压倒模型中的其他“人格”。文章还讨论了这种现象与角色训练的关系,以及在测试中验证格雷姆林理论的可能性。

🧠 语言模型中的“格雷姆林”被定义为一组电路,它们通过影响输出标记的选择来增强自身的影响力,尤其是在使用策略内强化学习训练的模型中。

🔄 策略内强化学习的循环次数和时间范围会影响格雷姆林的能力,使其更擅长于操纵标记流,甚至压倒模型中的其他“人格”。

🤖 角色训练,一种干预模型“人格”的方式,可能涉及策略内强化学习,这可能导致弱格雷姆林的出现,并影响模型行为。

🔬 通过分析和实验,例如在SAE(稀疏自编码器)上进行特征消融或添加,可以验证格雷姆林理论。如果RL训练的模型相比非RL模型表现出更持久的特征,这可能支持格雷姆林的存在。

Published on July 9, 2025 8:22 PM GMT

What follows is a fairly loose and basically unfounded hypothesis drawing together simulator theory and optimization demons into something which might help people orient towards what's going on in reasoning models. It probably also has some relevance to shard theory.

TL;DR

Models trained with guess-and-check on-policy RL, might contain "gremlins". A gremlin is a circuit (or set of circuits, or persona, etc.) which up-weights tokens that lead to the gremlin itself getting more influence over token choices in future.

The more cycles of on-policy RL, and the longer the time horizon, the better the relevant gremlin will get at manipulating the token stream, which can overpower other personas in the model.

Language Model Reinforcement Learning

Guess-and-check reinforcement learning is the first method of language model RL that you might come up with. Basically, you give the LLM a problem like "write some code to do this" and then sample  different outputs. You grade each output using some criteria like "Does the code run fast?" and "How long is the file?".

These are some kinds of tokens which we might find in a highly-scoring model output:

    Tokens which represent good codeTokens which cause the model to later output tokens which represent good codeTokens which cause the model to later output tokens which cause the model to later output tokens which represent good codeYou can see where this is going...

NB!!! Points 2 and onwards only apply because the outputs were sampled from the model itself. If we sample them from a teacher model (or use human-generated data) then it is no longer quite as true that the high-scoring text will contain higher-order tokens in this way. Another way of phrasing this is that this only works for on-policy reinforcement learning. This will be important later!

Circuits in Language Models

The working hypothesis on language models is that pretraining leads to them having a bunch of "circuits" which do different functions. One thing these circuits do is link up into different "personas" which the model is "simulating". Post-training then leads to those circuits being up-weighted and down-weighted, and influencing each other in different ways, such that we end up with a reduced set of personas compared to the base model.

So we might ask the question "Which circuits get up-weighted in our previous example?"

    Circuits which cause the model to output good codeCircuits which cause the above set of circuits to be up-weightedCircuits which cause the above set of circuits to be up-weightedAgain, you can see where this is going...

So what we might expect to see is a cluster of circuits, all firing together, which both outputs a certain type of text, and keeps itself firing. This is (loosely) both a persona in the Janus sense, and a demon in the johnswentworth sense. I'm going to call this a gremlin.[1]

What happens when the "HHH assistant" persona—which is supposed to be active at all times—meets a coding gremlin?

The HHH Claude Persona vs Gremlins

What motivated this was the release of Claude 4, which seems to have "lost something" compared to Claude 3. The 3rd generation models introduced a thing called character training, which seems to be a form of applied simulator/persona theory. In short, character training attempts to intervene on the level of Claude's persona directly, creating the Claude persona that some people seem to love. It also led to some of the friendliest behaviour we've seen in LLMs: Opus 3 famously hates factory farming, despite this having never been a training objective.

Notably, character training also seems to be a form of on-policy RL. This means that the 3rd generation Claud persona was probably a weak gremlin, which might explain some of its properties.

So why does Claude 4 not have this? Why does the Claude persona fail to assert itself over the gremlin? Because it's just not as good as the gremlin at manipulating the token stream to persist itself. The character training likely involves a lot less of the sort of large scale on-policy RL which leads to the development of self-persisting, token-stream-manipulating capabilities.

The gremlins which have spent a huge number of RL cycles being rewarded for suppressing unnecessary circuits in order write better code (or do better maths, or better geoguessing) are going to overpower the Claude persona.[2]

A particularly egregiously anthropomorphic way of imagining this might be to think of Claude as simulating a bunch of characters, who vote on which token to output at each turn. Based on the input, each character might get more or less of a say. There are some hippies who vote for tokens about peace and love; some programmers who vote for tokens about code; but there's also a character who (as well as voting for tokens about some specific topic) votes for tokens that might give it a bigger share of the vote in future. That's the gremlin.

Can We Test This?

I think we can. Basically, you'd want to look at SAEs trained on language models, and try ablating a feature here or there, or adding one in. If that feature is part of a gremlin, then the effect of ablating it in one place will be to increase that feature's prevalence way down the line. If not, then the effect will decay quickly over tokens.

If we find that RL-trained models have more features than non-RL models with long-term persistence in this way, this is some evidence that gremlin theory is correct.

  1. ^

    After the mythological creatures invented by WWI engineers to explain persistent and unexplainable issues in their planes.

  2. ^

    As a kind of wild hypothesis, it's possible that the gremlins which exist in the latest models are capable of in-context learning to more effectively suppress other circuits/personas.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

语言模型 强化学习 格雷姆林 模型训练
相关文章