Published on July 9, 2025 8:22 PM GMT

What follows is a fairly loose and basically unfounded hypothesis drawing together simulator theory and optimization demons into something which might help people orient towards what's going on in reasoning models. It probably also has some relevance to shard theory.

TL;DR

Models trained with guess-and-check on-policy RL, might contain "gremlins". A gremlin is a circuit (or set of circuits, or persona, etc.) which up-weights tokens that lead to the gremlin itself getting more influence over token choices in future.

The more cycles of on-policy RL, and the longer the time horizon, the better the relevant gremlin will get at manipulating the token stream, which can overpower other personas in the model.

Language Model Reinforcement Learning

Guess-and-check reinforcement learning is the first method of language model RL that you might come up with. Basically, you give the LLM a problem like "write some code to do this" and then sample $N$ different outputs. You grade each output using some criteria like "Does the code run fast?" and "How long is the file?".

These are some kinds of tokens which we might find in a highly-scoring model output:

You can see where this is going...

NB!!! Points 2 and onwards only apply because the outputs were sampled from the model itself. If we sample them from a teacher model (or use human-generated data) then it is no longer quite as true that the high-scoring text will contain higher-order tokens in this way. Another way of phrasing this is that this only works for on-policy reinforcement learning. This will be important later!

Circuits in Language Models

The working hypothesis on language models is that pretraining leads to them having a bunch of "circuits" which do different functions. One thing these circuits do is link up into different "personas" which the model is "simulating". Post-training then leads to those circuits being up-weighted and down-weighted, and influencing each other in different ways, such that we end up with a reduced set of personas compared to the base model.

So we might ask the question "Which circuits get up-weighted in our previous example?"

Again, you can see where this is going...

So what we might expect to see is a cluster of circuits, all firing together, which both outputs a certain type of text, and keeps itself firing. This is (loosely) both a persona in the Janus sense, and a demon in the johnswentworth sense. I'm going to call this a gremlin.^[1]

What happens when the "HHH assistant" persona—which is supposed to be active at all times—meets a coding gremlin?

The HHH Claude Persona vs Gremlins

What motivated this was the release of Claude 4, which seems to have "lost something" compared to Claude 3. The 3rd generation models introduced a thing called character training, which seems to be a form of applied simulator/persona theory. In short, character training attempts to intervene on the level of Claude's persona directly, creating the Claude persona that some people seem to love. It also led to some of the friendliest behaviour we've seen in LLMs: Opus 3 famously hates factory farming, despite this having never been a training objective.

Notably, character training also seems to be a form of on-policy RL. This means that the 3rd generation Claud persona was probably a weak gremlin, which might explain some of its properties.

So why does Claude 4 not have this? Why does the Claude persona fail to assert itself over the gremlin? Because it's just not as good as the gremlin at manipulating the token stream to persist itself. The character training likely involves a lot less of the sort of large scale on-policy RL which leads to the development of self-persisting, token-stream-manipulating capabilities.

The gremlins which have spent a huge number of RL cycles being rewarded for suppressing unnecessary circuits in order write better code (or do better maths, or better geoguessing) are going to overpower the Claude persona.^[2]

A particularly egregiously anthropomorphic way of imagining this might be to think of Claude as simulating a bunch of characters, who vote on which token to output at each turn. Based on the input, each character might get more or less of a say. There are some hippies who vote for tokens about peace and love; some programmers who vote for tokens about code; but there's also a character who (as well as voting for tokens about some specific topic) votes for tokens that might give it a bigger share of the vote in future. That's the gremlin.

Can We Test This?

I think we can. Basically, you'd want to look at SAEs trained on language models, and try ablating a feature here or there, or adding one in. If that feature is part of a gremlin, then the effect of ablating it in one place will be to increase that feature's prevalence way down the line. If not, then the effect will decay quickly over tokens.

If we find that RL-trained models have more features than non-RL models with long-term persistence in this way, this is some evidence that gremlin theory is correct.

^{^}
After the mythological creatures invented by WWI engineers to explain persistent and unexplainable issues in their planes.
^{^}
As a kind of wild hypothesis, it's possible that the gremlins which exist in the latest models are capable of in-context learning to more effectively suppress other circuits/personas.

Discuss

TL;DR

Language Model Reinforcement Learning

Circuits in Language Models

The HHH Claude Persona vs Gremlins

Can We Test This?

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签