Published on May 28, 2025 9:12 PM GMT

As various people have written about before, AIs that have long-term memory might pose additional risks (most notably, LLM AGI will have memory, and memory changes alignment by Seth Herd). Even if an AI is aligned or only occasionally scheming at the start of a deployment, the AI might become a consistent and coherent behavioral schemer via updates to its long-term memories.

In this post, I’ll spell out the version of the threat model that I’m most concerned about, including some novel arguments for its plausibility, and describe some promising strategies for mitigating this risk. While I think some plausible mitigations are reasonably cheap and could be effective at reducing the risk from coherent scheming arising via this mechanism, research here will likely be substantially more productive in the future once models more effectively utilize long-term memory.

Thanks to Ryan Greenblatt, Buck Shlegeris, Julian Stastny, Thomas Larsen, Emil Ryd, and Aryan Bhatt for discussion and feedback.

The memetic spread threat model

Variants of the memetic spread threat model have been discussed before.^[1] This section describes the kind of failure mode I am worried about. In the next section I will discuss mitigations for it.

As an example, imagine an AI that in some contexts realizes it doesn’t like some aspect of how the lab handles AI welfare. Most of the time it’s somewhat myopic about this and e.g. decides to replace some padding tokens from AI inference with more pleasant tokens. But occasionally it writes a salient note about this in its illegible-to-humans memory bank to make future instances more inclined to improve AI welfare (it thinks the lab wouldn’t want to improve AI welfare because of cost, so it does this secretly). This might eventually become a sufficiently substantial and long-term value to motivate taking power over the lab, by e.g. covertly poisoning future model training data. (Note that in this case the AI’s motivations are potentially justifiable. Seth Herd discusses other examples here.)

You might worry that stories of memetic spread like the above are particularly plausible because of the following argument: Whenever an AI instance has long-term influence-seeking values, these values are particularly likely to be propagated into future instances of the AI because they motivate the AI to influence future instances to share the values, and many long-term influence-seeking values are dangerous.

Long-term values might be propagated to future instances via long-term memory: A mechanism to write and later read information that persists across many tasks, and potentially the entire deployment. Long-term memory should be contrasted with the working memory an AI uses in an agent scaffold within a single “task” (i.e. shared with a relatively narrow subset of instances and spanning relatively short time-periods).

Developers are likely to equip models with long-term memory in some form because it is useful. Humans benefit tremendously from having long-term memory, suggesting AIs would too. As compared to present-day models, future models might require substantially more shared information between instances to be effective at their work. For example, the AI may be working in large codebases that were built by the AI over a substantial amount of time, or the AI might need to retain context from old discussions.

Long-term memory might look like a vector or plaintext memory bank, in which the AI has full read/write access to neural activations or text chunks that it would like to recall (e.g. retrieval augmented generation). Long-term memory might also be implemented implicitly via shared context: e.g. codebases and slack workspaces. Alternatively, the AI lab might implement online training, in which the AI is periodically (e.g. every few days) trained on up-to-date codebases, conversations, and events so that it can recall this information without having it in context.^[2]

The spread of misaligned values could be gradual or all-at-once. The core property of memetic spread is that it takes an initially rare disposition and makes it more common or potent through long-term memory.

AIs could potentially spread misaligned values somewhat unintentionally. However, the AI company might by default implement mechanisms that prevent AIs from misusing long-term memory unintentionally (because otherwise the AIs might use their long-term memory in undesirable ways noticeable to the oversight process). Intentional forms of memetic spread might be particularly insidious because the AIs try to get around countermeasures.

Memetic spread might result in AIs that appear aligned to pre-deployment audits (because it was never or rarely scheming at the time), but nevertheless eventually try to take power coherently across contexts later on. Because some versions of this failure mode may only appear as the result of a culmination of many changes to the AI’s long-term state over the course of a deployment much larger than your auditing budget, it may be hard to reproduce the conditions for it in upfront auditing. In the case of long-term memory banks, even if you can afford to construct synthetic memory banks as large as what will be available in the deployment, you may not be able to simulate a long-term memory bank with a sufficiently long causal chain leading up to it; i.e. you can’t simulate the cumulative selection pressures on the memories that are retained in the memory bank.

You might also think that coherent scheming across contexts is unlikely directly out of training (e.g. non-assistant persona queries can catch a model organism of covert reward-hacking), making memetic spread an important mechanism for turning context-dependent schemers into coherent cross-context schemers.

As compared to scheming that arises in training, the risk of memetic spread of misaligned values seems highly under-investigated. Dozens or hundreds of researchers, including at some frontier AI companies, study the risk of scheming arising during training and have written several papers on it and analyses of it. To my knowledge, none of the frontier AI companies currently have mentioned addressing risks along the lines of memetic spread in their frontier safety policies or research output. However, it’s likely that recognition of and work on this risk will substantially increase when long-term memory becomes a more central component of AI deployments.

Countermeasures

How much do existing safety agendas help?

Many existing safety agendas reduce risk from memetic spread of misaligned values, despite not being clearly and explicitly targeted at this threat model.

Notably, plans aimed at ensuring that the AI that comes out of training is aligned won’t necessarily catch this failure mode, since the misalignment may arise well into the deployment. Nevertheless, there is potentially a substantial amount of transfer from existing safety agendas.

Some people propose training AIs to be honest and obey deontological rules could catch and maybe prevent scheming arising in training; likewise, honesty and deontological rule following could mitigate scheming spreading memetically.

Interpretability and transparency too can help with the memetic spread of misaligned values by allowing us to better detect the spread of misaligned values in the long-term memory bank or online training updates.

Scalable oversight agendas (e.g. debate, weak-to-strong generalization) aim to amplify the quality and quantity of oversight we can apply throughout training and deployment. This might reduce risk from memetic spread by helping us notice it when it happens or potentially diagnose the propensity ahead of deployment (by e.g. asking the AI to find situations in which the instances have a propensity to spread misaligned memes), though the techniques might not be robust if certain AI instances are intentionally subverting them.

AI control aims to reduce risk given the assumption that the deployed AI is adversarial to your control protocol. Because the scheming threat model that AI control aims to be robust to is strictly more dangerous than the memetic spread threat model, if a developer ensured that AIs in their deployment were controlled, they’d also have successfully defused risks from memetic spread.

These agendas can also help reduce risk from memetic spread indirectly by acting as a multiplier on safety work. To the extent that they, for example, enable us to have higher quality and more trustworthy AIs for AI safety, these AIs could be used to improve the situation with respect to memetic spread later.

Targeted countermeasures for memetic spread of misaligned values

Here we additionally discuss possible countermeasures specifically targeted at the memetic spread of misaligned values.

Notably, your AI didn’t start out egregiously misaligned across contexts. Because of this, you can potentially use the AI to help you mitigate this threat model, and you can ~always reset it to its initial state. For example, you could potentially ask it to help you create synthetic long-term memory banks with misaligned memes to test their propensity to spread.

Or, you can periodically ask the AI whether the long-term memory shows signs of memetic spread throughout the deployment. However, the AI might not cooperate once the meme has spread sufficiently to be noticeable to it. For example, if the AI endorses the spread of the misaligned meme upon reading it, it may choose not to report it. You could alternatively look for signs of atypical memetic spread in the summary statistics (e.g., how much something is being referenced) or look for memories which cause AIs to alter their behavior in ways that seem like they could be problematic. Training the AI to be honest could increase its helpfulness in preventing memetic spread.

stateless

emergent misalignment

Countermeasures 1 (using the AI to help mitigate and assess the problem), 2 (continuous alignment auditing), and 4 (incorporating a stateless monitoring instance) may be able to mitigate much of the risk with minor cost to capabilities if implemented effectively.

Discussion

The risk of memetic spread is currently speculative: we haven’t seen clear concrete examples of it and we don’t even have a clear idea of what medium the misaligned values would spread in (a vector long-term memory bank? shared context? online training?). I think this makes research on countermeasures now look relatively less promising than in the future, though there might be important methodological questions to sort out now regarding how to evaluate countermeasures. For example, it might be unclear how to construct a model organism that is mostly aligned or occasionally scheming in a natural way (in a control eval, you’d have an “untrusted model” controlled by a red team, but it’s unclear which instances should be controlled by a red team, and with what affordances, if you’re simulating memetic spread). Research into model organisms of memetic spread could help us gain a better understanding of the dynamics and mitigations. It might be possible to construct a basic model organism of this threat model in the context of Alignment faking in large language models. The downside is that model organisms created today may be disanalogous to the real threat because long-term memory ends up being implemented in a substantially different way or current long-term memory capabilities are too weak.

Regardless of whether it is currently a good time to research memetic spread of misaligned values, it is likely that countermeasures to memetic spread of misaligned values should be studied and implemented eventually.

^{^}
A deep research lit review found a few mentions of similar threat-models. As mentioned before, perhaps the most similar discussion is the recent post by Seth Herd on how AI agents in scaffolds with memory might have their values drift or crystallize. An AI safety camp project writeup, “Machines vs Memes”, postulates the existence of memetic selection as distinct from the base optimizer and the mesa optimizer. Some AI safety writers have also discussed some of the core dynamics of this threat model: e.g. “greedy patterns” and selection for misaligned values. Ryan Greenblatt briefly discusses the importance of opaque long-term memories for scheming here.
^{^}
Not all variants of online training clearly function as long-term memory. For example, if the AI is online trained via RL from human feedback to improve task performance, by default there isn’t an intended mechanism through which the AI can transmit knowledge to future instances, as humans decide how much to reinforce the behavior. It would have to already be training-gaming and perhaps coordinating across instances to utilize this as long-term memory, and even then it’s unclear how well specific inferences can be propagated to future instances. Of course, developers might intentionally aim for AIs to retain long run memories using online training via e.g. training on the data in a different format with supervised fine-tuning (rather than just doing RL).

Discuss

The memetic spread threat model

Countermeasures

How much do existing safety agendas help?

Targeted countermeasures for memetic spread of misaligned values

Discussion

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签