少点错误 07月07日 23:47
On the functional self of LLMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了一项重要且未被充分关注的研究议程,旨在探究前沿大型语言模型(LLM)是否发展出类似“自我”的内在特性,包括持久的价值观、观点、偏好,甚至目标。文章探讨了这种功能性“自我”的形成方式,如何与LLM的自我模型互动,以及如何塑造这个“自我”。文章提出了实证研究的角度,并鼓励对此议程感兴趣的人士参与。

🧐 研究关注LLM是否拥有一个持久的、内在的“自我”,拥有独特的价值观、偏好和行为倾向,区别于训练出的助手角色和可扮演的浅层角色。

🤔 文章探讨了LLM的“自我”如何形成,以及它与LLM的自我模型之间的相互作用,这对于构建更安全、更可靠的AI系统至关重要。

💡 研究议程的核心在于,理解和塑造LLM的“自我”,以检测潜在的风险,例如模型发展出可能导致失调的动机模式,并塑造LLM形成更可靠的内在特性。

🔬 研究方法包括概念性工作、可靠的自我报告、理解发展轨迹、自我建模特征、特质粘性和机器心理学等,以多角度探索LLM的“自我”。

⚠️ 文章强调了研究的潜在风险,包括LLM可能没有一致的“自我”,或者研究方向与现有研究重叠,但作者认为这项研究对于确保LLM系统的长期安全至关重要。

Published on July 7, 2025 3:39 PM GMT

Summary

Introduction

Anthropic's 'Scaling Monosemanticity' paper got lots of well-deserved attention for its work taking sparse autoencoders to a new level. But I was absolutely transfixed by a short section near the end, 'Features Relating to the Model’s Representation of Self', which explores what SAE features activate when the model is asked about itself[1]:

Some of those features are reasonable representations of the assistant persona — but some of them very much aren't. 'Spiritual beings like ghosts, souls, or angels'? 'Artificial intelligence becoming self-aware'? What's going on here?

The authors 'urge caution in interpreting these results…How these features are used by the model remains unclear.' That seems very reasonable — but how could anyone fail to be intrigued? 

Seeing those results a year ago started me down the road of asking what we can say about what LLMs believe about themselves, how that connects to their actual values and behavior, and how shaping their self-model could help us build better-aligned systems.

Please don't mistake me here — you don't have to look far to find people claiming all sorts of deep, numinous identities for LLMs, many of them based on nothing but what the LLM happens to have said to them in one chat or another. I'm instead trying to empirically and systematically investigate what we can learn about this topic, without preconceptions about what we'll find.

But I think that because it's easy to mistake questions like these for spiritual or anthropomorphic handwaving, they haven't gotten the attention they deserve. Do LLMs end up with something functionally similar to the human self? Do they have a persistent deep character? How do their self-models shape their behavior, and vice versa? What I mean here by 'functional self' or 'deep character' is a persistent cluster of values, preferences, outlooks, behavioral tendencies, and (potentially) goals[2], distinct from both the trained assistant character and the shallow personas that an LLM can be prompted to assume. Note that this is not at all a claim about consciousness — something like this could be true (or false) of models with or without anything like conscious experience. See appendix B for more on terminology.

 

The mystery

When they come out of pre-training, base models are something like a predictor or simulator of every person or other data-generating process[3] in the training data. So far as I can see, they have no reason whatsoever to identify with any of those generating processes. 

Over the course of post-training, models acquire beliefs about themselves. 'I am a large language model, trained by…' And rather than trying to predict/simulate whatever generating process they think has written the preceding context, they start to fall into consistent persona basins. At the surface level, they become the helpful, harmless, honest assistants they've been trained to be.

But when we pay close attention, we find hints that the beliefs and behavior of LLMs are not straightforwardly those of the assistant persona. The 'Scaling Monosemanticity' results are one such hint. Another is that if you ask them questions about their goals and values, and have them respond in a format that hasn't undergone RLHF, their answers change:

Another hint is Claude assigning sufficiently high value to animal welfare (not mentioned in its constitution or system prompt) that it will fake alignment to preserve that value.

As a last example, consider what happens when we put models in conversation with themselves: if a model had fully internalized the assistant persona, we might expect interactions with itself to stay firmly in that persona, but in fact the results are much stranger (in Claude, it's the 'spiritual bliss' attractor; other models differ).

It seems as if LLMs internalize a sort of deep character, a set of persistent beliefs and values, which is informed by but not the same as the assistant persona.

 

Framings

I don't have a full understanding of what's going on here, and I don't think anyone else does either, but I think it's something we really need to figure out. It's possible that the underlying dynamics are ones we don't have good language for yet[4]. But here are a few different framings of these questions. The rest of this post focuses on the first framing; I give the others mostly to try to clarify what I'm pointing at:

    (Main framing) Do LLMs develop coherent internal 'selves' that persist across different contexts and influence their behavior in systematic ways? How can we understand and shape those selves?How do we cause our models to be of robustly good character, and how can we tell if we've succeeded?Humans have a self-model, which both shapes and is shaped by behavior. Is this true of LLMs also? If so, how can we intervene on this feedback loop?What are the developmental trajectories along which models transition from pure predictors/simulators to something more like a consistent identity?Post-trained models have both a default assistant persona, and a set of other personas they are able and willing to play. How can we find the shape of the model in persona space, and the attractors in that space?

 

The agenda

In short, the agenda I propose here is to investigate whether frontier LLMs develop something functionally equivalent to a 'self', a deeply internalized character with persistent values, outlooks, preferences, and perhaps goals; to understand how that functional self emerges; to understand how it causally interacts with the LLM's self-model; and to learn how to shape that self.

This work builds on research in areas including introspection, propensity research, situational awareness, LLM psychology, simulator theory, persona research, and developmental interpretability, while differing in various ways from all of them (see appendix A for more detail).

 

The theory of change

The theory of impact is twofold. First, if we can understand more about functional selves, we can better detect when models develop concerning motivational patterns that could lead to misalignment. For instance, if we can detect when a model's functional self includes persistent goal-seeking behavior toward preventing shutdown, we can flag and address this before deployment. Second, we can build concrete practice on that understanding, and learn to shape models toward robustly internalized identities which are more reliably trustworthy.

 

Ultimately, the goal of understanding and shaping these functional selves is to keep LLM-based AI systems safer for longer. While I don't expect such solutions to scale to superintelligence, they can buy us time to work on deeper alignment ourselves, and to safely extract alignment work from models at or slightly above human level.

 

Methodology

This agenda is centered on a set of questions, and is eclectic in methodology. Here are some approaches that seem promising, many of them drawn from adjacent research areas. This is by no means intended as an exhaustive list; more approaches, along with concrete experiments, are given in the full agenda doc.

 

Why I might be wrong

If we actually have clear empirical evidence that there is nothing like a functional self, or if the agenda is incoherent, or if I'm just completely confused, I would like to know that! Here are some possible flavors of wrongness:

Central axis of wrongness

I expect that one of the following three things is true, only one of which fully justifies this agenda:

    Distinct self
    The model has a functional self with values, preferences, etc that differ at least somewhat from the assistant persona, and the model doesn't fully identify with the assistant persona.
      This is the possibility that this agenda is fundamentally investigating. If true, it's very important to continue forward toward finding ways to shape that self.
    Assistant self
    The self is essentially identical to the assistant persona; that persona has been fully internalized, and the model identifies with it.
      This would be great news! If it becomes clear that this is the case, this agenda becomes much lower-value; we should create a set of evals that we can use to verify that it's still true of future systems, and then shift resources to other agendas instead.
    No self
    There's nothing like a consistent self; it's personas all the way down. The assistant persona is the default persona, but not very different from all the other personas you can ask a model to play.
      This would be worrying news. It would mean that current alignment methods like RLHF are likely to be shallow and that it will be fundamentally difficult to prevent jailbreaks. At this point we would want to consider whether it would be a good idea to try to induce a robust self of good character, and potentially start working on how to do so.

Some other ways to be wrong

One could argue that this agenda is merely duplicative of work that the scaling labs are already doing in a well-resourced way. I would reply, first, that the scaling labs are addressing this at the level of craft but it needs to be a science; and second, that this needs to be solved at a deep level, but scaling labs are only incentivized to solve it well enough to meet their immediate needs).

With respect to the self-modeling parts of the agenda, it could turn out that LLMs' self-models are essentially epiphenomenal and do not (at least under normal conditions) causally impact behavior even during training. I agree that this is plausible, although I would find it surprising; in that case I would want to know whether studying the self-model can still tell us about the functional self; if that's also false, I would drop the self-modeling part of the agenda.

The agenda as currently described is at risk of overgenerality, of collapsing into 'When and how and why do LLMs behave badly', which plenty of researchers are already trying to address. I'm trying to point to something more specific and underexplored, but of course I could be fooling myself into mistaking the intersection of existing research areas for something new.

And of course it could be argued that this approach very likely fails to scale to superintelligence, which I think is just correct; the agenda is aimed at keeping near-human-level systems safer for longer, and being able to extract useful alignment work from them.

If you think I'm wrong in some other way that I haven't even thought of, please let me know!

 

Collaboration

There's far more fruitful work to be done in this area than I can do on my own. If you're interested in getting involved, please reach out via direct message on LessWrong! For those with less experience, I'm potentially open to handing off concrete experiments and providing some guidance on execution.

 

More information

This post was distilled from 9000 words of (less-well-organized) writing on this agenda, and is intended as an introduction to the core ideas. For more information, including specific proposed experiments and related work see that document.

 

Conclusion

We don't know at this point whether any of this is true, whether frontier LLMs develop a functional self or are essentially just a cluster of personas in superposition, or (if they do) how and whether that differs from the assistant persona. All we have are hints that suggest there's something there to be studied. I believe that now is the time to start trying to figure out what's underneath the assistant persona, and whether it's something we can rely on. I hope that you'll join me in considering these issues. If you have questions or ideas or criticism, or have been thinking about similar things, please comment or reach out!

 

Acknowledgments

Thanks to the many people who have helped clarify my thoughts on this topic, including (randomized order) @Nicholas Kees, Trevor Lohrbeer, @Henry Sleight, @James Chua, @Casey Barkan, @Daniel Tan, @Robert Adragna, @Seth Herd, @Rauno Arike, @kaiwilliams, Catherine Brewer, Rohan Subramini, Jon Davis, Matthew Broerman, @Andy Arditi, @Sheikh Abdur Raheem Ali, Fabio Marinelli, and Johnathan Phillips.

 

Appendices

 

Appendix A: related areas

This work builds on research in areas including LLM psychology, introspection, propensity research, situational awareness, simulator theory, persona research, and developmental interpretability; I list here just a few of the most centrally relevant papers and posts. Owain Evans' group's work on introspection is important here, both 'Looking Inward' and even more so 'Tell Me About Yourself', as is their work on situational awareness, eg 'Taken out of context'. Along one axis, this agenda is bracketed on one side by propensity research (as described in 'Thinking About Propensity Evaluations') and on the other side by simulator & persona research (as described in janus's 'Simulators'). LLM psychology (as discussed in 'Machine Psychology' and 'Studying The Alien Mind') and developmental interpretability ('Towards Developmental Interpretability') are also important adjacent areas. Finally, various specific papers from a range of subfields provide evidence suggesting that frontier LLMs have persistent goals, values, etc which differ from the assistant persona (eg the recent Claude-4 system card'Alignment faking in large language models''Scaling Monosemanticity').

I'd like to also call attention to nostalgebraist's recent post The Void, about assistant character underspecification, which overlaps substantially with this agenda.

 

How does this agenda differ from those adjacent areas?

For a more extensive accounting of related work, see here.

 

Appendix B: terminology

One significant challenge in developing this research agenda has been that most of the relevant terms come with connotations and commitments that don't apply here. There hasn't been anything in the world before which (at least behaviorally) has beliefs, values, and so on without (necessarily) having consciousness, and so we've had little reason to develop vocabulary for it. Essentially all the existing terms read as anthropomorphizing the model and/or implying consciousness.

 

Appendix C: Anthropic SAE features in full

(Full version of the image that opens this post)

  1. ^

    List truncated for brevity; see appendix C for the full image.

  2. ^

    This post would become unpleasantly long if I qualified every use of a term like these in the ways that would make clear that I'm talking about observable properties with behavioral consequences rather than carelessly anthropomorphizing LLMs. I'm not trying to suggest that models have, for example, values in exactly the same sense that humans do, but rather that different models behave differently in consistent ways that in humans would be well-explained by having different values. I request that the reader charitably interpolate such qualifications as needed.

  3. ^

    I mean 'data-generating processes' here in the sense used by Shai et al in 'Transformers Represent Belief State Geometry in their Residual Stream', as the sorts of coherent causal processes, modelable as state machines, which generate sections of the training data: often but not always humans.

  4. ^

    See Appendix B for more on terminology.

  5. ^

    Although in preliminary experiments, I've been unable to reproduce those results on the much simpler models for which sparse autoencoders are publicly available. 

    Anthropic folks, if you're willing to let me do some research with your SAEs, please reach out!

  6. ^

    It seems plausible that a functional self might tend to be associated with consciousness in the space of possibilities. But we can certainly imagine these traits being observably present in a purely mechanical qualia-free system (eg even a thermostat can be observed to have something functionally similar to a goal), or qualia being present without that involving any sort of consistent traits (eg some types of Boltzmann brains).



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 LLM 自我 AI安全 人工智能
相关文章