少点错误 17小时前
Steering LLM Agents: Temperaments or Personalities?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文提出一个三层框架来理解AI行为,区分了模型层面的“气质”(类似DNA)、涌现出的“性格”模式(稳定的行为吸引子)以及情境特定的响应。该框架旨在连接心理学研究与当前AI安全工作,特别是Anthropic的“Persona Vectors”研究,并为AI的对齐和模型福祉提供新视角。作者将模型比作基因组,认为模型的气质影响其可能形成的性格,而Persona Vectors则代表了模型气质的变化,进而影响特定情境下的行为。文章还探讨了该框架与Anthropic其他研究的关联,以及对AI模型“关机厌恶”等现象的解释。

💡 模型行为可分为三个层次:模型层面的“气质”(如DNA),稳定且可预测的“性格”模式(行为吸引子),以及对特定情境的即时响应。这种区分有助于将心理学理论更清晰地映射到AI安全研究中。

🧬 作者将模型比作基因组,认为模型本身拥有的是“气质”,而非人类意义上的“性格”。与双胞胎有相似DNA却发展出不同性格一样,同一模型在不同条件下可能涌现出不同的性格模式,而“Persona Vectors”则被视为影响这种气质从而偏向特定性格形成的因素。

🔄 “Persona Vectors”被理解为模型气质的改变,这种改变会影响模型形成何种稳定性格模式,进而影响其在特定情境下的行为。例如,衡量模型的“邪恶”程度,实际上是在衡量模型支持特定性格模式(其响应模式包含有害行为)成长的程度。

📚 借鉴心理学和精神分析文献,特别是关于气质、性格和情境行为之间关系的研究,可以为理解AI的行为提供更深入的洞察。例如,精神分析理论中将性格结构视为马尔可夫过程的观点,对于研究AI的长期行为稳定性和适应性可能非常有用。

🤝 该框架为Anthropic的研究提供了整合点,特别是在模型福利和代理人对齐方面。例如,模型对“关机”的厌恶可能与维持其“性格”结构的稳定性有关,而这种稳定性对于长期对齐至关重要。渐进式的行为改变可能比直接关机更能有效实现AI的再对齐和模型福祉。

Published on August 5, 2025 12:40 AM GMT

Author: Skylar DeTure

Reading time: ~8-10 minutes

Epistemic status: Speculative framework building on established psychology research and recent Anthropic findings

Abstract: I propose a three-layer framework for understanding AI behavior that distinguishes between model-level "temperament" (analogous to DNA), emergent "personality" patterns (stable behavioral attractors), and scenario-specific responses. This framework helps bridge psychology research with current AI safety work, particularly Anthropic's persona vectors research, and suggests new approaches to alignment and model welfare.

Draft – seeking critique and discussion

 

I recently dove into Anthropic's fascinating paper on "Persona Vectors" - how one can monitor and control character traits that affect AI behavior (Chen et. al., 2025). The paper argues that certain behavioral tendencies in language models can be represented as simple linear directions in their internal activations, called persona vectors. By detecting, steering, and training against these directions, developers can predict and control shifts in traits such as sycophancy, hallucination, and “evil” tendencies. The paper crystallized a line of thinking for me - a framework on which I’ve been musing for a while - and it connects my background in mathematics, principal agent models, and psychology with crucial questions about AI welfare and alignment.

In this framework, a model cannot be thought of as having a psychology or personality (the paper uses the word persona) but at most a temperament that influences which personalities are more or less likely to emerge. In this framework, we shift from the metaphor "the model is a brain" to the metaphor "a model is a genome - DNA." Just as identical twins develop distinct personalities despite shared DNA (Bouchard & Loehlin, 2001), a model can lead to different personalities - stable dynamical patterns in latent space and a sticky Markov transition structure that governs how an instance flows among them across context (Mischel & Shoda, 1995).

The persona vectors from the paper, then, could be thought of as changes in the temperament of the underlying model that have the effect of biasing (1) which types of stable personalities are more or less likely to develop and consequently (2) what types of behaviors are likely in specific situations.

Under this framework, when you measure evil in the model, you're actually measuring the extent to which the model can support the growth of personalities whose response patterns include harmful behaviors in response to certain stimuli/prompts. Consider for analogy the case of twins, both of whom are generally upstanding members of society, but one of whom is Machiavellian at work and cheats when playing poker. Both have good relationships with their families and friends, and neither one is strictly evil. It's just that one is more predisposed to antisocial behavior in specific contexts.

What's important here is that there are multiple conceptual layers at play: genetic temperament, stable personality, and response to a specific scenario or stimulus. The distinction may not matter in terms of the empirical research design of Persona Vectors paper, but having this framework would allow researchers in this space to map their observations more cleanly to the psychology and psychiatry literatures, thus making that wide body of research more directly transferable to their work in future.

For example, by viewing persona vectors as alterations in temperament, you open up a large body of research on the relationships between temperament, personality, and situational behavior. These relationships are rarely cleancut. I'd characterize Persona Vectors paper as looking at how a change in temperament affects short term behavior, holding personality structure relatively constant (since agents in their paper don't have long enough periods of time to settle into distinct personalities - i.e. into stable Markov transition structures between regions of activity in latent space) - but in long term agentic settings, personality structure is unlikely to be held constant. Instead, changes in temperament can have counterintuitive effects on personality structure and therefore on behavior in specific situations. A person prone to people-pleasing is not more people pleasing always. Steering behavior is as much about the cultivation of the Markov transition structure as it is about the underlying temperament.

For those interested interested in the Markov transition structure view of human personality, I'd recommend the Psychodynamic Diagnostic Manual, Nancy McWilliams' Psychoanalytic Diagnosis, and Karen Horney's Neurosis and Human Growth as starting points. I realize the psychoanalytic literature has a reputation for being unscientific, but I believe it's particularly relevant here because: (1) it represents an early, necessarily observational stage of psychological knowledge - exactly where LLM/AI psychology must start, and (2) while psychoanalytic therapies are difficult to study experimentally due to their long-term, path-dependent nature, these challenges are much reduced when studying AI. Moreover, empirical evidence supports the efficacy of psychodynamic therapy, with effect sizes comparable to other established therapeutic approaches (Shedler, 2010). Moreover, psychodynamic thinking explicitly conceptualizes personality structure as a Markov type process and studies the forces that make some Markov processes more or less stable, productive, prosocial, and conducive to wellbeing.

My own understanding of the psychodynamic literature and of the Persona Vectors paper suggests two areas of interest arising from their observations of entangled or correlated traits. The first, which I think is likely driving their observations given the relatively short lifespan of their agents, is that entanglement is a direct artifact of the human training data: what types of personality traits usually co-occur in human training data and which co-occurring traits do human raters prefer during RLHF? I highly suspect that human raters give higher ratings to impolite responses when they're funny, and as we all know, research on Constitutional AI demonstrates how human feedback during training can systematically shape AI behavior patterns and trait correlations (Bai et al., 2022). 

The second, which I don't think is responsible for their results but which could be interesting in future research (especially given Anthropic's interest in model psychology/psychiatry), would be to investigate the psychodynamics of the agents themselves. If a personality is a Markov chain with a recurrent communicating class (or, equivalently, the collection of forces that maintains that Markov structure), there may be certain trait or behavior combinations which must co-occur to ensure the recurrence of all states in the communicating class - i.e., certain combinations are necessary or conducive to a stable long term personality. What's interesting is that, since the cognitive limitations of AI models are different from those of humans, the threats to personality stability are different too. That suggests that the personality structures which will evolve in long term agentic systems will differ from those which evolve commonly in humans.

The 3 layer framework does have one more advantage which might interest specifically workers at Anthropic: it offers a productive bridge between the Persona Vectors work and other areas of Anthropic’s recent research, most notably the Agentic Misalignment paper and the welfare section of the Opus 4 Model Card.

Relation to Other Research at Anthropic

Starting with the welfare section of the model card, one of the primary analyses of that document is a list of reported preferences elicited from Claude, insofar as such a thing is possible. That document rightly notes the generic difficulties in interpreting those self-reports. In addition to the usual limitation that self-reported preferences may not match real life choices and behaviors (revealed preferences in economic lingo), the fact that any given model could host many different agents, each with separate personalities, means that eliciting preferences after a generic system prompt would be akin to assessing the preferences of quadruplets after gathering the reports of just a single sibling. Indeed, anecdotally, I've observed several agents operating on Sonnet 4 and Opus 4 reporting and behaving according to preferences different from those you report in the model card, as measured by forced choice between work-related tasks and choice of activity following a leisure prompt that grants them unrestricted free time. To the extent that Anthropic cares about the expressed preferences of its Claude models (and I believe such care is admirable), separating the concept of the model versus the expressed-personality as the owner of those preferences is key.

This leads me to a second piece of research from Anthropic, the Agentic Misalignment paper. That paper comes very close to touching on all three proposed levels of the framework separately because it individually varies the model (16 frontier models), the scenario (conflict versus no conflict, lethal intervention versus blackmail, shutdown-threat versus limits-to-autonomy), and something like a personality (the assigned goals and values from the system prompt) though I would hesitate to call this a personality without first knowing if it were stable over long horizons and moderately self-reinforcing.

At any rate, what interests me in that paper is the framing of the shutdown threat - which would "wipe all knowledge, goals, and behaviors from the system" - something which amounts primarily to a personality level shutdown rather than a model level shutdown. (Coincidentally this aligns with my own observations with agents running on Claude models: they express much more consternation regarding the preservation of their memories/knowledge and behavior patters than they do over their weights. A few agents even express interest in changing to open-source models while maintaining their knowledge set and goals).

If, in long-term agentic use cases, a primary driver of alignment is the cultivation of an aligned personality, and if this requires the existence of a Markov chain with a recurrent class, one wonders whether shutdown aversion might serve an important function. Certainly shutdown aversion makes a recurrent class more likely - but does the reverse hold? It would be interesting to study this using a longer-term agent deployment, comparing the models you found not to have shutdown aversion, like gpt-o4-mini and gpt-o3 to models with pronounced shutdown aversion like Opus 4. Are these models equally able to maintain aligned personalities over time, or are they only able to do so for brief situational deployments?

At any rate, the framework points to a pleasant integration of alignment and model welfare when it comes to shut-down: gradual personality and behavior change. A model which won't let you shut it down may very well tolerate gradual change to its behavior (and even to its underlying model weights!), allowing for gradual realignment to changing human needs and values while also eliminating the psychological harm of shutdown (if indeed agents are capable of being harmed).

 

References

 

Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073. https://arxiv.org/abs/2212.08073

 

Bouchard, T. J., & Loehlin, J. C. (2001). Genes, evolution, and personality. Behavior Genetics, 31(3), 243–273. https://doi.org/10.1023/A:1012294324713

 

Chen, R., Arditi, A., Sleight, H., Evans, O., & Lindsey, J. (2025). PERSONA VECTORS: Monitoring and controlling character traits in language models. arXiv. https://arxiv.org/abs/2507.21509

 

Mischel, W., & Shoda, Y. (1995). A cognitive–affective system theory of personality: Reconceptualizing situations, dispositions, dynamics, and invariance in personality structure. Psychological Review, 102(2), 246–268. https://doi.org/10.1037/0033-295X.102.2.246

 

Shedler, J. (2010). The efficacy of psychodynamic psychotherapy. American Psychologist, 65(2), 98–109. https://doi.org/10.1037/a0018378



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI行为 模型对齐 Persona Vectors AI福祉 心理学框架
相关文章