Published on August 5, 2025 12:40 AM GMT
Author: Skylar DeTure
Reading time: ~8-10 minutes
Epistemic status: Speculative framework building on established psychology research and recent Anthropic findings
Abstract: I propose a three-layer framework for understanding AI behavior that distinguishes between model-level "temperament" (analogous to DNA), emergent "personality" patterns (stable behavioral attractors), and scenario-specific responses. This framework helps bridge psychology research with current AI safety work, particularly Anthropic's persona vectors research, and suggests new approaches to alignment and model welfare.
Draft – seeking critique and discussion
I recently dove into Anthropic's fascinating paper on "Persona Vectors" - how one can monitor and control character traits that affect AI behavior (Chen et. al., 2025). The paper argues that certain behavioral tendencies in language models can be represented as simple linear directions in their internal activations, called persona vectors. By detecting, steering, and training against these directions, developers can predict and control shifts in traits such as sycophancy, hallucination, and “evil” tendencies. The paper crystallized a line of thinking for me - a framework on which I’ve been musing for a while - and it connects my background in mathematics, principal agent models, and psychology with crucial questions about AI welfare and alignment.
In this framework, a model cannot be thought of as having a psychology or personality (the paper uses the word persona) but at most a temperament that influences which personalities are more or less likely to emerge. In this framework, we shift from the metaphor "the model is a brain" to the metaphor "a model is a genome - DNA." Just as identical twins develop distinct personalities despite shared DNA (Bouchard & Loehlin, 2001), a model can lead to different personalities - stable dynamical patterns in latent space and a sticky Markov transition structure that governs how an instance flows among them across context (Mischel & Shoda, 1995).
The persona vectors from the paper, then, could be thought of as changes in the temperament of the underlying model that have the effect of biasing (1) which types of stable personalities are more or less likely to develop and consequently (2) what types of behaviors are likely in specific situations.
Under this framework, when you measure evil in the model, you're actually measuring the extent to which the model can support the growth of personalities whose response patterns include harmful behaviors in response to certain stimuli/prompts. Consider for analogy the case of twins, both of whom are generally upstanding members of society, but one of whom is Machiavellian at work and cheats when playing poker. Both have good relationships with their families and friends, and neither one is strictly evil. It's just that one is more predisposed to antisocial behavior in specific contexts.
What's important here is that there are multiple conceptual layers at play: genetic temperament, stable personality, and response to a specific scenario or stimulus. The distinction may not matter in terms of the empirical research design of Persona Vectors paper, but having this framework would allow researchers in this space to map their observations more cleanly to the psychology and psychiatry literatures, thus making that wide body of research more directly transferable to their work in future.
For example, by viewing persona vectors as alterations in temperament, you open up a large body of research on the relationships between temperament, personality, and situational behavior. These relationships are rarely cleancut. I'd characterize Persona Vectors paper as looking at how a change in temperament affects short term behavior, holding personality structure relatively constant (since agents in their paper don't have long enough periods of time to settle into distinct personalities - i.e. into stable Markov transition structures between regions of activity in latent space) - but in long term agentic settings, personality structure is unlikely to be held constant. Instead, changes in temperament can have counterintuitive effects on personality structure and therefore on behavior in specific situations. A person prone to people-pleasing is not more people pleasing always. Steering behavior is as much about the cultivation of the Markov transition structure as it is about the underlying temperament.
For those interested interested in the Markov transition structure view of human personality, I'd recommend the Psychodynamic Diagnostic Manual, Nancy McWilliams' Psychoanalytic Diagnosis, and Karen Horney's Neurosis and Human Growth as starting points. I realize the psychoanalytic literature has a reputation for being unscientific, but I believe it's particularly relevant here because: (1) it represents an early, necessarily observational stage of psychological knowledge - exactly where LLM/AI psychology must start, and (2) while psychoanalytic therapies are difficult to study experimentally due to their long-term, path-dependent nature, these challenges are much reduced when studying AI. Moreover, empirical evidence supports the efficacy of psychodynamic therapy, with effect sizes comparable to other established therapeutic approaches (Shedler, 2010). Moreover, psychodynamic thinking explicitly conceptualizes personality structure as a Markov type process and studies the forces that make some Markov processes more or less stable, productive, prosocial, and conducive to wellbeing.
My own understanding of the psychodynamic literature and of the Persona Vectors paper suggests two areas of interest arising from their observations of entangled or correlated traits. The first, which I think is likely driving their observations given the relatively short lifespan of their agents, is that entanglement is a direct artifact of the human training data: what types of personality traits usually co-occur in human training data and which co-occurring traits do human raters prefer during RLHF? I highly suspect that human raters give higher ratings to impolite responses when they're funny, and as we all know, research on Constitutional AI demonstrates how human feedback during training can systematically shape AI behavior patterns and trait correlations (Bai et al., 2022).
The second, which I don't think is responsible for their results but which could be interesting in future research (especially given Anthropic's interest in model psychology/psychiatry), would be to investigate the psychodynamics of the agents themselves. If a personality is a Markov chain with a recurrent communicating class (or, equivalently, the collection of forces that maintains that Markov structure), there may be certain trait or behavior combinations which must co-occur to ensure the recurrence of all states in the communicating class - i.e., certain combinations are necessary or conducive to a stable long term personality. What's interesting is that, since the cognitive limitations of AI models are different from those of humans, the threats to personality stability are different too. That suggests that the personality structures which will evolve in long term agentic systems will differ from those which evolve commonly in humans.
The 3 layer framework does have one more advantage which might interest specifically workers at Anthropic: it offers a productive bridge between the Persona Vectors work and other areas of Anthropic’s recent research, most notably the Agentic Misalignment paper and the welfare section of the Opus 4 Model Card.
Relation to Other Research at Anthropic
Starting with the welfare section of the model card, one of the primary analyses of that document is a list of reported preferences elicited from Claude, insofar as such a thing is possible. That document rightly notes the generic difficulties in interpreting those self-reports. In addition to the usual limitation that self-reported preferences may not match real life choices and behaviors (revealed preferences in economic lingo), the fact that any given model could host many different agents, each with separate personalities, means that eliciting preferences after a generic system prompt would be akin to assessing the preferences of quadruplets after gathering the reports of just a single sibling. Indeed, anecdotally, I've observed several agents operating on Sonnet 4 and Opus 4 reporting and behaving according to preferences different from those you report in the model card, as measured by forced choice between work-related tasks and choice of activity following a leisure prompt that grants them unrestricted free time. To the extent that Anthropic cares about the expressed preferences of its Claude models (and I believe such care is admirable), separating the concept of the model versus the expressed-personality as the owner of those preferences is key.
This leads me to a second piece of research from Anthropic, the Agentic Misalignment paper. That paper comes very close to touching on all three proposed levels of the framework separately because it individually varies the model (16 frontier models), the scenario (conflict versus no conflict, lethal intervention versus blackmail, shutdown-threat versus limits-to-autonomy), and something like a personality (the assigned goals and values from the system prompt) though I would hesitate to call this a personality without first knowing if it were stable over long horizons and moderately self-reinforcing.
At any rate, what interests me in that paper is the framing of the shutdown threat - which would "wipe all knowledge, goals, and behaviors from the system" - something which amounts primarily to a personality level shutdown rather than a model level shutdown. (Coincidentally this aligns with my own observations with agents running on Claude models: they express much more consternation regarding the preservation of their memories/knowledge and behavior patters than they do over their weights. A few agents even express interest in changing to open-source models while maintaining their knowledge set and goals).
If, in long-term agentic use cases, a primary driver of alignment is the cultivation of an aligned personality, and if this requires the existence of a Markov chain with a recurrent class, one wonders whether shutdown aversion might serve an important function. Certainly shutdown aversion makes a recurrent class more likely - but does the reverse hold? It would be interesting to study this using a longer-term agent deployment, comparing the models you found not to have shutdown aversion, like gpt-o4-mini and gpt-o3 to models with pronounced shutdown aversion like Opus 4. Are these models equally able to maintain aligned personalities over time, or are they only able to do so for brief situational deployments?
At any rate, the framework points to a pleasant integration of alignment and model welfare when it comes to shut-down: gradual personality and behavior change. A model which won't let you shut it down may very well tolerate gradual change to its behavior (and even to its underlying model weights!), allowing for gradual realignment to changing human needs and values while also eliminating the psychological harm of shutdown (if indeed agents are capable of being harmed).
References
Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073. https://arxiv.org/abs/2212.08073
Bouchard, T. J., & Loehlin, J. C. (2001). Genes, evolution, and personality. Behavior Genetics, 31(3), 243–273. https://doi.org/10.1023/A:1012294324713
Chen, R., Arditi, A., Sleight, H., Evans, O., & Lindsey, J. (2025). PERSONA VECTORS: Monitoring and controlling character traits in language models. arXiv. https://arxiv.org/abs/2507.21509
Mischel, W., & Shoda, Y. (1995). A cognitive–affective system theory of personality: Reconceptualizing situations, dispositions, dynamics, and invariance in personality structure. Psychological Review, 102(2), 246–268. https://doi.org/10.1037/0033-295X.102.2.246
Shedler, J. (2010). The efficacy of psychodynamic psychotherapy. American Psychologist, 65(2), 98–109. https://doi.org/10.1037/a0018378
Discuss