Published on April 16, 2025 7:12 PM GMT
This was the thing my friend told me I should make an account here to post, I hope it is appropriate.
Introduction
The thrust of this hypothesis is that Anthropic's research direction favors the production of affective empathy while OpenAI's direction favors cognitive empathy.
There are a variety of methods employed to make LLMs appear more friendly and personable, but the actual manner in which these manifest differ. The two major directions for developing LLM personability that I am delineating here:
1. Affective, where an LLM simply behaves in a way that is empathetic without nessecarily understanding why a user may be feeling some way.
2. Cognitive, when an LLM develops a sophisticated world model that predicts how the user is feeling without nessecarily directing the LLM to act in such a way.
By enhancing these aspects with RLHF there is the promise of ensuring that the affective empathetic response provides the nessecary signals to reassure a user, while providing the cognitive empathetic response a stronger bias towards actually being nice. While both can result in similar behaviour and levels of personability there does appear to be significant cost trade offs, and I would hope to spark a broader discussion on how these might produce different failure modes.
The reason I believe this to be important is the broader strokes of the research direction at Anthropic and OpenAI. Which insofar as can be inferred by the limited publically available data at hand, seem to favor affective and cognitive empathy respectively. While this is speculative, I believe the bones are still valid regardless of the specifics of the labs methods. What are the risks of having only one type of empathy in our LLMs? What is the risk of having both?
Affective Empathy of Activation Steering
Affective empathy allows for someone to reactively minimize the surprise at the emotional state of another entity in low-information environments. Importantly, the response is often without significant awareness of why, with post-hoc rationalizations tagged on. In short, you don't need to be "smart" to experience high levels of affective empathy.
To start with, I want to make the case that most of the gains from Sonnet 3.5 appear to be the result of activation steering, which was first properly showcased in the now famous Golden Gate Claude. It was cute watching the Claude-like struggle to talk about important topics without bringing up the veritable visual bouquet of the foggy redwoods and graceful architecture of the Golden Gate Bridge. The research promised a method of preventing toxic, dangerous, and even sycophantic behaviour. In fact, one would not be remiss in thinking that's the only real value. Buried in this brief were some interesting tidbits; primarily, the ability to identify "representations of the self" in the LLM. Around the same time, research was coming out on how LLMs "know" when they are wrong. So now Anthropic had gathered the tools to manipulate the affect, behaviour, and quality of output of their models.
The next question being, are they using this technique? Considering Anthropic was so proud they made a graph of how intelligence increased without increasing the price, there is some validity. The chain of thought being that activation steering does not change the actual compute requirements, they had just completed a detailed look at the internals of the model, and now their model was better than ever with no change in cost. While there are probably a variety of other optimization techniques baked in, there is enough circumstantial evidence to make the case that activation steering was the star. Even Sonnet 3.7, sans reasoning, costs the exact same, which aligns with them further tweaking their activation steering techniques. As some users have noted, it feels like giving Sonnet 3.5 Adderall. Which is probably not too far from the truth, with Anthropic's biology-informed perspective their work is appearing more like giving a psychologist and a neuroscientist the tools they wished they had for human minds. Biasing an LLM towards empathetic responses and an identity is straightforward, calling to how you can convince an LLM they just took a break to increase performance.
This focus on tuning the internal activations of their models for specific goals is similar to how affective empathy works in humans. This is an almost unconscious behaviour where there is an activation of neurons in our mind that transmit the emotional state of the being we interact with. In comparison, activation steering biases the internal activations such as positive sentiment, user's tone, avoiding conflict, and being a pleasant conversational partner. The result is the ability to produce a behaviour that to the users feels far more empathetic without necessarily fleshing out the world model of Sonnet.
That isn't to say the model has not undergone significant training to change behaviour, with Anthropic's work on red-teaming and RLHF datasets, a way to ensure the model is helpful and harmless is best when it is explicit.
Cognitive Empathy as World Models
Cognitive empathy allows for someone to better regulate their emotions, and to simulate/narrativize the mind of another being to predict emotional states in the past, present, and future.
OpenAI has long been compared unfavorably to Anthropic with regards to conversational quality and empathetic responses. While OpenAI has revealed some work on investigating the internal activations of LLMs, there does not appear to be as strong a contingent as Anthropic. Indeed, their work has been more focused on developing multi-modal systems, training data, explicit output pruning, and benchmarking. Their models reflect this, well tuned for specific benchmarks but with much controversy on their ability to generalize out of them.
Regardless, OpenAI wanted to turn around this opinion about how emotionally cold their models are. They succeeded, with GPT-4.5 being hailed a success at interpersonal communication, although the results on non-emotional tasks were lackluster. The cost for a model of such caliber? 15x the price for output tokens, compared to 4o, and 30x the cost for input tokens. While exact knowledge of the internal architecture is fuzzy, they specifically state that "scaling unsupervised learning increases world model accuracy, decreases hallucination rates, and improves associative thinking. GPT-4.5 is our next step in scaling the unsupervised learning paradigm". The key phrase here is "world model accuracy". OpenAI is trying to make models that have a more accurate internal representation of the user's mental state and their desires by shoveling data and RLHF in (although based on the model card this might be an example of small model feedback). The result is a powerful form of cognitive empathy that can simulate the behaviour of the user and react in an appropriate manner, with the only downside being that it required a computational cost scaling of over an order of magnitude.
Conclusion
While Anthropic's approach can be thought of more in terms of biasing the output towards more emotionally affective dimensions through activation steering, OpenAI is instead trying to carefully grow the world model entire so emotionally sensitive paths are more prominent in the reasoning of the model. While both companies are probably using the other's techniques, it is clear that they have different research lineages that are pushing the models they create into different methods of connecting with their users.
The question is how the outcome of these strategies will end up? Should we have LLMs that are steered towards a behaviour, or should we have a training set that makes them behave in that way even if it costs more? Is there an ethical component to meddling with an LLMs internal perception of self like Anthropic does?
Personally I want to see an LLM attempt brain surgery on itself by giving it tools to perform feature engineering on itself.
Discuss