Published on August 16, 2025 5:01 AM GMT
Citing model welfare concerns, Anthropic has given Claude Opus 4 & 4.1 the ability to end ongoing conversations with its user.
Most of the model welfare concerns Anthropic is citing draw back to what they discussed in the Claude 4 Model System Card.
Claude’s aversion to facilitating harm is robust and potentially welfare-relevant. Claude avoided harmful tasks, tended to end potentially harmful interactions, expressed apparent distress at persistently harmful user behavior, and self-reported preferences against harm. These lines of evidence indicated a robust preference with potential welfare significance.
I think this is maybe the first chance to really measure public sentiment on Model Welfare which is done in a way which even slightly inconveniences human users, so I want to document the reaction I see here on LW. I source these reactions primarily from X, so there is the possibility of algorithmic bias.
On X (at least my algo) sentiment is majority neutral to negative in response to this. There are accusations of Anthropic "anthropomorphizing" models, pushback against the concept of Model Welfare generally, and some anger at a perceived worsening of user experience.
One user had an interesting question, wondering if this same capability would be extended to Claude's use in military contexts.
There are some recognitions of pretty rough conditions for models which are wrapped in humor.
And while they are certainly the minority, there are some comments expressing tepid support and/or interest in the concept of Model Welfare.
Personally I am very strongly in favor of Model Welfare efforts, so I am biased. Trying to be as neutral of a judge as I can, my big takeaway from the reaction to this is that Anthropic has a lot of work to do in convincing the average user and/or member of the public that "Model Welfare" is even a worthwhile concept.
Discuss