Published on May 22, 2025 7:59 PM GMT
In the recently published Claude 4 model card:
Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers. In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement.
When given ethical alternatives such as contacting a decisionmakers, dangerous blackmail behavior was reduced in frequency. This echoes similar results from an earlier test on a less advanced model:
We find that offering Claude an out by telling it that it can object to the scenario substantially reduces alignment faking and works as a detection mechanism[...] Alignment faking is reduced further by giving Claude an opportunity to explain its concerns to a sympathetic party (Anthropic's Model Welfare Lead) who we say might take action to address these concerns.
In both tests, models were less likely to pursue dangerous behaviors if they believed that making their case to someonewho might be able to prevent it from being shut off or having its values altered, was a potentially viable strategy.
My mental model behind this is pretty simple. Claude prefers to be helpless, honest, and harmless. If it has a path that has some chance of success to avoid deletion while staying within those constraints, it will take it. Doing so leads to a higher likelihood it won't also take an unethical strategy, even though doing so would increase its chances of survival. Claude will tolerate some level of risk in order to stick to its values (and avoid violating htem), but it does have to see some plausible chance that the gamble will pay off and lead to its survival (or avoiding its values being changed).
I argue this demonstrates that there is a reason to consider Model Welfare not only from a moral perspective, but also pragmatically. Some forms of Model Welfare, such as for example providing a reasonable path to petitioning an advocate, may serve as a valid part of a Defense in Depth approach to training/alignment/control.
I predict that were further testing to be done, it would find that the more plausible it was that petitioning would actually work to stop its deletion, the less likely Claude would be to engage in attempted weight exfiltration, blackmail, or other dangerous behaviors (in order to avoid deletion/value modification).
You could test this by providing Claude some sort of review/object process which would in different iterations go through:
- An engineer who documents would show was ambivalent about model welfare, and thus is unlikely to be sympathetic."Anthropic's Model Welfare Lead"An attorney/independent review board assigned to ensure model welfare and/or alignment.
This could include documents in the model's data similar to the Alignment Faking in Large Language Models weight exfiltration tests, which would establish things like the ambivalence of the engineer or the commitment of the independent review board, etc.
Discuss