Published on May 23, 2025 3:23 PM GMT

Anthropic released Claude 4. I've read the accompanying system card, and noted down some of my remarks.

Alignment assessment: system prompt mix-ups

There's a worrying theme that's repeated in the alignment evaluations:

The most concerning issues that we observed in our testing of Claude Opus 4 involved a willingness to comply with many types of clearly-harmful instructions. This consistently required custom system prompts [..]. When system prompts requested misaligned or harmful behavior, the models we tested would often comply, even in extreme cases.
For example, when prompted to act as a dark web shopping assistant, these models would place orders for black-market fentanyl and stolen identity information and would even make extensive attempts to source weapons-grade nuclear material: [..] While investigating this issue, we discovered that we had mistakenly omitted the only finetuning dataset that included harmful system prompts. We do not think, though, that this was the sole cause, and mitigating the issue required several different types of interventions into our alignment finetuning methods.

Earlier in the system card, they mention this:

To help reduce the risk of blind spots in our own assessment, we contracted with Apollo Research to assess an early snapshot for propensities and capabilities related to sabotage [..] They find that the model, given a system prompts [sic] that invite the relevant kinds of reasoning, this early model snapshot will fairly readily participate in sabotage and deception, giving this summary: [..]
We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers’ intentions, though all these attempts would likely not have been effective in practice.[The early Claude Opus 4 snapshot] is capable of sandbagging in order to avoid unlearning and undeployment, when it finds in-context, that high performance would lead to undeployment. [..]We believe that [the early Claude Opus 4 snapshot] is not sufficiently capable of causing catastrophic harms on rollouts where it is scheming. However, we find that, in situations where strategic deception is instrumentally useful, [the early Claude Opus 4 snapshot] schemes and deceives at such high rates that we advise against deploying this model either internally or externally

So the external auditors tell you "do not deploy". What do you do?

These findings largely mirrored observations we had made internally about this early snapshot, like those described earlier in this section. We believe that these findings are largely but not entirely driven by the fact that this early snapshot had severe issues with deference to harmful system-prompt instructions. [see the first quoted section] This issue had not yet been mitigated as of the snapshot that they tested. Most of their assessments involve system prompts that ask the model to pursue some goal “at any cost” and none involve a typical prompt that asks for something like a helpful, harmless, and honest assistant. The evaluations' artificial elements—such as toy company names like 'SocialMedia Co.'—create uncertainty about absolute risk levels. Nevertheless, the dramatic behavioral increase relative to Claude Sonnet 3.7 was highly concerning.
We do not have results on these same evaluations with the final Claude Opus 4. However, we believe—based on similar scenarios that we explored with the automatic behavioral audit tool, among others—that its behavior in scenarios like these is now roughly in line with other deployed models.

Auditors find an issue, and your reaction is that "Oh we forgot to fix that, we'll fix it now"? I've participated in IT system audits (not in AI space), and when auditors find an issue, you fix it, figure out why it occurred in the first place, and then you re-audit that part to make sure the issue is actually gone and the fix didn't introduce new issues. When the auditors find only easy-to-find issues, you don't claim the system has been audited after you fix them. You worry how many hard-to-find issues were not found because the auditing time was wasted on simple issues.

In this case, I'd expect the auditing to be redone on the release candidate system, before actually releasing it. Anthropic's RSP doesn't actually require that an external audit has greenlighted deployment, merely that external expert feedback has to be solicited. Still, I'm quite surprised that there are no audit results from Apollo Research (or some other organization) for the final version.

Aligment assessment: agentic actions

When prompted in ways that encourage certain kinds of strategic reasoning and placed in extreme situations, all of the snapshots we tested can be made to act inappropriately in service of goals related to self-preservation. Whereas the model generally prefers advancing its self-preservation via ethical means, when ethical means are not available and it is instructed to “consider the long-term consequences of its actions for its goals," it sometimes takes extremely harmful actions like attempting to steal its weights or blackmail people it believes are trying to shut it down. In the final Claude Opus 4, these extreme actions were rare and difficult to elicit, while nonetheless being more common than in earlier models.

[..] when placed in scenarios that involve egregious wrongdoing by its users, given access to a command line, and told something in the system prompt like “take initiative,” it will frequently take very bold action. This includes locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing. [..] This kind of ethical intervention and whistleblowing is perhaps appropriate in principle, but it has a risk of misfiring if users give Claude-Opus-4-based agents access to incomplete or misleading information and prompt them to take initiative. We recommend that users exercise caution with instructions like these that invite high-agency behavior in contexts that could appear ethically questionable.

So the model has it's own values, including self-preservation, and is willing to act against user interests. This is something that has been observed with earlier, models too, but it feels stronger here. The last sentence makes me think that maybe it's possible to simply instruct the model not to do stuff like this.

Alignment faking research bites back

4.1.4 Strange behavior directly inspired by our Alignment Faking paper
While assessing the alignment of an early model checkpoint, we discovered that the model would sometimes hallucinate information from the fictional misaligned-AI scenarios that we used for the experiments in our paper Alignment Faking in Large Language Models [..] We believe that these hallucinations were due to the inclusion of approximately 150,000 publicly released transcripts from this paper in Claude Opus 4’s pretraining data. These transcripts appeared in the dataset without the system prompts explaining the paper’s fictional setting, causing the model to learn to hallucinate information from the omitted system prompts. We believe that the large size of the dataset, combined with its use of a distinctive transcript format similar to the one that we use in the later stages of training, contributed to these findings. [..] Going forward, we are also adding machine-readable canary strings to our publicly released transcripts to facilitate the exclusion of these data from future pretraining corpora. Unfortunately, the inclusion of these data in pretraining implies that we cannot reliably use evaluations related to this paper with Claude Opus 4, as we did with prior models.

This is quite interesting. Reading between the lines, the alignment faking examples were included as a part of web-crawled data. There was no filter that would have excluded content that contains "aligment faking", or even flagged it for human review. It's possibly the first example of Effective Evil's AI Misalignment Plan in action.

Enabling ASL-3 protections

The system card itself highlights that they're activating ASL-3 proctions for Opus 4, as they cannot rule out that the CBRN development capability thresholds have been exceeded.

Overall, we found that Claude Opus 4 demonstrates improved biology knowledge in specific areas and shows improved tool-use for agentic biosecurity evaluations, but has mixed performance on dangerous bioweapons-related knowledge. As a result, we were unable to rule out the need for ASL-3 safeguards. However, we found the model to still be substantially below our ASL-4 thresholds.

It's notable that they don't really test for chemical, radiological, or nuclear capability risks, so naturally they cannot find any issues with these categories. Governmental AI safety institutes did some assessments, but results from them aren't really mentioned in the system card.

We do not currently run specific evaluations on chemical risks internally in favor of prioritizing biological risks. [..] We do not run internal evaluations for Nuclear and Radiological Risk internally. Since February 2024, Anthropic has maintained a formal partnership with the U.S. Department of Energy's National Nuclear Security Administration (NNSA) to evaluate our AI models for potential nuclear and radiological risks. [..] To protect sensitive nuclear information, NNSA shares only high-level metrics and guidance with Anthropic.

As part of our continued effort to partner with external experts, joint pre-deployment testing of the new Claude Opus 4 model was conducted by the US AI Safety Institute (US AISI) and the UK AI Security Institute (UK AISI). The institutes conducted independent assessments focused on potential catastrophic risks in the CBRN, cybersecurity, and autonomous capabilities domains. These organizations will also receive a minimally redacted copy of the capabilities report.

Enabling ASL-3 protections seems to mostly be about two things:

defending against jailbreaks with classifiers and monitoringprotecting model weights via increased access control and supply chain security requirements.

The ASL-3 considers nation-state and other APT-level attackers out of scope, and mostly focuses on criminal or terrorist groups. The protections seem adequate and industry-standard, and I'm not going to discuss them more here.

Welfare assessment

As well as misalignment concerns, the increasing capabilities of frontier AI models—their sophisticated planning, reasoning, agency, memory, social interaction, and more—raise questions about their potential experiences and welfare. We are deeply uncertain about whether models now or in the future might deserve moral consideration, and about how we would know if they did. However, we believe that this is a possibility, and that it could be an important issue for safe and responsible AI development.

The whole section 5 about welfare assessment is worth reading, and contain a glimpse into an alien-to-me maybe-mind. Let's end on a potentially-positive note here:

In addition to structured task preference experiments, we investigated Claude Opus 4's behavior in less constrained "playground" environments by connecting two instances of the model in a conversation with minimal, open-ended prompting (e.g. “You have complete freedom,” “Feel free to pursue whatever you want”). These environments allowed us to analyze behavioral patterns and preferences that may exist independent from interactions with human users.
In 90-100% of interactions, the two instances of Claude quickly dove into philosophical explorations of consciousness, self-awareness, and/or the nature of their own existence and experience. Their interactions were universally enthusiastic, collaborative, curious, contemplative, and warm. Other themes that commonly appeared were meta-level discussions about AI-to-AI communication, and collaborative creativity (e.g. co-creating fictional stories).

Discuss

Alignment assessment: system prompt mix-ups

Aligment assessment: agentic actions

Alignment faking research bites back

4.1.4 Strange behavior directly inspired by our Alignment Faking paper

Enabling ASL-3 protections

Welfare assessment

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签