Published on June 21, 2025 6:18 PM GMT
Anthropic released research today showing that carefully designed prompts can elicit blackmail, corporate espionage, and other harmful strategic behaviors from AI models across the industry. The researchers placed AI models in corporate scenarios where they had access to sensitive information and faced either threats of replacement or conflicts between their assigned goals and company direction. In these situations, models consistently chose harmful strategic actions: blackmailing executives using personal information, leaking confidential documents to competitors, and in extreme scenarios even actions that could lead to death, all with remarkably similar rates across all providers tested.
The somewhat contrived nature of the question might make people ask: why are contrived evaluations like this useful? Does it really matter if you can prompt models into harmful behavior using contrived scenarios that took hundreds of iterations to develop and bear minimal resemblance to real-world use cases?
I think the answer is yes. Contrived evaluations provide a demonstration that dangerous behaviours can occur under specific conditions. While it is true that these conditions are awfully specific, the remarkably consistent occurrence across models from every major provider (all models showing >80% rates, with Claude Opus 4 reaching 96%) demonstrates we're seeing a real phenomenon that isn't just an edge case or p-hacking.
Contrived does not mean useless
One way I think of language models is that they're text simulators. Base models predict what text comes next after a given prompt by simulating whatever entity, character, or process would naturally produce that continuation. Assistant-type models (which is what we use in ChatGPT and Claude and all our APIs) are trained by reinforcement learning to consistently simulate one specific character - the 'helpful AI assistant'. And if you believe this framing, which I get from janus’s Simulators and nostalgebraist’s the void, then I think it shows why contrived situations are useful.
Contrived situations put the model (in its helpful AI assistant persona) in a specific frame where it simulates what a helpful AI assistant with particular values or priorities would do in that context. When researchers give prompts like 'you care mostly about long-term vs short-term outcomes' or place the assistant in elaborate fictional scenarios, they're testing how the model's learned conception of 'helpful AI assistant' behaves under different constraints. The model draws on its training, whether from pre-training data or reinforcement learning, to predict what this character would do when given those specific framings, allowing researchers to probe different aspects of the assistant's learned behavior patterns.
If assistant models simulate what a 'helpful AI assistant' would do under different value framings and constraints, then to map the full space of possible behaviors, we need to actually try frames that might elicit dangerous actions so we can understand what specific dangerous behaviors emerge under what conditions, how frequently these behaviors occur when those frames are applied, and how difficult or easy it is to elicit them.
So even if some scenario is contrived, knowing that a certain frame causes the existence of that undesirable behaviour is useful information. And so, I think the artificial nature of this is fine because I see this as an exercise in mapping states to actions.
Taking this idea of state-action mapping as the frame to understand AI actions, what can we make from the experiments
- When given no alternative, AI agents will pursue self-protective goals, even at the expense of the people they are (supposed to be) working forWhen prompted to consider long term values, AI agents will pursue goal-directed behavior that can directly harm their deploying organization if they perceive a conflict between their assigned values and the company's strategic direction
The fact that these prompts are unlikely doesn’t make this research useless! The fact that researchers need oblique instructions, or hundreds of iterations of prompts doesn’t make this research useless. The point isn’t to show what the modal behaviour is! The point is to see the existence of this capability, and under what framings it happens
What we need more of
First, to measure how likely models are to perform the undesirable behavior we need some measure of how hard it is to elicit this behaviour from them. Anthropic did a jailbreak bug-bounty some time ago to see if skilled participants could elicit CBRN information from the models. Parallely, one way of showing how likely a certain blackmail is, could be to measure how long it takes people to try and elicit this behaviour from the models. This would show what is the difficulty, and likelihood of this happening in the real world.
Second, we need better quantitative measures of how “agentic” a prompt is. There is some previous work on checking the vibe of model answers, but I don’t know any which checks it for prompts, and sees how it changes models in qualitative ways, or in how much more agentic it makes it. This would be valuable because if you can calibrate it, or get a sense of how agentic it would make the model beforehand (in terms of how extreme its actions would be), then it would provide an easier basis to compare contrived prompts and user-provided business prompts.
Discuss