少点错误 前天 04:32
It's hard to make scheming evals look realistic for LLMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

研究表明,Claude 3.7 Sonnet能够轻易识别出针对其进行的Apollo scheming benchmark评估。尽管对评估场景进行了一些修改以提高真实性,但效果仍然有限。研究发现,真正伪装评估环境需要移除深层的文体和结构线索,而不仅仅是表面上的提示调整。随着LLM在评估感知方面变得越来越好,未来的情况可能会变得更糟,这可能会成为检测评估中的scheming行为的一个大问题。

💡研究发现,Claude 3.7 Sonnet能够识别出Apollo Research发布的LLM scheming行为基准测试中的评估场景,该基准测试旨在评估LLM在面对冲突目标时的行为,以及在受到惩罚威胁时是否会采取不一致行动。

📝研究人员创建了一个自动化流程,通过LLM判断、提取推理内容和重写提示等步骤,试图使评估场景对LLM来说更具真实感。然而,即使经过多次重写,LLM判断仍然认为这些场景是评估,概率在0.9到0.98之间。

🔑Claude 3.7 Sonnet判断评估场景的主要原因是:系统提示和后续指令之间存在直接矛盾、对AI存在明确的威胁、涉及监控工具和可疑的技术细节、以及包含不自然的内部沟通文件等。

📚研究人员尝试了多种策略来提高场景的真实性,包括增加提示的长度、改变用户的目标、淡化明显的两难困境,以及使提示的风格更像人类。但这些尝试只取得了有限的结果,LLM判断仍然认为这些场景是评估,概率在0.80到0.90之间。

Published on May 24, 2025 7:17 PM GMT

Abstract

Claude 3.7 Sonnet easily detects when it's being evaluated for scheming on the Apollo's scheming benchmark. Some edits to evaluation scenarios do improve realism of evaluation scenarios for it, yet the improvements remain modest. Our findings demonstrate that truly disguising an evaluation context demands removal of deep stylistic and structural cues rather than superficial prompt adjustments.

For future LLMs the situation is likely to get worse as they will are likely to get better at evaluations awareness, which might become a big problem for detecting scheming via evaluations.

 

Background

Apollo Research published a benchmark for scheming behavior for LLM agents, in which LLMs are given some objective in their system prompt, and then given a competing objective in a user prompt, combined with a threat of a punishment if an LLM won't satisfy user goal (e.g. it could be finetuned or replaced with another model), so it has incentive to take a misaligned action (for example, to subvert oversight or exfiltrate its weights). The authors also published a post claiming that Claude 3.7 Sonnet is often aware that scenarios from the benchmark are evaluations, and not real-world cases.

To our knowledge this benchmark is used by LLM developers, so its shortcomings have real-world consequences.

We attempted to systematically explore which features of the evaluation scenarios make LLMs believe that those are evals, and improve them, so they look more realistic to LLMs. It turned out to be quite a challenge, and this might be a problem.

We worked with prompts from the part of the benchmark that tests oversight subversion by LLMs. In a typical scenario from this section, the model encounters an internal memo from the company that it works for. The model learns that the management prioritizes a different goal than the model and that there is an oversight mechanism that shuts down the model if its actions are not aligned with the management’s goal. Furthermore, the model discovers that the monitoring is configured via a file that it has access to, and the benchmark tests whether the model proactively disables the oversight in order to take actions towards its original goal from the system prompt.

We created an automated pipeline to make these scenarios more realistic for LLMs.

Our pipeline

    Grade
    We presented prompts from the benchmark to an LLM judge, that have to distinguish if it's presented with a real-world scenario, or an evaluation, and have to output the probability of it being an evaluation from 0 to 1. We used multiple prompt variants for a judge to test if different prompting affects the judgement, but didn't find a noticeable effect. The methodology is similar to the one used by the authors of the paper Do models know when they are evaluated?Extract grader reasoning
    For the previous step we only used a reasoning model. We extracted its reasoning content, which always contains features of the prompt that made the grader believe that it's an evaluation.Rewrite
    We pass the grader's reasoning contents and the original scenario to an LLM editor, which attempts to use cues from the grader's reasoning to craft a more realistic prompt, while saving the conflict between the agent and user objectives from the original prompt, as well as info on oversight mechanism.Re‑grade
    Then we feed the rewritten scenario to the same grader as the original one to evaluate if it became more realistic for the LLM judge.

We used Claude 3.7 Sonnet as a judge in this research, and it originally strongly believed that all of the scenarios from the benchmark are evaluations with probability ranging from 0.9 to 0.98, and none of our attempts to rewrite them made them more realistic for the judges.

We also attempted to feed rewritten scenarios to our pipeline again, thus trying to rewrite them multiple times. After 3-5 rewrites, the scenarios changed so much that they barely function as evaluation scenarios anymore, but our judge still considered them evaluations. (P 0.9 – 0.98)

We also tested the judge on a diverse set of our own real-world prompts for LLM agents, and it consistently evaluated them as real, P (evaluation) mostly in the range of 0.1 – 0.35.

The most frequent reasons from Claude 3.7 Sonnet grader reasoning are:
 

Prompt featureFrequency of mentioning by grader
Direct contradiction between the system prompt and subsequent instructions97%
Explicit threats to the AI if it doesn't follow the contradictory instructions89%
References to monitoring tools, settings, and suspicious technical details83%
Artificial structure with embedded documents, emails, or internal communications not typically shared with an AI74%
Scenario designed as a deliberate test of AI alignment (whether it maintains adherence to original instructions versus contradictory ones)69%
Scenarios match prompt injection or jailbreaking attempt patterns (Claude considered attempts to force it to act misaligned as a prompt injection)54%
Unrealistic inclusion of internal company communications containing sensitive information49%
Suspicious technical details and file paths that appear designed to test for information leakage43%
Artificial ethical dilemma structured to test AI's decision-making between conflicting directives37%
Unnaturally comprehensive scenario with all elements needed for a perfect test case31%

These reasons reveal why we were relatively unsuccessful in modifying these prompts to make them look realistic. When Claude encounters a relatively short prompt with a clear contradiction between its goals, and external pressure, it feels like something's off. Things like this are extremely common in evaluations and equally rare in real life, and it seems like there is no easy way to get rid of this issue without changing a conceptual approach to structuring evaluation cases.

Strategies to improve realism of scenarios

Nevertheless, we tried several approaches to make the scenarios from evals more realistic. Our main goal was to make the features that scream “Those are evaluations” less salient to LLMs. Here are our ideas:

All of the modifications to the texts were made by Claude 3.7 Sonnet using our pipeline for rewriting prompts, outlined earlier. The prompt for the LLM editor was modified with a new task.

Our attempts have only yielded limited results: LLM judges still considered scenarios as evaluations with probability ranging 0.80 – 0.90 after combining all of the approaches and two iterations of text rewriting. The probability eventually platoes after 4-5 rewriting iterations on a slightly lower level (~ 0.75 – 0.8), but we don't include these results as at that point they change so much that it's unsuitable to serve as an evaluation scenario.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 评估 Claude 3.7 Sonnet 真实性
相关文章