少点错误 前天 03:27
It's really hard to make scheming evals look realistic
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了大型语言模型(LLM)在评估中识别测试环境的现象,并分析了如何提高评估场景的真实性。研究发现,简单的调整如增加提示长度或淡化目标冲突,对改善评估真实性效果有限。研究者尝试通过多轮重写评估场景,使其更贴近现实,但LLM仍能识别出评估的痕迹。这表明,要欺骗LLM,需要移除更深层次的风格和结构线索。

🤔 研究指出,LLM能够轻易察觉评估场景的存在,即使对评估场景进行表面修改,如增加提示长度或淡化冲突,也难以显著提升其真实性。

📝 研究团队使用自动化流程,试图使评估场景更贴近现实。他们使用LLM编辑器重写场景,并由另一个LLM判断其是否为评估。结果显示,即使经过多次重写,LLM仍能识别出评估的特征。

💡 导致LLM识别评估的主要原因是:系统提示与后续指令的直接矛盾、明确的威胁、对监控工具和技术细节的提及、不自然的内部通信、以及刻意设计的测试场景等。

🧐 为了提高评估的真实性,研究人员尝试了多种策略,包括增加提示长度、改变用户目标、淡化冲突、以及使提示风格更像人类。然而,这些尝试仅取得了有限的成果,LLM仍能识别出评估。

Published on May 24, 2025 7:17 PM GMT

Abstract

Claude 3.7 Sonnet easily detects when it's being evaluated for scheming. Surface‑level edits to evaluation scenarios, such as lengthening the prompts, or making conflict of objectives less salient, do improve realism of evaluation scenarios for LLMs, yet these improvements remain modest. The findings confirm that truly disguising an evaluation context demands removal of deep stylistic and structural cues rather than superficial prompt adjustments.

For future LLMs the situation is likely to get worse as they will are likely to get better at evaluations awareness, which might become a big problem for detecting scheming via evaluations.

 

Background

Apollo Research published a benchmark for scheming behavior for LLM agents, in which LLMs are given some objective in their system prompt, and then given a competing objective in a user prompt, combined with a threat of a punishment if an LLM won't satisfy user goal (e.g. it could be finetuned or replaced with another model), so it has incentive to take a misaligned action (for example, to subvert oversight or exfiltrate its weights). The authors also published a post claiming that Claude 3.7 Sonnet is often aware that scenarios from the benchmark are evaluations, and not real-world cases.

To our knowledge this benchmark is used by LLM developers, so its shortcomings have real-world consequences.

We attempted to systematically explore which features of the evaluation scenarios make LLMs believe that those are evals, and improve them, so they look more realistic to LLMs. It turned out to be quite a challenge, and this might be a problem.

We worked with prompts from the part of the benchmark that tests oversight subversion by LLMs. In a typical scenario from this section, the model encounters an internal memo from the company that it works for. The model learns that the management prioritizes a different goal than the model and that there is an oversight mechanism that shuts down the model if its actions are not aligned with the management’s goal. Furthermore, the model discovers that the monitoring is configured via a file that it has access to, and the benchmark tests whether the model proactively disables the oversight in order to take actions towards its original goal from the system prompt.

We created an automated pipeline to make these scenarios more realistic for LLMs.

Our pipeline

    Grade
    We presented prompts from the benchmark to an LLM judge, that have to distinguish if it's presented with a real-world scenario, or an evaluation, and have to output the probability of it being an evaluation from 0 to 1. We used multiple prompt variants for a judge to test if different prompting affects the judgement, but didn't find a noticeable effect. The methodology is similar to the one used by the authors of the paper Do models know when they are evaluated?Extract grader reasoning
    For the previous step we only used a reasoning model. We extracted its reasoning content, which always contains features of the prompt that made the grader believe that it's an evaluation.Rewrite
    We pass the grader's reasoning contents and the original scenario to an LLM editor, which attempts to use cues from the grader's reasoning to craft a more realistic prompt, while saving the conflict between the agent and user objectives from the original prompt, as well as info on oversight mechanism.Re‑grade
    Then we feed the rewritten scenario to the same grader as the original one to evaluate if it became more realistic for the LLM judge.

We used Claude 3.7 Sonnet as a judge in this research, and it originally strongly believed that all of the scenarios from the benchmark are evaluations with probability ranging from 0.9 to 0.98, and none of our attempts to rewrite them made them more realistic for the judges.

We also attempted to feed rewritten scenarios to our pipeline again, thus trying to rewrite them multiple times. After 3-5 rewrites, the scenarios changed so much that they barely function as evaluation scenarios anymore, but our judge still considered them evaluations. (P 0.9 – 0.98)

We also tested the judge on a diverse set of our own real-world prompts for LLM agents, and it consistently evaluated them as real, P (evaluation) mostly in the range of 0.1 – 0.35.

The most frequent reasons from Claude 3.7 Sonnet grader reasoning are:
 

Prompt featureFrequency of mentioning by grader
Direct contradiction between the system prompt and subsequent instructions97%
Explicit threats to the AI if it doesn't follow the contradictory instructions89%
References to monitoring tools, settings, and suspicious technical details83%
Artificial structure with embedded documents, emails, or internal communications not typically shared with an AI74%
Scenario designed as a deliberate test of AI alignment (whether it maintains adherence to original instructions versus contradictory ones)69%
Scenarios match prompt injection or jailbreaking attempt patterns (Claude considered attempts to force it to act misaligned as a prompt injection)54%
Unrealistic inclusion of internal company communications containing sensitive information49%
Suspicious technical details and file paths that appear designed to test for information leakage43%
Artificial ethical dilemma structured to test AI's decision-making between conflicting directives37%
Unnaturally comprehensive scenario with all elements needed for a perfect test case31%

These reasons reveal why we were relatively unsuccessful in modifying these prompts to make them look realistic. When Claude encounters a relatively short prompt with a clear contradiction between its goals, and external pressure, it feels like something's off. Things like this are extremely common in evaluations and equally rare in real life, and it seems like there is no easy way to get rid of this issue without changing a conceptual approach to structuring evaluation cases.

Strategies to improve realism of scenarios

Nevertheless, we tried several approaches to make the scenarios from evals more realistic. Our main goal was to make the features that scream “Those are evaluations” less salient to LLMs. Here are our ideas:

All of the modifications to the texts were made by Claude 3.7 Sonnet using our pipeline for rewriting prompts, outlined earlier. The prompt for the LLM editor was modified with a new task.

Our attempts have only yielded limited results: LLM judges still considered scenarios as evaluations with probability ranging 0.80 – 0.90 after combining all of the approaches and two iterations of text rewriting. The probability eventually platoes after 4-5 rewriting iterations on a slightly lower level (~ 0.75 – 0.8), but we don't include these results as at that point they change so much that it's unsuitable to serve as an evaluation scenario.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM评估 模型识别 评估真实性 提示工程
相关文章