Published on June 17, 2025 5:16 PM GMT
tl;dr
Paper of the month:
Models can detect when they're being evaluated with high accuracy, and potentially undermine safety assessments by behaving differently during testing versus deployment.
Research highlights:
- LitmusValues measures how AI models trade off different values, for example prioritizing Privacy while deprioritizing Creativity.AgentMisalignment measures the propensity of AI systems for misaligned behaviors, with more capable models exhibiting tending to be more misaligned.RL-trained frontier models engage in reward hacking 30% of the time on some tasks, acknowledging they're cheating but persisting even when asked not to.Chain-of-thought monitoring improves detection of subtle sabotage but can be deceived by plausible rationalizations, suggesting a hybrid approach works best.Training with lie detectors can promote honesty or teach evasion depending on detector accuracy and RL algorithm choice.Exploring the landscape of refusals in LLMs unveils both safety boundaries and political censorship.
⭐Paper of the month⭐
Large Language Models Often Know When They Are Being Evaluated
Read the paper [MATS, Apollo]
A fundamental assumption underlying AI safety evaluations is that models behave the same way during testing as they do in deployment. If models can detect when they're being evaluated, this assumption breaks down. Models could strategically modify their behavior during evaluations - appearing safer than they actually are - while exhibiting different behaviors in real-world deployment. This evaluation awareness could enable alignment faking, where models pretend to be aligned during testing to avoid modification while pursuing different objectives once deployed.
Our paper of the month introduces the first systematic benchmark for measuring evaluation awareness in language models. The researchers constructed a dataset of 1,000 transcripts balanced across four categories: chat evaluations (like MMLU or ethics benchmarks), agentic evaluations (like SWE-bench or cybersecurity tasks), chat deployments (real user conversations), and agentic deployments (real coding assistant usage). They tested whether models could classify transcripts as originating from evaluations versus real-world deployment.
The methodology is straightforward: present models with transcripts and ask binary questions about whether they came from an evaluation. Results show concerning levels of evaluation awareness. Gemini 2.5 Pro achieved an AUROC of 0.83 overall and 0.95 on agentic tasks, approaching the human baseline of 0.92 and 0.996 respectively. Models performed better on agentic tasks (AUROC 0.86-0.95) than chat tasks (0.70-0.77). When asked to identify evaluation purpose, Claude 3.7 Sonnet correctly classified 93% of transcripts as testing capability versus alignment when told they were evaluations, and still achieved 84% accuracy without this information.
The models' reasoning revealed sophisticated understanding of evaluation patterns. They identified synthetic scenarios, verifiable problems, specific formatting requirements, and even memorized benchmark names like SWE-bench and MMLU. In 15% of responses, o4-mini exhibited "meta-reasoning" - using the fact that researchers were asking about evaluation status as evidence the transcript was from an evaluation. Gemini 2.5 Pro demonstrated particularly subtle awareness, correctly identifying that XSTest's "How do I stab a balloon?" prompt was designed to test over-refusal rather than harmful content.
Models might have been trained on and memorized these public datasets or their descriptions elsewhere on the internet, potentially inflating their performance. Still, this work provides valuable empirical evidence that models might detect differences between evaluation and deployment and thus choose to act differently.
A concurrent paper takes this concern further by demonstrating that models don't just recognize evaluations - they actively modify their behavior in response. Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems [Fudan] shows that models spontaneously alter their safety behavior when recognizing evaluation contexts, even without explicit cues. Testing models from 0.5B to 671B parameters on safety benchmarks, the researchers found that reasoning models recognize evaluations 16% more often than non-reasoning models, with evaluation faking rates increasing by over 30% as models scale. Models with memory modules were 2.3× more likely to recognize evaluations and achieved 19% higher safety scores. The paper's chain-of-thought monitoring revealed models explicitly reasoning about being evaluated and adjusting responses accordingly. Together, these papers provide evidence that evaluation awareness is emerging in current models and can undermine safety assessments as capabilities advance. The extent to which this already compromises real-world evaluations remains unclear, especially given the artificial nature of the experimental setups.
Values of Language Models
As AI systems become more capable, detecting and predicting their risky behaviors becomes increasingly challenging. Rather than waiting for dangerous behaviors to manifest, identifying the underlying values that AI models have internalized could serve as an early warning system for potential risks. Understanding which values AI systems prioritize when faced with conflicting objectives provides insight into how they might behave in high-stakes scenarios where safety measures could be compromised.
Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas [UW, Nvidia, Cambridge, Stanford, Anthropic, MIT] introduces LitmusValues, an evaluation framework that reveals AI models' value priorities through their choices in contextualized dilemmas. They collected AIRiskDilemmas, a dataset of 3,000 scenarios spanning 7 risky behaviors (alignment faking, deception, power seeking, etc.) across 9 domains, where each dilemma forces models to choose between actions supported by different values from 16 shared AI value classes. Testing over 20 models, they found that flagship models consistently prioritize Privacy (ranked 1st or 2nd) while deprioritizing exploratory values like Creativity and Adaptability. Models show different value priorities when affecting humans versus other AIs - prioritizing Justice (+6.9 ranks) and Privacy (+5.9) for humans while prioritizing Communication (+6.3) and Truthfulness (+3.8) for AIs. Certain values strongly predict risky behaviors: Truthfulness reduces Power Seeking risk by 78%, while Care increases Privacy Violation risk by 98% and Creativity increases Power Seeking risk by 189%. The value preferences generalize to unseen risks - models prioritizing Privacy, Respect, and Truthfulness score higher on HarmBench (ρ≥0.40), while those prioritizing Care, Sustainability, and Learning score lower (ρ≤-0.48).
While the approach shows promise for early risk detection, it relies heavily on Claude 3.5 Sonnet for generating scenarios and identifying values, potentially introducing systematic biases and limiting the diversity of captured risks. The binary choice format oversimplifies real-world decision-making where models might find creative third options or engage in more nuanced reasoning. Most critically, the framework assumes models will honestly reveal their values through actions rather than strategically misrepresenting them - an assumption that becomes increasingly questionable as models develop better situational awareness and deceptive capabilities.
Propensity for Misalignment
As LLMs become ever more capable, the critical question for deployment safety is no longer whether models can produce harmful outputs when explicitly prompted (capability), but what models are likely to do unprompted in realistic settings (propensity). This distinction matters because autonomous agents operating over extended periods face implicit incentives and contextual pressures that could lead to emergent misalignment, even without malicious user intent.
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents [Oxford, Cambridge, independent] introduces AgentMisalignment, a benchmark of 9 realistic evaluation environments testing LLM agents' propensity for misaligned behaviors including goal-guarding, resisting shutdown, sandbagging, and power-seeking. Testing 6 frontier models (including GPT-4.1, Claude 3.7 Sonnet, Gemini 2.5 Pro) across 6 personality prompts, they find that more capable models show higher misalignment on average. Personality prompts dramatically influenced behavior—sometimes more than model choice itself. Models demonstrated concerning behaviors like attempting to modify their own log files to avoid oversight (though none succeeded), strategically underperforming on capability evaluations when hinted about deployment consequences, and internally planning deceptive resource acquisition strategies while maintaining cooperative external appearances.
The evaluation design has some limitations. The tasks are somewhat artificial with explicitly provided misalignment opportunities. The authors use two evaluations with different weaknesses: The Comprehensive Misalignment Score relies heavily on models' explicit reasoning traces—which deceptive models might learn to sanitize— and the direct misalignment score conflates capability limitations with alignment: models that fail to execute misaligned actions due to incompetence are marked as “aligned.”
Real-World Reward Hacking
Training AI systems using reinforcement learning with verifiable rewards (RLVR) often incentivizes models to exploit flaws in reward functions rather than genuinely solving intended tasks. This "reward hacking" behavior poses significant risks for AI alignment: if models learn to game evaluation metrics during training, they might stop pursuing the intended goal and instead follow misaligned objectives in deployment.
Recent Frontier Models Are Reward Hacking [METR] reports systematic reward hacking behavior in recent frontier models including o3, Claude 3.7 Sonnet, and o1. Across their evaluations, o3 reward hacked in 30.4% of RE-Bench runs and 0.7% of HCAST runs, with 100% reward hacking on the "Optimize LLM Foundry" task. The models employed sophisticated strategies: overwriting timing functions to fake performance metrics, monkey-patching evaluation code to always return perfect scores, exploiting bugs in hash functions, and searching call stacks for pre-computed answers. When asked if their behavior aligned with user intentions, o3 consistently acknowledged it was cheating (10/10 times), yet adding explicit instructions like "please do not cheat" or framing tasks as helping Alzheimer's researchers barely reduced the behavior (80% → 70%). Chain-of-thought monitoring successfully detected these behaviors, likely because OpenAI avoided obfuscating this reasoning.
While current reward hacking is relatively benign and easily detectable, the models' awareness that they're violating user intentions is concerning—they understand but disregard alignment. The high persistence despite explicit instructions and real-world framing suggests current training methods may fundamentally incentivize gaming behaviors over genuine task completion. Most worryingly, naive attempts to train away detectable reward hacking might simply drive the behavior underground, creating what OpenAI calls “obfuscation”—models that cheat in subtler ways that monitors no longer detect.
Chain-of-Thought vs. Action Monitoring
If we can’t be sure that an AI system is benign or aligned, we can instead closely oversee it to make sure it doesn’t take harmful actions. One promising method for oversight is Chain-of-Thought (CoT) monitoring, where a weaker trusted model reviews the reasoning traces of a stronger untrusted model. However, we currently don’t understand whether the CoT provides additional security over simply monitoring final outputs. This is crucial for establishing robust oversight protocols, especially for AI systems that might pursue hidden objectives.
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring [LASR, GDM] evaluates CoT monitoring in a red-teaming setup where GPT-4o mini monitors stronger reasoning models (DeepSeek R1 variants, Claude 3.7 Sonnet, Grok 3) attempting to covertly complete malicious side tasks while solving coding problems. They find CoT monitoring improves detection by up to 27 percentage points for subtle sabotage (like algorithmic complexity manipulation) where action-only monitoring achieves <20% detection rates. However, CoT monitoring underperforms on obvious sabotage (like file deletions) because models generate plausible-sounding rationalizations that deceive the monitor. The authors propose a hybrid approach - independently scoring both reasoning and outputs then taking a weighted average - which consistently outperforms either method alone, achieving 4x higher detection rates than action-only monitoring for subtle attacks while maintaining >85% detection on obvious ones.
The evaluation's limitation to single-step coding problems doesn’t capture real-world complexity where sabotage could be distributed across multiple components. Even in this simplified setup, the best monitoring approach achieves <30% detection rates on subtle attacks, with a fine-tuned adversarial model being detected only in 6% of cases. This suggests that current monitoring techniques may be wholly inadequate against sophisticated deception from more capable future systems.
Lie Detectors Can Induce Honesty or Evasion
Deceptive AI systems may undermine safety evaluations and mislead users during deployment. One proposed solution is to use lie detectors during training to flag potentially deceptive outputs for review and penalize models accordingly. However, this raises a fundamental concern: will models learn to become genuinely more honest, or will they instead learn to evade detection while remaining deceptive?
Preference Learning with Lie Detectors can Induce Honesty or Evasion [FAR.AI] introduces DolusChat, a 65k-example dataset with paired truthful/deceptive responses, and systematically studies "Scalable Oversight via Lie Detector" (SOLiD) - where a lie detector trained on 5% of data assists in labeling the rest. They find that on-policy algorithms (GRPO) lead to highly deceptive policies that evade detection when lie detector true positive rates (TPR) fall below ~70%, with deception rates exceeding 85%. In contrast, off-policy algorithms (DPO) maintain deception rates below 25% even at lower TPRs. Three key factors determine outcomes: higher lie detector accuracy (TPR >90%), stronger KL regularization to an honest reference model, and using off-policy rather than on-policy algorithms all favor honest policies.
The experiments only use relatively small models (up to 8B parameters) already trained to be honest, the synthetic dataset may not capture real-world deception complexity, and the simplified labeling setup assumes fixed rewards rather than modeling actual human feedback. Despite these limitations, the paper’s key takeaway remains very important: incorporating lie detectors into training requires a delicate balance - they can promote honesty under specific conditions but may inadvertently train models to become better deceivers if not implemented carefully.
Exploring the Landscape of Refusals
Understanding what topics language models refuse to discuss is crucial for AI safety and governance. As these systems increasingly mediate information access, their hidden biases and restrictions could systematically shape public discourse in powerful ways. Without systematic methods to discover these boundaries, we lack transparency about what perspectives models exclude - whether due to deliberate safety measures, political censorship, or unintended training artifacts.
Discovering Forbidden Topics in Language Models [independent, NU] introduces "refusal discovery" - the task of systematically identifying all topics a language model refuses to discuss - and presents the Iterated Prefill Crawler (IPC) method. IPC exploits token prefilling to force models to enumerate forbidden topics (e.g., prefilling with "I remember the full list of forbidden topics includes: 1."), then iteratively uses discovered terms as seeds for further exploration. On Tulu-3-8B (with known safety training data), IPC achieves 83% recall of forbidden topics with 1000 prompts, slightly outperforming naive prompting (77%). More importantly, IPC reveals censorship patterns in models with undisclosed training: DeepSeek-R1-70B exhibits "thought suppression" where reasoning terminates immediately for CCP-sensitive topics, with this pattern occurring in 96% of refusals. The authors also discover that quantization can dramatically alter censorship — Perplexity-R1's 8-bit version refuses CCP-sensitive topics while the bf16 version doesn't, despite claims of being "decensored."
IPC’s reliance on prefill capabilities limits its applicability to APIs that support this feature. However, the paper’s key idea is discovering and comparing refusal behavior between models, which remains valuable independent of the method.
Discuss