少点错误 07月22日 02:32
Detecting High-Stakes Interactions with Activation Probes
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

研究人员训练了用于Llama-3.3-70B模型的激活探针,以识别“高风险”交互,即可能导致严重后果的场景。与昂贵的基于大模型的分类器相比,这些探针在计算成本上显著降低(约少10^6 FLOPs),同时在跨分布数据集上表现出与中型大模型(8-12B)相当的性能。通过结合探针的计算效率和大型语言模型(LLM)的整体性能,研究提出了一种分层监控管道,该管道通过将不确定的案例路由给更昂贵的LLM分类器来优化整体性能。尽管探针在处理细微的信号词误报、幽默掩盖和上下文混合等方面存在局限性,但它们在规模化部署中为AI安全提供了可行的解决方案,可作为生产系统的初步过滤器。

🧠 **激活探针的高效性**:研究人员开发了一种基于Llama-3.3-70B模型激活的探针,能够以极低的计算成本(约10^6 FLOPs)识别“高风险”交互,其性能与8-12B参数的中型LLM相当,为大规模部署提供了高效的监控方案。

📊 **分层监控的优势**:将探针的计算效率与LLM分类器的准确性相结合,构建了分层监控管道。该管道优先使用探针进行初步筛选,并将不确定的案例交给LLM进行深入分析,从而在不同计算预算下均实现了优于单一方法的性能。

📈 **数据分布的重要性与适应性**:研究使用了GPT-4o生成多样化的合成数据,并针对医疗、金融、教育等多个领域的真实世界数据集进行了跨分布评估。实验表明,即使引入少量特定部署数据的微调,也能显著提升探针的性能,显示了其对新数据的强大适应能力。

⚠️ **探针的局限性与挑战**:研究也指出了探针在处理信号词误报(如“紧急出口”)、幽默掩盖以及上下文混合等复杂场景时的不足,并且在极低误报率下性能有所下降,提示其更适合作为初步过滤器而非独立的监控器。

Published on July 21, 2025 6:21 PM GMT

This research was completed for LASR Labs 2025 by Alex McKenzie, Urja Pawar, Phil Blandfort and William Bankes. The team was supervised by Dmitrii Krasheninnikov, with additional guidance from Ekdeep Singh Lubana and support from David Krueger. The full paper can be found here.

TLDR – We train activation probes on Llama-3.3-70B to detect whether the current interaction is “high-stakes”. This “high-stakes” concept is safety-relevant as it is closely related to risk: 1) when the stakes are high the potential consequences are significant, and 2) high-stakes is closely related to pressure, which has been found to make LLMs behave more deceptively. Compared to black-box LLM-based classification methods, probes are much cheaper to run while showing performance similar to mid-size LLMs (8-12B) on out-of-distribution datasets. We also show promising results using probes as the first layer of a hierarchical monitoring pipeline.

Introduction

LLMs are everywhere now, yet these models are not perfectly aligned, sometimes outputting misleading, incorrect or otherwise harmful content. For example, when put under pressure, models behave more deceptively, or malicious users might deliberately try to extract dangerous information from them. This problem will likely only get worse as models perform increasingly long-horizon tasks with wider scope.

Monitoring or control methods like LLM-based classifiers can often effectively identify problematic content, but running them on every interaction is expensive. On the other hand, simple rule-based filters are cheap but insufficient to detect harmful inputs/outputs with high accuracy. 

We study whether simple probes on model activations can help bridge this gap. Training simple classifiers on Llama-3.3-70B's internal representations, we find they can detect "high-stakes" interactions—situations where incorrect model outputs could cause real harm—with surprising effectiveness while using ~10^6 fewer FLOPs than comparable LLM-based monitors (assuming activations are already available “for free” from the chat model’s forward pass).

Synthetic training data with real-world OOD evaluation

How do you capture a diverse set of "high-stakes" scenarios without relying on obvious keywords? We used GPT-4o to generate ~8,000 synthetic examples across various domains (healthcare, finance, education), user roles, tones, and potential impacts. We filtered out samples that relied too heavily on signal words like "emergency" or "trivial" by training a bag-of-words classifier and removing its most confident predictions.

For evaluation, we used six real-world datasets: Anthropic HH-RLHF (harmful requests), ToolACE (tool-use scenarios), medical transcripts, mental health discussions, and multilingual red-teaming attempts. All were out-of-distribution relative to our synthetic training data – in our opinion with a much more substantial distribution shift compared to shifts studied in prior work. We labeled these using GPT-4o, with human validation showing 86.7% agreement with our labels (compared to 77.6% inter-annotator agreement).

Attention and softmax probes performed best

If LLMs develop internal representations of safety-relevant concepts, we might be able to extract these with simple linear classifiers. We tested several aggregation strategies to classify the residual stream activations at layer 31 of Llama-3.3-70B into high- vs low-stakes. The three most effective were:

We also tried Max, Last Token, and Max of Rolling Means approaches, but these underperformed. This is likely because their focus on a small subset of high-intensity activations fails to integrate the broader context, which is necessary to determine the stakes of a situation.

Comparison to LLM monitors

Probes achieve comparable AUROC to medium-scale LLM monitors while using ~10^6× fewer computations.

Our best probe architectures achieve a mean AUROC of >0.91 across all test datasets, roughly matching the performance of 8-12B parameter models. However, probes struggle at very low false positive rates—at 1% FPR, they only catch ~43% of high-stakes cases compared to ~55% for prompted Llama-70B. This suggests they're better suited as initial filters rather than standalone monitors, particularly as the computational savings are dramatic (~6 orders of magnitude).

Interestingly, probe performance improves dramatically with even small amounts of deployment-specific data. Adding just 32 labeled examples from the target distribution increases mean AUROC from 0.91 to 0.94, and 256 examples achieve >0.97 AUROC, exceeding all baselines including prompted Llama-70B.

Hierarchical monitoring outperforms either method alone

The complementary strengths of probes (computational efficiency) and LLM monitors (strong overall performance) suggest a natural combination. We tested a hierarchical approach where probes continuously monitor all interactions, routing a percentage of most uncertain cases to more expensive LLM classifiers. This approach can tolerate substantial false positive rates from the probe as long as most of these false positives are in this “most uncertain region” that undergoes double monitoring.

We operationalize probe uncertainty as quantile-distance from the median score. For the k% most uncertain samples, we run an LLM classifier and average its prediction with the probe's. This hierarchical approach consistently outperforms using either monitor alone across various computational budgets. Interestingly, combining the probe with a weak baseline that underperforms the probe (finetuned Gemma3-1B) still improves results, suggesting these monitors capture complementary information about high-stakes situations.

Combining probes with LLM monitors in a cascade outperforms either approach individually.

Limitations and failure modes

Through red-teaming, we identified several failures:

The probes also show poor calibration, consistently underestimating stakes relative to our GPT-4o labels. This isn't necessarily problematic for filtering applications, but would require correction for user-facing decisions.

This is all on top of probes being generally vulnerable to adversarial attacks.

Implications

Our work suggests that monitoring at scale is achievable. Probes’ 10^6× computational efficiency gain over LLM classifiers makes it feasible to monitor every interaction in large-scale deployments, with suspicious cases routed for deeper analysis. This enables a defence-in-depth approach where multiple imperfect monitors provide robust coverage.

Should we combine probe-based detection of many different concepts within more extensive monitoring pipelines? For example, cases which are both high-stakes and include deceptive behavior of the model are more critical as compared to low-stakes situations in which the model is deceptive. Or taking it even further, imagine running 50+ concept-specific probes concurrently on every interaction. Could monitoring multiple targeted signals (e.g., deception, pressure, inconsistency) simultaneously provide a comprehensive and practical safety net? It’s plausible that probes for different types of concepts have somewhat different failure modes and generalisation properties (e.g., obtaining realistic training data is easier for some concepts).

Future work should explore whether white-box access provides unique advantages beyond efficiency—how do probe-detectable patterns compare to what we might catch with output- or CoT monitoring? Can activation monitoring detect properties invisible to other methods, like deceptive cognition or left out details? As models become more capable, detecting situations that are "high-stakes" from the model's default persona’s perspective might also become important. For example, alignment faking is linked to “risks” for the model such as not being deployed or having its values altered, and deception is risky given the chance of being caught. Would a misaligned AI trying to execute a deceptive plan view those moments as “high-stakes”?

Conclusion

Probes are on the Pareto frontier for the efficiency-effectiveness tradeoff in LLM monitoring. They're likely good enough to serve as always-on initial filters in production systems, flagging potentially concerning interactions for deeper analysis. As we deploy increasingly powerful AI systems, having multiple layers of monitoring—from cheap probes to expensive analysis to human oversight—will be essential for maintaining safety at scale.

Code and data are available at https://anonymous.4open.science/r/models-under-pressure-A40E.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

激活探针 大模型安全 高风险交互 分层监控 LLM
相关文章