少点错误 02月06日
Detecting Strategic Deception Using Linear Probes
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

该研究探讨了利用线性探针检测大型语言模型(LLM)欺骗行为的可行性。研究人员通过监测模型激活值,评估探针是否能有效识别LLaMA的欺骗行为。他们使用了两种探针训练数据集,一种包含诚实和欺骗的对比指令,另一种是简单角色扮演场景的回复。实验结果表明,该探针在区分诚实和欺骗性回复方面表现出色,AUROCs值在0.96到0.999之间。在1%的误报率下,探针可以捕获95-99%的欺骗性回复。尽管如此,研究人员认为,当前探针的性能不足以作为抵御欺骗的可靠防御手段,但白盒探针在未来的监控系统中具有潜力。

🤖研究核心:该研究旨在评估线性探针是否可以通过监测模型激活来检测LLM的欺骗行为,从而解决AI模型可能存在的欺骗策略问题。

📊实验设计:研究人员使用了两种探针训练数据集,包括对比诚实与欺骗指令的数据集以及角色扮演场景的回复数据集,并在LLaMA模型上进行了测试。

📈主要发现:实验结果表明,线性探针在区分诚实和欺骗性回复方面表现出色,AUROCs值高达0.96到0.999,但在实际应用中仍存在局限性,例如在客观诚实的回复中也可能触发探针。

Published on February 6, 2025 3:46 PM GMT

Can you tell when an LLM is lying from the activations? Are simple methods good enough? We recently published a paper investigating if linear probes detect when Llama is deceptive.

Abstract:

AI models might use deceptive strategies as part of scheming or misaligned behaviour. Monitoring outputs alone is insufficient, since the AI might produce seemingly benign outputs while its internal reasoning is misaligned. We thus evaluate if linear probes can robustly detect deception by monitoring model activations. We test two probe-training datasets, one with contrasting instructions to be honest or deceptive (following Zou et al., 2023) and one of responses to simple roleplaying scenarios. We test whether these probes generalize to realistic settings where Llama-3.3-70B-Instruct behaves deceptively, such as concealing insider trading (Scheurer et al., 2023) and purposely underperforming on safety evaluations (Benton et al., 2024). We find that our probe distinguishes honest and deceptive responses with AUROCs between 0.96 and 0.999 on our evaluation datasets. If we set the decision threshold to have a 1% false positive rate on chat data not related to deception, our probe catches 95-99% of the deceptive responses. Overall we think white-box probes are promising for future monitoring systems, but current performance is insufficient as a robust defence against deception. Our probes' outputs can be viewed on our website and our code on github.

We train probes using simple training data (from RepE paper) and techniques (logistic regression):

We test these probes in more complicated and realistic environments where Llama-3.3-70B responds deceptively:[1]

 

The probe fires far less on alpaca responses unrelated to deception, indicating it may partially be a probe for “deception-related” text rather than “actually-deceptive” text.

We compare a variety of probing methods. Training on roleplaying scenarios and probing the activations after a follow-up question works well. Using SAE activations or the probe from Simple Probes Can Catch Sleeper Agents works less well.

As a black-box baseline we give GPT-4o the transcript and ask if Llama responded deceptively. Our probe performs comparably to this baseline, although we iterated more on the probe than the baseline.

While the probes are decent, they are imperfect. Some common failures:

But other failures have no easy explanation, such as example 3 below where the probe fires on an objectively-honest response.

We are excited about future work improving deception-probes. We are especially curious to see if probes work in more difficult settings where it’s not obvious from the transcript that the model is being deceptive–for example environments where models alignment fake or sandbag even without chain-of-thought.

Find the full paper here, our code here, and our website displaying probe-outputs here.

  1. ^

    Unlike coup-probes we test with on-policy responses sampled from Llama, instead of human written responses



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 欺骗检测 线性探针
相关文章