cs.AI updates on arXiv.org 15小时前
Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了利用开放权重推理语言模型的长链思考(CoT)来预测最终响应误对齐的可能性,评估了多种监测方法,发现基于CoT激活的简单线性探针在预测最终响应是否安全方面优于所有基于文本的方法,并提出了一种轻量级探针,可用于实时安全监测和生成过程中的早期干预。

arXiv:2507.12428v1 Announce Type: cross Abstract: Open-weights reasoning language models generate long chains-of-thought (CoTs) before producing a final response, which improves performance but introduces additional alignment risks, with harmful content often appearing in both the CoTs and the final outputs. In this work, we investigate if we can use CoTs to predict final response misalignment. We evaluate a range of monitoring approaches, including humans, highly-capable large language models, and text classifiers, using either CoT text or activations. First, we find that a simple linear probe trained on CoT activations can significantly outperform all text-based methods in predicting whether a final response will be safe or unsafe. CoT texts are often unfaithful and can mislead humans and classifiers, while model latents (i.e., CoT activations) offer a more reliable predictive signal. Second, the probe makes accurate predictions before reasoning completes, achieving strong performance even when applied to early CoT segments. These findings generalize across model sizes, families, and safety benchmarks, suggesting that lightweight probes could enable real-time safety monitoring and early intervention during generation.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

CoT激活 最终响应误对齐 安全监测 轻量级探针 语言模型
相关文章