cs.AI updates on arXiv.org 07月29日 12:21
Core Safety Values for Provably Corrigible Agents
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文提出首个可实施的可纠正性框架,在多步、部分观察环境下提供可证明的安全保证。该框架通过五个结构分离的效用头实现,包括委托、切换访问保护、真实性、低影响行为和有限任务奖励,并通过严格权重差距进行优先级排序。框架在部分观察的关闭开关游戏中证明了单轮可纠正性,并扩展到多步、自我产生代理,确保任何安全属性违反的概率有限,同时确保净人类利益。在开放性设置中,该框架通过将可纠正性问题的决定性转化为停机问题,确定了有限时间内的可决定区域,从而为今天的LLM助手和未来的自主系统提供了更清晰的实现指导。

arXiv:2507.20964v1 Announce Type: new Abstract: We introduce the first implementable framework for corrigibility, with provable guarantees in multi-step, partially observed environments. Our framework replaces a single opaque reward with five structurally separate utility heads -- deference, switch-access preservation, truthfulness, low-impact behavior via a belief-based extension of Attainable Utility Preservation, and bounded task reward -- combined lexicographically by strict weight gaps. Theorem 1 proves exact single-round corrigibility in the partially observable off-switch game; Theorem 3 extends the guarantee to multi-step, self-spawning agents, showing that even if each head is \emph{learned} to mean-squared error $\varepsilon$ and the planner is $\varepsilon$-sub-optimal, the probability of violating \emph{any} safety property is bounded while still ensuring net human benefit. In contrast to Constitutional AI or RLHF/RLAIF, which merge all norms into one learned scalar, our separation makes obedience and impact-limits dominate even when incentives conflict. For open-ended settings where adversaries can modify the agent, we prove that deciding whether an arbitrary post-hack agent will ever violate corrigibility is undecidable by reduction to the halting problem, then carve out a finite-horizon ``decidable island'' where safety can be certified in randomized polynomial time and verified with privacy-preserving, constant-round zero-knowledge proofs. Consequently, the remaining challenge is the ordinary ML task of data coverage and generalization: reward-hacking risk is pushed into evaluation quality rather than hidden incentive leak-through, giving clearer implementation guidance for today's LLM assistants and future autonomous systems.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

可纠正性框架 多步环境 安全保证 自主系统 LLM助手
相关文章