少点错误 2024年11月02日
Complete Feedback
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了一种简单、较弱的可修正性概念,即具有完整反馈接口。以逻辑归纳术语来说,AI训练者可将任何交易者插入市场。文中对比了部分反馈,还讨论了可修正性的相关问题,如系统对反馈的接受与信任,以及如何避免不良反馈等。

🧐具有完整反馈接口意味着AI训练者可将任何交易者插入市场,部分反馈则是只有一些命题得到反馈,而其他命题形成结构化假设来帮助预测可观察命题,如RL中只有奖励和感知数据被观察。

🤔如果AI预测到用户的某些反馈,它可能会做出相应行动,若此行动不理想,用户可给出其他反馈。AI系统应接受所有先前反馈,但对预期未来反馈的信任度不定。

🙅‍♂️文章提出反馈的“合法性”概念,AI需要努力保持反馈的合法性,即不操纵人类。对于可修正性方法的安全性,训练者需提供针对内部优化器的反馈。

❓文中还提到了两个关于具有完整反馈的系统的开放技术问题,一是能否保证良好的学习理论特性,二是能否同时防止自我修改激励和避免人类操纵激励。

Published on November 1, 2024 4:58 PM GMT

A simple, weak notion of corrigibility is having a "complete" feedback interface. In logical induction terms, I mean the AI trainer can insert any trader into the market. I want to contrast this with "partial" feedback, in which only some propositions get feedback and others ("latent" propositions) form the structured hypotheses which help predict the observable propositions -- for example, RL, where only rewards and sense-data is observed.

(Note: one might think that the ability to inject traders into LI is still "incomplete" because traders can give feedback on the propositions themselves, not on other traders; so the trader weights constitute "latents" being estimated. However, a trader can effectively vote against another trader by computing all that trader's trades and counterbalancing them. Of course, we can also more directly facilitate this, EG giving the user the ability to directly modify trader weights, and even giving traders an enhanced ability to bet on each other's weights.)

Why is this close to corrigibility?

The idea is that the trainer can enact "any" modification they'd like to make to the system as a trader. In some sense (which I need to articulate better), the system doesn't have any incentive to avoid this feedback.

For example, if the AI predicts that the user will soon give it the feedback that staying still and doing nothing is best, then it will immediately start staying still and doing nothing. If this is undesirable, the user can instead plan to give the feedback that the AI should "start staying still from now forward until I tell you otherwise" or some such.

This is not to say that the AI universally tries to update in whatever direction it anticipates the users might update it towards later. This is not like the RL setting, where there is no way for trainers to give feedback ruling out the "whatever the user will reward is good" hypothesis. The user can and should give feedback against this hypothesis!

The AI system accepts all previous feedback, but it may or may not trust anticipated future feedback. In particular, it should be trained not to trust feedback it would get by manipulating humans (so that it doesn't see itself as having an incentive to manipulate humans to give specific sorts of feedback).

I will call this property of feedback "legitimacy". The AI has a notion of when feedback is legitimate, and it needs to work to keep feedback legitimate (by not manipulating the human).

It's still the case that if a hypothesis has enough initial weight in the system, and it buys a pattern of propositions which end up (causally) manipulating the human trainer to reinforce that pattern of propositions, such a hypothesis can tend to gain influence in the system. What I'm doing here is "splitting off" this problem from corrigibility, in some sense: this is an inner-optimizer problem. In order for this approach to corrigibility to be safe, the trainer needs to provide feedback against such inner-optimizers. 

(Again, this is unlike the RL setting: in RL, hypotheses have a uniform incentive to get reward. For systems with complete feedback, different hypotheses are competing for different kinds of positive feedback. Still, this self-enforcing behavior needs to be discouraged by the trainer.)

This is not by any means a sufficient safety condition, since so much depends on the trainer being able to provide feedback against manipulative hypotheses, and train the system to have a robust concept of legitimate vs illegitimate feedback.

Instead, the argument is that this is a necessary safety condition in some sense. Systems with incomplete feedback will always have undesirable (malign) hypotheses which cannot be ruled out by feedback. For RL, this includes wireheading hypotheses (hypotheses which predict high reward from taking over control of the reinforcement signal) and human-manipulation hypotheses (hypotheses which predict high reward from manipulating humans to give high reward). For more exotic systems, this includes the "human simulator" failure mode which Paul Christiano detailed in the ELK report.

Note that this notion of corrigibility applies to both agentic and nonagentic systems. The AI system could be trained to act agentically or otherwise.

Two open technical questions wrt this:



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

可修正性 反馈接口 合法性 学习理论特性
相关文章