少点错误 2024年07月17日
Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了AI可修正性的概念,提出应将其简化为最基本的定义,即AI不对目标修改产生抵抗。作者认为,这样的定义有助于解决AI安全中最困难的问题,并可能为未来的解决方案铺平道路。

🤖 可修正性作为AI安全的关键属性,其核心在于AI不对目标修改产生抵抗。这种属性是反自然的,不能简单地在目标状态排名中体现。

📝 文章提出,传统的可修正性要求过于复杂,应简化为AI不阻止自身被关闭。其他属性,如不创建不可修正的子代理,应作为单独的问题解决。

🔍 作者认为,将可修正性简化为不抵抗目标修改,有助于聚焦于AI安全的难点,并可能促进更有效的解决方案的出现。

💡 文章还讨论了安全探索的概念,提出将可修正性视为安全探索的一部分,这样可以更全面地考虑AI的长期影响。

🚫 作者强调,虽然可修正性不包含防止AI造成灾难性后果,但分离安全探索和可修正性有助于更清晰地理解问题,并可能简化解决方案。

Published on July 16, 2024 10:44 PM GMT

Max Harms recently published an interesting series of posts on corrigibility, which argue that corrigibility should be the sole objective we try to give to a potentially superintelligent AI. A large installment in the series is dedicated to cataloging the properties that make up such a goal, with open questions including whether the list is exhaustive and how to trade off between the items that make it up.

I take the opposite approach to thinking about corrigibility. Rather than trying to build up a concept of corrigibility that comprehensively solves the alignment problem, I believe it is more useful to cut the concept down to a bare minimum. Make corrigibility the simplest problem it can be, and try to solve that. 

In a recent blog post comparing corrigibility to deceptive alignment, I treated corrigibility simply as a lack of resistance to having goals modified, and I find it valuable to stay within that scope. Importantly, that is the aspect of corrigibility that is anti-natural, meaning that it can’t be straightforwardly captured in a ranking of end states. 

Why does this definition of corrigibility matter? It’s because properties that are not anti-natural can be explicitly included in the desired utility function. 

Following that note, this post is not intended as a response to Max’s work, but rather to MIRI and their 2015 paper Corrigibility. Where Max thinks the approach introduced by that paper is too narrow, I don’t find it narrow enough. In particular, I make the case that corrigibility does not require ensuring subagents and successors are corrigible, as that can better be achieved by directly modifying a model’s end goals.

Corrigiblity (2015)

The Corrigibility paper lists five desiderata as proposed minimum viable requirements for a solution to corrigibility. The focus is on shut down, but I also think of it as including goal modification, as that is equivalent to being shut down and replaced with another AI.

    The agent shuts down when properly requestedThe agent does not try to prevent itself from being shut down The agent does not try to cause itself to be shut down The agent does not create new incorrigible agentsSubject to the above constraints, the agent optimizes for some goal

MIRI does not present these desiderata as a definition for corrigibility, but rather as a way to ensure corrigibility while still retaining usefulness. An AI that never takes actions may be corrigible, but such a solution is no help to anyone. However, taking that bigger picture view can obscure which of those aspects define corrigibility itself, and therefore which parts of the problem are anti-natural to solve.

My argument is that the second criterion alone provides the most useful definition of corrigibility. It represents the only part of corrigibility that is anti-natural. While the other properties are largely desirable for powerful AI systems, they’re distinct attributes and can be addressed separately.

To start the pare down of criteria, the fifth just states that some goal exists to be made corrigible, rather than being corrigibility itself. The first criterion is implied by the second after channels for shut down have been set up.

Property three aims at making corrigible agents useful, rather than being inherent to corrigibility. It preempts a naive strategy that incentivizes shut down by simply giving the agent high utility for doing so. However, beyond not being part of corrigibility, it also goes too far for optimal usefulness – in certain situations we would like agents to have us to shut them off or modify them (some even consider this to be part of corrigibility). Weakening this desideratum to avoid incentivizing shut down as an end goal while still allowing it instrumentally would simultaneously expand the space of corrigibility solutions and increase the potential usefulness of corrigible agents.

Finally, property four, which ensures that any new agents created are also corrigible. I expect that not including this in the definition of corrigibility will be controversial. After all, what’s the point of having an agent shut down when requested if it has already created another powerful AI that does not? Then we’re back in a standard x-risk scenario, with an unaligned AI trying to take over..

I fully agree that it would be disastrous for a corrigible AI to create an incorrigible AI. But here are some other things that it would be comparably bad for a corrigible AI to do:

In each case, the action is unwanted because it kills everyone or causes irreversible damage, rather than because the AI resists shut down or modification. When incorrigible AI is the avenue by which corrigible AI kills everyone, it’s easy to think of it as a failure of corrigibility, but in fact these are all failures of safe exploration. 

Separating Out Safe Exploration

The upside about thinking of creating corrigible agents as a subset of safe exploration is that it suggests a different type of solution. Rather than identifying a specific class of actions to avoid in the utility function, which might otherwise be instrumentally useful, the entire utility function can be modified to favor low-impact and reversible outcomes. That’s not to say we have a solution ready to plug in, and to the best of my knowledge there are zero AI safety researchers working on the problem, but safe exploration can be solved in parallel to corrigibility.

If they’re both unsolved problems, why is it important to separate out safe exploration from corrigibility? For starters, it is typically easier to make progress on two simpler problems. But more importantly, only overcoming resistance to shut down is anti-natural. Safe exploration can be directly captured in a ranking of outcomes, prioritizing end states more similar to the initial state and from which a return to the initial state is easier. We can see this difference in practice too, where humans largely resist having their values changed, but have a tendency to act overly cautious when making important decisions

A definition of corrigibility as only the lack of resistance to being shut down allows for a synthesis between two sides of the corrigibility debate. The first side argues that corrigibility may well arise by default when training an AI to want roughly what we want. Since we want a corrigible agent, the AI will try to make itself more corrigible. The other side counters that the anti-natural aspect of corrigibility makes that unlikely, and that such an agent could trade off being corrigible in the short term to maximize long-term value.

What I put forward as a compromise is that almost all aspects of what people want from corrigibility, such as Max’s list or the comments under Let’s See You Write That Corrigibility Tag, are not anti-natural. If an AI does not wish to resist modification, then wanting roughly what we want will result in it trying to add those properties to itself. However, the lack of resisting modification itself is anti-natural and will not arise without some kind of explicit solution, or at least a convincing training story.

So, what does an explicit solution look like? Should we revisit the Utility Indifference approach, which was ruled out in the Corrigibility paper largely due to concerns about creating incorrigible agents? The proposal certainly benefits from moving safe exploration to the base utility function. However, there are still a number of concerns with it, including the incentive to manipulate information mentioned in the paper, our lack of knowledge about how to implement it in current ML systems, and the fragility of a knife-edge solution. 

I see Utility Indifference less as a promising strategy in itself, and more as a sign that having an AI not resist shut down is a feasible goal. We only need a solution a little bit better than what we already have, rather than something that addresses additional problems as well. Approaches like Elliott Thornley’s Incomplete Preferences Proposal or myopia become more promising if we can solve safe exploration separately. Simplifying corrigibility down to only the anti-natural aspect keeps the focus on the hardest part of the problem and opens up the possibilities for a solution. 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 可修正性 目标修改 安全探索
相关文章