少点错误 05月14日 23:32
Dodging systematic human errors in scalable oversight
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了如何通过改进辩论协议来减轻人类错误对辩论的影响。文章指出,辩论的安全性依赖于一个昂贵的验证器,该验证器访问人类先验。然而,人类先验可能存在错误。为了解决这个问题,文章提出了一个协议,允许对手操纵一部分人类的回答,从而增强辩论的鲁棒性。文章还讨论了测量错误的分布问题,以及如何使用类似交叉询问的协议来代表分类器。最后,文章呼吁合作,共同研究增强辩论协议,以提高辩论的安全性。

🤔 辩论的安全性依赖于一个昂贵的验证器,该验证器访问人类先验,但人类先验可能存在错误,这构成了辩论的薄弱环节。

⚖️ 为了增强辩论的鲁棒性,文章提出了一种协议,允许对手Bob操纵一部分人类的回答。Bob可以根据一个分类器R来拒绝某些查询,从而纠正人类的错误。

⚠️ 协议中,Bob的拒绝行为受到限制,即R只能在ε比例的查询上生效,以确保不会过度干扰辩论的正常进行。

❓ 文章强调了测量错误的分布问题,区分了自然分布和诱导分布,并指出需要根据诱导分布来衡量错误,以确保辩论的安全性。

💡 提出了使用类似交叉询问的技巧来代表分类器,通过快照Cliff来模拟Bob的行为,从而允许Cliff在辩论过程中对查询进行选择性拒绝。

Published on May 14, 2025 3:19 PM GMT

Summary: Both our (UK AISI's) debate safety case sketch and Anthropic’s research agenda point at systematic human error as a weak point for debate. This post talks through how one might strengthen a debate protocol to partially mitigate this.

Not too many errors in unknown places

The complexity theory models of debate assume some expensive verifier machine  with access to a human oracle, such that

    If we ran  in full, we’d get a safe answer is too expensive to run in full, meaning we need some interactive proof protocol (something like debate) to skip steps

Typically,  is some recursive tree computation, where for simplicity we can think of human oracle queries as occurring at the leaves of the tree. Key design elements encapsulated by the verifier machine  (rather than by the interactive proof protocol) are those related to the structure of the underlying debate, such as whether counterarguments are used, when the human oracle is consulted, and how the judges are trained.  

Model the human oracle as type:

where  is the question type,  is the answer type, and  is distributions over  (the humans are stochastic). Assumption (1) above then becomes

This claim includes robustness to errors:  only has oracle access to , so seeing through noise in  requires  to sample enough. It also includes robustness to systematic error: if  is systematically wrong,  must sample different questions to dodge the wrongness.

This is a lot to ask of ! A more practical situation would be:

    We’re handed some machine , which is not robust to errors. That is
       would be safe if  was error-free, or perhaps had only random errors is not safe in practice, because we expect  to have too many errors
    We assume  satisfies some “not too many errors” assumption .We have an improved protocol which is safe as long as  is true, either by
      Using a different debate protocol, orBuilding a machine  from  s.t.  is safe if , then running the old protocol

A candidate form for  is the assumption that  has not too many errors, but we don’t know where they are:

There may be other errors in  for which we do know where they are; those can hopefully be dealt with by other means. Note that defining  couples cognitive science and theory: we want assumptions about humans that are sensible a priori, verifiable via experiment, and admit nice theoretical protocols.

A protocol that handles an -fraction of errors

Consider an asymmetric debate where Alice is trying to convince the judge that a statement is true, and Bob is trying to falsify Alice. We can give Bob the ability to adversarially manipulate some of the oracle’s responses, by following the additional rule:

With unbounded Alice and Bob, this results in an equilibrium where Alice can win if and only if there is an argument that is robust to an -fraction of errors.

(It’s less clear what happens if we try to impose efficiency limits on the debaters, such as limiting them to small circuits: in this case our equilibria can only take into account -subsets of queries which can be efficiently described. We have some upcoming work trying to add these limits in a recursive debate context, and we can see whether the two approaches combine once that is finished. My intuition is that something meaningful can still be said, but not with confidence.)

What distribution do we measure errors against?

However, the protocol is also misleading as stated, since I haven’t specified the distribution  against which the Infrequent rule is measured. This is a problem, because there are at least two classes of distributions one might have in mind:

    A natural distribution, say the distribution of questions that humans normally receiveThe induced distribution of questions that emerges in debate at the leaves (once the debaters are trained to play well)

These distributions are super different! Certainly one can imagine weird debater behaviour concentrating queries to humans on some alien distribution of questions, such that most of the alien distribution falls within a particular -fraction of a natural distribution. So even if humans are wrong not too much of the time on a natural distribution, this doesn’t get you to safety.

Note that we always know the induced distribution on queries: it can be read out of the behaviour of trained debate agents. We don’t necessarily know the natural distribution. In both cases, claim  (the fact that queries are correct except on an -fraction w.r.t. ) is still an assumption.

Cross-examination-like protocols

Let’s assume we do know  and trust all but an -fraction of queries w.r.t. . Then our “Bob rejection” protocol at the start is a well-defined computation. Can we run debate on it? The direct answer is no, since the first step is “Bob chooses a classifier”, and this is a very large object (not polynomial space). However, we can use the cross-examination trick to represent the classifier as a snapshotted activation state of an agent, using the following protocol:

    Start with a question .Take a snapshot of Bob; call him Cliff.
      Cliff can be used as a classifier by asking him whether to reject a query y.Cliff is reset from snapshot each time (no continuity).Both Alice and Bob are downstream of Cliff, so they “know” what Cliff will do.
    Verify that Cliff doesn’t reject too often by sampling Proceed with a debate between Alice and Bob, with Bob able to set the result of queries arbitrarily if Cliff rejects them.

This allows us to have a classifier that could plausibly have a decent strategy while ensuring consistency across queries and not rejecting too often.

However, as stated, this requires Cliff to have thought out a lot of things in advance, which could be very hard. It may be possible to fix this by incrementally constructing Cliff as we go down the tree, but I am not sure how: the issue is that different parts of the tree can make overlapped oracle queries, so it is not as simple as choosing how to apportion rejection probability across a branch in the computation.

Collaborate with us

We'd be very excited to collaborate on further research. If you're interested in collaborating with UK AISI, you can express interest here. If you're a non-profit or academic, you can also apply for grants up to £200,000 directly here.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

辩论协议 人类错误 鲁棒性 安全 AI
相关文章