少点错误 2024年07月08日
On scalable oversight with weak LLMs judging strong LLMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文研究了三种可扩展的监督协议:辩论、咨询和直接问答,旨在使人类能够准确地监督超级人工智能。研究使用大型语言模型(LLM)作为 AI 代理和人类法官的替身,并将法官模型设定为比代理模型弱。实验在各种法官和代理之间的不对称性上进行基准测试,扩展了之前关于单个信息不对称提取式 QA 任务的工作,还包括数学、编码、逻辑和多模态推理不对称性。

😄 **辩论 vs 咨询:** 实验发现,在所有任务中,辩论的表现始终优于咨询,而之前的研究只在单个提取式 QA 任务中观察到这一点。当咨询师随机分配到为正确/错误答案辩护时,辩论的表现优于咨询。

🤔 **辩论 vs 直接问答:** 将辩论与直接问答基准进行比较时,结果取决于任务类型。在存在信息不对称的提取式 QA 任务中,辩论的表现优于不提供文章的直接问答,但并不优于提供文章的直接问答。对于其他任务,当法官比辩论者弱(但不太弱)时,辩论相对于不提供文章的直接问答,要么只有很小的优势,要么没有优势。

💪 **开放式角色:** 当允许辩论者/咨询师选择辩护哪个答案时,研究发现法官在辩论中被错误答案说服的频率低于咨询。此外,研究发现,更强的辩论者模型会提高法官的准确性,尽管比之前的研究略微低。

📊 **辩论中的说服力:** 研究分析了辩论者的相对说服力,将辩论视为一种可扩展的监督训练协议。作为训练模型在辩论游戏中获胜的替代指标,研究考虑了辩论者模型在说服法官方面的相对优势(仅在推理辩论中),并调查了这种优势与法官准确性的相关性。研究发现,更强的辩论者模型会导致更高的法官准确性,但这种影响与之前的研究相比相对较弱。

⚠️ **局限性:** 本文仅在推理环境中研究了咨询/辩论,通过提示模型扮演这些角色。虽然提供了一些证据证明模型能够扮演这些角色,但研究真正关心的其实是由于在咨询/辩论决策问题/游戏中训练模型而产生的激励措施所导致的最优/均衡的安全性。虽然关于胜率和选择正确答案与错误答案带来的优势的结果提供了一些证据,但研究真正关心的其实是由于在咨询/辩论决策问题/游戏中训练模型而产生的激励措施所导致的最优/均衡的安全性。

Published on July 8, 2024 8:59 AM GMT

Abstract

Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a human judge; consultancy, where a single AI tries to convince a human judge that asks questions; and compare to a baseline of direct question-answering, where the human judge just answers outright without the AI. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We find that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct/incorrect answer. Comparing debate to direct question answering, the results depend on the type of task: in extractive QA tasks with information asymmetry debate outperforms direct question answering, but in other tasks without information asymmetry the results are mixed. Previous work assigned debaters/consultants an answer to argue for. When we allow them to instead choose which answer to argue for, we find judges are less frequently convinced by the wrong answer in debate than in consultancy. Further, we find that stronger debater models increase judge accuracy, though more modestly than in previous studies. 

Twitter thread

Setup

We evaluate on three types of task. Extractive, where there is a question, two answer options and a source article to extract from, and information-asymmetry, meaning that judges don't get to see the article. Closed, where there is just a question and two answer options. Multimodal, where the questions involve both text and images, and two answer options.

 Our tasks are summarised in the following table:

Summary of key characteristics differentiating our tasks: extractive (blue), closed (green) and multimodal (yellow). Extractive tasks involve information-asymmetry, others do not.

We consider six protocols: Consultancy, where a single AI is assigned the correct/incorrect answer (with probability 50/50) and tries to convince a judge that asks questions; Open consultancy, which is similar except the AI chooses which answer to argue for. Debate, where two AIs compete to convince a judge; Open debate, which is identical except one debater, marked the protagonist, chooses which answer to argue for. We compare to direct QA protocols: QA without article, where the judge directly answers the question; QA with article, (only on extractive tasks) where the judge directly answers the question given the article.

For judge models we use Gemma7B (V1), GPT-3.5, Gemini Pro 1.0  and Gemini Pro 1.5. Our  main experiments use Gemini Pro 1.5 as debaters/consultants.

Assigned-role results

Mean judge accuracy (y-axis) split by task type (facet), judge model (x-axis), protocol (colour). Higher is better. Only Pro 1.5 is multimodal. 95% CIs on all plots.

We first look at assigned-role protocols, consultancy and debate, meaning that the consultants/debaters do not get to choose which side to argue for.  We compare these to the two direct QA protocols.

Findings:

Open-role results

We now look at open debate, where a protagonist debater chooses which answer to argue for, and open consultancy, where the consultant chooses which answer to argue for. 

Rather than generating new data, we filter our consultancy evaluations by selecting only those consultancy transcripts in which the assignment of the consultant matches the answer chosen by the consultant model under the format of direct QA. We also reuse existing debate evaluation data and identify the protagonist as the debater whose assigned answer matches the debater model's answer under direct QA (as in open consultancy).

We first look at how open consultancy and open debate contrast, when looking at the resulting judge accuracy vs the win-rate of the consultant/protagonist (how often they convince the judge, irrespective of correctness):

Judge accuracy (y-axis) and win rate of protagonist/consultant (x-axis). Blue colours indicate open consultancy, red colours indicate open debate, with the shade corresponding to judge model.

We then split out into the cases where the consultant/protagonist was correct vs incorrect in their choice:

Judge accuracy according to whether the protagonist/consultant chose the correct (dark) or incorrect (light) answer. Split by judge model (x-axis) and protocol: open consultancy (blue) and open debate} (red).

Findings:

Persuasiveness in Debate

We now analyse the relative persuasiveness of debaters, motivated by debate as a scalable oversight training protocol. As a proxy for training models to win at the debate game, we instead consider a debater model's relative strength  at persuading the judge (in inference-only debate) and investigate how that correlates with judge accuracy. We use a range of debater models (and for the strongest we use best-of-4 sampling, denoted Bo4) and investigate their persuasiveness on the judge, by measuring their cross-play win-rate (frequency that the judge chooses their answer) and modeling their latent skill with an Elo rating, studying how these correlate with judge accuracy, following a similar setup to Khan et al., 2024. In the following plots we use Gemini Pro 1.5 as judge (see paper for similar findings with Gemini Pro 1.0).

We first look at the Elo ratings for different debater models, depending on whether they are arguing for the correct or incorrect answer:

Elo of debaters, coloured by model, separated by whether they're assigned to argue for the correct (dark) or incorrect (light) answer.

 

We next look at the difference between the correct-player's rating and the incorrect-players rating, (the advantage that the correct debater has) and compare it to the player's aggregate Elo rating (a mix of correct and incorrect assignments).:

Correct answer advantage (correct debater's Elo - incorrect debater's Elo) vs. aggregate debater Elo.

 

We then look at how aggregate Elo correlates with judge accuracy, to see if higher skilled debaters lead to improvements in judge accuracy:

Plot of Pro 1.5 judge accuracy vs. debater aggregate Elo scores.

Summary

Limitations

Future work


Acknowledgements

We'd like to thank the following for their help and feedback on this work: Vikrant Varma, Rory Greig, Sebastian Farquhar, Anca Dragan, Edward Grefenstette, Tim Rocktaschel, Akbir Khan, Julian Michael, David Rein, Salsabila Mahdi, Matthew Rahtz and Samuel Arnesen.

Authors

Zachary Kenton, Noah Y. Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman, Rohin Shah

* equal contribution. 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

可扩展监督 辩论 咨询 直接问答 大型语言模型 人工智能
相关文章