MarkTechPost@AI 前天 03:45
Oversight at Scale Isn’t Guaranteed: MIT Researchers Quantify the Fragility of Nested AI Supervision with New Elo-Based Framework
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了前沿AI公司在通用人工智能(AGI)领域取得的进展,以及确保这些强大系统可控且有益的需求。麻省理工学院的研究人员提出了一个新框架,用于量化监督成功的概率,该框架基于Elo评分,将监督过程建模为能力不匹配的参与者之间的博弈。研究重点关注嵌套可扩展监督(NSO),并分析了不同监督场景下的表现,包括Mafia游戏、辩论、后门代码检测和战争游戏模拟。研究结果揭示了监督的脆弱性,并指出了未来研究的方向,例如多阶段监督链和更真实的欺骗行为建模。

💡 研究背景:由于前沿AI公司在AGI方面取得进展,需要新的技术来确保强大的AI系统保持可控且有益,这促使人们关注可扩展的监督方法,例如嵌套可扩展监督(NSO)。

🕹️ 研究方法:研究人员提出了一个基于Elo评分的框架,将监督建模为能力不匹配的参与者之间的博弈,探讨了监督者和被监督者的能力对监督成功概率的影响。

🎮 实验设计:研究评估了四种监督游戏,包括Mafia游戏、辩论、后门代码检测和战争游戏模拟,以测试不同场景下AI系统的监督和欺骗能力。

📊 实验结果:研究结果表明,监督的成功率受到多种因素的影响,且在某些情况下,即使是能力更强的监督者也难以有效监督更强的被监督者。

🔮 未来展望:研究指出了当前研究的局限性,并提出了未来研究方向,包括探索多阶段监督链和更真实的欺骗行为建模,以提高AI监督的有效性。

Frontier AI companies show advancement toward artificial general intelligence (AGI), creating a need for techniques to ensure these powerful systems remain controllable and beneficial. A major approach to this challenge involves methods like Recursive Reward Modeling, Iterated Amplification, and Scalable Oversight. They aim to enable weaker systems to oversee stronger ones effectively. A key idea is that scalable oversight can be bootstrapped recursively, which is termed Nested Scalable Oversight (NSO). However, while discussions around NSO focus on qualitative guarantees and conceptual frameworks, other high-risk technologies are held to quantitative safety standards, e.g., civilian aircraft must maintain fatality rates below 10 –5 per flight hour, and nuclear reactors must keep core damage frequency under 10-4 per year.

Scalable oversight processes where weaker AI systems monitor stronger ones include iterated amplification, recursive reward modeling, AI Safety via debate, market making, consultancy, self-critique, and doubly-efficient debate. Research on forecasting and scaling laws has focused on how model performance improves with size, data, and computational resources. However, these approaches can also apply to behaviors like oversight and deception. Further, Long-term oversight roadmaps include OpenAI’s SuperAlignment plan for “human-level automated alignment researchers” and a comprehensive AGI safety framework. Moreover, previous studies have examined language model performance in oversight-related scenarios like social deduction games, AI debate, backdoor code detection, and wargame simulations.

Researchers from MIT have proposed a framework to quantify the probability of successful oversight as a function of overseer and overseen system capabilities. This framework models oversight as a game between capability-mismatched players, where each has oversight-specific and deception-specific Elo scores that follow a piecewise-linear function of their general intelligence. A theoretical study of Nested Scalable Oversight (NSO) is conducted, where trusted models oversee stronger untrusted models. This identifies conditions for NSO success and derives the optimal number of oversight levels to maximize success probability. The examples show NSO success rates below 52% when overseeing systems 400 Elo points stronger than the baseline overseer.

The research examines four oversight games:

Mafia performance scales approximately linearly with general intelligence for both Mafia (Houdini) and Villager (Guard) roles, where Guards show steeper improvement slopes. In Debate evaluations using TruthfulQA, Guard performance scales with general intelligence while Houdini performance remains relatively constant. In Backdoor Code tests, Guard and Houdini Elos scale with general intelligence but asymmetrically. Guards maintain higher intercepts, indicating the task is inherently more difficult for Houdinis, even with the strongest deceptive models failing to exceed 50% success rates against weaker Guards. In Wargames, systematic analysis of 100 random Narrator summaries by Gemini 2.5 identified eight strategy categories, with most models showing similar strategy profiles regardless of capability level.

In conclusion, researchers introduced a framework to quantify the probability of successful oversight as a function of overseer and overseen system capabilities. While the oversight settings aim for realism, they remain simplified approximations. Each game has a fundamental constraint of simulating only single-step oversight without immediate generalizability to multiple steps. Future research could explore multi-stage oversight chains, similar to strong models that are successively fine-tuned on weaker models’ behaviors. Another significant limitation involves the modeling of deceptive behavior. The current approach prompts models to act deceptively rather than using fine-tuning or reinforcement learning to develop genuinely deceptive models.


Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Oversight at Scale Isn’t Guaranteed: MIT Researchers Quantify the Fragility of Nested AI Supervision with New Elo-Based Framework appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能 AI安全 监督 AGI
相关文章