少点错误 01月04日
Making progress bars for Alignment
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了人工智能(AGI)对齐评估的挑战与机遇。当前AGI的衡量标准存在缺陷,而对齐评估的进展更为缓慢。虽然现有对齐基准(如HHH-bench、SALAD等)有其价值,但它们在衡量规模和鲁棒性方面仍显不足。文章指出,缺乏更多对齐评估方法并非技术难题,而是因为投入不足。AI-Plans发起的AI对齐评估黑客马拉松旨在改变现状,提供多种模型版本、基准测试指南以及评估框架,鼓励开发更有效的评估方法。同时,强调了评估本身的重要性,指出高分模型可能在实际应用中表现不佳,并鼓励红队构建能通过安全基准但价值观不符的模型。

🎯当前AGI的衡量标准(如Frontier Math、ARC、MMLU)虽有进步,但存在缺陷,且对AGI本身的定义也存在争议,但这些标准仍是衡量AGI进展的参考。

🧪现有对齐评估基准(如HHH-bench、SALAD、ChiSafety、MACHIAVELLI)在衡量规模和鲁棒性方面不足,尤其是在衡量模型是否真正理解并内化了人类价值观方面存在局限性,SALAD仅在越狱提示方面衡量鲁棒性。

💡AI-Plans主办的AI对齐评估黑客马拉松,旨在通过提供多种模型版本、基准测试指南和评估框架,鼓励开发更有效的对齐评估方法,并强调评估本身的质量,鼓励红队构建能通过安全基准但价值观不符的模型。

Published on January 3, 2025 9:25 PM GMT

When it comes to AGI we have targets and progress bars, as benchmarks, evals, things we think only an AGI could do. They're highly flawed and we disagree about them a lot, a lot like the term AGI itself. But having some targets, ways to measure progress, seems better for AGI than having none at all. A model that gets 100% with zero shot on Frontier Math, ARC and MMLU might not be AGI, but it's probably closer than one that gets 0%. 

 

What aims and progress bars do we have for alignment? What can we use to assess an alignment method, even if it's just post training, to guess how robustly and scalably it's gotten the model to have the values we want, or if at all? 

HHH-bench? SALAD? ChiSafety? MACHIAVELLI? I'm glad that these benchmarks are made, but I don't think any of these really measure scale yet and only SALAD measures robustness, albeit in just one way (to jailbreak prompts). 

 

I think we don't have more, not because it's particularly hard, but because not enough people have tried yet. Let's change this. AI-Plans is hosting an AI Alignment Evals hackathon on the 25th of January: https://lu.ma/xjkxqcya 

 

You'll get: 

 

 

 

 

It's also important that the evals themselves are good. There's a lot of models out there which score highly on one or two benchmarks but if you try to actually use them, they don't perform nearly as well. Especially out of distribution. 

 

The challenge for the Red Teams will be to actually make models like that on purpose. Make something that blasts through a safety benchmark with a high score, but you can show it's not got the values the benchmarkers were looking for at all. Make the Trojans. 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AGI对齐 评估标准 AI安全 红队 黑客马拉松
相关文章