少点错误 2024年09月23日
Model evals for dangerous capabilities
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了大型语言模型(LM)安全评估的最佳实践,重点关注评估实验室在评估 LM 系统危险能力方面的做法。作者强调了公布评估结果、方法和任务的重要性,并指出实验室应使用高质量的评估,包括 DeepMind 和 OpenAI 的评估,同时与第三方评估机构合作,以确保全面和可靠的评估结果。

😊 **公开评估结果和方法**:实验室应公开评估结果、任务和方法,以便其他实验室、政府和相关审计人员进行审查。这有助于提高透明度,促进研究合作,并推动行业标准的制定。

😊 **进行良好的诱导并公布细节**:实验室应进行良好的诱导,并公布细节,或至少证明其诱导是有效的。这包括对“指令遵循、工具使用和一般代理”以及相关领域的能力进行通用微调,并使用“仅帮助”模式,避免使用推理时间缓解措施。此外,实验室应使用脚手架、提示和思维链,并提供有关脚手架的详细信息,以帮助观察者了解其有效性。

😊 **使用适当的评估工具**:实验室应使用适当的评估工具,例如互联网浏览器和代码解释器,以及根据领域或任务需要提供的其他工具。在威胁模型相关的情况下,实验室应允许多次尝试,尤其是在编码方面。否则,应允许多次尝试或使用较弱的技术,例如最佳-n或自一致性方法。

😊 **对评估结果进行分析**:实验室应分析评估结果,以确定虚假失败的发生频率,并采取措施进行修复。此外,实验室还应在类似任务上进行后训练,并预测不同规模模型的性能,以了解模型能力的扩展规律。

😊 **与第三方评估机构合作**:实验室应与第三方评估机构合作,例如英国 AISI、美国 AISI、METR 和 Apollo,在模型部署之前分享评估结果。实验室至少应分享“仅帮助”模式和无推理时间缓解措施的评估结果。

😊 **设定能力阈值**:实验室应为每个威胁模型或危险能力领域设定能力阈值,并将其操作化为评估分数,并设置安全缓冲。这有助于实验室及时识别模型的危险能力,并采取相应的措施。

😊 **使用高质量的评估**:实验室应使用高质量的评估,以确保评估结果准确反映模型的能力。一些高质量的评估包括 DeepMind 的评估、InterCode-CTF、Cybench 或其他 CTF、OpenAI 的评估以及 METR 的自主评估。

😊 **进行持续评估**:实验室应定期对模型进行评估,以监测其能力的变化。实验室还应使用 AI 研究评估作为预警信号,以发现 AI 增强型 AI 研究导致危险能力快速增长的迹象。

😊 **公开分享评估结果**:实验室应公开分享评估结果,以促进行业合作和标准的制定。

😊 **改善评估质量**:实验室应努力改善评估质量,并提高评估方法的透明度。

😊 **与第三方评估机构合作**:实验室应与第三方评估机构合作,以确保评估结果的可靠性和客观性。

😊 **设定能力阈值**:实验室应设定能力阈值,并制定相应的应对措施,以确保模型的安全使用。

😊 **使用高质量的评估**:实验室应使用高质量的评估,以确保评估结果准确反映模型的能力。

😊 **进行持续评估**:实验室应定期对模型进行评估,以监测其能力的变化。

😊 **公开分享评估结果**:实验室应公开分享评估结果,以促进行业合作和标准的制定。

😊 **改善评估质量**:实验室应努力改善评估质量,并提高评估方法的透明度。

😊 **与第三方评估机构合作**:实验室应与第三方评估机构合作,以确保评估结果的可靠性和客观性。

😊 **设定能力阈值**:实验室应设定能力阈值,并制定相应的应对措施,以确保模型的安全使用。

😊 **使用高质量的评估**:实验室应使用高质量的评估,以确保评估结果准确反映模型的能力。

😊 **进行持续评估**:实验室应定期对模型进行评估,以监测其能力的变化。

😊 **公开分享评估结果**:实验室应公开分享评估结果,以促进行业合作和标准的制定。

Published on September 23, 2024 11:00 AM GMT

Testing an LM system for dangerous capabilities is crucial for assessing its risks.

Summary of best practices

Best practices for labs evaluating LM systems for dangerous capabilities:

Using easier evals or weak elicitation is fine, if it's done such that by the time the model crosses danger thresholds, the developer will be doing the full eval well.

These recommendations aren't novel — they just haven't been collected before.

How labs are doing

See DC evals: labs' practices.

This post basically doesn't consider some crucial factors in labs' evals, especially the quality of particular evals, nor some adjacent factors, especially capability thresholds and planned responses to reaching capability thresholds. One good eval is better than lots of bad evals but I haven't figured out which evals are good. What's missing from this post—especially which evals are good, plus illegible elicitation stuff—may well be more important than the desiderata this post captures.

DeepMindOpenAI ≥ Anthropic >>> Meta >> everyone else. The evals reports published by DeepMind, OpenAI, and Anthropic are similar. I only weakly endorse this ranking, and reasonable people disagree with it.

DeepMind, OpenAI, and Anthropic all have decent coverage of threat models; DeepMind is the best. They all do fine on sharing their methodology and reporting their results; DeepMind is the best. They're all doing poorly on sharing with third-party evaluators before deployment; DeepMind is the worst. (They're also all doing poorly on setting capability thresholds and planning responses, but that's out-of-scope for this post.)

My main asks on evals for the three labs are similar: improve the quality of their evals, improve their elicitation, be more transparent about eval methodology, and share better access with a few important third-party evaluators. For OpenAI and Anthropic, I'm particularly interested in them developing (or adopting others') evals for scheming or situational awareness capabilities, or committing to let Apollo evaluate their models and incorporate that into their risk assessment process.

(Meta does some evals but has limited scope, fails to share some details on methodology and results, and has very poor elicitation. Microsoft has a "deployment safety board" but it seems likely ineffective, and it doesn't seem to have a plan to do evals. xAI, Amazon, and others seem to be doing nothing.)

Appendix: misc notes on best practices

Coming in about 12 hours.

Appendix: misc notes on particular evals

This is very nonexhaustive, shallow (just based on labs' reports, not looking at particular tasks), and generally of limited insight. I'm writing it because I'm annoyed that when labs release or publish reports on their evals, nobody figures out whether the evals are good (or at least they don't tell me). Perhaps this section will inspire someone to write a better version of it.

Google DeepMind:

Evals sources: evals paperevals repoFrontier Safety FrameworkGemini 1.5.

Existing comments: Ryan Greenblatt.

Evals:

DeepMind scored performance on each eval 0-4 (except CBRN), but doesn't have predetermined thresholds, and at least some evals would saturate substantially below 4. DeepMind's FSF has high-level "Critical Capability Levels" (CCLs); they feel pretty high; they use different categories from the evals described above (they're in Autonomy, Biosecurity, Cybersecurity, and ML R&D).[1]

OpenAI:

Evals sources: o1 system cardPreparedness Frameworkevals repo.

Evals:

OpenAI's PF has high-level "risk levels"; they feel pretty high; they are not operationalized in terms of low-level evals.

Anthropic:

Evals sources: RSP evals report – Claude 3 OpusResponsible Scaling Policy.

Evals:

Meta:

Evals sources: Llama 3CyberSecEval 3.

Evals:

Appendix: reading list

Sources on evals and elicitation:[2]

Sources on specific labs:


Thanks to several anonymous people for discussion and suggestions. They don't necessarily endorse this post.

Crossposted from AI Lab Watch. Subscribe on Substack.

  1. ^

     Misc notes:

      The FSF says:
      > we will define a set of evaluations called “early warning evaluations,” with a specific “pass” condition that flags when a CCL may be reached before the evaluations are run again. We are aiming to evaluate our models every 6x in effective compute and for every 3 months of fine-tuning progress. To account for the gap between rounds of evaluation, we will design early warning evaluations to give us an adequate safety buffer before a model reaches a CCL.The evals paper proposes a CCL for self-proliferation and tentatively suggests an early warning trigger. But this isn't in the FSF. And it says when a model meets this trigger, it is likely within 6x [] "effective compute" scaleup from the [] CCL, but a safety buffer should be almost certainly >6x effective compute from the CCL.
  2. ^

     This list may be bad. You can help by suggesting improvements.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 安全评估 最佳实践 危险能力 威胁模型 评估工具 第三方评估 能力阈值 高质量评估 持续评估 公开分享
相关文章