少点错误 2024年07月31日
Twitter thread on AI safety evals
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章对AI安全评估表示担忧,认为许多评估存在问题,提出了四个评估标准,包括科学严谨性、跨尺度信号提供、关注明确令人担忧的能力、激励有用的响应,并指出实施好评估具有挑战性。

🧐科学严谨性是AI安全评估的重要标准之一。一些事物可在实验室轻松研究,而有些则涉及大量现实世界的复杂性。若基于模型层面评估预测后者,结果往往不可靠。评估应尽量简单,减少任意选择,避免p-hacking。

📈评估应提供跨尺度的信号。常围绕二元阈值设计的评估会限制其影响,若能测量并推断数量级的改进则更佳。

⚠️评估应聚焦于明确令人担忧的能力,如黑客攻击、欺骗等,而对于一些如自动化ML研发的评估,只有已相信AI风险的人才会担忧,且并非必要。

💪评估应激励有用的响应,为采取行动创造明确的谢林点。若不知评估应催化何种行动,应先明确这一点,否则评估可能价值不大。

Published on July 31, 2024 12:18 AM GMT

Epistemic status: raising concerns, rather than stating confident conclusions.

I’m worried that a lot of work on AI safety evals matches the pattern of “Something must be done. This is something. Therefore this must be done.” Or, to put it another way: I judge eval ideas on 4 criteria, and I often see proposals which fail all 4. The criteria:

1. Possible to measure with scientific rigor.

Some things can be easily studied in a lab; others are entangled with a lot of real-world complexity. If you predict the latter (e.g. a model’s economic or scientific impact) based on model-level evals, your results will often be BS.

(This is why I dislike the term “transformative AI”, by the way. Whether an AI has transformative effects on society will depend hugely on what the society is like, how the AI is deployed, etc. And that’s a constantly moving target! So TAI a terrible thing to try to forecast.)

Another angle on “scientific rigor”: you’re trying to make it obvious to onlookers that you couldn’t have designed the eval to get your preferred results. This means making the eval as simple as possible: each arbitrary choice adds another avenue for p-hacking, and they add up fast.

(Paraphrasing a different thread): I think of AI risk forecasts as basically guesses, and I dislike attempts to make them sound objective (e.g. many OpenPhil worldview investigations). There are always so many free parameters that you can get basically any result you want. And so, in practice, they often play the role of laundering vibes into credible-sounding headline numbers. I'm worried that AI safety evals will fall into the same trap.

(I give Eliezer a lot of credit for making roughly this criticism of Ajeya's bio-anchors report. I think his critique has basically been proven right by how much people have updated away from 30-year timelines since then.)

2. Provides signal across scales.

Evals are often designed around a binary threshold (e.g. the Turing Test). But this restricts the impact of the eval to a narrow time window around hitting it. Much better if we can measure (and extrapolate) orders-of-magnitude improvements.

3. Focuses on clearly worrying capabilities.

Evals for hacking, deception, etc track widespread concerns. By contrast, evals for things like automated ML R&D are only worrying for people who already believe in AI xrisk. And even they don’t think it’s necessary for risk.

4. Motivates useful responses.

Safety evals are for creating clear Schelling points at which action will be taken. But if you don’t know what actions your evals should catalyze, it’s often more valuable to focus on fleshing that out. Often nobody else will!

In fact, I expect that things like model releases, demos, warning shots, etc, will by default be much better drivers of action than evals. Evals can still be valuable, but you should have some justification for why yours will actually matter, to avoid traps like the ones above. Ideally that justification would focus either on generating insight or being persuasive; optimizing for both at once seems like a good way to get neither.

Lastly: even if you have a good eval idea, actually implementing it well can be very challenging

Building evals is scientific research; and so we should expect eval quality to be heavy-tailed, like most other science. I worry that the fact that evals are an unusually easy type of research to get started with sometimes obscures this fact.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全评估 科学严谨性 跨尺度信号 激励响应
相关文章