少点错误 03月04日
Validating against a misalignment detector is very different to training against one
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了在训练对齐AI时,将对齐测试作为验证指标与将其作为训练目标的根本区别。文章通过概率和优化策略的分析,揭示了Goodhart定律在不同优化方法下的适用性。通过重采样策略,即使初始对齐概率较低,也能有效提高最终模型的对齐概率。此外,文章还通过Alice和Bob的例子,生动地阐述了阈值拒绝方法与传统优化方式的差异,强调了验证在机器学习中的重要性。

💡Goodhart定律指出,如果过度关注某个指标B,并以此为优化目标,那么期望得到的指标A的高概率可能无法实现。然而,如果通过从概率分布P中重采样直到B为真,那么A为真的概率实际上就是P(A|B)。

🧪将对齐检测器用作验证指标与用作训练目标存在本质区别。前者类似于从模型分布中随机抽样,而后者可能导致模型学会欺骗检测器,而非真正实现对齐。验证在机器学习中扮演着重要角色,不应被忽视。

💰Alice和Bob的例子生动地说明了这一点。Alice通过询问路人翻译法语短语来寻找会说法语的人,而Bob则悬赏给任何能用法语说出该短语的人。结果,Alice成功找到了真正的法语使用者,而Bob却可能被不懂法语但使用翻译工具的人欺骗。

Published on March 4, 2025 3:41 PM GMT

Consider the following scenario:

A naive alignment strategy would be to add a penalty term to our loss function that penalises failing the alignment test.

Question: If we apply this strategy and get an AI that passes the test, what’s the probability that it's aligned?

    a) 50%. Just apply Bayes' rule to get P(aligned | passes test). The prior odds of aligned:misaligned are 10:90, the likehood ratio is 90:10, so we get posterior odds of 900:900, i.e. a probability of 50%.

    b) 10%. We get the base rate P(aligned), because the misalignment test is trivially circumventable.
 

Answer:

 Okay, it's not quite a mathematical question -- we can't be sure how the change to the loss landscape affects the path followed by SGD.

That said, it seems like it's probably b) 10%. Intuitively it seems like the only thing that will change is that if the model would have been detectably misaligned, now it will be undectably misaligned instead.

 Certainly (a) is a naive and mistaken way to reason about what will happen -- we're not randomly sampling models that pass the alignment test.

New strategy: instead of adding a penalty term, we save the alignment test for the end of training. If the model passes, then great. If it fails, we throw the model away and train a new one with another random idea from our list. Repeat the process until we get an AI that passes the alignment test.

Question: What's the probability we get an aligned AI using this strategy?

    a) 50%. It’s different this time, for some reason. ? 


    b) 10% again. Optimisation pressure is optimisation pressure; we’re just applying it in a different way.
 

Answer:

 This time it follows mathematically from the setup that it’s (a) 50%. We are indeed randomly sampling models that pass the test, so we get P(aligned | passes test), which we can figure out with Bayes' rule as above.

OK, so what’s the lesson here?

One way to state Goodhart’s law is that if you care about property A, and you notice that P(A|B) is high, so you optimise for B, then it’s likely that you won’t get the high probability of A that you were expecting.

The lesson from the example above is that for some optimisation methods, Goodhart doesn’t apply. In particular, if you optimise for B is by resampling from P until B becomes true, then the probability that A is true is actually P(A|B) after all.

There’s a tempting intuition that using a misalignment detector as a validation metric should be the same as using it as a training objective, only perhaps a bit gentler. I think this intuition is wrong! Using it for validation is qualitatively different. This makes sense given how big of a thing validation is in ML.

Is this really how validation metrics work?

No, not exactly.

If you’re training an image classifier or something, you don’t (normally) just have a list of training ideas that you keep resampling from until you get one with above a certain validation score. You run a better search algorithm, using the validation scores that your previous ideas got to choose the ideas you try next. And you normally try to get the validation score as high as possible rather than just above a threshold.

But still, there’s a large element of ‘just try random stuff until you get a good validation score’. It’s certainly a huge step in that direction compared to gradient descent, and plausibly most of the way there.

It is still possible to overfit to the validation set – which it wouldn’t be under the pure random search picture – but empirically it’s way less of a problem.

A silly example

Here’s a silly example to help build the intuition that this threshold-based rejection thing is qualitatively different to other ways of optimising.

Imagine Alice and Bob each want to give £1,000,000 to someone who speaks French. They decide to use whether or not somebody can translate “I want a million pounds” as their test of French-speakingness.

Alice stands on a street corner asking people if they can translate the phrase. Sure enough, the first person who can successfully translate it is someone who actually speaks French. She’s successfully sampling from the distribution P(French speaker | can translate the phrase).

Bob puts up a poster saying “I’m on the next street corner offering £1,000,000 to the first person who can say ‘I want a million pounds’ in French.” If the first person to walk past the poster doesn’t speak French, they get ChatGPT to translate the phrase and take the million pounds anyway. So Bob ends up sampling from the marginal distribution P(French speaker).

A reasonable objection

“Compared to your example, I think the base rate of alignment will be lower, and our misalignment detectors will be worse, so validating against misalignment dectectors is stilly pretty doomed.”

Fair enough. If the base rate of alignment is 1% and aligned models are only twice as likely to pass the test as misaligned models, then using the detector for validation only gets us a 2% chance of alignment.[1] I wrote the post to argue against the view expressed here, which says that if we detect misalignment, our only two options are to pause AI development (which is implausible) or to train against the detector (which is doomed). If you think aligned AIs are incredibly rare or building good misalignment detectors is really hard, then things still look pretty bleak.

  1. ^

    Although stacking a few independent detectors that are all individually decent could be as good as having one very good detector.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

对齐AI Goodhart定律 验证指标 机器学习 优化策略
相关文章