少点错误 02月03日
Stopping unaligned LLMs is easy!
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了如何检测超级智能LLM的不良行为。提出一些合理假设,认为可通过将LLM输出喂给更简单的LLM进行检测,若未通过验证则停止整个系统。还讨论了相关反对意见及成本、推动AI实验室实施等问题。

🎯LLM不会在预测单个令牌时具有超级智能

💡超级智能在连续运行LLM时出现,可利用早期输出解决复杂问题

🔍通过将LLM输出给简单LLM检测不良行为,未通过则停止系统

💰使用GPT3级别的验证器检查成本不高,对超级智能AI成本影响小

Published on February 3, 2025 3:38 PM GMT

Or at least it might be.

Status: butterfly idea, I have only vague understanding of internals of LLMs

I have no idea how to align superintelligent LLMs. But I think it's quite likely we'll easily be able to catch them in the act and turn them off before they do anything too dastardly.

I'm going to make the following reasonable (but by no means certain) assumptions:

    LLMs will never be superintelligent when predicting a single token. Superintelligence emerges when you continuously run an LLM in serial, where it's able to use its earlier output as a scratchpad to solve more complicated problems.LLMs will have no state. Minus some technical complexities, predicting 10 tokens is no different from predicting one token, using that as input to predict the next token, and so on recursively.Superintelligent systems based on LLMs may use multiple LLMs in all sorts of complex configurations, but not in a way that breaks the above assumptions about LLMs, and not in a way that any emergent behaviour of the system as a whole isn't reflected in the outputs of any of the constituent LLMs.

If so, it seems that all you need to do to detect any unwanted behaviour from a superintelligent system is to feed all output from constituent LLMs to a simpler LLM to detect output that looks like it's leading towards unaligned behaviour. Only once the output has been verified, pass it on to the next system (including looping it back to itself to output more tokens). If it fails verification immediately stop the whole system.

Objections

But won't the superintelligent AI be able to trick the verifier?

The superintelligent AI definitely could trick the verifier. But the verifier isn't checking the superintelligent AI, it's checking each individual pass of the LLM. Since each individual pass is not superintelligent, if the AI decided to trick the verifier it would have decide to do so statefully - in the LLM output. So the verifier can see the AIs decision before that decision has been passed anywhere, and can shut down the superintelligent AI.

Effectively the verifier acts before the system is yet superintelligent.

But won't the superintelligent AI prevent itself being shut down?

Same answer. It definitely could, but it would have to decide to do so, and by then the LLM will already have shut it down.

Akkkshully LLMs generate multiple tokens per pass, and what if they use multi headed quantronmium flip flopping dialescretic feedback.

The exact details don't matter. So long as LLMs:

a) continue to be called serially, with the output of one run the input to the next

b) each pass is (relatively) shallow and unlikely to be superintelligent on it's own

c) the only information passed from one run to the next is plaintext

the basic idea should work.

Won't this be expensive

My assumption is this would probably work even if you use gpt 3 level verifiers to check the text, and those should be so cheap compared to whatever model you're using for your superintelligent AI that it basically won't impact the cost at all. I believe the leading AI labs already use LLMs to verify output from their API, and that's where safety ratings come from (I could be wrong).

Even if you need more powerful LLMs, you shouldn't need ones more powerful than the one that's being verified, effectively doubling your cost (you avoid superintelligent verifers by restricting them to a single pass  - after all they're only verifying a single pass).

How do we persuade AI labs to do this?

I believe (but could be wrong) that they already use LLMs to check API output for safety. This isn't much different. It's very likely they'll start doing this anyway, and even if not, only minor push should be required.

One of your assumptions is wrong!

Very possible, please tell me why in the comments!



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

超级智能LLM 检测行为 成本问题 AI实验室
相关文章