少点错误 2024年08月05日
Near-mode thinking on AI
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了作者在AI方面的思考转变,包括对AI风险的近模式思考、预测失误的反思、对超人类能力的新认识以及自动化对齐研究等内容。

文章指出过去对AI的‘欺骗性对齐/谋划’威胁模型过于自信,现在认识到需要仔细思考其先决条件,如模型对输入的理解、对训练过程的认知等。

作者通过DeepMind在国际数学奥林匹克竞赛中的表现,反思自己预测的错误,意识到不能仅从远模式思考,而应从近模式思考具体方法和效果。

作者认为在某些方面达到超人类能力往往比想象中更平凡,如在象棋、杂耍、语言模型的说服力和编程技能等方面。

作者原本对自动化对齐研究持怀疑态度,现在认为存在一些可行的途径,如利用GPT-4进行相关研究,但也存在诸多问题需要考虑。

Published on August 4, 2024 8:47 PM GMT

There is a stark difference between rehearsing classical AI risk 101 arguments about instrumental convergence, and tackling problems like "Design and implement the exact evaluations we'll run on GPT-5 to determine whether it's capable enough that we should worry about it acting aligned until it can execute a takeover". 

And naturally, since I've started working on problems like the one above, I've noticed a large shift in my thinking on AI. I describe it as thinking about risks in near-mode, as opposed to far-mode. 

In this post, I share a few concrete examples about my experiences with this change-of-orientation.

I. Prerequisites for scheming

Continuing with the example from the intro: A year ago I was confident about the "the AI is just playing along with our training and evaluations, until it is in a position where it can take over" threat model (deceptive alignment / scheming) basically being the default outcome and the main source of AI x-risk. I now think I was overconfident.

Past-me hadn't, however, really thought through the prerequisites for scheming. A textbook example of a successful deceptive alignment story, applied to an LLM, paints a picture of a model that:

Now, one may argue whether it's strictly necessary that a model has an explicit picture of the training objective, for example, and revise one's picture of the deceptive alignment story accordingly. We haven't yet achieved consensus on deceptive alignment, or so I've heard. 

It's also the case that, as past-me would remind you, a true superintelligence would have no difficulty with the cognitive feats listed above (and that current models show sparks of competence in some of these). 

But knowing only that superintelligences are really intelligent doesn't help with designing the scheming-focused capability evaluations we should do on GPT-5, and abstracting over the specific prerequisite skills makes it harder to track when we should expect scheming to be a problem (relative to other capabilities of models).[1] And this is the viewpoint I was previously missing.

II. A failed prediction

There's a famous prediction market about whether AI will get gold from the International Mathematical Olympiad by 2025. For a long time, the market was around 25%, and I thought it was too high.

Then, DeepMind essentially got silver from the 2024 IMO, short of gold by one point. The market jumped to 70%, where it has stayed since.

Regardless of whether DeepMind manages to improve on that next year and satisfy all minor technical requirements, I was wrong. Hearing about the news, I (obviously) sat down with pen and paper and thought: Why was I wrong? How could I have thought that faster?

One mistake is that thought it was not-that-likely that the big labs would give a serious attempt at this. But in hindsight, I shouldn't have been shocked that, having seen OpenAI do formal theorem proving and DeepMind doing competitive programming and math olympiad geometry, they just might be looking at the IMO as well.

But for the more important insight: The history of AI is littered with the skulls of people who claimed that some task is AI-complete, when in retrospect this has been obviously false. And while I would have definitely denied that getting IMO gold would be AI-complete, I was surprised by the narrowness of the system DeepMind used.

(I'm mature enough to not be one of those people who dismiss DeepMind by saying that all they did was Brute Force and not Real Intelligence, but not quite mature enough to not poke at those people like this.)

I think I was too much in the far-mode headspace of one needing Real Intelligence - namely, a foundation model stronger than current ones - to do well on the IMO, rather than thinking near-mode "okay, imagine DeepMind took a stab at the IMO; what kind of methods would they use, and how well would those work?"

Even with this meta-level update I wouldn't have in advance predicted that IMO will fall just about now - indeed, I had (half-heartedly) considered the possibility of doing formal theorem proving+RL+tree-search before the announcement - but I would have been much less surprised. I also updated away from a "some tasks are AI-complete" type of view, towards "often the first system to do X will not be the first systems to do Y".[2]

III. Mundane superhuman capabilities

I've come to realize that being "superhuman" at something is often much more mundane than I've thought. (Maybe focusing on full superintelligence - something better than humanity on practically any task of interest - has thrown me off.)

Like:

As a consequence, I now think that the first transformatively useful AIs could look behaviorally quite mundane. (I do worry about later in the game superhuman AIs being better in ways humans cannot comprehend, though.)

IV. Automating alignment research

For a long time, I didn't take the idea of automating alignment research seriously. One reason for my skepticism was that this is just the type of noble good-for-pr goal I would expect people to talk about, regardless of whether it's feasible and going to happen or not. Another reason was that I thought people were talking about getting AIs to do conceptual foundational research like Embedded Agency, which seemed incredibly difficult to me.

Whereas currently I see some actually feasible seeming avenues for doing safety research. Like, if I think about the recent work I've looked at in situational awareness, out-of-context reasoning, dangerous capability evaluations, AI control, hidden cognition and tons of other areas, I really don't see a fundamental reason why you couldn't speed up such research massively. You can think of a pipeline like

And sure enough, this would totally fail for dozens of reasons, there are dozens of things you could do better, and dozens of questions about whether you can do useful versions of this safely or not. I'm also talking about (relatively easily verifiable) empirical research here, which one might argue is not sufficient.

Nevertheless, now that I have this concrete near-mode toy answer to "okay, imagine Anthropic took a stab at automating alignment research; what kind of methods would they use?", it's easier for me to consider the idea of automating alignment research seriously.

  1. ^

    Also, many of the relevant questions are not about pure capability, but also whether the model in fact uses those capabilities in the postulated way, and about murkier things like the developmental trajectory of scheming.

  2. ^

    While keeping in mind that LLMs solved a ton of notoriously hard problems in AI in one swoop, and foundation models sure get lots of different capabilities with scale.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 思考转变 超人类能力 自动化研究
相关文章