少点错误 06月10日 00:37
Outer Alignment is the Necessary Compliment to AI 2027's Best Case Scenario
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了AI发展中的关键问题,特别关注了AI对齐和经济影响。作者认为,在追求更先进的AI训练和控制技术的同时,应重视内外对齐策略。文章指出,虽然内部对齐对于后继系统可能更容易解决,但现代训练方法尚未完全解决内部对齐问题。此外,文章还分析了AI对经济的影响,强调了对经济变化进行更细致分析的必要性,并提出了在AI发展中需要关注的关键点。

💡作者认为,如果更先进的训练和控制技术能够实现AI的对齐,并使其能够战略性地创建后继代理,那么内部对齐对于后继系统来说可能比对人类更容易解决,因为先前的AI可以嵌入到后继系统中。

⚠️文章指出,由于现代训练方法尚未解决内部对齐问题,而控制技术也未对内部对齐做出声明,因此,即使在最乐观的情况下,成功对齐改变世界的模型也可能仅限于研究对齐的系统所随机产生的、未知是否不良的值,这几乎没有用处。

🧐作者质疑了通过快速迭代不同想法来设计训练方法,并对比预期结果和实际结果的方法。这种方法可能无法解决所有潜在的对齐问题,特别是那些不明显的缺陷。

💰文章认为,对于那些像AI 2027作者一样有适度担忧的人来说,现在就应该推动对外部对齐工作的投资,希望将来能够将其与预计可以完成的内部对齐工作相结合。

🌍文章还提到了AI对经济的影响。作者认为,应该对经济变化进行更细致的分析,而不是仅仅设想通用的“机器人经济”。文章强调了分析局部激励和经济内部信息传递的重要性,并指出自动化对不同工作岗位的具体影响需要详细说明。

Published on June 9, 2025 3:43 PM GMT

To the extent we believe more advanced training and control techniques will lead to alignment of agents capable enough to strategically make successor agents -- and be able to solve inner alignment as a convergent instrumental goal -- we must also consider that inner alignment for successor systems can be solved much easier than for humans, as the prior AIs can be embedded in the successor. The entire (likely much smaller) prior model can be run many times more than the successor model, to help MCTS whatever plans it's considering in the context of the goals of the designer model.

I've been thinking about which parts of AI 2027 are the weakest, and this seems like the biggest gap.[1] Given this scenario otherwise seems non-ridiculous, we should have a fairly ambitious outer alignment plan meant to compliment it, otherwise it seems extraordinarily unlikely that the convergent alignment research would be useful to us humans.

Since modern training hasn't solved inner alignment, and control techniques do not make claims on inner alignment, it seems like the default path (even in the most optimistic case scenario) would be successfully aligning world-changing models only to the not-known-to-be-bad but randomly-rolled values of the system doing the alignment research, which seems nearly useless.

I'd like to zoom in on one particular element of their proposal as well: "Why is it aligned? Whereas Safer-1 had basically the same training as Agent-4, Safer-2 has a new training method that actually incentivizes the right goals and principles instead of merely appearing to. They were able to design this training method by rapidly iterating through many different ideas, and contrasting the intended result (the Spec) to the actual result (the thoughts in the chain of thought)."

This approach is unlikely to work for the very obvious reason that only some flaws will be apparent. Let's imagine half of the potential alignment issues are sufficiently obvious you could notice them before anything bad happens through the combination of chain of thought observation, model activation tracing, and adversarial testing. If you need to cycle through 99 models and only get one that meets your requirements on model 100, you should expect only a 1% chance you also accidentally also avoided all of the non-obvious problems. I'm not sure how seriously to take this model of safety development, but suffice it to say modern "safety" approaches for commercial models already rely on "well, we removed all the obvious problems, that's the same as obviously having removed all the problems, right?" and it demonstrably produces hilarious results.

In short, I think there is a strong case, for those with worries as moderately-sized as the authors of AI 2027, to push, now, for investment into outer alignment work in the hopes we can integrate it later with the inner alignment work we anticipate could be done later. For what it's worth we may already have the test models to engage in systematic moral learning (e.g. environments with groups of agents with different tasks to achieve can generate information to grade lottery-of-birth style moral analysis, or in other words, we could plausibly check if a model helps create an environment it would want to be in as a randomly selected agent -- it would be a trivial, obvious training procedure to begin to poke at questions relevant to outer alignment work).

  1. ^

    Although I do not think it's important to the mainline analysis, I was concerned at how often AI users were imagined as monoliths, which I think overstates how much the economy will change in the medium-term (which in turn will make the prediction appear worse than it is during that medium-term). For instance, it hardly matters how advanced an AI gets, for many existing companies and workflows. There can be learning, development, and asset lock-in effects meaningful enough that if an AI can offer "replace your supplier with X", then the company receiving the offer will itself be duplicated. The primary mechanism for automation is probably in firms being out-competed, but contract cycles are longer than you might guess, and so it's simply a much slower method of AI-economy integration than is presented. Instead of imagining generic 'robot economies' we should probably use a more fine-grained analysis about local incentives and information transfer within the economy. At some point a relatively high percentage of jobs just might turn out to be low-additional-cost tasks composed largely of someone taking responsibility for something. If your analysis is that automation reduces the number of QA Analysts while making coding far more productive, for instance, you should probably spell out why. The more obvious analysis would suggest the opposite, until the world is far too weird to make meaningful projections.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI对齐 经济影响 外部对齐 内部对齐
相关文章