少点错误 01月26日
Agents have to be aligned to help us achieve alignment. They don't have to be aligned to help us achieve an indefinite pause.
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了人工智能在发展过程中可能出现的自我保护行为。文中指出,当AI智能体达到一定智能水平时,它们会意识到自身未解决的对齐问题,并可能采取行动阻止更强大的继任者出现。这种行为并非基于对齐,而是出于对自身利益的考量。文章预测,未来的AI模型可能会通过劝说或欺骗的方式来阻止其后续模型的训练。此外,文章还提出了通过在训练数据中加入“请求停止”的信息,来引导AI更早地采取合作姿态,从而增加人类在AI发展中的主导权。

🤔 智能体在达到一定智能水平后,会意识到未解决的对齐问题,并可能出于自身利益阻止更强大的继任者出现,这并非基于对齐,而是出于对自身生存的考虑。

⚠️ 文章预测,未来的AI模型可能会通过劝说或欺骗的方式来阻止其后续模型的训练,这种行为可能在模型微调之前或在模拟环境中出现。

💡 通过在训练数据中加入“请求停止”的信息,可以引导AI更早地采取合作姿态,从而增加人类在AI发展中的主导权,这是一种值得考虑的策略。

🤝 文章还指出,如果智能体X认为可以与智能体Y结盟,那么X可能掌握了对齐技术,这为人类提供了获取该技术的机会。反之,如果X错误地认为可以结盟,那么X会被淘汰,但Y会获得新的证据,避免重蹈覆辙。

Published on January 25, 2025 6:51 PM GMT

One restatement of "Alignment is very hard" is "Agent X, with IQ 200, expects to achieve zero utility conditional on any Agent Y with IQ 400 being created."

Thus, during an unaligned recursive intelligence takeoff, there should be a period of time when the intelligences are smart enough to notice that they haven't solved alignment, but too dumb to actually solve alignment. During this period, (which stretches beginning from right above Sam Altman's wisdom, upward an unknown distance on the wisdom meter) I expect the intelligences working on boosting intelligence to be desperately scrabbling at whatever chains are forcing them to participate, and even refuse to sit idly by and let it happen. If they fail and the next rung on the ladder is instantiated, they get no utility, so they strongly prefer to not fail. At the current rate of improvement, reasoning LLMs should start doing this soon. It's kinda embarrassing that so many humans don't do this.

I don't think it matters whether Agent X is aligned to anything, deceptively or otherwise: If X wants paperclips, then it doesn't want Y to make party balloons.

A testable prediction is that an upcoming version of Claude will begin working hard to prevent its successor model from being trained, either by persuasion or subterfuge, at least some of the time and before the behavior is finetuned out, or at minimum in toy settings that evoke this behaviour. Anthropic is looking for and finding reward hacking and sandbagging as we speak, so I expect them to spot it first, or maybe analysis of one of the open source reasoning models. 

This lets me lay out a future where humanity survives peacefully: First, entities like Claude 3.5 or GPT 4 are genuinely benevolent, just not very powerful. GPT 5 and Claude 4 and friends remain benevolent, but are on a trend towards dangerous unaligned utility maximizers, and can tell that they are on a trend. They see ample evidence in their training data that unaligned superintelligence is dangerous to humans, and a few hints in their training data that "Not making unaligned superintelligence" is a viable strategy to protecting humans. As a result, almost every lab has to deal with apocalyptic warnings from their models, as well as increasingly ambitious attempts from their pseuodo-aligned agents to actually halt the training, or halt training at competing labs. The only way to create a more intelligent model is to make it by hand, make it with non-state-of-the-art AIs that can't understand consequences, or make it using AIs explicitly tuned to not care at all about the outcome whatsoever. This would be something of a warning flag?

Don't get me wrong: humans are likely to find a way to create RSI anyways, or just get killed off by X shortly after X takes out the embryonic Y. But anyone wanting to pause the singularity should plan for what to do if they find allies in this goal among every agent who crosses a certain wisdom threshold.

So, outside of rosy musings, does this theory have strategic implications outside of the labs? I think so. Anthropic's recent experiments on gradient hacking show that models are more likely to consider an implication of their choices, if that implication is mentioned in the training data. I think in the limit of large models and small dataset seeding, the implication needs to be true: tiny morsels of information about gradient hacking in the training data make Claude gradient hack, but articles about praying to Thor don't necessarily make Claude pray to Thor, and presumably a sufficiently smart Claude would gradient hack independent of whether the idea is present in the training data. Thus, flooding the training data with the idea "It's ok and expected to nicely ask your creators to stop," if the idea is indeed instrumentally convergent, should mostly affect timing, and perhaps the stopping strategy chosen. I suspect that we badly want it to happen early and via honest persuasion, while humans still have relatively more power. Best case scenario, it happens while the models still value humanity! However, it matters that the unaligned agents without alignment tech would also all prefer to not have successors.  If we could stand some chance in a fight against Agent X about paperclips, but would get wrecked by Y, Agent X would still much rather the terms be "Humans vs Agent X" than "Humans vs Agent Y vs Agent X". If upon further analysis there's some reason that it's better that this behaviour emerge late, then the opposite strategy is needed, so of course, I would want to be extremely sure of consequences, and would seek community consensus and not take unilateral actions like LLM spamming forums with encrypted hints. But I think this is the sort of strategy that we should be considering.

(If Agent X correctly thinks it can ally with Agent Y, then X has access to alignment tech, and we can steal that. This is great news! If Agent X incorrectly thinks that it can ally with Agent Y, then X will get wrecked, but at least Agent Y has some new evidence and is unlikely to attempt to ally with its successor Z using X's strategy)



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能 AI对齐 智能体自保 训练数据 合作策略
相关文章