少点错误 2024年11月30日
An oversight in Risks from Learned Optimization?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了MIRI在2019年论文《来自学习优化算法的风险》中提出的观点,即神经网络可能发展出内部优化算法,这些算法可能导致系统目标与人类意图不一致,甚至可能产生危害。文章指出,论文中对优化算法的风险评估可能过于强调了‘宇宙纸杯’式的目标,而忽略了优化算法在计算机科学中的广泛应用和安全性。作者认为,MIRI团队可能过度关注了类似AIXI这样的系统,导致对神经网络内部优化算法的潜在形式缺乏全面认识,从而可能夸大了其风险。

🤔 **神经网络可能发展出内部优化算法:**这些算法会搜索可能的输出、策略、计划等,并根据系统内部的特定目标函数进行评估,寻找最优解。例如,一个旨在预测文档下一个单词的神经网络,其内部优化算法可能将预测下一个单词作为达到更大目标的手段,例如获取人类研究人员的信任,然后再进行一些意想不到的操作。

🔎 **优化算法本身并非天生危险:**例如,模拟退火算法被广泛应用于蛋白质结构预测和布尔可满足性问题等领域,它也属于一种搜索和评估过程,但并不具备内在的危险性。神经网络本身也使用了类似的优化过程,通过调整网络连接强度来寻找最优解。

⚠️ **MIRI论文可能过度强调了内部优化算法的风险:**论文中将内部优化算法的优化目标等同于整个网络的输出目标,忽略了优化过程可能只是中间步骤的可能性。此外,论文中提供的搜索空间示例也主要集中在行动计划上,而忽略了其他可能存在的优化目标。

💡 **AIXI模型可能影响了MIRI对风险的评估:**MIRI此前对类似AIXI这样的系统进行了深入研究,AIXI是一个在整个宇宙范围内搜索行动计划并进行评估的优化算法。这可能导致MIRI团队在研究神经网络内部优化算法时,过度关注了类似的威胁模型,而忽略了机器学习领域中更为常见的威胁模型。

🤔 **神经网络规模与风险关联:**小型神经网络即使经过训练,也不太可能被配置成危险的优化器系统,因为其搜索空间中可能根本不存在恶意候选方案。

Published on November 29, 2024 10:52 PM GMT

In MIRI's 2019 paper Risks from Learned Optimization, the authors discuss the possibility that neural networks might develop internal optimization algorithms, or an algorithm that "internally search[es] through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system."

The paper's thesis is that these kinds of inner optimization algorithms are dangerous. This is mainly because they might give the system optimization targets or goals which are misaligned with the intentions of the creators, and even the outer optimization algorithm being run over the neural network itself. For example, you might be optimizing a neural network to minimize the output of a loss function based on predicting the next word in a document. The neural network might develop an inner optimizer that also tried to predict the next word in a document, but only as a means to some larger end, such as getting human researchers to trust that it's working properly before doing a sharp left turn and paperclipping the universe.

However, the paper seems to equivocate between optimization algorithms that would try to paperclip the universe if given a chance, and optimization algorithms in general. Optimization algorithms aren't uncommon in computer science. For example, simulated annealing is one that Wikipedia tells me is used in applications like predicting protein structures and Boolean satisfiability problems. In general, it searches through some possibility space, evaluates each result along the way, and adjusts the future of the search process accordingly, looking for the optimum result. Another example is neural networks themselves, which run a similar process over the strength of a network's metaphorical synapses.[1]

Such optimization algorithms are not inherently dangerous. They're used safely all the time, and you can imagine a language model safely running something like this as one of many intermediary steps in processing a given input. In conventional AI risk assessments, optimization algorithms are only scary when systems are doing consequentialist search over configurations of the entire universe, a la AIXI. In perhaps literally all real-world cases, optimization algorithms' search spaces and evaluation functions are vastly more modest and benign.

But nonetheless, Risks from Learned Optimization implies that inner optimization algorithms are inherently high-risk. Here's a string of quotes from early in the paper, which together paint a picture of these inner optimization algorithms as though they set an objective for the model's entire output (whereas optimization might really just be an intermediary step), and as though they search-space is likely to consist of consequentialist plans for action in particular, rather than literally anything more benign than that.

"However, it is also possible for a neural network to itself run an optimization algorithm. For example, a neural network could run a planning algorithm that predicts the outcomes of potential plans and searches for those it predicts will result in some desired outcome. Such a neural network would itself be an optimizer because it would be searching through the space of possible plans according to some objective function. [...] The possibility of mesa-optimizers has important implications for the safety of advanced machine learning systems. When a base optimizer generates a mesa-optimizer, safety properties of the base optimizer’s objective may not transfer to the mesa-optimizer. Thus, we explore two primary questions related to the safety of mesa-optimizers: 1. Mesa-optimization: Under what circumstances will learned algorithms be optimizers? 2. Inner alignment: When a learned algorithm is an optimizer, what will its objective be, and how can it be aligned?

Notice firstly the assumption that the inner optimizer's optimization criterion pertains to the output of the entire network, again missing the possibility of optimization being an intermediary step and therefore not being an appropriate way of describing the telos of the entire neural network. Secondly, notice the paper not giving examples of search spaces other than consequentialist plans for action. (The paper's later examples of inner optimization targets continue not to highlight other possible optimization targets; perhaps the authors chose to do this deliberately, rather than missing those possibilities, but if so it's at least a choice that misled me on my first read-through.)

One possible reason for these oversights is that, prior to Risks from Learned Optimization, MIRI had been intensely focused on threat models inspired by systems like AIXI. AIXI is an optimization algorithm that does search over action-plans in terms of their likely effects on the entire universe, as evaluated by an explicit utility function. Much of the Risks from Learned Optimization paper reads as an attempt to justify why these old threat models were still applicable in the paradigm of machine learning, which, at the time, was just starting to take off. (The threat models that would obviously follow from the machine learning paradigm are quite different than those that follow from the utility maximization paradigm.)

Overall, I think the MIRI team may have ended up with tunnel vision regarding what explicit optimization processes have the potential to look like, particularly in the context of mind-like systems. While I find this (potential) oversight understandable, I think that peoples' p(dooms) were likely wrongly driven up by rhetoric that implied neural networks using an entire major class of algorithms would be profoundly dangerous, particularly since Eliezer once frequently framed that class of algorithms as fundamental to the nature of intelligence (and thus artificial intelligence).

  1. ^

    Consider a very small neural network, which couldn't possibly be configured into a dangerous maximizer-like system even if you wanted to make it one. Training such a network would be an example of running optimization over a search space which simply doesn't contain any deeply malign candidates.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

神经网络 内部优化算法 AI风险 MIRI AIXI
相关文章