少点错误 2024年11月13日
Level 1 and Level 2 Optimizers, or Tendimizers and Optimizers
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了人工智能对齐领域中关于'优化器'的不同理解。作者将优化器分为Level 1和Level 2两种类型,Level 1优化器侧重于系统将大量可能状态引导至少量目标状态的能力,例如人类建造汽车的过程。Level 2优化器则强调系统拥有明确的内部目标表征,并通过搜索过程来实现这些目标,例如机器学习算法和规划算法。作者认为,当前的大语言模型可能介于两种类型之间,其优化过程更像是人类行为,受强化学习影响,而非严格的效用最大化。因此,我们需要重新审视大语言模型的风险模型,不能简单套用早期AI安全运动中针对Level 2优化器的威胁模型。

🤔 **Level 1优化器**:这类优化器侧重于系统将大量可能状态引导至少量目标状态的能力,例如人类建造汽车,即使汽车的原子排列方式多种多样,但人类却能可靠地制造出汽车。

💡 **Level 2优化器**:这类优化器强调系统拥有明确的内部目标表征,并通过搜索过程来实现这些目标,例如机器学习算法通过搜索参数空间来优化目标函数,规划算法则搜索可能的计划来实现目标。

🤖 **大语言模型可能介于Level 1和Level 2之间**:作者推测,大语言模型的优化过程更像是人类行为,受强化学习影响,而非严格的效用最大化。其内部可能并非像AIXI那样拥有明确的效用函数。

⚠️ **需要重新审视大语言模型的风险模型**:由于大语言模型可能并非Level 2优化器,因此早期AI安全运动中针对Level 2优化器的威胁模型可能并不完全适用,需要开发新的风险模型。

🤔 **人类行为类比**:人类行为也类似于模糊的Level 2优化,我们受到强化学习的影响,形成了一定的行为倾向,而并非完全由明确的目标驱动。

Published on November 13, 2024 3:02 AM GMT

I'd like to quickly highlight a discrepancy in how the concept of optimization is understood by different members of the alignment community. On one hand, you have definitions like that of sequences-era Yudkowsky, or that of Alex Flint. Yudkowsky once defined optimization in terms of systems which make certain unlikely outcomes likely, reliably hitting small targets in large search spaces. For example, if a knowledgeable and resource-rich human, wants to build a car, they'll reliably be able to build cars, even though there's lots of ways the car's atoms could be arranged, and the vast majority of them are not in fact cars. Therefore, when an informed and resource-rich human is motivated to build a car, it is acting as an optimizer, fairly reliably hitting small targets in large search spaces.

In The Ground of Optimization, Alex Flint gives a similar definition: "An optimizing system is a physical process in which the configuration of some part of the universe moves predictably towards a small set of target configurations from any point in a broad basin of optimization, despite perturbations during the optimization process." The focus here is still on systems which effectively embed robust attractor states into the larger systems they're a part of.

I'll call systems which meet this definition of optimization Level 1 Optimizers: they meet the criterion of being able to steer a large set of possible systems into a relatively small set of outcome-states. (Someone who knows, like, dynamical systems theory or something should hurry up and formalize this mathematically, if that hasn't already been done.) However, there's another common definition of optimization in the alignment community, which I'll call Level 2 Optimization. It has to do with systems featuring what are called explicit internal representations of their own optimization targets. They also run explicit search processes for actions that are likely to achieve those goals.

For example, MIRI's Risks from Learned Optimization paper uses a definition that falls into this category.

While all AI systems are optimized for something (have a purpose), whether they actually optimize for anything (pursue a goal) is non-trivial. We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.

The paper then gives examples of what I'd call Level 2 Optimizers.

Learning algorithms in machine learning are optimizers because they search through a space of possible parameters—e.g. neural network weights—and improve the parameters with respect to some objective. Planning algorithms are also optimizers, since they search through possible plans, picking those that do well according to some objective.

Here's an example that makes the distinction I'm drawing especially clear: a system like AIXI is a Level 2 Optimizer. In terms of its actual programming, it has a utility function, and it searches over ways it might reconfigure the world it's in to maximize expected value per that utility function. By contrast, you can imagine a lookup table which perfectly simulates the behavior of AIXI at some particular timestep; it observes the same world, and takes the same action. But since its actual code doesn't include a utility function, an expected utility calculator, or anything else like that, it's merely a Level 1 Optimizer.

(Another set of terms we might use to capture the same concept: the first system is a proper optimizer, and the second one is what I call a tendimizer. It still embeds certain attractor states, or evolutionary tendencies, into its surroundings, i.e. the ones outlined by the former system's utility function. However, it lacks internal symbols which, when naturally interpreted, straightforwardly tell you what those attractor states are going to be. I'll elaborate on what I mean by "when naturally interpreted" in a future post.)

Obviously, in the context of the previous example about AIXI, this distinction wouldn't matter at all in terms of whether the system in question e.g. posed an x-risk to whatever larger system it was optimizing over. Both behave identically, and so they're equally deadly. However, the distinction becomes much more relevant when discussing the risks involved with designing an AI system. Assuming you had an idea for a working AIXI-like architecture, it would be very easy to destroy the world with it. Even slightly screw up the system's utility function, and the cosmos could be paperclipped.

By contrast, with a lookup table, you would need to manually specify the system's reaction to literally every last situation it might encounter. Causing its reactions to be malicious and intelligent enough to paperclip the cosmos would be dramatically more difficult. Of course, many Level 1 Optimizers have much cleverer designs than mere lookup tables, which and therefore easier for humans to use to accomplish their own goals. However, the point is that no known Level 1 Optimizer architecture makes it easy to accidentally paperclip the cosmos, in the manner of a utility maximizer. The threat model has to look different.

The question explored by MIRI's Risks from Learned Optimization paper is whether deep learning systems might design Level 2 Optimizers, at which point it becomes highly plausible that those optimizers may end up with poorly specified utility functions, and thereby paperclip the cosmos. By contrast. If instead, they're mere tendimizers whose tendencies are slowly and steerably chiseled out by the deep learning process, then we at least shouldn't expect them to be dangerous in exactly the way called to mind by considering paperclip maximizers, unless we can come up with a really good reason they'd simulate maximizers in particular anyway.

As for myself, my suspicion is that current LLMs probably fall somewhere in the middle. It's entirely possible to prompt an LLM using text that contains an explicit, though informal, representation of some goal. For example, you can ask it to write you code for a webpage with a heading that says "WELCOME TO HELL" and ASCII art of a smiley face; it will then fulfill this goal when you hit enter. You could even prompt the model to generate multiple drafts, grade each of its own drafts, and select the one with the highest grade. This is effectively the system running a fuzzy facsimile of a Level 2 optimization process at the visible, natural language-driven layer of its own cognition.

However, within the model's hidden layers, I think there's a good chance that nothing like hardcore Level 2 Optimization is going on. Instead, my guess is that the hidden layers have been slowly sculpted to recognize and fulfill these kinds of natural-language requests in fairly benign ways; i.e., I think there's a good chance that this is accomplished using computational techniques which look nothing like rigorous utility maximization at all. (See Anthropic's research on induction heads for some intuition on what kinds of non-AIXI-like computations LLMs might use in practice.)

If this is true, then LLMs might not be very much more more likely to perform paperclipping than a human being. It strikes me that humans also basically just do a fuzzy facsimile of Level 2 Optimization; their own thoughts are, in a sense, their own natural language prompts. And the real reason our behavior ends up optimizing in a consistent direction is that we've been reinforcement-trained to behave in certain ways all our lives. Holding an "explicit goal" in mind only helps you go after it to the extent that you've been reinforced into wanting to pursue that goal, and RL seems like a much more safely steerable process defining an explicit utility function.

(Related: see Anna Salamon's post "Humans are not automatically strategic".)

Of course, I could be wrong, and LLMs could be doing (Level 2) mesa-optimization in precisely the sense outlined in Risks from Learned Optimization. I think reducing uncertainty about this is very important. However, there's a real chance that it turns out that LLMs don't have permanent, ruthless goals in the sense of AIXI, and instead have them at best in the same sense that humans have them; that is, they're more like high-level behavioral tendencies baked into us slowly and steerably by reinforcement learning. Only in certain circumstances do we think about goals that we have, and even then, we only pursue them if we've been reinforced into doing so.

(An LLM may self-modify to develop a stronger tendency to bring about certain attractor states in the environment, but perhaps it would only be motivated to do so under the same circumstances that a human would. Perhaps one can reinforce an LLM to simply not want to do this.)

If all of this is correct, we're going to need different threat models for why LLMs are dangerous, e.g. ones inspired by sim theory or LLMs' likely analogues to human psychology (humans sometimes being evil). So that's why I'm interested in distinguishing between these two types of optimization: only one of them implies the validity threat models developed at the beginning of the AI safety movement.


(post-script: i had the idea for the first half of this post months ago. i came up with the second half on the fly. the second half, particularly the conclusion that LLMs are in some sense in the middle between level 1 and level 2 optimizers, but still probably don't have the failure modes of the latter, makes me question the quality of the distinction i'm drawing here. i think it's an important first step but not actually the final framework i'll use for this topic. posting it anyway because i love wasting everybody's time ^^)



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI对齐 优化器 Level 1优化器 Level 2优化器 大语言模型
相关文章