少点错误 01月25日
Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了工具性目标与终端目标之间的根本差异。通过烘焙蛋糕和餐厅运营的例子,文章阐述了工具性目标如何天然地带有隐式约束,避免干扰其他子目标,并强调了在多智能体环境中可预测性和信息可见性的重要性。文章进一步提出,通过关注工具性目标的收敛性,可以构建出更安全、更符合人类期望的AI系统,甚至可以完全放弃终端目标,直接以工具性目标为导向,从而实现更广泛的兼容性和可控性。最后,文章还探讨了相关的研究方向,旨在深入理解和实践这一概念。

🍰 终端目标是最终追求,例如烘焙蛋糕,不考虑其他后果;工具性目标是实现终端目标的手段,例如获取可可粉,需要考虑不破坏其他有助于烘焙的因素。

🍳 在餐厅环境中,工具性目标(如制作蛋糕)需要考虑更多隐式约束,避免浪费食材、破坏工具,因为这些可能影响未来其他菜品的制作。此外,还需要考虑与其他厨师的协作,保证信息的透明和可预测性,避免内耗。

💰 工具性目标具有收敛性,例如在餐厅中,干净的盘子、空闲的台面和资金是许多目标共同需要的资源。因此,追求工具性目标的智能体倾向于维护这些资源,而不是独占或浪费它们,以促进整体目标的实现。

💡 通过关注工具性目标的收敛性,我们可以构建出更安全的AI系统,该系统会自然地避免干扰其他智能体的目标,并且会主动提供可预测的行为和透明的信息,这与传统的低影响AI的理念不同。

🤔 甚至可以放弃终端目标,直接以工具性目标为导向,构建一个“只”对工具性收敛子目标可纠正的智能体,这在理论上可以带来更广泛的兼容性和可控性,是值得研究的方向。

Published on January 24, 2025 8:20 PM GMT

The Cake

Imagine that I want to bake a chocolate cake, and my sole goal in my entire lightcone and extended mathematical universe is to bake that cake. I care about nothing else. If the oven ends up a molten pile of metal ten minutes after the cake is done, if the leftover eggs are shattered and the leftover milk spilled, that’s fine. Baking that cake is my terminal goal.

In the process of baking the cake, I check my fridge and cupboard for ingredients. I have milk and eggs and flour, but no cocoa powder. Guess I’ll have to acquire some cocoa powder! Acquiring the cocoa powder is an instrumental goal: I care about it exactly insofar as it helps me bake the cake.

My cocoa acquisition subquest is a very different kind of goal than my cake baking quest. If the oven ends up a molten pile of metal shortly after the cocoa is acquired, if I shatter the eggs or spill the milk in my rush to the supermarket, then that’s a problem - a molten oven or shattered eggs or spilled milk would make it harder for me to bake the cake! More generally, in the process of acquiring cocoa powder, I want to not mess up other things which are helpful for making the cake. Unlike my terminal goal of baking a cake, my instrumental goal of acquiring cocoa powder comes with a bunch of implicit constraints about not making other instrumental subgoals much harder.

(If you’re already thinking “hmm, that sounds kinda like corrigibility”, then you have the right idea and that is indeed where we’re going with this.)

Generalizable takeaway: unlike terminal goals, instrumental goals come with a bunch of implicit constraints about not making other instrumental subgoals much harder.

The Restaurant

Now imagine that I’m working as a chef in a big restaurant. My terminal goal is the restaurant’s long-term success; I care about nothing else. If the bombs drop, so long as the restaurant is still doing good business afterwards, I’ll be happy.

One day, a customer orders a fresh chocolate cake, and it falls to me to bake it. Now baking the cake is an instrumental goal.

One key difference from the previous example: in the restaurant, I don’t know all the things which future customers will order. I don’t know exactly which ingredients or tools will be needed tomorrow. So, in the process of baking the cake, I want to avoid wasting ingredients or destroying tools which might be useful for any of the dishes which future customers might order. My instrumental goal of baking a cake comes with a bunch of implicit constraints about not-making-harder a whole distribution of potential future instrumental subgoals.

Another key difference from the previous example: now there are multiple chefs, multiple subagents working on different instrumental subgoals. As part of the implicit constraints on my cake-baking, I need to not make their instrumental subgoals more difficult. And that notably brings in lots of informational constraints. For instance, if I use some eggs, I need to either put the rest of the eggs back in a location predictable to the other chefs, or I need to communicate to the other chefs where I left the eggs, so that they don’t have to spend time searching for the eggs later. So my instrumental goal of baking a cake comes with a bunch of constraints about being predictable to others, and/or making information about what I’m doing visible to others.

Generalizable takeaway: unlike terminal goals, instrumental goals come with implicit constraints about being predictable, making information about what one is doing visible, and not-making-harder a whole broad distribution of other possible instrumental goals.

… and now this sounds a lot like corrigibility.

Happy Instrumental Convergence?

Still sticking to the restaurant example: presumably many different instrumental goals in the restaurant require clean plates, empty counter space, and money. Those are all convergently instrumentally-useful resources within the restaurant.

Now, the way you might be used to thinking about instrumental convergence is roughly: “For lots of different goals in the restaurant, I need clean plates, empty counter space, and money. So, I might as well seize a bunch of those things upfront. Sure that’ll screw over the other chefs, but I don’t care about that.”. And that is how the reasoning might go if baking this one cake were a terminal goal.

But instrumental goals are different. If I’m the chef baking the cake as an instrumental goal, I instead reason: “For lots of different goals in the restaurant, a chef needs clean plates, empty counter space, and money. So, I should generally make sure those things are readily available to my fellow chefs as much as possible, so that they'll be able to solve their problems for our shared terminal goal. I’ll avoid using the resources up, and even make more of them available (by e.g. cleaning a counter top) whenever I have a relative advantage in doing so.”.

I want to emphasize that this sort of reasoning should require no “special sauce”. It’s just a natural, implicit part of instrumental goals, as opposed to terminal goals.

One more interesting thing to highlight: so far, insofar as instrumental goals are corrigible, we've only talked about them being corrigible toward other instrumental subgoals of the same shared terminal goal. The chef pursuing the restaurant's success might be perfectly fine screwing over e.g. a random taxi driver in another city. But instrumental convergence potentially points towards general corrigibility.

Suppose, in the restaurant example, that clean plates, empty counter space, and money are the only significant convergently instrumental goals. Then, (in the restaurant environment,) we get a natural notion of general corrigibility: if I just “try not to step on the toes” of instrumentally-convergent subgoals, then that will mostly keep me from stepping on the toes of most subgoals pursued by other restaurant-denizens, regardless of what our top-level goals are. The same strategy works for many different top-level goals in this restaurant, so it’s a generally corrigible strategy.

More generally, if I track instrumentally-convergent subgoals throughout the whole world, and generally "avoid stepping on the toes" of any of them... that would be a generally corrigible strategy.

And that unlocks the natural next jump.

All The Way Up

The natural next jump: do we even need the terminal goal at all? What if a mind’s top-level goals were the same “kind of thing” as instrumental goals more generally? Indeed, in some ways that would be a very natural structure for a general-purpose mind; it needs the ability to recursively handle instrumental subgoals anyway, so why have a whole extra different kind of goal at the top?

So long as instrumental convergence kicks in hard enough in the global environment, the mind can “try not to step on the toes” of instrumentally-convergent subgoals, and then that will mostly keep it from stepping on the toes of most other people's subgoals, regardless of the original terminal goal. So to build a generally corrigible system, we can imagine just dropping terminal goals altogether, and aim for an agent which is 'just' corrigible toward instrumentally-convergent subgoals.

For AI purposes, this would be a much safer kind of agent. It would be an AI which naturally tries not to “step on other agents’ toes”, naturally behaves such that it doesn’t get in the way of other agents’ goals (and in particular humans’ goals). But unlike e.g. naive formulations of “low-impact” agents, such an AI would also actively try to behave in ways predictable and legible to other agents, and make sure that other agents can easily query information about its own behavior.

In short, it sounds like all the properties of corrigibility we always hoped for, all coming from a single coherent underlying concept (i.e. not thrown together ad-hoc), and therefore likely to also generalize in ways we like to properties we haven’t yet thought to ask for.

Research Threads

This concept of corrigibility immediately suggests lots of research approaches.

First, on the theory side, there’s the problem of fleshing out exactly what the “type signature” of an instrumental goal is, with all those implicit constraints. The main way one would tackle this problem would be:

On the empirical side, one could try clever ways of training instrumental rather than terminal goals into a system. For instance, the restaurant example suggests training a system to work with many instances of itself or other systems in order to solve top-level goals in a reasonably general environment. Then, y’know… see what happens.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

工具性目标 终端目标 目标收敛 AI安全 可纠正性
相关文章