少点错误 2024年10月05日
AISafety.info: What is the "natural abstractions hypothesis"?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了自然抽象假说(NAH),包括其定义、对AI对齐的重要性及该假说的真实性。NAH认为物理世界可抽象为低层次系统的高层次抽象,且不同思维会收敛于同一组抽象。若NAH为真,AI对齐将简化。但目前NAH的真实性仍是个实证问题,相关理论也有待发展。

🌐自然抽象假说指出,我们的物理世界可很好地抽象为高层次抽象,这些抽象对于多数事物,其描述与其他事物相互作用的信息维度远低于事物本身,如轮子、岩石等的例子所示。

🎯NAH认为不同思维会收敛到相同的抽象集,因其是所有相关信息到达思维的最有效表示,且许多与思维有距离的世界部分会影响思维所关心的事物,所以思维会有学习这些抽象的动机。

💪若NAH为真,AI对齐会更容易,可对超级智能进行机械可解释性操作,甚至编辑其目标使其与人类兼容,且一些人类价值或其他对齐目标可能是自然抽象,可借此向AI灌输相关价值。

❓目前NAH的真实性是个实证问题,我们可与之交流或手动检查的不同思维种类很少,现有数据虽与NAH一致,但对于一般超人系统尚无数据,相关理论也远未充分发展。

Published on October 5, 2024 12:31 PM GMT

AISafety.info writes AI safety intro content. We'd appreciate any feedback

Introduction

The natural abstraction hypothesis (NAH) claims:

If the NAH is true, AI alignment could be dramatically simplified, as it implies that any cognition a very powerful AI uses will be in terms of concepts that humans can understand.[1]

Explanation of the natural abstraction hypothesis

Let's unpack that definition. First, what do we mean by “our physical world abstracts well”? Just that for most things in the world, the information that describes how the thing interacts with other stuff “far away” from the system is much lower-dimensional (i.e., described by fewer numbers) than the thing itself. “Far away” can refer to many kinds of separation, including physical[2], conceptual, or causal separation.

For example, a wheel can be understood without considering the position and velocity of every atom in it. We only need to know a few large-scale properties like its shape, how it rotates, etc. to know how a wheel interacts with other parts of the world. This is a handful of numbers compared to an atomically precise description, which would require over 10^26 numbers! In this sense, the wheel is an abstraction of the atoms that compose it. Or consider a rock: you don't need to keep track of its chemical composition if you’re chucking it at someone. You just need to know how hard and heavy it is.

The NAH claims that different minds will converge to the same set of abstractions because they are the most efficient representations of all relevant info that reaches the mind from “far away”. And many parts of the world that are far from a mind will influence things the mind cares about, so a mind will be incentivized to learn these abstractions. So, for instance, if someone mostly cares about building great cars, then things like “Hertzian Zones” may affect its ability to build great cars despite being conceptually far from car-design. So said mind would plausibly have to learn what high pressure phase transitions are.

Moreover, NAH claims that the abstractions that humans usually use are approximately natural abstractions. That is, any mind that looks at and uses car wheels successfully will have learned what a circle is in approximately the same way as a human. Or if some aliens about the size of humans, born on a planet similar to our own, were to come up with a theory of motion, they’d land on Newtonian physics. Or perhaps General Relativity if they were more sophisticated.

Note how strong a claim NAH is! It applies to aliens, to superintelligences, and even to alien superintelligences! But before we investigate whether it is true, why does NAH matter for alignment?

Why the natural abstraction hypothesis is important for alignment

Alignment is probably easier if NAH is true than if it isn't. If superintelligences will reliably use approximately the same concepts humans use, then there's no fundamental barrier to doing mechanistic interpretability on superintelligences, and maybe even editing their goals to be human-compatible.

If we are lucky, human values, or other alignment targets like “niceness” or corrigibility or property rights, are themselves natural abstractions. If these abstractions are represented in a simple way in most advanced AI systems, then alignment, or control, is simply a matter of locating these abstractions within the AI's mind and forming a goal from them like “be corrigible to your creator”. A crude but remarkably effective technique in this vein is activation steering[3]. If these values are natural abstractions, then even if are not represented anywhere in the AI’s cognition, they could still be taught to the AI using little data.

Different alignment targets look more or less plausible as natural abstractions. Any specific conception of value a human has — e.g., natural law deontology or ancient Athenian virtue ethics — is unlikely to be a natural abstraction[4]. But there are some parts of human values, and inputs to them, that are plausibly natural abstractions. If an AI used those abstractions, that would make it easier for a training process to instill values that depend on them into the AI.

Is NAH true?

We don't know. The truth of the NAH is ultimately an empirical question, and we have few distinct kinds of minds we can converse with, or manually inspect, to see if their abstractions are natural. For the few tests we can do on different minds — i.e., humans, some animals, and AI — the data are consistent with NAH.

Humans can quickly share abstractions,[5] and use roughly the same ones in the same environment. Our abstractions continue to work even in drastically different environments from where we acquired them. For example, F=ma still works on the moon. As far as we can tell with our crude ability to measure abstractions, very different AIs trained in different ways on different data develop basically the same abstractions; even more so the more capable the AI.

But we have no data for generally superhuman systems. This is where some theories of natural abstractions would have to come into play. Then we might test theories against existing data and use the best to predict what will occur for superhuman systems. Alas, the theory of natural abstractions is far from developed enough to do such things. We do not even have a good technical definition yet, which is why the hypothesis is framed informally.[6]

The work is ongoing.

 

  1. ^

    “Good representations are all alike; every bad representation is bad in its own way” — if Tolstoy had invented the Natural Abstractions Hypothesis, that is what it would say.

  2. ^

    Relative to the size of the system — “far away” from a fly might mean a few centimeters, while “far away” from the sun might mean thousands of kilometers. ↩︎

  3. ^

    See for instance Golden Gate Claude

  4. ^

    If human values are't natural abstractions, it doesn't follow that they have nothing to do with natural abstractions. Human values may have inputs which are natural abstractions, which can significantly constrain their type signature, making them easier to find. Perhaps even making them good enough proxies to natural abstractions in some training regimes they they get found by default

  5. ^

    Note that you've never needed 1TB of data to describe an idea to someone, let alone to convince them that something is a rock. 

  6. ^

    It is somewhat sloppy to say “the” natural abstraction hypothesis, as there are various formulations, and of course there might be a few, distinct natural abstractions corresponding to a given human abstraction, rather than one. Some of the formulations have different implications for alignment. This is why this article’s exposition has to be fuzzy enough to accommodate most of these variants. 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

自然抽象假说 AI对齐 人类价值 实证问题
相关文章