少点错误 2024年08月20日
Clarifying alignment vs capabilities
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了AGI安全中对齐和能力的区别及相关问题,提出了新的定义,并分析了研究的价值和困难。

🧠AGI的对齐和能力主要是AI的属性,而非AI研究的属性。ML研究者过去主要关注模型在特定训练任务上的表现,而LLM的兴起使人们更关注通用能力。然而,神经网络可能会发展出我们不喜欢的内部目标或价值观,我们应尽力避免。

🚧将对齐和能力从AI的属性扩展到不同类型研究的属性是困难的。很难根据研究的影响来分类,且对齐研究人员对哪些影响是好是坏存在分歧。此外,大多数ML研究应被视为应用研究,因其专注于在相对狭窄的领域提高模型性能。

🔍为使对齐研究有意义,应更具体地思考哪些研究类型对防止AGI的生存风险有价值。有两个关键属性:一是关注最坏情况,二是采用科学方法。科学方法应注重理解认知而非行为。

🌐理解AGI对齐的核心障碍是我们对神经网络的理解不足。因此,应支持有助于我们以原则性方式理解深度学习的研究。

Published on August 19, 2024 8:42 PM GMT

A core distinction in AGI safety is between alignment and capabilities. However, I think this distinction is a very fuzzy one, which has led to a lot of confusion. In this post I’ll describe some of the problems with how people typically think about it, and offer a replacement set of definitions.

“Alignment” and “capabilities” are primarily properties of AIs not of AI research

The first thing to highlight is that the distinction between alignment and capabilities is primarily doing useful work when we think of them as properties of AIs. This distinction is still under-appreciated by the wider machine learning community. ML researchers have historically thought about performance of models almost entirely with respect to the tasks they were specifically trained on. However, the rise of LLMs has vindicated the alignment community’s focus on general capabilities, and now it’s much more common to assume that performance on many tasks (including out-of-distribution tasks) will improve roughly in parallel. This is a crucial assumption for thinking about risks from AGI.

Insofar as the ML community has thought about alignment, it has mostly focused on aligning models’ behavior to their training objectives. The possibility of neural networks aiming to achieve internally-represented goals is still not very widely understood, making it hard to discuss and study the reasons those goals might or might not be aligned with the values of (any given set of) humans. To be fair, the alignment community has caused some confusion by describing models as more or less “aligned”, rather than more or less “aligned to X” for some specified X. I’ll talk more about this confusion, and how we should address it, in a later post. But the core point is that AIs might develop internally-represented goals or values that we don’t like, and we should try to avoid that.

However, extending “alignment” and “capabilities” from properties of AIs to properties of different types of research is a fraught endeavor. It’s tempting to categorize work as alignment research to the extent that it can be used to make AIs more aligned (to many possible targets), and as capabilities research to the extent that it can be used to make AIs more capable. But this approach runs into (at least) three major problems.

Firstly, in general it’s very difficult to categorize research by its impacts. Great research often links together ideas from many different subfields, typically in ways that only become apparent throughout the course of the research. We see this in many historical breakthroughs which shed light on a range of different domains. For example, early physicists studying the motions of the stars eventually derived laws governing all earthly objects. Meanwhile Darwin’s study of barnacles and finches led him to principles governing the evolution of all life. Analogously, we should expect that big breakthroughs in our understanding of neural networks and deep learning would be useful in many different ways.

More concretely, there are many cases where central examples of alignment research have advanced, or plausibly will advance, AI capabilities to a significant extent:

Secondly, not only is it difficult to predict the effects of any given piece of research, it’s also difficult for alignment researchers to agree on which effects are good or bad. This is because there are deep disagreements in the field about the likelihoods of different threat models. The more difficult you think the alignment problem is, the more likely you are to consider most existing research useless or actively harmful (as Yudkowsky does). By contrast, Christiano has written in defense of the usefulness of RHLF research and language model agents; and many alignment researchers who are more closely-linked to mainstream ML have even broader views of what research is valuable from an alignment perspective.

Thirdly, for people concerned about AGI, most ML research should count as neither alignment nor capabilities—because it focuses on improving model performance in relatively narrow domains, in a way that is unlikely to generalize very far. I’ll call this type of research applications research. The ubiquity of applications research (e.g. across the many companies that are primarily focused on building ML products) makes some statistics that have been thrown around about the relative numbers of alignment researchers versus capabilities researchers (e.g. herehere) very misleading.

What research is valuable for preventing misalignment?

In order to make “alignment research” a meaningful concept, we should therefore think more concretely about what types of research we expect to be valuable for preventing existential risk from misaligned AGI. I’ll focus on two key properties:

Valuable property 1: worst-case focus

Most ML research focuses on improving the average performance of models (whether on a narrow set of tasks or a broad range of tasks). By contrast, alignment researchers are primarily interested in preventing models’ worst-case misbehavior, which may arise very rarely (and primarily in situations where models expect it won’t be detected). While there’s sometimes overlap between work on improving average-case behavior and work on improving worst-case behavior, in general we should expect them to look fairly different.

We see a similar dynamic play out in cybersecurity (as highlighted by Yudkowsky’s writing on security mindset). In an ideal world, we could classify most software engineering work as cybersecurity work, because security would be built into the design by default. But in practice, creating highly secure systems requires a different skillset from regular software engineering, and typically doesn’t happen unless it’s some team’s main priority. Similarly, even if highly principled capabilities research would in theory help address alignment problems, in practice there's a lot of pressure to trade off worst-case performance for average-case performance.

These pressures are exacerbated by the difficulty of addressing worst-case misbehavior even in current models. Its rarity makes it hard to characterize or study. Adversarial methods (like red-teaming) can find some examples, but these methods are bottlenecked by the capabilities of the adversary. A more principled approach would involve formally verifying safety properties, but formally specifying or verifying non-trivial properties would require significant research breakthroughs. The extent to which techniques for eliciting and addressing worst-case misbehavior of existing models will be helpful for more capable models is an open question.

Valuable property 2: scientific approach

Here’s another framing: a core barrier to aligning AGI is that we don’t understand neural networks well enough to say many meaningful things about how they function. So we should support research that helps us understand deep learning in a principled way. We can view this as a distinction between science and engineering: engineering aims primarily to make things work, science aims primarily to understand how they work. (This is related to Nate Soares’ point in this post.)

Thinking of AGI alignment as being driven by fundamental science highlights that big breakthroughs are likely to be relatively simple and easy to recognize—more like new theories of physics that make precise, powerful predictions than a big complicated codebase that we need to scrutinize line-by-line. This makes me optimistic about automating it in a way that humans can verify.

However, “trying to scientifically understand deep learning” is too broad a criterion to serve as a proxy for whether research will be valuable from an alignment perspective. For example, I expect that most work on scientifically understanding optimizers will primarily be useful for designing better optimizers, rather than understanding the models that result from the optimization process. So can we be more precise about what aspect of a “scientific approach” is valuable for alignment? My contention: a good proxy is the extent to which the research focuses on understanding cognition rather than behavior—i.e. the extent to which it takes a cognitivist approach rather than a behaviorist approach.

Some background on this terminology: the distinction between behaviorism and cognitivism comes from the history of psychology. In the mid-1900s, Behaviorists held that the internal mental states of humans and animals couldn’t be studied scientifically, and therefore that the only scientifically meaningful approach was to focus on describing patterns of behavior. The influential behaviorist B. F. Skinner experimented with rewarding and punishing animal behavior, which eventually led to the modern field of reinforcement learning. However, the philosophical commitments of behaviorism became increasingly untenable. In the field of ethology, which studies animal behavior, researchers like Jane Goodall and Frans de Waal uncovered sophisticated behaviors inconsistent with viewing animals as pure reinforcement learners. In linguistics, Chomsky wrote a scathing critique of Skinner’s 1957 book Verbal Behavior. Skinner characterized language as a set of stimulus-response patterns, but Chomsky argued that this couldn’t account for human generalization to a very wide range of novel sentences. Eventually, psychology moved towards a synthesis in which study of behavior was paired with study of cognition.

ML today is analogous to psychology in the 1950s. Most ML researchers are behaviorists with respect to studying AIs. They focus on how training data determines behavior, and assume that AI behavior is driven by “bundles of heuristics” except in the cases where it’s demonstrated otherwise. This makes sense on narrow tasks, where it’s possible to categorize and study different types of behavior. But when models display consistent types of behavior across many different tasks, it becomes increasingly difficult to predict that behavior without reference to the underlying cognition going on inside the models. (Indeed, this observation can be seen as the core of the alignment problem: we can’t deduce internal motivations from external behavior.)

ML researchers often shy away from studying model cognition, because the methodology involved is often less transparent and less reproducible than simply studying behavior. This is analogous to how early ethologists who studied animals “in the wild” were disparaged for using unrigorous qualitative methodologies. However, they gradually collected many examples of sophisticated behavior (including tool usepower struggles, and cultural transmission) which eventually provided much more insight than narrow, controlled experiments performed in labs.

Similarly, I expect that the study of the internal representations of neural networks will gradually accumulate more and more interesting data points, spark more and more concrete hypotheses, and eventually provide us with principles for understanding neural networks’ real-world behavior that are powerful enough to generalize even to very intelligent agents in very novel situations.

A better definition of alignment research

We can combine the two criteria above to give us a two-dimensional categorization of different types of AI research. In the table below, I give central examples of each type (with the pessimistic-case category being an intermediate step between average-case and worst-case):

 Average-casePessimistic-caseWorst-case
EngineeringScalingRLHFAdversarial robustness
Behaviorist scienceOptimization scienceScalable oversightAI control
Cognitivist scienceConcept-based interpretabilityMechanistic interpretabilityAgent foundations

 

 

 

 

 

 

 

There’s obviously a lot of research that I’ve skipped over, and exactly where each subfield should be placed is inherently subjective. (For example, there’s been a lot of rigorous scientific research on adversarial examples, but in practice it seems like the best mitigations are fairly hacky and unprincipled, which is why I put adversarial robustness in the “engineering” row.) But I nevertheless think that these are valuable categories for organizing our thinking.

We could just leave it here. But in practice, I don’t think people will stop talking about “alignment research” and “capabilities research”, and I’d like to have some definition of each that doesn’t feel misguided. So, going forward, I’ll define alignment research and capabilities research as research that’s close to the bottom-right and top-left corners respectively. This defines a spectrum between them; I’d like more researchers to move towards the alignment end of the spectrum. (Though recall, as I noted above, that most ML research is neither alignment nor capabilities research, but instead applications research.)

Lastly, I expect all of this to change over time. For example, the central examples of adversarial attacks used to be cases where image models gave bad classifications. Now we also have many examples of jailbreaks which make language models ignore developers’ instructions. In the future, central examples of adversarial attacks will be ones which make models actively try to cause harmful outcomes. So I hope that eventually different research directions near the bottom-right corner of the table above will unify into a rigorous science studying artificial values. And in the very long term, perhaps all parts of the table will unify into a rigorous science of cognition more generally, encompassing not just artificial but also biological minds. For now, though, when I promote alignment, it means that I’m trying to advance worst-case and/or cognitivist AI research.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AGI安全 对齐与能力 研究价值 科学方法 认知理解
相关文章