Clarifying alignment vs capabilities

Published on August 19, 2024 8:42 PM GMT

A core distinction in AGI safety is between alignment and capabilities. However, I think this distinction is a very fuzzy one, which has led to a lot of confusion. In this post I’ll describe some of the problems with how people typically think about it, and offer a replacement set of definitions.

“Alignment” and “capabilities” are primarily properties of AIs not of AI research

The first thing to highlight is that the distinction between alignment and capabilities is primarily doing useful work when we think of them as properties of AIs. This distinction is still under-appreciated by the wider machine learning community. ML researchers have historically thought about performance of models almost entirely with respect to the tasks they were specifically trained on. However, the rise of LLMs has vindicated the alignment community’s focus on general capabilities, and now it’s much more common to assume that performance on many tasks (including out-of-distribution tasks) will improve roughly in parallel. This is a crucial assumption for thinking about risks from AGI.

Insofar as the ML community has thought about alignment, it has mostly focused on aligning models’ behavior to their training objectives. The possibility of neural networks aiming to achieve internally-represented goals is still not very widely understood, making it hard to discuss and study the reasons those goals might or might not be aligned with the values of (any given set of) humans. To be fair, the alignment community has caused some confusion by describing models as more or less “aligned”, rather than more or less “aligned to X” for some specified X. I’ll talk more about this confusion, and how we should address it, in a later post. But the core point is that AIs might develop internally-represented goals or values that we don’t like, and we should try to avoid that.

However, extending “alignment” and “capabilities” from properties of AIs to properties of different types of research is a fraught endeavor. It’s tempting to categorize work as alignment research to the extent that it can be used to make AIs more aligned (to many possible targets), and as capabilities research to the extent that it can be used to make AIs more capable. But this approach runs into (at least) three major problems.

Firstly, in general it’s very difficult to categorize research by its impacts. Great research often links together ideas from many different subfields, typically in ways that only become apparent throughout the course of the research. We see this in many historical breakthroughs which shed light on a range of different domains. For example, early physicists studying the motions of the stars eventually derived laws governing all earthly objects. Meanwhile Darwin’s study of barnacles and finches led him to principles governing the evolution of all life. Analogously, we should expect that big breakthroughs in our understanding of neural networks and deep learning would be useful in many different ways.

More concretely, there are many cases where central examples of alignment research have advanced, or plausibly will advance, AI capabilities to a significant extent:

activation steering

Eliciting Latent Knowledge

their secrecy about it

Secondly, not only is it difficult to predict the effects of any given piece of research, it’s also difficult for alignment researchers to agree on which effects are good or bad. This is because there are deep disagreements in the field about the likelihoods of different threat models. The more difficult you think the alignment problem is, the more likely you are to consider most existing research useless or actively harmful (as Yudkowsky does). By contrast, Christiano has written in defense of the usefulness of RHLF research and language model agents; and many alignment researchers who are more closely-linked to mainstream ML have even broader views of what research is valuable from an alignment perspective.

Thirdly, for people concerned about AGI, most ML research should count as neither alignment nor capabilities—because it focuses on improving model performance in relatively narrow domains, in a way that is unlikely to generalize very far. I’ll call this type of research applications research. The ubiquity of applications research (e.g. across the many companies that are primarily focused on building ML products) makes some statistics that have been thrown around about the relative numbers of alignment researchers versus capabilities researchers (e.g. here, here) very misleading.

What research is valuable for preventing misalignment?

In order to make “alignment research” a meaningful concept, we should therefore think more concretely about what types of research we expect to be valuable for preventing existential risk from misaligned AGI. I’ll focus on two key properties:

Valuable property 1: worst-case focus

Most ML research focuses on improving the average performance of models (whether on a narrow set of tasks or a broad range of tasks). By contrast, alignment researchers are primarily interested in preventing models’ worst-case misbehavior, which may arise very rarely (and primarily in situations where models expect it won’t be detected). While there’s sometimes overlap between work on improving average-case behavior and work on improving worst-case behavior, in general we should expect them to look fairly different.

We see a similar dynamic play out in cybersecurity (as highlighted by Yudkowsky’s writing on security mindset). In an ideal world, we could classify most software engineering work as cybersecurity work, because security would be built into the design by default. But in practice, creating highly secure systems requires a different skillset from regular software engineering, and typically doesn’t happen unless it’s some team’s main priority. Similarly, even if highly principled capabilities research would in theory help address alignment problems, in practice there's a lot of pressure to trade off worst-case performance for average-case performance.

These pressures are exacerbated by the difficulty of addressing worst-case misbehavior even in current models. Its rarity makes it hard to characterize or study. Adversarial methods (like red-teaming) can find some examples, but these methods are bottlenecked by the capabilities of the adversary. A more principled approach would involve formally verifying safety properties, but formally specifying or verifying non-trivial properties would require significant research breakthroughs. The extent to which techniques for eliciting and addressing worst-case misbehavior of existing models will be helpful for more capable models is an open question.

Valuable property 2: scientific approach

Here’s another framing: a core barrier to aligning AGI is that we don’t understand neural networks well enough to say many meaningful things about how they function. So we should support research that helps us understand deep learning in a principled way. We can view this as a distinction between science and engineering: engineering aims primarily to make things work, science aims primarily to understand how they work. (This is related to Nate Soares’ point in this post.)

Thinking of AGI alignment as being driven by fundamental science highlights that big breakthroughs are likely to be relatively simple and easy to recognize—more like new theories of physics that make precise, powerful predictions than a big complicated codebase that we need to scrutinize line-by-line. This makes me optimistic about automating it in a way that humans can verify.

However, “trying to scientifically understand deep learning” is too broad a criterion to serve as a proxy for whether research will be valuable from an alignment perspective. For example, I expect that most work on scientifically understanding optimizers will primarily be useful for designing better optimizers, rather than understanding the models that result from the optimization process. So can we be more precise about what aspect of a “scientific approach” is valuable for alignment? My contention: a good proxy is the extent to which the research focuses on understanding cognition rather than behavior—i.e. the extent to which it takes a cognitivist approach rather than a behaviorist approach.

Some background on this terminology: the distinction between behaviorism and cognitivism comes from the history of psychology. In the mid-1900s, Behaviorists held that the internal mental states of humans and animals couldn’t be studied scientifically, and therefore that the only scientifically meaningful approach was to focus on describing patterns of behavior. The influential behaviorist B. F. Skinner experimented with rewarding and punishing animal behavior, which eventually led to the modern field of reinforcement learning. However, the philosophical commitments of behaviorism became increasingly untenable. In the field of ethology, which studies animal behavior, researchers like Jane Goodall and Frans de Waal uncovered sophisticated behaviors inconsistent with viewing animals as pure reinforcement learners. In linguistics, Chomsky wrote a scathing critique of Skinner’s 1957 book Verbal Behavior. Skinner characterized language as a set of stimulus-response patterns, but Chomsky argued that this couldn’t account for human generalization to a very wide range of novel sentences. Eventually, psychology moved towards a synthesis in which study of behavior was paired with study of cognition.

ML today is analogous to psychology in the 1950s. Most ML researchers are behaviorists with respect to studying AIs. They focus on how training data determines behavior, and assume that AI behavior is driven by “bundles of heuristics” except in the cases where it’s demonstrated otherwise. This makes sense on narrow tasks, where it’s possible to categorize and study different types of behavior. But when models display consistent types of behavior across many different tasks, it becomes increasingly difficult to predict that behavior without reference to the underlying cognition going on inside the models. (Indeed, this observation can be seen as the core of the alignment problem: we can’t deduce internal motivations from external behavior.)

ML researchers often shy away from studying model cognition, because the methodology involved is often less transparent and less reproducible than simply studying behavior. This is analogous to how early ethologists who studied animals “in the wild” were disparaged for using unrigorous qualitative methodologies. However, they gradually collected many examples of sophisticated behavior (including tool use, power struggles, and cultural transmission) which eventually provided much more insight than narrow, controlled experiments performed in labs.

Similarly, I expect that the study of the internal representations of neural networks will gradually accumulate more and more interesting data points, spark more and more concrete hypotheses, and eventually provide us with principles for understanding neural networks’ real-world behavior that are powerful enough to generalize even to very intelligent agents in very novel situations.

A better definition of alignment research

We can combine the two criteria above to give us a two-dimensional categorization of different types of AI research. In the table below, I give central examples of each type (with the pessimistic-case category being an intermediate step between average-case and worst-case):

	Average-case	Pessimistic-case	Worst-case
Engineering	Scaling	RLHF	Adversarial robustness
Behaviorist science	Optimization science	Scalable oversight	AI control
Cognitivist science	Concept-based interpretability	Mechanistic interpretability	Agent foundations

There’s obviously a lot of research that I’ve skipped over, and exactly where each subfield should be placed is inherently subjective. (For example, there’s been a lot of rigorous scientific research on adversarial examples, but in practice it seems like the best mitigations are fairly hacky and unprincipled, which is why I put adversarial robustness in the “engineering” row.) But I nevertheless think that these are valuable categories for organizing our thinking.

We could just leave it here. But in practice, I don’t think people will stop talking about “alignment research” and “capabilities research”, and I’d like to have some definition of each that doesn’t feel misguided. So, going forward, I’ll define alignment research and capabilities research as research that’s close to the bottom-right and top-left corners respectively. This defines a spectrum between them; I’d like more researchers to move towards the alignment end of the spectrum. (Though recall, as I noted above, that most ML research is neither alignment nor capabilities research, but instead applications research.)

Lastly, I expect all of this to change over time. For example, the central examples of adversarial attacks used to be cases where image models gave bad classifications. Now we also have many examples of jailbreaks which make language models ignore developers’ instructions. In the future, central examples of adversarial attacks will be ones which make models actively try to cause harmful outcomes. So I hope that eventually different research directions near the bottom-right corner of the table above will unify into a rigorous science studying artificial values. And in the very long term, perhaps all parts of the table will unify into a rigorous science of cognition more generally, encompassing not just artificial but also biological minds. For now, though, when I promote alignment, it means that I’m trying to advance worst-case and/or cognitivist AI research.

Discuss

“Alignment” and “capabilities” are primarily properties of AIs not of AI research

What research is valuable for preventing misalignment?

Valuable property 1: worst-case focus

Valuable property 2: scientific approach

A better definition of alignment research

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签