Is weak-to-strong generalization an alignment technique?

Published on January 31, 2025 7:13 AM GMT

Weak-to-strong generalization (W2SG) was initially presented as an analogy for human supervision of superhuman AI systems. From the paper:

We propose a simple setup for studying the problem of humans supervising superhuman models by considering an analogy: can we use weak models to supervise strong models?

"Humans supervising superhuman models" sounds a lot like achieving scalable oversight, which has at times been defined as:

here

Despite variation in these definitions, they point at a similar challenge: that of inducing desirable behavior in an AI system when the ability to grade its behavior is limited in quantity or quality.^[1] We might say that there is the problem of scalable oversight, and then there are methods that seek to address this problem, which could be termed scalable oversight methods. Informally, these are "methods for humans supervising superhuman models."

So, combining the motivation for W2SG with this definition of scalable oversight, we could then say: W2SG is a model of the problem of scalable oversight. The purpose of the model is to facilitate the development of scalable oversight methods.

But wait!

This does not seem to be how people use these terms. Instead, W2SG and scalable oversight are discussed as different and complementary types of methods! For example, this post calls them "two approaches to addressing weak supervision," offering the following definitions:

[SO-R] Scalable oversight: "try to make the supervision signal stronger, such that we return to the 'normal ML' regime;" and[W2SG-R] W2SG: "make the strong student (the AI system) generalize correctly from the imperfect labels provided by the weak teacher."

Interpreted narrowly,^[2] SO-R excludes many techniques that might qualify under the broader definitions given above. Notably, the definition appears^[3] to exclude the given definition of weak-to-strong generalization (W2SG-R), which would arguably qualify as scalable oversight according to definitions above.

According to SO-1

According to SO-2

Similarly, this post asks "how can we combine W2SG with scalable oversight?" and talks about W2SG as an alternative alignment method.

In fact, even the original W2SG paper refers to W2SG as a kind of technique in appendix G, although there is no mention of what such a technique might look like, concretely.

Our plan for aligning superintelligence is a work in progress, but we believe that weak-to-strong techniques could serve as a key ingredient. In this section we sketch several illustrative possiblities for how we could use weak-to-strong generalization to help align superintelligent systems.
...
We will align this model using our most scalable techniques available, e.g. RLHF (Christiano et al., 2017; Ouyang et al., 2022), constitutional AI (Bai et al., 2022b), scalable oversight (Saunders et al., 2022; Bowman et al., 2022), adversarial training, or—the focus of this paper—-weak-to-strong generalization techniques.

Referring to "scalable oversight" and "W2SG" as alignment strategies feels like a type error to me-- at least, according to my preferred definitions. But I'm not interested in debating definitions.

What I want to know is...

^[4]

assumption

^{^}
Note: definition SO-4 is broader in that it includes evaluation, not just training.
^{^}
The narrow interpretation of SO-R: "scalable oversight methods are methods which intervene only via the labels(/grades) used to train the (primary) AI."
^{^}
It's not clear that SO-R actually excludes W2SG-R. A W2SG-R method generates labels to train the student to "generalize better," which would seem to qualify as "making the supervision signal stronger."
^{^}
The posts linked above mention combinations of scalable oversight and W2SG, but by combining the two together, it's not clear what the "W2SG" part is. In some cases, W2SG seems to refer merely to the hope that a powerful model will generalize from imperfectly-labeled data, e.g.:
One unappealing aspect of W2SG is that when we actually apply it to tasks we can’t evaluate and don’t have ground truth for, we are to some extent taking a “leap of faith on generalization.”

Discuss

But wait!

What I want to know is...

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签