We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap

Published on September 19, 2024 10:22 PM GMT

Background: “Learning” vs “Learning About”

Adaptive systems, reinforcement “learners”, etc, “learn” in the sense that their behavior adapts to their environment.

Bayesian reasoners, human scientists, etc, “learn” in the sense that they have some symbolic representation of the environment, and they update those symbols over time to (hopefully) better match the environment (i.e. make the map better match the territory).

These two kinds of “learning” are not synonymous^[1]. Adaptive systems “learn” things, but they don’t necessarily “learn about” things; they don’t necessarily have an internal map of the external territory. (Yes, the active inference folks will bullshit about how any adaptive system must have a map of the territory, but their math does not substantively support that interpretation.) The internal heuristics or behaviors “learned” by an adaptive system are not necessarily “about” any particular external thing, and don’t necessarily represent any particular external thing^[2].

We Humans Learn About Our Values

“I thought I wanted X, but then I tried it and it was pretty meh.”

“For a long time I pursued Y, but now I think that was more a social script than my own values.”

“As a teenager, I endorsed the view that Z is the highest objective of human existence. … Yeah, it’s a bit embarrassing in hindsight.”

The ubiquity of these sorts of sentiments is the simplest evidence that we do not typically know our own values^[3]. Rather, people often (but not always) have some explicit best guess at their own values, and that guess updates over time - i.e. we can learn about our own values.

Note the wording here: we’re not just saying that human values are “learned” in the more general sense of reinforcement learning. We’re saying that we humans have some internal representation of our own values, a “map” of our values, and we update that map in response to evidence. Look again at the examples at the beginning of this section:

“I thought I wanted X, but then I tried it and it was pretty meh.”“For a long time I pursued Y, but now I think that was more a social script than my own values.”“As a teenager, I endorsed the view that Z is the highest objective of human existence. … Yeah, it’s a bit embarrassing in hindsight.”

Notice that the wording of each example involves beliefs about values. They’re not just saying “I used to feel urge X, but now I feel urge Y”. They’re saying “I thought I wanted X” - a belief about a value! Or “now I think that was more a social script than my own values” - again, a belief about my own values, and how those values relate to my (previous) behavior. Or “I endorsed the view that Z is the highest objective” - an explicit endorsement of a belief about values. That’s how we normally, instinctively reason about our own values. And sure, we could reword everything to avoid talking about our beliefs about values - “learning” is more general than “learning about” - but the fact that it makes sense to us to talk about our beliefs about values is strong evidence that something in our heads in fact works like beliefs about values, not just reinforcement-style “learning”.

Two Puzzles

Puzzle 1: Learning About Our Own Values vs The Is-Ought Gap

Very roughly speaking, an agent could aim to pursue any values regardless of what the world outside it looks like; “how the external world is” does not tell us “how the external world should be”. So when we “learn about” values, where does the evidence about values come from? How do we cross the is-ought gap?

Puzzle 2: The Role of Reward/Reinforcement

It does seem like humans have some kind of physiological “reward”, in a hand-wavy reinforcement-learning-esque sense, which seems to at least partially drive the subjective valuation of things. (Something something mesocorticolimbic circuit.) For instance, Steve Byrnes claims that reward is a signal from which human values are learned/reinforced within lifetime^[4]. But clearly the reward signal is not itself our values. So what’s the role of the reward, exactly? What kind of reinforcement does it provide, and how does it relate to our beliefs about our own values?

Using Each Puzzle To Solve The Other

Put these two together, and a natural guess is: reward is the evidence from which we learn about our values. Reward is used just like ordinary epistemic evidence - so when we receive rewards, we update our beliefs about our own values just like we update our beliefs about ordinary facts in response to any other evidence. Indeed, our beliefs-about-values can be integrated into the same system as all our other beliefs, allowing for e.g. ordinary factual evidence to become relevant to beliefs about values in some cases.

What This Looks Like In Practice

I eat some escamoles for the first time, and my tongue/mouth/nose send me enjoyable reward signals. I update to believe that escamole are good - i.e. I assign it high value.

Now someone comes along and tells me that escamoles are ant larvae. An unwelcome image comes into my head, of me eating squirming larvae. Ew, gross! That’s a negative reward signal - I downgrade my estimated value of escamoles.

Now another person comes along and tells me that escamoles are very healthy. I have cached already that "health" is valuable to me. That cache hit doesn’t generate a direct reward signal at all, but it does update my beliefs about the value of escamoles, to the extent that I believe this person. That update just routes through ordinary epistemic machinery, without a reward signal at all.

At some point I sit down and think about escamoles. Yeah, ants are kinda gross, but on reflection I don’t think I endorse that reaction to escamoles. I can see why my reward system would generate an “ew, gross” signal, but I model that reward as being the result of two decoupled things: either a hardcoded aversion to insects, or my actual values. I know that I am automatically averse to putting insects in my mouth so it's less likely that the negative reward is evidence of my values in this case; the signal is explained away in the usual epistemic sense by some cause other than my values. So, I partially undo the value-downgrade I had assigned to escamoles in response to the “ew, gross” reaction. I might still feel some disgust, but I consciously override that disgust to some extent.

That last example is particularly interesting, since it highlights a nontrivial prediction of this model. Insofar as reward is treated as evidence about values, and our beliefs about values update in the ordinary epistemic manner, we should expect all the typical phenomena of epistemic updating to carry over to learning about our values. Explaining-away is one such phenomenon. What do other standard epistemic phenomena look like, when carried over to learning about values using reward as evidence?

^{^}
as far as we currently know
^{^}
Of course sometimes a Bayesian reasoner’s beliefs are not “about” any particular external thing either, because the reasoner is so thoroughly confused that it has beliefs about things which don’t exist - like e.g. beliefs about the current location of my dog. I don’t have a dog. But unlike the adaptive system case, for a Bayesian reasoner such confusion is generally considered a failure of some sort.
^{^}
Note that, in treating these sentiments as evidence that we don’t know our own values, we’re using stated values as a proxy measure for values. When we talk about a human’s “values”, we are notably not talking about:
The thing we’re talking about, when we talk about a human’s “values”, is a thing internal to the human’s mind. It’s a high-level cognitive structure. Things like stated preferences, revealed preferences, etc are useful proxies for human values, but those proxies importantly break down in various cases - even in day-to-day life.
^{^}
Steve uses the phrase “model-based RL” here, though that’s a pretty vague term and I’m not sure how well the usual usage fits Steve’s picture.

Discuss

Background: “Learning” vs “Learning About”

We Humans Learn About Our Values

Two Puzzles

Puzzle 1: Learning About Our Own Values vs The Is-Ought Gap

Puzzle 2: The Role of Reward/Reinforcement

Using Each Puzzle To Solve The Other

What This Looks Like In Practice

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签