少点错误 2024年09月20日
We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了两种学习方式,指出人类对自身价值观的了解是通过内部价值表征和对证据的响应来实现的,还提出了两个关于学习自身价值观的谜题及解决方案,并通过吃蚁蛹的例子进行说明。

🎯自适应系统等的学习是行为适应环境,但不一定了解事物;人类则通过对自身价值的内部表征和对证据的回应来了解自身价值观,如‘我曾认为我想要X,但尝试后发现并非如此’的例子。

🤔学习自身价值观存在两个谜题:一是价值证据从何而来及如何跨越实然与应然的差距;二是奖励在其中的作用,它虽不是人类价值观本身,但似乎能部分驱动对事物的主观评价。

💡将两个谜题结合,自然的猜测是奖励是我们了解自身价值观的证据,像普通认知证据一样被用于更新我们对自身价值观的信念,吃蚁蛹的例子说明了这一点。

❓文章还探讨了把奖励作为价值证据时,其他标准认知现象在了解自身价值观的学习中会是什么样,如解释消除现象。

Published on September 19, 2024 10:22 PM GMT

Background: “Learning” vs “Learning About”

Adaptive systems, reinforcement “learners”, etc, “learn” in the sense that their behavior adapts to their environment.

Bayesian reasoners, human scientists, etc, “learn” in the sense that they have some symbolic representation of the environment, and they update those symbols over time to (hopefully) better match the environment (i.e. make the map better match the territory).

These two kinds of “learning” are not synonymous[1]. Adaptive systems “learn” things, but they don’t necessarily “learn about” things; they don’t necessarily have an internal map of the external territory. (Yes, the active inference folks will bullshit about how any adaptive system must have a map of the territory, but their math does not substantively support that interpretation.) The internal heuristics or behaviors “learned” by an adaptive system are not necessarily “about” any particular external thing, and don’t necessarily represent any particular external thing[2].

We Humans Learn About Our Values

“I thought I wanted X, but then I tried it and it was pretty meh.”

“For a long time I pursued Y, but now I think that was more a social script than my own values.”

“As a teenager, I endorsed the view that Z is the highest objective of human existence. … Yeah, it’s a bit embarrassing in hindsight.”

The ubiquity of these sorts of sentiments is the simplest evidence that we do not typically know our own values[3]. Rather, people often (but not always) have some explicit best guess at their own values, and that guess updates over time - i.e. we can learn about our own values.

Note the wording here: we’re not just saying that human values are “learned” in the more general sense of reinforcement learning. We’re saying that we humans have some internal representation of our own values, a “map” of our values, and we update that map in response to evidence. Look again at the examples at the beginning of this section:

Notice that the wording of each example involves beliefs about values. They’re not just saying “I used to feel urge X, but now I feel urge Y”. They’re saying “I thought I wanted X” - a belief about a value! Or “now I think that was more a social script than my own values” - again, a belief about my own values, and how those values relate to my (previous) behavior. Or “I endorsed the view that Z is the highest objective” - an explicit endorsement of a belief about values. That’s how we normally, instinctively reason about our own values. And sure, we could reword everything to avoid talking about our beliefs about values - “learning” is more general than “learning about” - but the fact that it makes sense to us to talk about our beliefs about values is strong evidence that something in our heads in fact works like beliefs about values, not just reinforcement-style “learning”.

Two Puzzles

Puzzle 1: Learning About Our Own Values vs The Is-Ought Gap

Very roughly speaking, an agent could aim to pursue any values regardless of what the world outside it looks like; “how the external world is” does not tell us “how the external world should be”. So when we “learn about” values, where does the evidence about values come from? How do we cross the is-ought gap?

Puzzle 2: The Role of Reward/Reinforcement

It does seem like humans have some kind of physiological “reward”, in a hand-wavy reinforcement-learning-esque sense, which seems to at least partially drive the subjective valuation of things. (Something something mesocorticolimbic circuit.) For instance, Steve Byrnes claims that reward is a signal from which human values are learned/reinforced within lifetime[4]. But clearly the reward signal is not itself our values. So what’s the role of the reward, exactly? What kind of reinforcement does it provide, and how does it relate to our beliefs about our own values?

Using Each Puzzle To Solve The Other

Put these two together, and a natural guess is: reward is the evidence from which we learn about our values. Reward is used just like ordinary epistemic evidence - so when we receive rewards, we update our beliefs about our own values just like we update our beliefs about ordinary facts in response to any other evidence. Indeed, our beliefs-about-values can be integrated into the same system as all our other beliefs, allowing for e.g. ordinary factual evidence to become relevant to beliefs about values in some cases.

What This Looks Like In Practice

I eat some escamoles for the first time, and my tongue/mouth/nose send me enjoyable reward signals. I update to believe that escamole are good - i.e. I assign it high value.

Now someone comes along and tells me that escamoles are ant larvae. An unwelcome image comes into my head, of me eating squirming larvae. Ew, gross! That’s a negative reward signal - I downgrade my estimated value of escamoles.

Now another person comes along and tells me that escamoles are very healthy. I have cached already that "health" is valuable to me. That cache hit doesn’t generate a direct reward signal at all, but it does update my beliefs about the value of escamoles, to the extent that I believe this person. That update just routes through ordinary epistemic machinery, without a reward signal at all.

At some point I sit down and think about escamoles. Yeah, ants are kinda gross, but on reflection I don’t think I endorse that reaction to escamoles. I can see why my reward system would generate an “ew, gross” signal, but I model that reward as being the result of two decoupled things: either a hardcoded aversion to insects, or my actual values. I know that I am automatically averse to putting insects in my mouth so it's less likely that the negative reward is evidence of my values in this case; the signal is explained away in the usual epistemic sense by some cause other than my values. So, I partially undo the value-downgrade I had assigned to escamoles in response to the “ew, gross” reaction. I might still feel some disgust, but I consciously override that disgust to some extent.

That last example is particularly interesting, since it highlights a nontrivial prediction of this model. Insofar as reward is treated as evidence about values, and our beliefs about values update in the ordinary epistemic manner, we should expect all the typical phenomena of epistemic updating to carry over to learning about our values. Explaining-away is one such phenomenon. What do other standard epistemic phenomena look like, when carried over to learning about values using reward as evidence?

Escamoles
  1. ^
  2. ^

    Of course sometimes a Bayesian reasoner’s beliefs are not “about” any particular external thing either, because the reasoner is so thoroughly confused that it has beliefs about things which don’t exist - like e.g. beliefs about the current location of my dog. I don’t have a dog. But unlike the adaptive system case, for a Bayesian reasoner such confusion is generally considered a failure of some sort.

  3. ^

    Note that, in treating these sentiments as evidence that we don’t know our own values, we’re using stated values as a proxy measure for values. When we talk about a human’s “values”, we are notably not talking about:

      The human’s stated preferencesThe human’s revealed preferencesThe human’s in-the-moment experience of bliss or dopamine or whatever<whatever other readily-measurable notion of “values” springs to mind>

    The thing we’re talking about, when we talk about a human’s “values”, is a thing internal to the human’s mind. It’s a high-level cognitive structure. Things like stated preferences, revealed preferences, etc are useful proxies for human values, but those proxies importantly break down in various cases - even in day-to-day life.

  4. ^

    Steve uses the phrase “model-based RL” here, though that’s a pretty vague term and I’m not sure how well the usual usage fits Steve’s picture.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人类价值观 学习方式 价值证据 奖励作用
相关文章