Astral Codex Ten Podcast feed 2024年07月17日
Highlights From The Comments On Motivated Reasoning And Reinforcement Learning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

这篇文章探讨了大脑如何学习和行动,作者认为大脑不是通过强化学习来学习的,而是通过预测价值,并根据预测值的梯度来行动。文章引用了例子说明,我们更容易记住对我们有强烈影响的事件,无论是正面的还是负面的,而对普通事件则印象不深。文章还建议可以通过实验来验证这一理论,通过对不同刺激的反应来观察大脑对价值的学习和行动模式。

🧠 **大脑学习价值,而非强化学习:** 作者认为大脑并非通过强化学习来学习,而是通过预测价值,并根据预测值的梯度来行动。这就像大脑建立了一张价值地图,学习预测各种事物和事件的价值,然后根据这张地图来决定行动,选择价值更高的选项。

⚡️ **记忆偏向于高价值事件:** 由于大脑学习的是价值,因此我们更容易记住对我们有强烈影响的事件,无论是正面的还是负面的,而对普通事件则印象不深。这是因为高价值事件会留下更深刻的记忆痕迹。

🧪 **实验验证:** 文章建议可以通过实验来验证这一理论。例如,可以对受试者进行一系列实验,在实验中展示不同的刺激,并根据刺激的价值进行奖励或惩罚。通过观察受试者对不同刺激的反应,可以验证大脑对价值的学习和行动模式。

💡 **价值地图的应用:** 理解大脑如何学习价值,可以帮助我们更好地理解人类的行为,并应用于各种领域,例如教育、心理治疗和人工智能等。

🧠 **价值预测与梯度行动:** 作者认为大脑通过预测价值,并根据预测值的梯度来行动,这是一种更加灵活和高效的学习和行动机制,可以帮助我们更好地适应复杂的环境。

https://astralcodexten.substack.com/p/highlights-from-the-comments-on-motivated

I. Comments From People Who Actually Know What They’re Talking About

Gabriel writes:

The brain trains on magnitude and acts on sign.

That is to say, there are two different kinds of "module" that are relevant to this problem as you described, but they're not RL and other; they're both other. The learning parts are not precisely speaking reinforcement learning, at least not by the algorithm you described. They're learning the whole map of value, like a topographic map. Then the acting parts find themselves on the map and figure out which way leads upward toward better outcomes.

More precisely then: The brain learns to predict value and acts on the gradient of predicted value.

The learning parts are trying to find both opportunities and threats, but not unimportant mundane static facts. This is why, for example, people are very good at remembering and obsessing over intensely negative events that happened to them -- which they would not be able to do in the RL model the post describes! We're also OK at remembering intensely positive events that happened to us. But ordinary observations of no particular value mostly make no lasting impression. You could test this by a series of 3 experiments, in each of which you have a screen flash several random emoji on screen, and each time a specific emoji is shown to the subject, you either (A) penalize the subject such as with a shock, or (B) reward the subject such as with sweet liquid when they're thirsty, or (C) give the subject a stimulus that has no significant magnitude, whether positive or negative, such as changing the pitch of a quiet ongoing buzz that they were not told was relevant. I'd expect subjects in both conditions A and B to reliably identify the key emoji, whereas I'd expect quite a few subjects in condition C to miss it.

By learning associates with a degree of value, whether positive or negative, it's possible to then act on the gradient in pursuit of whatever available option has highest value. This works reliably and means we can not only avoid hungry lions and seek nice ripe bananas, but we also do

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大脑学习 价值预测 梯度行动 强化学习
相关文章