The challenge of hidden gifts in multi-agent reinforcement learning

cs.AI updates on arXiv.org 前天 12:03

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

本文探讨了多智能体强化学习（MARL）中“隐藏礼物”带来的挑战。隐藏礼物指的是，个体在不知情的情况下，也能受益于其他个体的行为。研究设计了一个简单的网格世界任务，其中智能体需要解锁自己的门以获得个人奖励，而集体奖励则需要所有智能体合作，传递唯一的钥匙。实验表明，包括MARL算法在内的多种强化学习算法难以学习到如何获得集体奖励。独立模型无关策略梯度智能体在获得自身行为历史信息时可以解决该问题，但MARL智能体仍然无法解决。最后，作者推导出一个修正项，可以减少学习中的方差，并帮助独立智能体更可靠地收敛到集体成功。

🔑“隐藏礼物”是指个体在不知情的情况下，受益于其他个体的行为。这在多智能体强化学习中造成了信用分配的难题，因为很难确定谁应该为这些隐藏的利益负责。

🚪研究设计了一个简单的网格世界任务，智能体需要解锁自己的门以获得个人奖励。如果所有智能体都解锁了他们的门，那么整个群体将获得更大的集体奖励。但是，只有一个钥匙可以打开所有的门，因此只有当智能体在使用完钥匙后将其传递给其他智能体时，才能获得集体奖励。传递钥匙的行为就是一种“隐藏礼物”。

🤖实验表明，包括MARL算法在内的多种强化学习算法，都无法学习到如何获得集体奖励。然而，当独立模型无关策略梯度智能体获得关于他们自身行为历史的信息时，他们可以解决这个问题。MARL智能体即使获得了行为历史信息，仍然无法解决问题。

💡作者推导了一个修正项，灵感来源于学习感知方法，它可以减少学习中的方差，并帮助独立智能体更可靠地收敛到集体成功。这些结果表明，在存在“隐藏礼物”的情况下，多智能体环境中的信用分配可能特别具有挑战性，并且证明了独立智能体中的学习感知可以使这些环境受益。

arXiv:2505.20579v1 Announce Type: cross Abstract: Sometimes we benefit from actions that others have taken even when we are unaware that they took those actions. For example, if your neighbor chooses not to take a parking spot in front of your house when you are not there, you can benefit, even without being aware that they took this action. These "hidden gifts" represent an interesting challenge for multi-agent reinforcement learning (MARL), since assigning credit when the beneficial actions of others are hidden is non-trivial. Here, we study the impact of hidden gifts with a very simple MARL task. In this task, agents in a grid-world environment have individual doors to unlock in order to obtain individual rewards. As well, if all the agents unlock their door the group receives a larger collective reward. However, there is only one key for all of the doors, such that the collective reward can only be obtained when the agents drop the key for others after they use it. Notably, there is nothing to indicate to an agent that the other agents have dropped the key, thus the act of dropping the key for others is a "hidden gift". We show that several different state-of-the-art RL algorithms, including MARL algorithms, fail to learn how to obtain the collective reward in this simple task. Interestingly, we find that independent model-free policy gradient agents can solve the task when we provide them with information about their own action history, but MARL agents still cannot solve the task with action history. Finally, we derive a correction term for these independent agents, inspired by learning aware approaches, which reduces the variance in learning and helps them to converge to collective success more reliably. These results show that credit assignment in multi-agent settings can be particularly challenging in the presence of "hidden gifts", and demonstrate that learning awareness in independent agents can benefit these settings.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签