CMU Researchers Explore Expert Guidance and Strategic Deviations in Multi-Agent Imitation Learning

MarkTechPost@AI 2024年07月31日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

卡内基梅隆大学的研究人员提出了一种针对多智能体模仿学习（MAIL）的替代目标，称为遗憾差距，该目标明确地考虑了群体中智能体潜在的偏差。他们研究了价值差距和遗憾差距之间的关系，表明虽然可以使用单智能体模仿学习（IL）算法将价值差距最小化，但这并不能阻止遗憾差距变得任意大。这一发现表明，在 MAIL 中实现遗憾等价比实现价值等价更具挑战性。为了解决这个问题，他们开发了两种有效的约简方法，分别为：MALICE，在专家覆盖假设下进行无遗憾在线凸优化，以及 BLADES，在可查询专家访问下进行无遗憾在线凸优化。

🤔 **遗憾差距：多智能体模仿学习的新目标** 卡内基梅隆大学的研究人员提出了一种名为“遗憾差距”的目标函数，它在多智能体模仿学习（MAIL）中取代了传统的价值差距。遗憾差距明确地考虑了群体中智能体潜在的偏差，这种偏差会导致学习者无法完全复制专家的行为，从而影响学习效率。研究人员发现，传统的价值差距最小化方法并不能阻止遗憾差距变得任意大，这表明在 MAIL 中实现遗憾等价比实现价值等价更具挑战性。

💪 **解决遗憾差距：MALICE 和 BLADES** 为了解决遗憾差距问题，研究人员开发了两种有效的约简方法。第一种方法叫做 MALICE，它在专家覆盖假设下进行无遗憾在线凸优化。专家覆盖假设是指专家能够提供所有可能情况下的行动建议。MALICE 算法通过对专家提供的行动建议进行重要性采样，来重新调整行为克隆损失函数，从而最小化遗憾差距。第二种方法叫做 BLADES，它在可查询专家访问下进行无遗憾在线凸优化。可查询专家是指学习者可以向专家询问特定情况下的行动建议。BLADES 算法利用专家提供的行动建议来指导学习者的策略，从而最小化遗憾差距。

📚 **价值差距：一种更弱但更实用的目标** 虽然遗憾差距是更理想的目标，但对于那些非战略性智能体来说，价值差距仍然是一个合理的目标函数。研究人员发现，传统的单智能体模仿学习算法可以有效地最小化价值差距，这使得在 MAIL 中实现价值等价相对容易。两种常用的单智能体 IL 算法，行为克隆 (BC) 和逆强化学习 (IRL)，被用来最小化价值差距。这些算法在联合策略上运行，其中 BC 和 IRL 被应用于多智能体环境，成为联合行为克隆 (J-BC) 和联合逆强化学习 (J-IRL)。这些改编导致与单智能体环境中相同的价值差距界限。

🚀 **未来方向：实践应用和算法优化** 研究人员在论文中指出，未来工作的重点是开发和实施这些理想化算法的实际近似方法。他们希望能够将这些方法应用于现实世界中的问题，例如交通流量控制、机器人协作等。此外，他们还计划进一步优化这些算法，使其更加高效和鲁棒。

💡 **结论：多智能体模仿学习的新进展** 卡内基梅隆大学的研究人员通过引入遗憾差距这一新的目标函数，为多智能体模仿学习（MAIL）提供了新的思路。他们开发的 MALICE 和 BLADES 算法为解决遗憾差距问题提供了有效的解决方案。这些研究结果为多智能体模仿学习领域的发展提供了新的方向，也为将这些方法应用于现实世界中的问题奠定了基础。

The problem of a mediator learning to coordinate a group of strategic agents is considered through action recommendations without knowing their underlying utility functions, such as routing drivers through a road network. The challenge lies in the difficulty of manually specifying the quality of these recommendations, making it necessary to provide the mediator with data on desired coordination behavior. This transforms the problem into one of multi-agent imitation learning (MAIL). A fundamental question in MAIL is identifying the right objective for the learner, explored through the development of personalized route recommendations for users.

Current research to solve the challenges in multi-agent imitation learning includes several methodologies. Single-agent imitation Learning techniques like behavioral cloning reduce imitation to supervised learning but suffer from covariate shifts, leading to compounding errors. Interactive approaches like inverse reinforcement learning (RL) allow learners to observe the consequences of their actions, preventing compounding errors but are sample-inefficient. The next approach is multi-agent imitation learning in which the concept of the regret gap has been explored but not utilized fully in Markov Games. The third approach, Inverse game theory focuses on recovering utility functions rather than learning coordination from demonstrations.

Researchers from Carnegie Mellon University have proposed an alternative objective for multi-agent imitation learning (MAIL) in Markov Games called the regret gap, which explicitly accounts for potential deviations by agents in the group. They investigated the relationship between the value and regret gaps, showing that while the value gap can be minimized using single-agent imitation learning (IL) algorithms, it does not prevent the regret gap from becoming arbitrarily large. This finding indicates that achieving regret equivalence is more challenging than achieving value equivalence in MAIL. To address this, two efficient reductions are developed to no-regret online convex optimization, (a) MALICE, under a coverage assumption on the expert, and (b) BLADES, with access to a queryable expert.

Although the value gap is considered a ‘weaker’ objective, it can be a reasonable learning objective in real-world applications where agents are non-strategic. The natural multi-agent generalization of single-agent imitation learning algorithms can efficiently minimize the value gap, making it relatively easy to achieve in MAIL. Two such single-agent IL algorithms, Behavior Cloning (BC) and Inverse Reinforcement Learning (IRL), are used to minimize the value gap. These algorithms run over joint policies where BC and IRL are applied to the multi-agent setting, becoming Joint Behavior Cloning (J-BC) and Joint Inverse Reinforcement Learning (J-IRL). These adaptations result in the same value gap bounds as in the single-agent setting.

Multi-agent Aggregation of Losses to Imitate Cached Experts (MALICE), is an efficient algorithm extending the ALICE algorithm to the multi-agent setting. ALICE is an interactive algorithm that uses importance sampling to re-weight the BC loss based on the density ratio between the current learner policy and that of the expert. It requires full demonstration coverage to ensure finite importance weights. ALICE utilizes a no-regret algorithm to learn a policy that minimizes reweighed on-policy error, ensuring a linear-in-H bound on the value gap under a recoverability assumption. MALICE adapts these principles to multi-agent environments, providing a robust solution for minimizing the regret gap.

In conclusion, researchers from Carnegie Mellon University have introduced an alternative objective for MAIL in Markov Games called the regret gap. For strategic agents that are not mere puppets, another source of distribution shift arises from deviations by agents within the population. This shift cannot be efficiently controlled through environmental interaction, such as inverse RL. So, it requires estimating the expert’s actions in counterfactual states. Utilizing this insight, the researchers derived two reductions that can minimize the regret gap under a coverage or queryable expert assumption. Future work includes developing and implementing practical approximations of these idealized algorithms.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post CMU Researchers Explore Expert Guidance and Strategic Deviations in Multi-Agent Imitation Learning appeared first on MarkTechPost.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签