MarkTechPost@AI 2024年12月17日
UBC Researchers Introduce ‘First Explore’: A Two-Policy Learning Approach to Rescue Meta-Reinforcement Learning RL from Failed Explorations
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了不列颠哥伦比亚大学研究人员提出的“First-Explore”方法,该方法通过学习两种不同的策略来解决元强化学习(Meta-RL)中探索和利用的难题。传统的Meta-RL方法在探索过程中容易陷入局部最优,而First-Explore方法将探索和利用策略分开,先由探索策略提供信息,再由利用策略最大化回报,从而提高整体性能。该方法在多个RL环境中取得了显著成果,超越了传统Meta-RL方法,并为未来的研究指明了方向,但仍然面临一些挑战,如无法探索未来、忽视负回报和长序列建模等问题。

🧭 First-Explore方法的核心在于将探索和利用分离,学习两种独立的策略。探索策略负责提供信息,而利用策略负责最大化回报,两者结合以实现整体性能提升。

🧠 探索策略通过提供当前探索序列的上下文(包括之前的动作、奖励和观察)来训练,其目标是生成能够导致后续高回报利用策略的序列。这使得探索不再受当前低回报的影响。

🏆 First-Explore在多个实验环境中均表现出显著优势,在固定臂赌博机问题中回报是Meta-RL方法的两倍,在黑暗寻宝房间环境中是其十倍,在迷宫环境中是其六倍。

⚙️ First-Explore的实现采用了GPT-2风格的因果Transformer架构,两种策略共享相似的参数,仅在最终层头有所不同。

Reinforcement Learning is now applied in almost every pursuit of science and tech, either as a core methodology or to optimize existing processes and systems. Despite broad adoption even in highly advanced fields, RL lags in some fundamental skills. Sample Inefficiency is one such problem that limits its potential. In simple terms, RL needs thousands of episodes to learn reasonably basic tasks, such as exploration, that humans master in just a few shots (for example, assume a kid finally figuring out basic arithmetic in high school). Meta-RL circumvents the above problem by enabling an agent with prior experience. The agent remembers the events of previous episodes to adapt to new environments and achieve sample efficiency. Meta-RL is better than standard RL as it learns to explore and learns highly complex strategies far beyond the ability of standard RL, like learning new skills or conducting experiments to learn about the current environment.

Having discussed how good the memory-based Meta-RL is in the RL space, let’s discuss what limits it. Traditional Meta-RL approaches aim to maximize the cumulative reward across all the episodes in a sequence of consideration, which means it finds an optimal balance between exploration and exploitation. Generally, this balance means prioritizing exploration in early episodes to exploit them later. The problem now is that even state-of-the-art methods get stuck on local optimums while exploring, especially when an agent must sacrifice immediate reward in the quest for subsequent higher reward. In this article, we discuss the latest study that claims to be able to remove this problem from Meta-RL.

Researchers at the University of British Columbia presented “First-Explore, Then Exploit,” a Meta-RL approach that differentiates exploration and exploitation by learning two distinct policies. The explore policy first informs the exploit policy, which maximizes episode return; neither attempt to maximize individual returns but are combined post-training to maximize cumulative reward. As the exploration policy is trained solely to inform the exploit policy, poor current exploitation no longer causes immediate rewards to discourage exploration. The explore policy first performs successive episodes where it is provided with the context of the current exploration sequence, which includes previous actions, rewards, and observations. It is incentivized to produce episodes that, when added to the current context, result in subsequent high-return exploit-policy episodes. The exploit policy then takes context from the explore policy for n episodes to produce high-return episodes.

The official implementation of First-Explore is done in a GPT-2-style causal transformer architecture. Both policies share similar parameters and differ only in the final layer head.

For experimentation, the authors compared First-Explore against three RL environments: Bandits with One Fixed Arm, Dark Treasure Rooms, and Ray Maze, all of varying challenges. The One Arm Fixed Bandit is a multi-armed bandit problem designed to forgo immediate reward while having no exploratory value. The second domain is a grid world environment, where an agent who cannot see its surroundings looks for randomly positioned rewards. The final environment is the most challenging of all and also highlights the learning capabilities of First-Explore beyond Meta-RL. It consisted of randomly generated mazes with three reward positions.

First-Explore achieved twice the total rewards of meta-RL approaches in the domain of the Fixed Arm Bandit. This number further soared 10 times for the second environment and 6 times for the last. Besides Meta-RL approaches, First-Explore also substantially outperformed other RL methods when it came to forgoing immediate reward.

Conclusion: First- Explore posed an effective solution to the immediate reward problem plagues traditional meta-RL approaches. It bifurcated exploration and exploitation to learn two independent policies that, combined with post-training, maximized cumulative good, which meta-RL was unable to achieve regardless of the training method. However, it also faces some challenges, paving the way for future research. Among these challenges were the inability to explore the future, disregard for negative rewards, and long-sequence modeling. In the future, it will be interesting to see how these problems are resolved and if they positively impact the efficiency of RL in general.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post UBC Researchers Introduce ‘First Explore’: A Two-Policy Learning Approach to Rescue Meta-Reinforcement Learning RL from Failed Explorations appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Meta-RL 强化学习 First-Explore 双策略学习 探索与利用
相关文章