MarkTechPost@AI 2024年12月26日
Meet ONI: A Distributed Architecture for Simultaneous Reinforcement Learning Policy and Intrinsic Reward Learning with LLM Feedback
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

ONI是一种新颖的分布式架构,它利用大型语言模型(LLM)的反馈,同时学习强化学习(RL)策略和内在奖励函数。该方法使用异步LLM服务器来注释代理收集的经验,然后将其转换为内在奖励模型。ONI探索了多种奖励建模算法,并在NetHack学习环境中取得了卓越的性能,无需预先收集的数据集。该架构的关键组件包括LLM服务器、异步数据传输、哈希表和动态奖励模型学习代码。实验结果表明,ONI在稀疏奖励任务中显著优于现有方法,为开发更自主的内在奖励方法奠定了基础。

💡 ONI采用分布式架构,通过异步LLM服务器为强化学习代理的经验生成内在奖励,无需预先收集的数据集,实现了在线学习策略和奖励函数。

⚙️ ONI系统基于Sample Factory库和异步近端策略优化(APPO),使用480个并发环境实例,在Tesla A100 GPU上实现了每秒约32k次环境交互的高吞吐量。

📊 实验结果表明,ONI在NetHack学习环境的多个稀疏奖励任务中,性能显著提升,特别是“ONI-检索”和“ONI-分类”变体,在无奖励设置下表现出色。

🚀 ONI通过探索哈希、分类和排序等多种奖励建模方法,证明了其在解决稀疏奖励问题上的有效性,为强化学习领域提供了新的研究方向。

Reward functions play a crucial role in reinforcement learning (RL) systems, but their design presents significant challenges in balancing task definition simplicity with optimization effectiveness. The conventional approach of using binary rewards offers a straightforward task definition but creates optimization difficulties due to sparse learning signals. While intrinsic rewards have emerged as a solution to aid policy optimization, their crafting process requires extensive task-specific knowledge and expertise, placing substantial demands on human experts who must carefully balance multiple factors to create reward functions that accurately represent the desired task and enable efficient learning.

Recent approaches have utilized Large Language Models (LLMs) to automate reward design based on natural language task descriptions, following two main methodologies. The first approach focuses on generating reward function codes through LLMs, which has shown success in continuous control tasks. However, this method faces limitations as it requires access to environment source code or detailed parameter descriptions and struggles with processing high-dimensional state representations. The second approach involves generating reward values directly through LLMs, exemplified by methods like Motif, which ranks observation captions using LLM preferences. However, it requires pre-existing captioned observation datasets and involves a time-consuming three-stage process.

Researchers from Meta, the University of Texas Austin, and UCLA have proposed ONI, a novel distributed architecture that simultaneously learns RL policies and intrinsic reward functions using LLM feedback. The method uses an asynchronous LLM server to annotate the agent’s collected experiences, which are then transformed into an intrinsic reward model. The approach explores various algorithmic methods for reward modeling, including hashing, classification, and ranking models, to investigate their effectiveness in addressing sparse reward problems. This unified methodology achieves superior performance in challenging sparse reward tasks within the NetHack Learning Environment, operating solely on the agent’s gathered experience without requiring external datasets.

ONI uses several key components built upon the Sample Factory library and its asynchronous variant proximal policy optimization (APPO). The system operates with 480 concurrent environment instances on a Tesla A100-80GB GPU with 48 CPUs, achieving approximately 32k environment interactions per second. The architecture incorporates four crucial components: an LLM server on a separate node, an asynchronous process for transmitting observation captions to the LLM server via HTTP requests, a hash table for storing captions and LLM annotations, and a dynamic reward model learning code. This asynchronous design maintains 80-95% of the original system throughput, processing 30k environment interactions per second without reward model training and 26k interactions when training a classification-based reward model.

The experimental results demonstrate significant performance improvements across multiple tasks in the NetHack Learning Environment. While the extrinsic reward agent performs adequately on the dense Score task, it fails on sparse reward tasks. ‘ONI-classification’ matches or approaches the performance of existing methods like Motif across most tasks, achieving this without pre-collected data or additional dense reward functions. Among ONI variants, ‘ONI-retrieval’ shows strong performance, while ‘ONI-classification’ consistently improves through its ability to generalize to unseen messages. Moreover, the ‘ONI-ranking’ achieves the highest experience levels, while ‘ONI-classification’ leads in other performance metrics in reward-free settings.

In this paper, researchers introduced ONI which represents a significant advancement in RL by introducing a distributed system that simultaneously learns intrinsic rewards and agent behaviors online. It shows state-of-the-art performance across challenging sparse reward tasks in the NetHack Learning Environment while eliminating the need for pre-collected datasets or auxiliary dense reward functions that were previously essential. This work establishes a foundation for developing more autonomous intrinsic reward methods that can learn exclusively from agent experience, operate independently of external dataset constraints, and effectively integrate with high-performance reinforcement learning systems.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post Meet ONI: A Distributed Architecture for Simultaneous Reinforcement Learning Policy and Intrinsic Reward Learning with LLM Feedback appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ONI 强化学习 内在奖励 大型语言模型 分布式架构
相关文章