MarkTechPost@AI 2024年07月08日
Policy Learning with Large World Models: Advancing Multi-Task Reinforcement Learning Efficiency and Performance
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

这项研究提出了一种名为PWM(Policy learning with large World Models)的基于模型的强化学习(MBRL)算法,它利用大型世界模型进行策略学习,并通过一阶梯度优化,在多任务环境中实现了高效的策略训练。PWM利用预训练的大型世界模型,使模型能够处理多达152个动作维度的复杂任务。实验结果表明,PWM在多任务环境中取得了比其他方法更高的奖励,并展现出更快的训练速度和更强的鲁棒性。

😊 **PWM 算法概述:** PWM 是一种基于模型的强化学习(MBRL)算法,利用大型世界模型进行策略学习,并通过一阶梯度优化,在多任务环境中实现了高效的策略训练。PWM 利用预训练的大型世界模型,使模型能够处理多达 152 个动作维度的复杂任务。PWM 强调在长时域内平滑、稳定的梯度,而不是单纯的准确性。这种方法通过高效的一阶优化,在多任务环境中取得了比其他方法更高的奖励,并展现出更快的训练速度和更强的鲁棒性。

🤖 **世界模型的应用:** 世界模型是 MBRL 中的重要组成部分,它可以模拟环境并为策略学习提供信息。PWM 使用大型世界模型,这些模型经过离线数据的预训练,能够有效地捕捉环境的复杂动态。大型世界模型的应用使 PWM 能够处理多任务环境中的复杂任务,并通过一阶梯度优化策略,实现了高效的训练过程。

🏆 **实验结果:** 实验结果表明,PWM 在多任务环境中取得了比其他方法更高的奖励,并展现出更快的训练速度和更强的鲁棒性。PWM 在 Hopper、Ant、Anymal、Humanoid 和肌肉驱动的 Humanoid 等复杂控制任务中进行了评估,并与 SHAC(使用真实模型)和 TD-MPC2(主动规划的无模型方法)进行了比较。结果表明,PWM 在奖励性能和优化平滑度方面都优于 SHAC 和 TD-MPC2。在 30 个和 80 个多任务环境中的进一步测试表明,PWM 的奖励性能优于 TD-MPC2,推理时间也更快。消融研究表明,PWM 对刚性接触模型具有鲁棒性,并且在样本效率方面优于 TD-MPC2,尤其是在训练更好的世界模型时。

🤔 **未来展望:** 虽然 PWM 在多任务强化学习中展现出了强大的性能,但也存在一些局限性。例如,PWM 需要大量预先存在的离线数据来训练世界模型,这限制了其在数据有限场景中的应用。此外,PWM 需要为每个新任务重新训练,这对于快速适应新任务提出了挑战。未来的研究可以探索改进世界模型训练方法,并将 PWM 扩展到基于图像的环境和现实世界的应用。

Reinforcement Learning (RL) excels at tackling individual tasks but struggles with multitasking, especially across different robotic forms. World models, which simulate environments, offer scalable solutions but often rely on inefficient, high-variance optimization methods. While large models trained on vast datasets have advanced generalizability in robotics, they typically need near-expert data and fail to adapt across diverse morphologies. RL can learn from suboptimal data, making it promising for multitask settings. However, methods like zeroth-order planning in world models face scalability issues and become less effective as model size increases, particularly in massive models like GAIA-1 and UniSim.

Researchers from Georgia Tech and UC San Diego have introduced Policy learning with large World Models (PWM), an innovative model-based reinforcement learning (MBRL) algorithm. PWM pretrains world models on offline data and uses them for first-order gradient policy learning, enabling it to solve tasks with up to 152 action dimensions. This approach outperforms existing methods by achieving up to 27% higher rewards without costly online planning. PWM emphasizes the utility of smooth, stable gradients over long horizons rather than mere accuracy. It demonstrates that efficient first-order optimization leads to better policies and faster training than traditional zeroth-order methods.

RL splits into model-based and model-free approaches. Model-free methods like PPO and SAC dominate real-world applications and employ actor-critic architectures. SAC uses First-order Gradients (FoG) for policy learning, offering low variance but facing issues with objective discontinuities. Conversely, PPO relies on zeroth-order gradients, which are robust to discontinuities but prone to high variance and slower optimization. Recently, the focus in robotics has shifted to large multi-task models trained via behavior cloning. Examples include RT-1 and RT-2 for object manipulation. However, the potential of large models in RL still needs to be explored. MBRL methods like DreamerV3 and TD-MPC2 leverage large world models, but their scalability could be improved, particularly with the growing size of models like GAIA-1 and UniSim.

The study focuses on discrete-time, infinite-horizon RL scenarios represented by a Markov Decision Process (MDP) involving states, actions, dynamics, and rewards. RL aims to maximize cumulative discounted rewards through a policy. Commonly, this is tackled using actor-critic architectures, which approximate state values and optimize policies. In MBRL, additional components such as learned dynamics and reward models, often called world models, are used. These models can encode true states into latent representations. Leveraging these world models, PWM efficiently optimizes policies using FoG, reducing variance and improving sample efficiency even in complex environments.

In evaluating the proposed method, complex control tasks were tackled using the flex simulator, focusing on environments like Hopper, Ant, Anymal, Humanoid, and muscle-actuated Humanoid. Comparisons were made against SHAC, which uses ground truth models, and TD-MPC2, a model-free method that actively plans at inference time. Results showed that PWM achieved higher rewards and smoother optimization landscapes than SHAC and TD-MPC2. Further tests on 30 and 80 multi-task environments revealed PWM’s superior reward performance and faster inference time than TD-MPC2. Ablation studies highlighted PWM’s robustness to stiff contact models and higher sample efficiency, especially with better-trained world models.

The study introduced PWM as an approach in MBRL. PWM utilizes large multi-task world models as differentiable physics simulators, leveraging first-order gradients for efficient policy training. The evaluations highlighted PWM’s ability to outperform existing methods, including those with access to ground-truth simulation models like TD-MPC2. Despite its strengths, PWM relies heavily on extensive pre-existing data for world model training, limiting its applicability in low-data scenarios. Additionally, while PWM offers efficient policy training, it requires re-training for each new task, posing challenges for rapid adaptation. Future research could explore enhancements in world model training and extend PWM to image-based environments and real-world applications.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post Policy Learning with Large World Models: Advancing Multi-Task Reinforcement Learning Efficiency and Performance appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

强化学习 多任务学习 世界模型 策略学习 一阶梯度优化
相关文章