MarkTechPost@AI 06月18日
AREAL: Accelerating Large Reasoning Model Training with Fully Asynchronous Reinforcement Learning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

AREAL 是一种全异步强化学习系统,旨在提高大型语言模型(LLM)的训练效率,特别是在推理任务中。与传统的同步系统不同,AREAL 将生成和训练过程分离,使生成和模型更新并行进行。通过这种方式,AREAL 减少了 GPU 的闲置时间,提高了吞吐量。实验结果表明,AREAL 在数学和编码任务上的训练速度比现有方法快 2-3 倍,同时保持了相当的准确性,这标志着大规模 LLM 的 RL 训练取得了进展。

💡 **核心问题:** 传统的同步强化学习系统在训练大型推理模型时,由于需要等待批处理中耗时最长的输出完成,导致 GPU 利用率低下。AREAL 的出现,旨在解决这一效率瓶颈。

⚙️ **系统架构:** AREAL 采用全异步设计,将生成和训练过程解耦。系统由生成输出的 rollout worker、评估响应的 reward service、执行 PPO 更新的 trainer worker 和协调数据流的 controller 四个主要组件构成。

🚀 **关键优化:** AREAL 引入了 staleness-aware 训练和修改后的 PPO 算法,以解决数据陈旧问题,确保学习的稳定性。同时,系统层面的优化,如流水线 CPU-GPU 操作和动态序列打包,进一步提高了训练速度和 GPU 效率。

📈 **实验结果:** 在数学和编码任务上,AREAL 的训练速度比 DeepScaleR 和 DeepCoder 等现有方法快 2-3 倍,同时保持了相当的准确性。AREAL 能够高效地扩展到多个 GPU,并处理长上下文长度(高达 32k 个 token)。

🌟 **总结:** AREAL 通过异步设计和关键优化,显著提高了大型推理模型的训练效率,为大规模 RL 在语言模型中的应用提供了新的可能性。

Introduction: The Need for Efficient RL in LRMs

Reinforcement Learning RL is increasingly used to enhance LLMs, especially for reasoning tasks. These models, known as Large Reasoning Models (LRMs), generate intermediate “thinking” steps before providing final answers, thereby improving performance on complex problems such as math and coding. However, training LRMs with RL at scale is challenging due to the need for massive parallelization and efficient system design. Current systems often rely on synchronous batch processing, where generation must wait for the longest output in a batch to finish, leading to GPU underutilization. Even newer methods still face bottlenecks, as they use outdated rollouts but remain batch-based.

Background: Reinforcement Learning RL’s Impact on LLM Reasoning Abilities 

Reinforcement Learning RL has become a widely used strategy to enhance the reasoning abilities of LLMs, particularly for tasks with clearly defined reward signals, such as math, coding, scientific reasoning, and tool use. These models typically improve by extending their chain-of-thought reasoning during training. Open-source efforts have shown that distilled, smaller models can also perform well on such tasks. Asynchronous RL methods, proven effective in gaming, have recently been explored for LLMs, but primarily in short-context or limited-overlap scenarios. Prior work has also investigated strategies, such as partial rollouts, to improve efficiency while maintaining training stability. 

System Overview: Introducing AReaL

Researchers from IIIS, Tsinghua University, Ant Research, and HKUST introduce AReaL, a fully asynchronous reinforcement learning system designed to train large reasoning models more efficiently. Unlike traditional synchronous systems, AReaL separates the generation and training processes; rollout workers continuously produce outputs, while training workers update the model in parallel as new data arrives. This design enhances GPU usage and accelerates training. To handle data staleness, AReaL utilizes a tailored version of PPO and incorporates optimizations such as dynamic batching and parallel reward services. On math and code tasks, AReaL achieves up to 2.77× faster training while maintaining or improving final model performance. 

Technical Architecture: Key Components and Optimizations

AREAL is designed to decouple generation and training across separate GPU clusters, improving scalability, hardware efficiency, and flexibility for reinforcement learning with large models. The system includes four main components: rollout workers that support interruptible generation and model updates, a reward service that evaluates responses, trainer workers that perform PPO updates, and a controller that coordinates the data flow. To address challenges such as data staleness and inconsistent policy versions, AREAL employs staleness-aware training and a decoupled PPO objective. Additionally, system-level optimizations such as pipelined CPU-GPU operations, non-blocking asynchronous requests, and dynamic sequence packing enhance training speed and GPU efficiency. 

Experimental Results: Scaling and Performance

AREAL was tested on math and coding tasks using distilled Qwen2 models of various sizes. It achieved 2–3 times faster training than prior methods, such as DeepScaleR and DeepCoder, while maintaining comparable accuracy. The system scales efficiently across GPUs and handles long context lengths (up to 32k tokens), outperforming synchronous methods’ key design features such as interruptible generation and dynamic microbatching, which boost training speed and hardware utilization. Algorithmically, AREAL’s decoupled PPO objective allows stable learning even with stale data, unlike standard PPO. Overall, AREAL balances speed and performance effectively, making it well-suited for large-scale RL training of language models. 

Conclusion: Advancing Large-Scale RL for Language Models

In conclusion, AREAL is an asynchronous reinforcement learning system designed to enhance the efficiency of training LLMs, particularly for tasks such as coding and mathematical reasoning. Unlike traditional synchronous methods that wait for all outputs before updating, AREAL allows generation and training to run in parallel. This decoupling reduces GPU idle time and boosts throughput. To ensure learning remains stable, AREAL introduces staleness-aware strategies and a modified PPO algorithm that effectively handles older training data. Experiments show that it delivers up to 2.77 times faster training than synchronous systems, without sacrificing accuracy, marking a step forward in scaling up RL for large models. 


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post AREAL: Accelerating Large Reasoning Model Training with Fully Asynchronous Reinforcement Learning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AREAL 强化学习 大型语言模型 异步训练
相关文章