AI News 04月25日
RAGEN: AI framework tackles LLM agent instability
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

研究人员推出了RAGEN,这是一种旨在解决LLM Agent在处理复杂情况时出现的不稳定性的AI框架。该框架通过StarPO方法,优化Agent在多轮交互中的决策过程。RAGEN是一个模块化系统,专注于在强化学习下训练和评估LLM Agent的推理能力。研究人员在三个简约的符号游戏环境中测试了LLM,包括Bandit、Sokoban和Frozen Lake,以分析Agent如何通过交互学习决策策略。研究结果表明,多轮强化学习训练中存在“Echo Trap”问题,需要通过稳定策略和优化Rollout质量来解决,同时需要更精细的奖励机制来培养真正的推理能力。

🦺**RAGEN框架与StarPO方法**:RAGEN是一个模块化系统,旨在通过StarPO(State-Thinking-Actions-Reward Policy Optimisation)方法,解决LLM Agent在多轮交互中决策不稳定的问题,专注于在强化学习下训练和评估LLM Agent的推理能力,优化整个交互序列。

🎲**简约环境下的测试**:研究人员在Bandit(单轮随机任务,测试风险敏感推理)、Sokoban(多轮确定性谜题,需要远见和计划)和Frozen Lake(多轮随机网格导航任务,需要不确定性下的计划)三个简约的符号游戏环境中测试LLM,以便清晰地分析Agent如何通过交互学习决策策略。

⚠️**“Echo Trap”问题与稳定策略**:研究发现多轮强化学习训练中存在“Echo Trap”问题,即Agent最初表现提升后又崩溃,过度拟合局部奖励的推理模式。为了解决这个问题,研究团队开发了StarPO-S,通过方差过滤轨迹、引入Critic模型以及解耦剪裁和KL散度移除等技术来提高稳定性。

🔄**Rollout质量的关键性**:Rollout(用于训练的模拟交互轨迹)的质量对学习有显著影响。关键因素包括任务多样性(通过多样化的初始状态和每个Prompt生成多个响应来促进泛化)、交互粒度(每个Turn允许多个行动以实现更好的计划)和Rollout频率(使用反映Agent当前策略的最新Rollout)。

🤔**推理与奖励设计**:研究表明,仅仅提示模型“思考”并不能保证有意义的推理出现,尤其是在多轮任务中。标准的轨迹级别奖励(通常是稀疏的且基于结果)是不够的。未来的工作应该探索显式评估中间推理步骤质量的奖励,例如基于格式的惩罚或奖励解释质量。

Researchers have introduced RAGEN, an AI framework designed to counter LLM agent instability when handling complex situations.

Training these AI agents presents significant hurdles, particularly when decisions span multiple steps and involve unpredictable feedback from the environment. While reinforcement learning (RL) has shown promise in static tasks like solving maths problems or generating code, its application to dynamic, multi-turn agent training has been less explored.   

Addressing this gap, a collaborative team from institutions including Northwestern University, Stanford University, Microsoft, and New York University has proposed StarPO (State-Thinking-Actions-Reward Policy Optimisation).

StarPO offers a generalised approach for training agents at the trajectory level (i.e. it optimises the entire sequence of interactions, not just individual actions.)

Accompanying this is RAGEN, a modular system built to implement StarPO. This enables the training and evaluation of LLM agents, particularly focusing on their reasoning capabilities under RL. RAGEN provides the necessary infrastructure for rollouts, reward assignment, and optimisation within multi-turn, stochastic (randomly determined) environments.

Minimalist environments, maximum insight

To isolate the core learning challenges from confounding factors like extensive pre-existing knowledge or task-specific engineering, the researchers tested LLMs using RAGEN in three deliberately minimalistic, controllable symbolic gaming environments:   

    Bandit: A single-turn, stochastic task testing risk-sensitive symbolic reasoning. The agent chooses between options (like ‘Phoenix’ or ‘Dragon’ arms) with different, initially unknown, reward profiles.Sokoban: A multi-turn, deterministic puzzle requiring foresight and planning, as actions (pushing boxes) are irreversible.Frozen Lake: A multi-turn, stochastic grid navigation task where movement attempts can randomly fail, demanding planning under uncertainty.

These environments allow for clear analysis of how agents learn decision-making policies purely through interaction.   

Key findings: Stability, rollouts, and reasoning

The study yielded three significant findings concerning the training of self-evolving LLM agents:

The ‘Echo Trap’ and the need for stability

A recurring problem observed during multi-turn RL training was dubbed the “Echo Trap”. Agents would initially improve but then suffer performance collapse, overfitting to locally rewarded reasoning patterns. 

This was marked by collapsing reward variance, falling entropy (a measure of randomness/exploration), and sudden spikes in gradients (indicating training instability). Early signs included drops in reward standard deviation and output entropy.   

To combat this, the team developed StarPO-S, a stabilised version of the framework. StarPO-S incorporates:   

StarPO-S consistently delayed collapse and improved final task performance compared to vanilla StarPO.   

Rollout quality is crucial

The characteristics of the ‘rollouts’ (simulated interaction trajectories used for training) significantly impact learning. Key factors identified include:   

Maintaining freshness, alongside appropriate action budgets and task diversity, is key for stable training.   

Reasoning requires careful reward design

Simply prompting models to ‘think’ doesn’t guarantee meaningful reasoning emerges, especially in multi-turn tasks. The study found:

This suggests that standard trajectory-level rewards (often sparse and outcome-based) are insufficient. 

“Without fine-grained, reasoning-aware reward signals, agent reasoning hardly emerge[s] through multi-turn RL.”

The researchers propose that future work should explore rewards that explicitly evaluate the quality of intermediate reasoning steps, perhaps using format-based penalties or rewarding explanation quality, rather than just final outcomes.   

RAGEN and StarPO: A step towards self-evolving AI

The RAGEN system and StarPO framework represent a step towards training LLM agents that can reason and adapt through interaction in complex, unpredictable environments.

This research highlights the unique stability challenges posed by multi-turn RL and offers concrete strategies – like StarPO-S’s filtering and stabilisation techniques – to mitigate them. It also underscores the critical role of rollout generation strategies and the need for more sophisticated reward mechanisms to cultivate genuine reasoning, rather than superficial strategies or hallucinations.

While acknowledging limitations – including the need to test on larger models and optimise for domains without easily verifiable rewards – the work opens “a scalable and principled path for building AI systems” in areas demanding complex interaction and verifiable outcomes, such as theorem proving, software engineering, and scientific discovery.

(Image by Gerd Altmann)

See also: How does AI judge? Anthropic studies the values of Claude

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

The post RAGEN: AI framework tackles LLM agent instability appeared first on AI News.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

RAGEN LLM Agent 强化学习 StarPO AI框架
相关文章