MarkTechPost@AI 03月15日
Optimizing Test-Time Compute for LLMs: A Meta-Reinforcement Learning Approach with Cumulative Regret Minimization
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了一种通过优化测试时计算来增强大型语言模型(LLM)推理能力的方法。该方法将优化问题转化为元强化学习(meta RL)挑战,并引入累积遗憾作为关键指标。传统的基于结果奖励的强化学习模型在最小化遗憾方面表现不佳,尤其是在处理新查询时。为了解决这个问题,研究人员提出了一种名为元强化学习微调(MRT)的方法,该方法结合了密集奖励奖励,鼓励逐步改进。实验结果表明,MRT显著提高了测试时计算效率,在数学推理方面实现了比基于结果奖励的强化学习高2-3倍的性能和1.5倍的token效率。

🎯 研究挑战:当前提高LLM推理能力的方法主要依赖于使用搜索轨迹或基于二元结果奖励的强化学习进行微调,但这些方法可能无法充分利用测试时计算。

💡 MRT方法:研究者提出了一种元强化学习微调(MRT)方法,通过结合密集奖励奖励来量化进展,从而平衡探索和利用,确保在通往正确答案的道路上稳步前进。这种方法旨在最大限度地提高LLM在给定测试时token预算内的性能。

📈 实验结果:在数学基准测试中,MRT的表现明显优于现有方法,提高了准确性和token效率。研究还表明,优化进展可以最大限度地减少计算遗憾,同时在不牺牲准确性的前提下改进解决方案的发现。

Enhancing the reasoning abilities of LLMs by optimizing test-time compute is a critical research challenge. Current approaches primarily rely on fine-tuning models with search traces or RL using binary outcome rewards. However, these methods may not fully exploit test-time compute efficiently. Recent research suggests that increasing test-time computing can improve reasoning by generating longer solution traces and incorporating structured steps such as reflection, planning, and algorithmic search. Key challenges remain whether LLMs allocate computational resources effectively based on task complexity and discover solutions to more difficult problems when given a larger test-time compute budget. Addressing these is crucial for improving efficiency and generalization in LLM reasoning.

Recent advancements in scaling test-time compute have explored training separate verifiers for selection-based methods like best-of-N or beam search, which can sometimes be more effective than increasing data or model size. However, fine-tuning on unfamiliar search traces may lead to memorization rather than genuine reasoning improvements. RL-based approaches have demonstrated promise in generating chain-of-thought reasoning, enabling models to introspect, plan, and refine their outputs. However, increasing reasoning length does not always correlate with higher accuracy, as models may generate unnecessarily long sequences without meaningful progress. To address this, recent efforts have incorporated structured reward mechanisms and length penalties to encourage efficient reasoning, ensuring that models focus on producing informative, concise solutions rather than excessive computation.

Researchers from Carnegie Mellon University & Hugging Face investigate optimizing test-time compute for LLMs by refining how models allocate computational resources during reasoning. Instead of relying solely on outcome-reward RL, they introduce a fine-tuning approach that balances exploration and exploitation, ensuring steady progress toward correct answers. Their method incorporates a dense reward bonus to quantify progress, improving efficiency. Evaluations on mathematical benchmarks demonstrate that this approach significantly outperforms existing methods, enhancing both accuracy and token efficiency. Their findings also suggest that optimizing for progress minimizes computational regret while improving solution discovery without sacrificing accuracy.

The problem of optimizing test-time compute is framed as a meta reinforcement learning (meta RL) challenge. The goal is to maximize an LLM’s performance within a given test-time token budget by balancing exploration and exploitation. Instead of solely optimizing for outcomes, the proposed Meta Reinforcement Fine-Tuning (MRT) approach minimizes cumulative regret by rewarding progress across sequential episodes. This budget-agnostic strategy allows LLMs to make steady progress regardless of training constraints. By incorporating a reward bonus based on incremental improvements, MRT ensures efficient test-time compute usage, enhancing adaptability and response accuracy within deployment constraints.

The study evaluates the effectiveness of MRT in optimizing test-time computation, with a focus on achieving high accuracy while maintaining computational efficiency. The study presents key findings, compares MRT’s efficiency with prior methods, and conducts ablation experiments on token budget and progress. MRT consistently outperforms baseline models and outcome-reward RL (GRPO), achieving state-of-the-art results in its size category. It also improves out-of-distribution robustness and delivers larger performance gains with weaker models. Furthermore, MRT significantly enhances token efficiency, requiring fewer tokens for comparable accuracy. Additional experiments highlight its effectiveness in backtracking search and linearized evaluations.

In conclusion, the study reframes optimizing test-time compute as a meta-reinforcement learning (RL) problem, introducing cumulative regret as a key metric. State-of-the-art outcome-reward RL models fail to minimize regret, often struggling with novel queries within a token budget. This limitation arises from training solely with outcome rewards, which lack the granularity to guide stepwise progress. To address this, MRT is proposed, incorporating a dense reward bonus that encourages incremental improvement. MRT enhances test-time compute efficiency, achieving 2-3x better performance and 1.5x greater token efficiency in mathematical reasoning compared to outcome-reward RL, though several open questions remain.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Optimizing Test-Time Compute for LLMs: A Meta-Reinforcement Learning Approach with Cumulative Regret Minimization appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 元强化学习 推理优化
相关文章