少点错误 05月30日
Incorrect Baseline Evaluations Call into Question Recent LLM-RL Claims
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入分析了7篇热门的大型语言模型(LLM)强化学习(RL)论文,这些论文声称改进了模型的“推理能力”。研究发现,这些论文中报告的进步可能并非真实,而是由于评估设置中存在各种问题。例如,预RL模型的基线数据被严重低估,导致RL模型看起来优于实际情况。研究强调了正确评估和科学报告的重要性,并质疑在快速发展的RL领域中,如果研究结果不可靠,我们是否真的在进步。

🧐 许多LLM RL论文声称通过强化学习提升了模型的推理能力,但这些改进可能存在误导性。

🤔 这些论文中,预RL模型的基线数据通常被低估,导致RL模型的效果被夸大。

📉 在某些情况下,经过RL训练后的模型表现甚至不如其原始的预RL基线模型。

🛠️ 改进可能仅仅是由于修复了提示、生成超参数,而非RL算法本身的贡献。

⚠️ 研究强调了正确评估和科学报告在LLM RL研究中的重要性,以确保研究结果的可靠性。

Published on May 29, 2025 6:40 PM GMT

There has been a flurry of recent papers proposing new RL methods that claim to improve the “reasoning abilities” in language models. The most recent ones, which show improvements with random or no external rewards have led to surprise, excitement and confusion.

We analyzed 7 popular LLM RL papers (100+ to 3000+ likes, 50k+ to 500k+ views on X) including “Spurious Rewards”, “RL from 1 example”, and 3 papers exploring “Intrinsic Confidence Rewards”. We found that in most of these papers the improvements could be a mirage due to various accidental issues in the evaluation setups (discussed below). As such, the baseline numbers of the pre-RL models are massively underreported compared to official numbers in the Qwen releases, or other standardized evaluations (for example in the Sober Reasoning paper). In several cases, the post-RL model performance was actually worse than the (correctly evaluated) pre-RL baseline they start from. This means the elicitation these works achieve with RL, could also be replicated without any weight updates or finetuning. Here, we do not mean non-trivial elicitation of some latent capabilities, just what can be achieved by fixing prompting and generation hyperparameters. These include using correct formats and better ways to parse answers from responses, using recommended sampling temperatures, and using few-shot prompting to improve format-following.

Overall, these papers made us wonder if recent LLM RLVR papers have any signal, but we find their own claims could be noise due to underreported baselines. The proposed methods might have promise, and our goal is not to detract from their potential, but rather emphasise the importance of correct evaluations and scientific reporting. We understand the pressure to publish RL results quickly given how fast the community seems to be moving. But if the claims cannot be trusted, are we really moving forward?



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 强化学习 评估 推理能力
相关文章