MarkTechPost@AI 07月27日 05:51
REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasoning in Large Reasoning Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

当前大型推理模型(LRMs)的评估多依赖单问题测试,已显不足。本文介绍REST(Reasoning Evaluation through Simultaneous Testing)框架,它通过将多个问题打包在单一提示中进行同步测试,旨在更真实地反映LRMs在多上下文推理能力。REST能有效揭示模型在实际应用场景下的表现,如优先级分配、交叉干扰抵抗和认知负荷管理。研究发现,即使是顶尖模型在多问题压力下准确率也会显著下降,且现有训练方法未必能保证鲁棒性。REST通过提高认知负荷,模拟真实世界需求,为模型开发提供了新的方向,如“long2short”训练方法被证明能提升模型在压力下的表现。

📊 **现有评估方法局限性**:当前主流的单问题测试方法已难以有效区分大型推理模型(LRMs)的性能差异,且无法模拟真实世界多任务处理的场景,如教育辅导或技术支持,这些场景需要模型同时处理多个可能相互干扰的问题,考验其真正的认知负荷管理和鲁棒性。

🚀 **REST框架介绍**:REST(Reasoning Evaluation through Simultaneous Testing)是一个创新的多问题同步压力测试框架,它通过将多个问题捆绑成单一提示来测试LRMs,能够全面评估模型在处理复杂、多上下文推理任务时的表现,包括优先级分配、交叉问题干扰抵抗和动态认知负荷管理能力。

📉 **模型性能在压力下显著下降**:研究表明,即使是表现最优的模型,在REST的多问题同步测试下,其准确率也会出现显著下降。例如,在AIME24等挑战性基准上,一些模型的准确率下降近30%,这颠覆了模型能够轻松处理多任务的先前认知。

💡 **增强模型区分度与指导训练**:REST能显著放大性能相近的模型之间的差距,为模型开发者提供更精细化的性能洞察。同时,研究发现“long2short”等鼓励简洁推理链的训练方法有助于提升模型在多问题压力下的表现,为未来模型优化指明了方向。

🧐 **模拟真实世界认知负荷**:通过同步呈现多个问题,REST有效地增加了LRMs的认知负荷,模拟了真实世界中AI系统需要动态调整优先级、避免过度纠结于单一问题,并抵抗并发任务干扰的复杂场景,从而提供更具实践意义的评估结果。

Large Reasoning Models (LRMs) have rapidly advanced, exhibiting impressive performance in complex problem-solving tasks across domains like mathematics, coding, and scientific reasoning. However, current evaluation approaches primarily focus on single-question testing, which reveals significant limitations. This article introduces REST (Reasoning Evaluation through Simultaneous Testing) — a novel multi-problem stress-testing framework designed to push LRMs beyond isolated problem-solving and better reflect their real-world multi-context reasoning capabilities.

Why Current Evaluation Benchmarks Fall Short for Large Reasoning Models

Most current benchmarks, such as GSM8K and MATH, evaluate LRMs by asking one question at a time. While effective for initial model development, this isolated question approach faces two critical drawbacks:

    Decreasing Discriminative Power: Many state-of-the-art LRMs now achieve near-perfect scores on popular benchmarks (e.g., DeepSeek-R1 reaching 97% accuracy on MATH500). These saturated results make it increasingly difficult to distinguish true model improvements, forcing the expensive, continuous creation of harder datasets to differentiate capabilities.Lack of Real-World Multi-Context Evaluation: Real-world applications — like educational tutoring, technical support, or multitasking AI assistants — require reasoning across multiple, potentially interfering questions simultaneously. Single-question testing does not capture these dynamic, multi-problem challenges that reflect true cognitive load and reasoning robustness.

Introducing REST: Stress-Testing LRMs with Multiple Problems at Once

To address these challenges, researchers from Tsinghua University, OpenDataLab, Shanghai AI Laboratory, and Renmin University developed REST, a simple yet powerful evaluation method that simultaneously tests LRMs on multiple questions bundled into a single prompt.

REST Reveals Key Insights About LRM Reasoning Abilities

The REST evaluation uncovers several groundbreaking findings:

1. Significant Performance Degradation Under Multi-Problem Stress

Even state-of-the-art LRMs like DeepSeek-R1 show notable accuracy drops when handling multiple questions together. For example, DeepSeek-R1’s accuracy on challenging benchmarks like AIME24 falls by nearly 30% under REST compared to isolated question testing. This contradicts prior assumptions that large language models are inherently capable of effortlessly multitasking across problems.

2. Enhanced Discriminative Power Among Similar Models

REST dramatically amplifies the differences between models with near-identical single-question scores. On MATH500, for instance:

Similarly, among same-sized models like AReaL-boba-RL-7B and OpenThinker2-7B, REST captures significant differences in multi-problem handling abilities that single-question evaluations mask.

3. Post-Training Methods May Not Guarantee Robust Multi-Problem Reasoning

Models fine-tuned with reinforcement learning or supervised tuning on single-problem reasoning often fail to preserve their advantages in REST’s multi-question setting. This calls for rethinking training strategies to optimize reasoning robustness under realistic multi-context scenarios.

4. “Long2Short” Training Enhances Performance Under Stress

Models trained with “long2short” techniques — which encourage concise and efficient reasoning chains — maintain higher accuracy under REST. This suggests a promising avenue for designing models better suited to simultaneous multi-problem reasoning.

How REST Stimulates Realistic Reasoning Challenges

By increasing the cognitive load on LRMs through simultaneous problem presentation, REST simulates real-world demands where reasoning systems must dynamically prioritize, avoid overthinking one problem, and resist interference from concurrent tasks.

REST also systematically analyzes error types, revealing common failure modes such as:

These nuanced insights are largely invisible in single-question assessments.

Practical Evaluation Setup and Benchmark Coverage

Conclusion: REST as a Future-Proof, Realistic LRM Evaluation Paradigm

REST constitutes a significant leap forward in evaluating large reasoning models by:

In sum, REST paves the way for more reliable, robust, and application-relevant benchmarking of next-generation reasoning AI systems.


Check out the Paper, Project Page and Code. All credit for this research goes to the researchers of this project. SUBSCRIBE NOW to our AI Newsletter

The post REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasoning in Large Reasoning Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型推理模型 模型评估 REST框架 多问题推理 AI性能
相关文章