MarkTechPost@AI 02月14日
Can 1B LLM Surpass 405B LLM? Optimizing Computation for Small LLMs to Outperform Larger Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

该研究探讨了测试时缩放(TTS)技术在提升大型语言模型(LLM)性能方面的作用,尤其关注策略模型、过程奖励模型(PRM)和问题复杂度对TTS的影响。通过对MATH-500和AIME24任务的实验,发现计算最优的TTS策略依赖于这些因素,使得较小的模型(如1B、3B、7B)在效率上能够超越较大的模型(如405B、GPT-4o、DeepSeek-R1)。研究强调了奖励感知TTS在优化缩放中的重要性,证明了战略性的测试时计算能够显著增强LLM在不同架构和任务复杂度下的推理能力。该研究还发现,1B模型可以通过优化TTS超越405B模型,并提出了未来改进监督方法的方向。

💡测试时缩放(TTS)是一种关键技术,通过在推理过程中利用额外的计算资源来增强大型语言模型(LLM)的性能。TTS分为内部TTS和外部TTS,前者通过扩展思维链(CoT)过程鼓励逐步推理,后者使用采样或搜索方法,以固定模型提高性能。

🏆研究表明,计算最优的TTS策略取决于策略模型、过程奖励模型(PRM)和问题复杂度。较小的模型(如1B、3B、7B)可以通过更高效的方式胜过较大的模型(如405B、GPT-4o、DeepSeek-R1)。奖励在TTS性能中起着重要作用,因此提出了一种奖励感知策略,将奖励整合到计算分配中。

🎯实验结果显示,使用计算最优TTS的小模型在MATH-500和AIME24任务上优于大型模型。与多数投票相比,TTS提高了高达256倍的效率,并且比CoT提高了154.6%的推理能力。TTS还优于几种基于长CoT的方法,证明了其在增强LLM推理能力方面的有效性。

🧠该研究强调了“弱到强”监督的转变,一个7B的PRM可以有效地监督一个72B的策略模型。未来的工作应侧重于改进监督方法,以增强推理能力。虽然结果基于数学任务,但将TTS扩展到编码和化学领域仍有待探索。

Test-Time Scaling (TTS) is a crucial technique for enhancing the performance of LLMs by leveraging additional computational resources during inference. Despite its potential, there has been little systematic analysis of how policy models, Process Reward Models (PRMs), and problem complexity influence TTS, limiting its practical application. TTS can be categorized into Internal TTS, which encourages step-by-step reasoning through extended Chain-of-Thought (CoT) processes, and External TTS, which enhances performance using sampling or search-based methods with fixed models. The key challenge in External TTS lies in optimizing computational allocation for different tasks. Current methods employ PRMs to guide answer selection and scale test-time computation efficiently. However, a comprehensive evaluation of how factors impact TTS strategies remains unexplored, restricting the community’s understanding of optimal computation scaling for LLMs.

Prior research has explored multiple strategies to enhance LLM performance, including majority voting, search-based approaches, and self-refinement techniques. Test-time methods such as CoT prompting, self-verification, and external tool integration have proven effective in improving reasoning without modifying model parameters. PRMs, which outperform Output Reward Models (ORMs), significantly refine LLM-generated outputs. Recent advancements in PRMs focus on efficient data collection methods, implicit rewards, and advanced ranking techniques to improve mathematical reasoning. Tools like ProcessBench and PRMBench have been developed to facilitate benchmarking and evaluate PRM effectiveness. The evolution of PRMs and TTS strategies underscores the need for systematic research to optimize inference-time computation and enhance LLM capabilities across diverse tasks.

Researchers from Shanghai AI Laboratory, Tsinghua University, Harbin Institute of Technology, and BUPT investigate the impact of policy models, PRMs, and problem complexity on TTS through extensive experiments on MATH-500 and AIME24 tasks. Their findings show that compute-optimal TTS strategies depend on these factors, allowing smaller models (e.g., 1B, 3B, 7B) to outperform larger ones (e.g., 405B, GPT-4o, DeepSeek-R1) with greater efficiency. The study emphasizes the importance of reward-aware TTS for optimal scaling, demonstrating that strategic test-time computation significantly enhances LLM reasoning abilities across different architectures and task complexities.

Compute-optimal TTS optimally distributes computational resources for each problem. Prior approaches rely on PRMs as verifiers, either trained on the same policy model (on-policy) or a different one (offline). On-policy PRMs yield more accurate rewards, while offline PRMs face out-of-distribution challenges. Given the high cost of training PRMs per model, a general approach is needed. Experiments show that rewards significantly influence TTS performance. Thus, a reward-aware strategy is proposed, integrating rewards into compute allocation. Additionally, problem difficulty is better assessed using absolute thresholds rather than quantiles for more effective scaling strategies.

The study examines the effectiveness of Compute-Optimal TTS in enhancing the performance of small policy models compared to larger ones. Experiments assess whether TTS allows smaller models to outperform larger ones, improve upon CoT and majority voting, and surpass long-CoT methods. Findings reveal that small models using compute-optimal TTS can outperform significantly larger models on MATH-500 and AIME24 tasks. TTS improves efficiency by up to 256× compared to majority voting and boosts reasoning by 154.6% over CoT. Moreover, TTS outperforms several long-CoT-based methods, demonstrating its effectiveness in enhancing LLM reasoning capabilities.

In conclusion, the study examines compute-optimal TTS across various policy models, PRMs, and task complexities. Findings highlight that smaller models can surpass larger ones using optimized TTS, with a 1B model outperforming a 405B model. A 7B PRM also effectively supervises a 72B policy model, emphasizing a shift towards “weak-to-strong” supervision. Future work should focus on improving supervision methods for enhanced reasoning. While results are based on mathematical tasks, expanding TTS to coding and chemistry remains unexplored. These insights underscore TTS’s potential to refine LLM efficiency and adaptability across diverse challenges.


Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System(Promoted)

The post Can 1B LLM Surpass 405B LLM? Optimizing Computation for Small LLMs to Outperform Larger Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

测试时缩放 大型语言模型 过程奖励模型 计算优化 弱到强监督
相关文章