MarkTechPost@AI 02月07日
s1: A Simple Yet Powerful Test-Time Scaling Approach for LLMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了一种针对大型语言模型(LLM)的简单而强大的测试时扩展方法。通过精心策划包含1000个推理轨迹问题的s1K数据集,并结合一种名为预算强制的新技术,该方法能够控制测试时的计算量,并通过策略性地插入“等待”指令来延长或缩短模型的思考过程,从而使模型能够审查和纠正其推理。实验结果表明,在Qwen2.5-32B-Instruct模型上进行微调后,s1-32B模型在测试时计算扩展方面表现出显著的性能提升,并且在样本效率方面优于其他模型,仅需1000个额外的训练样本即可实现显著改进。

🧠**s1K数据集**:该数据集包含1000个问题,这些问题经过精心挑选,基于难度、多样性和质量标准,旨在提供高质量的推理训练数据。

⏱️**预算强制技术**:该技术通过在模型的推理过程中策略性地插入“等待”指令来控制测试时的计算量,允许模型有机会审查和纠正其推理过程。

🏆**s1-32B模型性能**:通过在s1K数据集上对Qwen2.5-32B-Instruct模型进行微调,s1-32B模型在测试时计算扩展方面表现出显著的性能提升,并在样本效率方面优于其他模型。该模型仅需1000个额外的训练样本即可实现显著改进, 并且可以匹配o1-preview的性能。

🎯**知识提炼**:s1-32B模型在AIME24上接近Gemini 2.0 Thinking的性能,表明成功进行了知识提炼, 证明了少量高质量数据可以有效激活模型在预训练中获得的潜在推理能力。

Language models (LMs) have significantly progressed through increased computational power during training, primarily through large-scale self-supervised pretraining. While this approach has yielded powerful models, a new paradigm called test-time scaling has emerged, focusing on improving performance by increasing computation at inference time. OpenAI’s o1 model has validated this approach, showing enhanced reasoning capabilities through test-time compute scaling. However, replicating these results has proven challenging, with various attempts using techniques like Monte Carlo Tree Search (MCTS), multi-agent approaches, and reinforcement learning. Even models like DeepSeek R1 have used millions of samples and complex training stages, yet none have replicated the test-time scaling behavior in o1.

Various methods have been developed to tackle the test-time scaling challenge. Sequential scaling approaches enable models to generate successive solution attempts, with each iteration building upon previous outcomes. Tree-based search methods combine sequential and parallel scaling, implementing techniques like MCTS and guided beam search. REBASE has emerged as a notable approach, utilizing a process reward model to optimize tree search through balanced exploitation and pruning, showing superior performance compared to sampling-based methods and MCTS. These approaches heavily rely on reward models, which come in two forms: outcome reward models for evaluating complete solutions in Best-of-N selection, and process reward models for assessing individual reasoning steps in tree-based search methods.

Researchers from Stanford University, the University of Washington, the Allen Institute for AI, and Contextual AI have proposed a streamlined approach to achieve test-time scaling and enhanced reasoning capabilities. Their method centers on two key innovations: the carefully curated s1K dataset comprising 1,000 questions with reasoning traces, selected based on difficulty, diversity, and quality criteria, and a novel technique called budget forcing. This budget-forcing mechanism controls test-time computation by either cutting short or extending the model’s thinking process through strategic “Wait” insertions, enabling the model to review and correct its reasoning. The approach was implemented by fine-tuning the Qwen2.5-32B-Instruct language model on the s1K dataset.

The data selection process follows a three-stage filtering approach based on quality, difficulty, and diversity criteria. The quality filtering stage begins by removing samples with API errors and formatting issues, reducing the initial dataset to 51,581 examples, from which 384 high-quality samples are initially selected. The difficulty assessment employs two key metrics: model performance evaluation using Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct models, with correctness verified by Claude 3.5 Sonnet, and reasoning trace length measured by the Qwen2.5 tokenizer. For diversity, questions are classified into specific domains using the Mathematics Subject Classification system through Claude 3.5 Sonnet. This comprehensive filtering process results in a final dataset of 1,000 samples spanning 50 domains.

The s1-32B model demonstrates significant performance improvements through test-time compute scaling with budget forcing. s1-32B operates in a superior scaling paradigm compared to the base Qwen2.5-32B-Instruct model using majority voting, validating the effectiveness of sequential scaling over parallel approaches. Moreover, s1-32B emerges as the most efficient open data reasoning model in sample efficiency, showing marked improvement over the base model with just 1,000 additional training samples. While r1-32B achieves better performance it requires 800 times more training data. Notably, s1-32B approaches Gemini 2.0 Thinking’s performance on AIME24, suggesting successful knowledge distillation.

This paper shows that Supervised Fine-Tuning (SFT) with just 1,000 carefully selected examples can create a competitive reasoning model that matches the o1-preview’s performance and achieves optimal efficiency. The introduced budget forcing technique, when combined with the reasoning model, successfully reproduces OpenAI’s test-time scaling behavior. The effectiveness of such minimal training data suggests that the model’s reasoning capabilities are largely present from pretraining on trillions of tokens, with the fine-tuning process merely activating these latent abilities. This aligns with the “Superficial Alignment Hypothesis” from LIMA research, suggesting that a relatively small number of examples can effectively align a model’s behavior with desired outcomes.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)

The post s1: A Simple Yet Powerful Test-Time Scaling Approach for LLMs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

测试时扩展 语言模型 s1K数据集 预算强制 推理能力
相关文章