MarkTechPost@AI 02月21日
This AI Paper Introduces ‘Shortest Majority Vote’: An Improved Parallel Scaling Method for Enhancing Test-Time Performance in Large Language Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了一种名为“最短多数投票”的创新方法,旨在优化大型语言模型(LLM)的测试时扩展性能。该方法由复旦大学和上海人工智能实验室的研究人员提出,通过并行计算并考虑解决方案的长度,解决了传统顺序扩展方法的局限性。研究表明,较短的解决方案往往比冗长的解决方案更准确,因为它包含较少的自修订。通过将解决方案的长度纳入多数投票过程中,该方法能够优先选择频繁且简洁的答案,从而提高模型的性能。

💡**测试时扩展的挑战**:大型语言模型利用大量的计算资源来处理和生成类似人类的文本。测试时扩展是一种新兴技术,旨在通过在推理过程中动态分配计算资源来提高LLM的推理能力。然而,研究表明,延长推理链并不一定能提高准确性,反而可能引入不一致性。

⚖️**顺序扩展与并行扩展**:当前的测试时扩展方法主要分为顺序和并行两种。顺序扩展在推理过程中延长思维链(CoT),期望更长的推理能够提高准确性。但研究表明,延长CoT并不总是能带来更好的结果。相比之下,并行扩展同时生成多个解决方案,并根据预定的标准选择最佳方案,在保持准确性和效率方面更有效。

🗳️**最短多数投票方法**:复旦大学和上海人工智能实验室的研究人员提出了一种名为“最短多数投票”的创新方法,通过利用并行计算并考虑解决方案的长度来优化测试时扩展。该方法的核心思想是,较短的解决方案往往比冗长的解决方案更准确,因为它包含较少的自修订。通过将解决方案的长度纳入多数投票过程中,该方法能够优先选择频繁且简洁的答案,从而提高模型的性能。

📈**实验结果验证**:实验评估表明,“最短多数投票”方法在多个基准测试中显著优于传统的多数投票方法。在AIME数据集上,采用该技术的模型与现有的测试时扩展方法相比,准确率有所提高。例如,R1-Distill-32b的准确率达到了72.88%。QwQ和LIMO也表现出增强的性能,尤其是在扩展推理链之前导致不一致的情况下。

Large language models (LLMs) use extensive computational resources to process and generate human-like text. One emerging technique to enhance reasoning capabilities in LLMs is test-time scaling, which dynamically allocates computational resources during inference. This approach aims to improve the accuracy of responses by refining the model’s reasoning process. As models like OpenAI’s o1 series introduced test-time scaling, researchers sought to understand whether longer reasoning chains led to improved performance or if alternative strategies could yield better results.

Scaling reasoning in AI models poses a significant challenge, especially in cases where extended chains of thought do not necessarily translate to better outcomes. The assumption that increasing the length of responses enhances accuracy is being questioned by researchers, who have found that longer explanations can introduce inconsistencies. Errors accumulate over extended reasoning chains, and models often make unnecessary self-revisions, leading to performance degradation rather than improvement. If test-time scaling is to be an effective solution, it must balance reasoning depth with accuracy, ensuring that computational resources are used efficiently without diminishing the model’s effectiveness.

Current approaches to test-time scaling primarily fall into sequential and parallel categories. Sequential scaling extends the chain-of-thought (CoT) during inference, expecting that more extended reasoning will lead to improved accuracy. However, studies on models like QwQ, Deepseek-R1 (R1), and LIMO indicate that extending CoTs does not consistently yield better results. These models frequently use self-revision, introducing redundant computations that degrade performance. In contrast, parallel scaling generates multiple solutions simultaneously and selects the best one based on a predetermined criterion. Comparative analyses suggest that parallel scaling is more effective in maintaining accuracy and efficiency.

Researchers from Fudan University and the Shanghai AI Laboratory introduced an innovative method called “Shortest Majority Vote” to address the limitations of sequential scaling. This method optimizes test-time scaling by leveraging parallel computation while factoring in solution length. The primary insight behind this approach is that shorter solutions tend to be more accurate than longer ones, as they contain fewer unnecessary self-revisions. By incorporating solution length into the majority voting process, this method enhances models’ performance by prioritizing frequent and concise answers.

The proposed method modifies traditional majority voting by considering the number and length of solutions. Conventional majority voting selects the most frequently occurring answer among generated solutions, whereas Shortest Majority Vote assigns higher priority to answers that appear often but are also shorter. The reasoning behind this approach is that longer solutions tend to introduce more errors due to excessive self-revisions. Researchers found that QwQ, R1, and LIMO generate increasingly longer responses when prompted to refine their solutions, often leading to lower accuracy. The proposed method aims to filter out unnecessary extensions and prioritize more precise answers by integrating length as a criterion.

Experimental evaluations demonstrated that Shortest Majority Vote method significantly outperformed traditional majority voting across multiple benchmarks. On the AIME dataset, models incorporating this technique showed an increase in accuracy compared to existing test-time scaling approaches. For instance, accuracy improvements were observed in R1-Distill-32b, which reached 72.88% compared to conventional methods. Similarly, QwQ and LIMO also exhibited enhanced performance, particularly in cases where extended reasoning chains previously led to inconsistencies. These findings suggest that the assumption that longer solutions always yield better results is flawed. Instead, a structured and efficient approach that prioritizes conciseness can lead to superior performance.

The results also revealed that sequential scaling suffers from diminishing returns. While initial revisions may contribute to improved responses, excessive revisions often introduce errors rather than correcting them. In particular, models like QwQ and R1-Distill-1.5b tended to change correct answers into incorrect ones rather than improving accuracy. This phenomenon further highlights the limitations of sequential scaling, reinforcing the argument that a more structured approach, such as Shortest Majority Vote, is necessary for optimizing test-time scaling.

The research underscores the need to rethink how test-time scaling is applied in large language models. Rather than assuming that extending reasoning chains leads to better accuracy, the findings demonstrate that prioritizing concise, high-quality solutions through parallel scaling is a more effective strategy. The introduction of Shortest Majority Vote provides a practical and empirically validated improvement over existing methods, offering a refined approach to optimizing computational efficiency in LLMs. By focusing on structured reasoning rather than excessive self-revision, this method paves the way for more reliable and accurate AI-driven decision-making.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

    Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

    The post This AI Paper Introduces ‘Shortest Majority Vote’: An Improved Parallel Scaling Method for Enhancing Test-Time Performance in Large Language Models appeared first on MarkTechPost.

    Fish AI Reader

    Fish AI Reader

    AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

    FishAI

    FishAI

    鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

    联系邮箱 441953276@qq.com

    相关标签

    大型语言模型 测试时扩展 最短多数投票 并行计算 AI推理
    相关文章