MarkTechPost@AI 2024年10月14日
Salesforce AI Research Proposes Dataset-Driven Verifier to Improve LLM Reasoning Consistency
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了一种提升大型语言模型推理一致性的创新方法。研究者提出通过生成多推理路径的框架,利用综合数据集训练验证器来评估和排名输出,以提高准确性。该方法在多项基准测试中取得显著成果,为LLM的发展带来新方向。

🎯研究者指出LLM在多步推理中存在问题,尤其在复杂任务中难以准确执行且不易从错误中学习,而新框架通过生成多种推理路径来解决这一问题。

📚为训练有效验证器,团队开发了涵盖数学和代码任务的综合数据集,包含多种LLM生成的正确和错误答案,用于训练Math-Rev和Code-Rev两个验证器。

🎉研究结果表明,验证器Math-Rev和Code-Rev在基准测试中取得显著进步,超越了GPT-4o和LLaMA3等模型,且无参考偏好调整方法优于传统结果奖励模型。

💡研究者提出的整合协作验证与多推理路径和验证器的新范式,不仅取得了先进成果,还强调了融合不同推理策略的潜力,以提高复杂问题解决的准确性和可靠性。

Large language models (LLMs) often fail to consistently and accurately perform multi-step reasoning, especially in complex tasks like mathematical problem-solving and code generation. Despite recent advancements, LLMs struggle to detect and learn from errors because they are predominantly trained on correct solutions. This limitation leads to difficulties in verifying and ranking outputs, particularly when subtle flaws are present.

Researchers from the University of Notre Dame and Salesforce AI introduce an innovative framework that scales up inference-time computation by generating multiple reasoning paths for complex tasks. Verifiers assess these paths and rank the generated outputs by correctness to improve accuracy. To train effective verifiers, the team developed a comprehensive dataset of both correct and incorrect solutions for math and code tasks generated by multiple LLMs. This dataset is unique because it includes a diverse range of solution patterns, allowing the verifiers to better distinguish between correct and erroneous answers. By integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) reasoning strategies, the researchers developed a novel collaborative verification approach that leverages both step-by-step human-readable reasoning and executable code validation.

The dataset introduced is comprehensive, covering both math and code tasks. It consists of solutions generated by various LLMs, encompassing both correct and incorrect answers. For the math tasks, models such as Mistral, Phi, and InternLM2-Math were used, generating over 159,000 correct and 100,000 incorrect solutions. For code reasoning, datasets like MBPP and MagiCoder-75k were used to produce more than 132,000 correct and 145,000 incorrect code solutions. Each problem had multiple sampled solutions, providing a diverse collection of approaches and errors. This dataset was used to train two verifiers: Math Reasoning Ensembled Verifier (Math-Rev) and Code Reasoning Ensembled Verifier (Code-Rev), both developed using SimPO, a reference-free preference-tuning method.

The results presented in the paper demonstrate significant improvements over previous methods. The verifiers Math-Rev and Code-Rev achieved state-of-the-art accuracy on benchmarks such as GSM8k and MATH, even surpassing the performance achieved by GPT-4o and LLaMA3. For instance, Math-Rev paired with Qwen-72B-Instruct outperformed LLaMA3.1-405B and GPT-4o on the MATH test set, with notable accuracy improvements. The researchers also compared different training methods for verifiers, finding that reference-free preference tuning, such as SimPO, performed better than traditional outcome reward models (ORM). Moreover, the integration of Chain-of-Thought and Program-of-Thought methods for verification, called CoTnPoT, proved effective in leveraging the strengths of both natural language and executable code to enhance verification accuracy.

Conclusion

This research introduces a new paradigm for improving the reasoning capabilities of LLMs by integrating collaborative verification with multiple reasoning paths and verifiers. By releasing their comprehensive dataset and verifiers, the researchers aim to foster future advancements in scaling up inference-time computation and enhancing the reliability of LLMs. Their approach not only achieves state-of-the-art results but also highlights the potential of integrating different reasoning strategies to make complex problem-solving more accurate and reliable. This work paves the way for more robust LLMs that can better understand and verify their own outputs, thus increasing the trustworthiness of AI-generated reasoning.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17, 2024] RetrieveX – The GenAI Data Retrieval Conference (Promoted)

The post Salesforce AI Research Proposes Dataset-Driven Verifier to Improve LLM Reasoning Consistency appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 推理一致性 验证器 数据集
相关文章