MarkTechPost@AI 2024年09月21日
Google DeepMind Introduced Self-Correction via Reinforcement Learning (SCoRe): A New AI Method Enhancing Large Language Models’ Accuracy in Complex Mathematical and Coding Tasks
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

谷歌DeepMind推出SCoRe方法,旨在提升大型语言模型的自我校正能力,改善其在复杂任务中的表现和可靠性。

🎯大型语言模型在复杂推理领域应用广泛,但存在自我校正错误的难题。当前模型在面对错误时,存在过度依赖提示指令或无法动态调整响应等问题。

💡为解决此问题,多种方法被提出,但存在诸多限制。如监督微调易放大原始训练数据的偏差,使用多模型的方法计算成本高且可能与实际查询分布不匹配。

🌟谷歌DeepMind的SCoRe方法通过强化学习,利用自生成数据让模型学会改进响应,减少对外部监督的依赖。该方法包括初始化训练和强化学习两个关键阶段,显著提升了模型的自我校正能力。

🎉SCoRe方法在数学推理和编码任务中取得显著成果,提高了自我校正准确性,减少了将正确答案改为错误答案的情况,在不同领域具有广泛的适用性。

Large language models (LLMs) are increasingly used in domains requiring complex reasoning, such as mathematical problem-solving and coding. These models can generate accurate outputs in several domains. However, a crucial aspect of their development is their ability to self-correct errors without external input, intrinsic self-correction. Many LLMs, despite knowing what is necessary to solve complex problems, fail to accurately retrieve or apply it when required, resulting in incomplete or incorrect answers. The growing importance of self-correction has led researchers to explore new methods to enhance LLMs’ performance and reliability in real-world applications.

One of the main challenges in improving LLMs is their inability to correct their mistakes consistently. While LLMs may generate correct responses in parts, they need help to revise incorrect answers when confronted with errors. Current models either over-rely on prompt-based instructions or fail to adjust their responses dynamically when errors arise. This issue is especially pronounced in tasks requiring multi-step reasoning, where the model’s inability to revisit and revise earlier steps leads to cumulative inaccuracies. To address this problem, researchers are exploring techniques that enhance the model’s ability to independently detect and correct its mistakes, significantly improving performance in tasks that involve reasoning and problem-solving.

Various methods have been developed to tackle this issue, but most have significant limitations. Many rely on supervised fine-tuning, where LLMs are trained to follow correction patterns from previous responses. This approach, however, often amplifies biases from the original training data, leading the model to make minimal or ineffective corrections. Other techniques, such as using multiple models, employ separate verifier models to guide corrections. These methods are computationally expensive and may not be feasible for widespread deployment. Also, they suffer from a mismatch between the training data and real-world query distribution, leading to suboptimal results when applied in practice. The need for a method enabling LLMs to self-correct without external supervision has become increasingly clear.

Researchers at Google DeepMind introduced a novel approach called Self-Correction via Reinforcement Learning (SCoRe). This method aims to teach LLMs to improve their responses using self-generated data, eliminating the need for external supervision or verifier models. By employing multi-turn reinforcement learning (RL), SCoRe enables the model to learn from its responses and adjust them in subsequent iterations. This method reduces the reliance on external data and trains the model to handle real-world tasks more effectively by improving the self-correction capability. Using this approach, the researchers addressed the common problem of distribution mismatch in training data, making the model’s corrections more robust and effective.

SCoRe’s methodology involves two key stages. The model undergoes initialization training in the first stage and is optimized to generate an initial correction strategy. This step helps the model develop the ability to make substantial corrections without collapsing into minor edits. In the second stage, reinforcement learning is employed to amplify the model’s self-correction ability. This stage focuses on improving the model’s performance in a multi-turn setting, where it is rewarded for generating better corrections on subsequent attempts. Including reward shaping in the reinforcement learning process ensures that the model focuses on improving accuracy rather than making minimal changes. Combining these two stages significantly improves the model’s capacity to identify and correct errors, even when confronted with complex queries.

The results of the SCoRe method demonstrate a significant improvement in the self-correction performance of LLMs. When applied to the Gemini 1.0 Pro and 1.5 Flash models, SCoRe achieved a 15.6% improvement in self-correction accuracy for mathematical reasoning tasks from the MATH dataset and a 9.1% improvement for coding tasks in the HumanEval dataset. These gains highlight the method’s effectiveness compared to traditional supervised fine-tuning methods. The model’s accuracy increased to 60.0% for the first attempt and 64.4% for the second attempt, showcasing its ability to revise its initial response effectively. These results are a significant leap forward, as existing models typically fail to achieve positive self-correction rates.

The performance metrics also underline SCoRe’s success in reducing the number of correct answers that were changed to incorrect answers in the second attempt, a common issue in other self-correction methods. The model improved its correction rate from 4.6% to 5.8% in mathematical reasoning tasks while reducing incorrect-to-correct changes. The SCoRe showed similar improvements in coding tasks, achieving a 12.2% self-correction delta on the HumanEval benchmark, underscoring its generalizability across different domains.

In conclusion, the development of SCoRe addresses a long-standing problem in the field of large language models. Researchers have substantially advanced in enabling LLMs to self-correct effectively by utilizing reinforcement learning on self-generated data. SCoRe improves accuracy and enhances the model’s ability to handle complex, multi-step reasoning tasks. This approach marks a significant shift from previous methods, which relied on external supervision and suffered from data mismatches. The two-stage training process and reward shaping provide a robust framework for improving LLMs’ self-correction capabilities, making them more reliable for practical applications.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)

The post Google DeepMind Introduced Self-Correction via Reinforcement Learning (SCoRe): A New AI Method Enhancing Large Language Models’ Accuracy in Complex Mathematical and Coding Tasks appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

谷歌DeepMind SCoRe 大型语言模型 自我校正
相关文章