MarkTechPost@AI 02月28日
Elevating AI Reasoning: The Art of Sampling for Learnability in LLM Training
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了一种新的强化学习训练策略,旨在提高大型语言模型(LLM)在数学问题解决等推理任务中的训练效率。该策略着重于选择成功率具有高方差的样本,避免了传统方法中对过于简单或过于困难的问题的无效训练。通过优先选择模型表现不稳定的问题,该方法能够集中训练于提供最具信息量的学习信号的情境,从而加速模型收敛,提高泛化能力,并降低计算成本。实验结果表明,该策略在多个数据集上均能显著提升模型性能。

🎯 传统强化学习微调大型语言模型(LLM)时,存在训练效率低下的问题,因为模型在处理大量问题时,要么总是正确,要么总是错误,导致学习信号不足。

💡 为了解决这个问题,研究人员提出了一种新的训练策略,该策略专注于选择成功率具有高方差的样本,迫使模型关注那些既不太容易也不太难的问题,从而提供更具信息量的学习信号。

🧪 该策略通过一个多步骤流程运作,包括识别候选问题、评估成功概率、计算方差、选择高方差问题并将其存储在动态缓冲区中,最后,通过结合缓冲区中的高方差问题和随机抽样的样本来形成训练批次。

📊 实验结果表明,与传统方法相比,使用该策略训练的模型在更少的训练步骤内达到相同的精度,并且在多个数据集上都表现出更好的测试准确性和泛化能力。

Reinforcement learning (RL) has been a core component in training large language models (LLMs) to perform tasks that involve reasoning, particularly mathematical problem-solving. A considerable inefficiency occurs during training, including a situation where many questions are always answered or left unsolved. The lack of variability in success rates is to blame for inefficient learning results because questions that do not yield a gradient signal do not allow the model’s performance to be improved. Traditional RL-based fine-tuning strategies are susceptible to expensive computational costs, increased energy usage, and inefficient use of resources. Correcting this is necessary to improve training efficiency and make language models learn from problems that greatly improve their reasoning.

The standard training regimen of large language models (LLMs) uses policy gradient techniques, such as Proximal Policy Optimization (PPO), in which models engage with each query repeatedly and corrections are applied based on signs of success or failure. One of the greatest drawbacks of this approach, however, is that the majority of training examples belong to clusters of extremes—either always correct or always incorrect. When an example is always solved correctly, repeated attempts do not provide further learning information. On the contrary, an impossible query provides no feedback for improvement. As a result, precious computational resources are wasted on useless training scenarios. Different curriculum-learning techniques, such as Unsupervised Environment Design (UED), have attempted to dynamically control training difficulty. These techniques, however, rely on heuristics such as regret-based selection, which are largely insufficient in anticipating optimal problem difficulty and fail to generalize well to reasoning tasks relevant to LLM training.

To address this inefficiency, a novel training policy has been suggested and proposed that focuses on samples with high variance of success rates, thus forcing models to focus on questions not too easy and not too difficult. By identifying and choosing issues where the model performs erratically, the approach concentrates training on scenarios that provide the most informative learning signals. Differing from previous policies that utilized random sampling to train batches, this systematic selection method enhances update efficiency by eliminating problems that do not allow significant improvement. The procedure adapts during training, continuously optimizing question selection to track the fluctuating strength of the model. By targeting instances of moderate difficulty, the approach enables better learning and better generalization to novel tasks.

The structured selection process operates through a multi-step pipeline that begins with the identification of candidate questions at each training iteration. Multiple rollouts are generated to assess the probability of success for each problem, and the variance of these success rates is computed using the function ? ( 1 − ? ), where ?  represents the likelihood of a correct solution. The most learnable questions with moderate success probabilities are prioritized and stored in a dynamic buffer. Training batches are then formed by selecting a combination of high-variance problems from this buffer and additional randomly sampled examples from the dataset. This carefully crafted batch is then utilized to calculate policy gradients and update the model parameters. The efficacy of this strategy is validated by applying two reinforcement learning algorithms, PPO and VinePPO, to two mathematical reasoning datasets: MATH, comprising 12,000 competition-level problems, and GSM8K, comprising 8,000-grade school-level questions. Additional tests are performed on the CollegeMath and OlympiadBench datasets to quantify the generalization capabilities outside the original training distribution. The entire framework combines VinePPO with smooth optimizations such as gradient accumulation, multi-rollout estimation, and Deepspeed ZeRO to offer scalable performance.

The learning-driven selection mechanism greatly improves both the speed and efficiency of model training. Models trained with this curriculum are as accurate as models trained with traditional methods in roughly four times fewer training steps, with a remarkable improvement in convergence rates. Performance improves consistently through several datasets, with better test accuracy on GSM8K and MATH. The structured curriculum also generalizes to out-of-distribution tasks, with better generalization to datasets like CollegeMath and OlympiadBench. Training batch composition is optimized by eliminating questions with zero learning signal, leading to more efficient training. The approach is also found to be computationally beneficial, as sample generation can be scaled efficiently without redundant model updates. The combination of faster convergence, better generalization, and lower computational overhead makes this adaptive learning process a valuable and efficient tool for reinforcement learning-based LLM fine-tuning. 

A paradigm for high-variance learning opportunity target question selection effectively addresses the inefficiencies witnessed in language model fine-tuning based on reinforcement learning. Focusing on problems that produce the most informative training signals maximizes learning efficiency, achieving faster improvement and better adaptability with new samples. Large-scale experiments validate the strategy to be better in enhancing training speed, test accuracy, and generalization to more than one dataset. The findings highlight the promise of structured sample selection in model training refinement improvement and computational resource optimization. Future studies on the strategy can investigate its applicability to other reinforcement learning tasks, such as reward model optimization, preference-based fine-tuning, and generalized decision-making tasks in AI.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

The post Elevating AI Reasoning: The Art of Sampling for Learnability in LLM Training appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

强化学习 LLM训练 高方差抽样 AI推理
相关文章