MarkTechPost@AI 02月24日
Optimizing Training Data Allocation Between Supervised and Preference Finetuning in Large Language Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了大型语言模型(LLM)后训练阶段中,如何在有限的资源下优化监督微调(SFT)和偏好微调(PFT)的比例。研究表明,在数据预算有限的情况下,适当分配SFT和PFT的数据量至关重要,平衡的训练数据集性能可能优于数据量大2-5倍但分配不合理的训练集。尤其是在数学任务中,直接对基础模型进行偏好微调可能会遇到“冷启动问题”,而少量SFT可以显著提升性能。该研究通过多项任务、不同模型大小和数据标注成本的实验,为LLM的训练策略提供了重要的参考依据。

🔥研究揭示了在LLM后训练中,优化SFT和PFT之间训练数据预算分配的关键性。适当平衡的数据集表现,优于大小为其2-5倍但分配欠佳的数据集。

🥶研究发现,在数学任务中,直接对基础模型应用偏好微调(PFT)存在“冷启动问题”,即性能欠佳。但即使将预算的10%分配给初始SFT,也能有效缓解此问题,显著提升性能。

💰实验结果表明,在低数据量场景下,纯SFT表现出色;而当数据预算充足时,增加偏好数据的比例更有利。例如,使用包含25% SFT分配的5K示例,在摘要、帮助性和小学数学等任务中,可以达到使用包含75% SFT分配的20K示例的性能水平。

🧪研究方法严谨,通过控制实验环境,排除了通用对话数据集的干扰,从而能够精确衡量目标数据标注带来的性能提升。研究人员假设SFT和PFT的劳动成本相等,并使用开源或合成数据进行实验。

Large Language Models (LLMs) face significant challenges in optimizing their post-training methods, particularly in balancing Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) approaches. While SFT uses direct instruction-response pairs and RL methods like RLHF use preference-based learning, the optimal allocation of limited training resources between these approaches remains unclear. Recent studies have shown that models can achieve task alignment and improved reasoning capabilities without extensive SFT, challenging traditional sequential post-training pipelines. Moreover, the substantial cost of collecting and annotating human data compared to compute costs creates a need to understand the effectiveness of different training methods under fixed data-annotation budgets.

Existing research has explored various trade-offs in language model training under fixed budgets, including comparisons between pretraining versus finetuning and finetuning versus model distillation. Studies have examined the data and compute costs of SFT and RL methods in isolation along with cost-efficiency considerations in generating human and synthetic data. While some research shows the effects of high-quality preference data on RL methods like Direct Preference Optimization (DPO) and PPO, other studies focus on the relationship between SFT and RL methods regarding model forgetfulness, generalization, and alignment. However, these studies haven’t failed to address optimal resource allocation between SFT and RL-based approaches under strict data annotation constraints.

Researchers from the Georgia Institute of Technology have proposed a comprehensive study examining the optimal allocation of training data budgets between SFT and Preference Finetuning (PFT) in LLMs. The study investigates this relationship across four diverse tasks, multiple model sizes, and various data annotation costs. It addresses the “cold start problem” in mathematical tasks, where eliminating SFT leads to suboptimal performance due to distribution shifts when applying DPO directly to the base model. Their findings suggest that while larger data budgets benefit from combining both methods, allocating even a small portion of the budget to SFT can significantly improve performance on analytical tasks.

The study evaluates the cost-effectiveness and optimal resource allocation between SFT and PFT in post-training LLMs under 10 billion parameters. The research methodology measures data budgets through training examples or monetary annotation costs, assuming equal labor costs for both methods and the availability of training prompts. The experimental setup begins with no task-specific labeled data, using open-source datasets, or synthetically curated data for each target task. To maintain focus on task-specific improvements, general-purpose conversational datasets commonly used in PFT, such as UltraFeedback and Chatbot Arena preferences are excluded. This controlled approach allows for precise measurement of performance improvements resulting from targeted data annotation.

The results reveal that optimal allocation of the training budget between SFT and PFT methods proves crucial, with properly balanced datasets outperforming suboptimally allocated datasets 2-5 times larger in size. Using 5K examples with 25% SFT allocation for tasks like Summarization, Helpfulness, and Grade School Math matches the performance of 20K examples with 75% SFT allocation. The study identifies that pure SFT excels in low-data scenarios, while larger data budgets benefit from higher proportions of preference data. Moreover, direct preference finetuning on base models shows limited success in mathematical tasks, and allocating even a small portion to SFT significantly improves performance by better aligning the reference model’s response style.

In conclusion, this paper provides crucial insights into optimizing LLM post-training under resource constraints, particularly regarding the interplay between SFT and PFT. The study identifies a significant “cold-start problem” when applying PFT directly to base models, which can be mitigated effectively by allocating even 10% of the budget to initial SFT. However, the research acknowledges limitations, including offline methods like DPO and KTO use for RL implementation, and potential biases from using GPT4 for synthetic data generation and evaluation. Moreover, the model size is limited to 10 Billion parameters otherwise it would be extremely compute resource intensive to run thousands of finetuning runs with larger model sizes like 70B parameters.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

The post Optimizing Training Data Allocation Between Supervised and Preference Finetuning in Large Language Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM SFT PFT 数据优化 模型训练
相关文章