MarkTechPost@AI 6小时前
Dynamic Fine-Tuning (DFT): Bridging the Generalization Gap in Supervised Fine-Tuning (SFT) for LLMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章介绍了一种名为动态微调(DFT)的新技术,旨在解决监督微调(SFT)在提升大型语言模型(LLM)泛化能力方面的局限性。DFT通过动态调整目标函数中每个token的权重,稳定梯度更新,从而显著提升模型在各种基准测试上的泛化表现。研究表明,DFT在数学推理等任务上优于标准SFT,并且在离线强化学习场景下也展现出竞争力,为LLM的有效训练提供了一种更简单且高效的替代方案。该方法在数学数据集上表现出色,但未来仍需在更多领域和更大模型上进行验证。

🌟 DFT技术通过动态重估SFT损失来解决SFT模型泛化能力不足的问题。它通过数学分析发现标准SFT梯度存在缺陷的奖励结构,限制了模型的泛化能力。DFT通过动态调整目标函数中每个token的概率,稳定了梯度更新,从而提升了模型的泛化能力,并在多个基准测试和模型上表现出优势。

🚀 DFT在标准SFT设置下,即仅使用专家演示数据且无负样本或奖励模型时,表现优于标准SFT,尤其在挑战性基准测试上,其泛化能力和鲁棒性更强。它还展现出更好的学习效率和收敛速度,在大多数情况下优于Importance-Weighted SFT (iw-SFT)。

💡 在离线强化学习(RL)设置下,DFT同样表现出色,优于所有离线和在线RL基线方法。例如,在数学推理任务上,DFT的平均得分显著高于最佳离线方法RFT和强大的在线RL算法GRPO,尤其在AMC23和Minerva Math等更难的任务上,DFT的优势更为明显。

🔬 该研究由来自多所知名大学的研究人员合作完成,验证了DFT在数学推理数据集和参数量高达7B的模型上的有效性。但研究也承认其局限性,目前仅限于文本领域,且未在其他领域或更大模型上进行充分测试。未来的工作计划将扩展DFT的应用范围,包括更广泛的基准测试、更大的模型以及跨模态任务,以验证其跨模态的有效性。

Supervised Fine-Tuning (SFT) is a standard technique for adapting LLMs to new tasks by training them on expert demonstration datasets. It is valued for its simplicity and ability to develop expert-like behavior quickly, but often underperforms in generalization compared to reinforcement learning (RL). RL allows models to explore diverse strategies, which leads to stronger generalization. However, RL demands high computational resources, careful hyperparameter tuning, and access to reward signals, which are not always practical. Although hybrid methods combining SFT and RL exist, the question remains: can SFT itself be fundamentally improved? This is important when datasets lack negative samples or reward models are unavailable.

Existing attempts to address the challenges of SFT and RL have led to a variety of hybrid methods. A common strategy combines an initial SFT phase with subsequent RL refinement, as seen in methods like InstructGPT. Alternative methods like interleaving SFT and RL steps or Direct Preference Optimization (DPO) aim to integrate imitation and reinforcement signals more efficiently. Techniques like Negative-aware Fine-Tuning (NFT) allow models to self-improve by modeling incorrect outputs. Theoretical work has attempted to unify SFT and RL, viewing SFT as a reward-weighted or implicit RL. However, they fall short of establishing a precise mathematical equivalence between SFT and offline policy gradients.

A team of researchers from Southeast University, UC Berkeley, Shanghai Jiao Tong University, Nanyang Technological University, and Wuhan University have proposed Dynamic Fine-Tuning (DFT), a method to address the limited generalization of SFT LLMs. Through mathematical analysis, they identify that standard SFT gradients encode a flawed reward structure, limiting the model’s capacity to generalize effectively. DFT addresses this by stabilizing gradient updates through dynamic rescaling of the objective function based on the probability of each token. This modification enhances generalization across multiple benchmarks and base models. Moreover, DFT shows competitive performance in offline RL settings, offering a simpler alternative to traditional RL methods.

DFT is evaluated in a standard SFT setting, where only expert demonstration data is available, without negative samples, reward models, or verification signals. It is trained using the NuminaMath CoT dataset, which contains 860k mathematical problems and solutions. The dataset covers various sources, including Chinese high school mathematics exercises and U.S. and international mathematical olympiads. In an offline RL setting, DFT is tested using the rejection sampling fine-tuning (RFT) framework. Here, responses are generated for 10k math questions, with correct answers verified and retained, resulting in 140k training examples. Positive-negative preference pairs are also created for DPO training from the generated response.

In SFT settings, DFT outperforms standard SFT across all evaluated LLMs, and shows superior generalization and robustness on challenging benchmarks where standard SFT yields a minimal or negative impact. It exhibits better learning efficiency and faster convergence characteristics, and outperforms Importance-Weighted SFT (iw-SFT) in most scenarios. In offline RL settings, DFT outperforms both offline and online RL baselines. It scores an average of 35.43, exceeding the best offline method, RFT, by +11.46 points, and outperforms the strongest online RL algorithm, GRPO, by +3.43 points. Moreover, DFT scores 64.71 on Math500, slightly ahead of GRPO, and achieves significant gains on harder tasks like AMC23 (+7.19 over GRPO) and Minerva Math (+6.23 over GRPO).

In this work, researchers address the generalization gap between SFT and RL. They introduce Dynamic Fine-Tuning (DFT), a simple yet powerful method that dynamically reweights the SFT loss using token probabilities. This one-line modification stabilizes learning and enhances generalization, as evidenced by performance gains across mathematical reasoning benchmarks. However, evaluations of DFT are limited to math-focused datasets and models up to 7B parameters, with no testing on other domains or larger models. Moreover, this research is limited to text-only scenarios. Future work aims to extend DFT to broader benchmarks, larger models, and vision-language tasks to validate its cross-modal effectiveness.


Try it here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Dynamic Fine-Tuning (DFT): Bridging the Generalization Gap in Supervised Fine-Tuning (SFT) for LLMs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

动态微调 LLM 监督微调 泛化能力 强化学习
相关文章