GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

cs.AI updates on arXiv.org 5小时前

GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

本文提出动态熵权重方法，通过改进Group Relative Policy Optimization算法，提升大型语言模型推理能力，有效解决长链推理任务中的粗粒度信用分配问题，实验证明该方法显著优于DAPO基线。

arXiv:2508.04349v1 Announce Type: cross Abstract: Reinforcement learning (RL) with algorithms like Group Relative Policy Optimization (GRPO) improves Large Language Model (LLM) reasoning, but is limited by a coarse-grained credit assignment that applies a uniform reward to all tokens in a sequence. This is a major flaw in long-chain reasoning tasks. This paper solves this with \textbf{Dynamic Entropy Weighting}. Our core idea is that high-entropy tokens in correct responses can guide the policy toward a higher performance ceiling. This allows us to create more fine-grained reward signals for precise policy updates via two ways: 1) \textbf{Group Token Policy Optimization} (\textbf{GTPO}), we assigns a entropy-weighted reward to each token for fine-grained credit assignment. 2) \textbf{Sequence-Level Group Relative Policy Optimization} (\textbf{GRPO-S}), we assigns a entropy-weighted reward to each sequence based on its average token entropy. Experiments show our methods significantly outperform the strong DAPO baseline. The results confirm that our entropy-weighting mechanism is the key driver of this performance boost, offering a better path to enhance deep reasoning in models.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

动态熵权重 LLM推理 Group Relative Policy Optimization 长链推理信用分配

相关文章

Training Data Locality and Chain-of-Thought Reasoning in LLMs with Ben Prystawski - #673

Amazon EC2 P5e instances are generally available

超强o1模型智商已超120！1小时写出NASA博士1年代码，最新编程赛超越99.8%选手

Transformer推理天花板被谷歌打破？DeepMind首席科学家亮出84页PPT，却遭LeCun反对

CALM: Credit Assignment with Language Models for Automated Reward Shaping in Reinforcement Learning

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Neural Magic Unveils Machete: A New Mixed-Input GEMM Kernel for NVIDIA Hopper GPUs

ShadowKV: A High-Throughput Inference System for Long-Context LLM Inference

OpenAI o1太贵？那就自己做一个！纯提示方法让普通LLM进化出复杂推理能力

Researchers from Snowflake and CMU Introduce SuffixDecoding: A Novel Model-Free Approach to Accelerating Large Language Model (LLM) Inference through Speculative Decoding