SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression

cs.AI updates on arXiv.org 13小时前

SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression

本文提出SSPO，一种无需辅助模型或手动标注的RL过程监督框架，通过模型自身生成的逐步偏好信号优化推理步骤，有效减少LLM推理过程中的错误和冗余，提升模型性能。

arXiv:2508.12604v1 Announce Type: cross Abstract: Test-time scaling has proven effective in further enhancing the performance of pretrained Large Language Models (LLMs). However, mainstream post-training methods (i.e., reinforcement learning (RL) with chain-of-thought (CoT) reasoning) often incur substantial computational overhead due to auxiliary models and overthinking. In this paper, we empirically reveal that the incorrect answers partially stem from verbose reasoning processes lacking correct self-fix, where errors accumulate across multiple reasoning steps. To this end, we propose Self-traced Step-wise Preference Optimization (SSPO), a pluggable RL process supervision framework that enables fine-grained optimization of each reasoning step. Specifically, SSPO requires neither auxiliary models nor stepwise manual annotations. Instead, it leverages step-wise preference signals generated by the model itself to guide the optimization process for reasoning compression. Experiments demonstrate that the generated reasoning sequences from SSPO are both accurate and succinct, effectively mitigating overthinking behaviors without compromising model performance across diverse domains and languages.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SSPO LLM推理过程监督优化框架

相关文章

Training Data Locality and Chain-of-Thought Reasoning in LLMs with Ben Prystawski - #673

Google DeepMind Researchers Propose a Novel Divide-and-Conquer Style Monte Carlo Tree Search (MCTS) Algorithm ‘OmegaPRM’ for Efficiently Collecting High-Quality Process Supervision Data

Amazon EC2 P5e instances are generally available

超强o1模型智商已超120！1小时写出NASA博士1年代码，最新编程赛超越99.8%选手

Transformer推理天花板被谷歌打破？DeepMind首席科学家亮出84页PPT，却遭LeCun反对

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Neural Magic Unveils Machete: A New Mixed-Input GEMM Kernel for NVIDIA Hopper GPUs

ShadowKV: A High-Throughput Inference System for Long-Context LLM Inference

OpenAI o1太贵？那就自己做一个！纯提示方法让普通LLM进化出复杂推理能力

Researchers from Snowflake and CMU Introduce SuffixDecoding: A Novel Model-Free Approach to Accelerating Large Language Model (LLM) Inference through Speculative Decoding