MarkTechPost@AI 02月23日
TokenSkip: Optimizing Chain-of-Thought Reasoning in LLMs Through Controllable Token Compression
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

TokenSkip是一种优化大型语言模型(LLM)中思维链(CoT)处理的新方法。它通过可控的token压缩,在保持推理准确性的同时,显著降低计算开销。该方法基于不同推理token对最终答案的重要性不同,选择性地保留关键token并跳过不太重要的token。通过token修剪构建压缩的CoT训练数据,并进行监督微调。实验结果表明,TokenSkip在多个模型上都表现出良好的效果,尤其是在保持推理能力的同时,显著降低了计算开销。

🔑TokenSkip的核心在于其可控的压缩机制,它基于token的重要性来选择性地保留关键token,并跳过不太重要的token,从而在LLM中优化CoT处理。

⚙️TokenSkip包含两个主要阶段:训练数据准备和推理。在训练阶段,系统使用目标LLM生成CoT轨迹,并通过token修剪来压缩这些轨迹,修剪过程由“重要性评分”机制指导。

📊实验结果表明,较大的语言模型在实现较高压缩率的同时,更能保持性能。例如,Qwen2.5-14B-Instruct模型在token使用量减少40%的情况下,性能仅下降0.4%。

🚀TokenSkip在与基于prompt的缩减和截断等替代方法相比时,表现出卓越的性能。在MATH-500数据集上,它实现了30%的token使用量减少,而性能下降不到4%。

Large Language Models (LLMs) face significant challenges in complex reasoning tasks, despite the breakthrough advances achieved through Chain-of-Thought (CoT) prompting. The primary challenge lies in the computational overhead introduced by longer CoT sequences, which directly impacts inference latency and memory requirements. The autoregressive nature of LLM decoding means that as CoT sequences grow longer, there is a proportional increase in processing time and memory usage in attention layers where computational costs scale quadratically. Finding a balance between maintaining reasoning accuracy and computational efficiency has become a critical challenge, as attempts to reduce reasoning steps often compromise the model’s problem-solving capabilities.

Various methodologies have been developed to address the computational challenges of Chain-of-Thought (CoT) reasoning. Some approaches focus on streamlining the reasoning process by simplifying or skipping certain thinking steps, while others attempt to generate steps in parallel. A different strategy involves compressing reasoning steps into continuous latent representations, enabling LLMs to reason without generating explicit word tokens. Moreover, prompt compression techniques to handle complex instructions and long-context inputs more efficiently range from using lightweight language models to generate concise prompts, employing implicit continuous tokens for task representation, and implementing direct compression by filtering for high-informative tokens.

Researchers from The Hong Kong Polytechnic University and the University of Science and Technology of China have proposed TokenSkip, an approach to optimize CoT processing in LLMs. It enables models to skip less important tokens within CoT sequences while maintaining connections between critical reasoning tokens, with adjustable compression ratios. The system works by first constructing compressed CoT training data through token pruning, followed by a supervised fine-tuning process. Initial testing across multiple models, including LLaMA-3.1-8B-Instruct and Qwen2.5-Instruct series shows promising results, particularly in maintaining reasoning capabilities while significantly reducing computational overhead.

TokenSkip’s architecture is built on the fundamental principle that different reasoning tokens contribute varying levels of importance to reaching the final answer. It contains two main phases: training data preparation and inference. In the training phase, the system generates CoT trajectories using the target LLM, and Each remaining trajectory undergoes pruning with a randomly selected compression ratio. The token pruning process is guided by an “importance scoring” mechanism. TokenSkip maintains the autoregressive decoding approach during inference but enhances efficiency by enabling LLMs to skip less important tokens. The structure of the input format is such that the question and compression ratio gets separated by end-of-sequence tokens.

The results show that larger language models are more adept at maintaining performance while achieving higher compression rates. The Qwen2.5-14B-Instruct model achieves remarkable results with only a 0.4% performance drop while reducing token usage by 40%. TokenSkip shows superior performance when compared with alternative approaches like prompt-based reduction and truncation. While prompt-based reduction fails to achieve target compression ratios and truncation leads to significant performance degradation, TokenSkip maintains the specified compression ratio while preserving reasoning capabilities. On the MATH-500 dataset, it achieves a 30% reduction in token usage with less than a 4% performance drop.

In this paper, researchers introduced TokenSkip which represents a significant advancement in optimizing CoT processing for LLMs by introducing a controllable compression mechanism based on token importance. The method’s success lies in maintaining reasoning accuracy while significantly reducing computational overhead by selectively preserving critical tokens and skipping less important ones. The approach has proven effective with LLMs, showing minimal performance degradation even at substantial compression ratios. This research opens new possibilities for advancing efficient reasoning in LLMs, establishing a foundation for future developments in computational efficiency while maintaining robust reasoning capabilities.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

    Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

    The post TokenSkip: Optimizing Chain-of-Thought Reasoning in LLMs Through Controllable Token Compression appeared first on MarkTechPost.

    Fish AI Reader

    Fish AI Reader

    AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

    FishAI

    FishAI

    鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

    联系邮箱 441953276@qq.com

    相关标签

    TokenSkip LLM CoT推理 Token压缩
    相关文章