MarkTechPost@AI 2024年11月16日
Apple Researchers Propose Cut Cross-Entropy (CCE): A Machine Learning Method that Computes the Cross-Entropy Loss without Materializing the Logits for all Tokens into Global Memory
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

苹果研究者提出Cut Cross-Entropy(CCE)方法,旨在解决大语言模型训练中内存需求过高的问题。该方法通过动态计算必要的对数几率并在片上内存进行计算,显著降低了内存占用,提高了模型的可扩展性和训练效率。

💻CCE方法动态计算必要对数几率,减少内存占用

🚀支持更大批量规模,提升计算资源利用效率

🎯定制CUDA内核和梯度过滤,不影响训练速度

✨适用于多种架构和场景,具有广泛应用潜力

🌐为训练更大型模型提供可能,促进架构发展

Advancements in large language models (LLMs) have revolutionized natural language processing, with applications spanning text generation, translation, and summarization. These models rely on large amounts of data, large parameter counts, and expansive vocabularies, necessitating sophisticated techniques to manage computational and memory requirements. A critical component of LLM training is the cross-entropy loss computation, which, while central to model accuracy, presents significant memory challenges due to the size and complexity of the vocabulary.

The memory requirements of the cross-entropy loss layer constrict training large language models, especially as vocabulary sizes reach hundreds of thousands of tokens. The issue becomes acute in models like Gemma 2 (2B), where the cross-entropy loss computation alone can consume up to 24 GB of memory, accounting for up to 90% of the memory footprint during training. These limitations restrict batch sizes and force trade-offs between model performance and computational feasibility, posing a significant bottleneck for scalability.

Previous methods aimed at reducing memory usage, such as FlashAttention and hierarchical vocabularies, have addressed specific components like self-attention but fall short in alleviating the burden of the cross-entropy layer. Chunking methods reduce memory requirements but introduce latency trade-offs, limiting their practical use. Also, these approaches need to fully exploit the sparsity of gradients or leverage hardware optimizations, leaving room for improvement.

Researchers at Apple introduced the Cut Cross-Entropy (CCE) method, a novel approach designed to overcome the memory challenges associated with large vocabulary models. Unlike conventional methods that compute and store all logits for tokens in memory, CCE dynamically calculates only the necessary logits and performs log-sum-exp reductions in on-chip memory. This technique eliminates the need to materialize large matrices in GPU memory, significantly reducing the memory footprint. For instance, in the Gemma 2 model, the memory usage for loss computation dropped from 24 GB to just 1 MB, with total classifier head memory consumption reduced from 28 GB to 1 GB.

The core of CCE lies in its efficient computation strategy, which employs custom CUDA kernels to process embeddings and perform reductions. By calculating logits on the fly and avoiding intermediate memory storage, the method capitalizes on shared GPU memory, which is faster and more efficient than traditional global memory usage. Also, gradient filtering selectively skips computations that contribute negligibly to the gradient, leveraging the inherent sparsity of the softmax matrix. Vocabulary sorting optimizes processing by grouping tokens with significant contributions, minimizing wasted computation. Together, these innovations enable a memory-efficient, low-latency loss computation mechanism.

The performance gains from CCE are remarkable. Memory reductions enabled a 10-fold increase in batch size for smaller models like GPT-2 and a 1.5-fold increase for larger models like Llama 2 (13B). Training throughput remained unaffected, and experimental results demonstrated stable convergence, matching the performance of traditional methods. For a batch of 8,192 tokens with a vocabulary size 256,000, CCE achieved a peak memory usage of just 1 MB compared to 28 GB in baseline methods. Training stability tests on models such as Llama 3 (8B) and Phi 3.5 Mini confirmed the reliability of CCE, with indistinguishable loss curves compared to existing approaches.

This research highlights several key takeaways:

In conclusion, the CCE method represents a significant breakthrough in training large language models by addressing the critical bottleneck of memory-intensive cross-entropy loss layers. Through innovative techniques like dynamic logit computation, gradient filtering, and vocabulary sorting, CCE enables dramatic reductions in memory usage without sacrificing speed or accuracy. This advancement not only enhances the efficiency of current models but also paves the way for more scalable and balanced architectures in the future, opening new possibilities for large-scale machine learning.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI WEBINAR] Implementing Intelligent Document Processing with GenAI in Financial Services and Real Estate Transactions

The post Apple Researchers Propose Cut Cross-Entropy (CCE): A Machine Learning Method that Computes the Cross-Entropy Loss without Materializing the Logits for all Tokens into Global Memory appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Cut Cross-Entropy 语言模型训练 内存优化 效率提升 可扩展性
相关文章