MarkTechPost@AI 2024年07月02日
Adam-mini: A Memory-Efficient Optimizer Revolutionizing Large Language Model Training with Reduced Memory Usage and Enhanced Performance
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Adam-mini 是一种针对大型语言模型 (LLM) 训练的内存高效优化器,它通过利用模型的 Hessian 结构将参数划分为块,为每个块分配一个高质量的学习率,从而显著减少了内存使用量。与 Adam 相比,Adam-mini 的内存使用量减少了 45% 到 50%,同时保持或甚至提高了性能。该方法在各种语言模型上都表现出色,包括预训练、监督微调和基于人类反馈的强化学习。

🤔 Adam-mini 是一种针对大型语言模型 (LLM) 训练的内存高效优化器。它通过利用模型的 Hessian 结构将参数划分为块,为每个块分配一个高质量的学习率,从而显著减少了内存使用量。与 Adam 相比,Adam-mini 的内存使用量减少了 45% 到 50%,同时保持或甚至提高了性能。

🚀 Adam-mini 通过利用 Transformer 模型的 Hessian 结构,将模型参数划分为块,例如查询 (Query)、键 (Key)、值 (Value) 和多层感知器 (MLP) 层。然后,为每个块计算一个有效的学习率,该学习率是该块中 Adam 的二阶动量值的平均值。这种方法减少了内存占用,并简化了学习率分配过程。

📈 Adam-mini 在各种语言模型上都表现出色,包括预训练、监督微调和基于人类反馈的强化学习。在监督微调和强化学习任务中,Adam-mini 始终优于 AdamW,实现了更高的评估分数和更快的收敛速度。

💡 Adam-mini 通过减少内存占用并提高吞吐量,为研究人员提供了更有效的工具,使训练大型模型变得更加容易,并鼓励更多研究人员参与,即使他们只有有限的 GPU 资源。

📊 在 Llama2-7B 模型的预训练中,Adam-mini 的吞吐量达到 5572.19 个 token/秒,而 AdamW 的吞吐量仅为 3725.59 个 token/秒,提高了 49.6%。这使得在处理相同数量的 token 时,墙上时间减少了 33%。

The field of research focuses on optimizing algorithms for training large language models (LLMs), which are essential for understanding and generating human language. These models are critical for various applications, including natural language processing and artificial intelligence. Training LLMs requires significant computational resources and memory, making optimizing these processes a high-priority area for researchers.

The primary problem addressed by this paper is the high memory demand of optimization algorithms used in training large language models. Specifically, the Adam optimizer, a standard in the field due to its superior performance, requires substantial memory to store optimizer states such as first-order and second-order momentum values. This memory demand doubles the necessary resources compared to the model size, creating a significant burden. As a result, training large models becomes expensive and less accessible to researchers with limited resources. Alternative methods like Adafactor attempt to reduce memory usage but often compromise performance, highlighting the need for more efficient solutions.

The Adam optimizer is widely used for training LLMs because of its ability to handle various model sizes and tasks effectively. However, Adam’s requirement for extensive memory to store its optimizer states, particularly the first-order and second-order momentums, poses a considerable challenge. For instance, training a 7 billion parameter model with Adam requires about 56 GB per card for these states alone, totaling 86 GB when gradients are included. This makes training prohibitively expensive, even with advanced graphical cards like the A100-80GB. Additionally, CPU-offloading and sharding are employed to manage this high memory requirement, increasing latency and slowing down the training process.

Researchers from The Chinese University of Hong Kong, Shenzhen, Shenzhen Research Institute of Big Data, Duke University, and Stanford University introduced Adam-mini, an optimizer designed to achieve similar or better performance than Adam while reducing memory usage by 45% to 50%. Adam-mini accomplishes this by partitioning model parameters into blocks based on the Hessian structure of transformers. Each block is then assigned a single high-quality learning rate, significantly reducing the number of learning rates from billions to a manageable number. This approach allows Adam-mini to maintain or even improve performance with a fraction of the memory required by Adam.

Adam-mini works by leveraging the near-block diagonal structure of transformers’ Hessians, partitioning parameters into blocks such as Query, Key, Value, and MLP layers. For each block, a single effective learning rate is calculated using the average of Adam’s second-order momentum values in that block. This method reduces the memory footprint and simplifies the learning rate assignment process. For example, during the pre-training of Llama2-7B on two A800-80GB GPUs, Adam-mini achieved a throughput of 5572.19 tokens per second, compared to 3725.59 tokens per second with AdamW, representing a 49.6% increase. This efficiency results in a 33% reduction in wall-clock time for processing the same number of tokens.

The researchers validated Adam-mini’s performance across various language models ranging from 125 million to 7 billion parameters, including pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF). The optimizer demonstrated on-par or superior performance to AdamW, with notable improvements in memory efficiency and training speed. For instance, in supervised fine-tuning and reinforcement learning tasks, Adam-mini consistently outperformed AdamW, achieving higher evaluation scores and faster convergence.

In conclusion, the Adam-mini optimizer addresses the significant memory inefficiencies of traditional optimization methods like Adam by introducing a novel partitioning strategy based on the Hessian structure of models. This innovative approach results in substantial memory savings and improved training efficiency, making it a valuable tool for researchers working with large-scale language models. By reducing the memory footprint by up to 50% and increasing throughput by nearly 50%, Adam-mini not only enhances the feasibility of training large models but also encourages broader participation from researchers with limited GPU resources.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 45k+ ML SubReddit

The post Adam-mini: A Memory-Efficient Optimizer Revolutionizing Large Language Model Training with Reduced Memory Usage and Enhanced Performance appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Adam-mini 大型语言模型 优化器 内存效率 深度学习
相关文章