MarkTechPost@AI 05月17日 14:35
This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

DeepSeek-V3通过架构创新,在有限硬件资源下实现了高性能大语言模型。该模型采用多头潜在注意力(MLA)优化内存,混合专家(MoE)框架提升计算效率,并使用FP8混合精度训练加速。此外,定制的多平面网络拓扑减少了设备间通信开销。DeepSeek-V3在2048个NVIDIA H800 GPU上实现了最先进的性能,同时注重成本效益,通过智能设计在更精简的资源上与更大规模的系统竞争,为大语言模型的广泛应用提供了新思路。

🧠**MLA压缩降低KV缓存需求:** DeepSeek-V3通过多头潜在注意力(MLA)技术,将每个token的KV缓存需求降低至70KB,显著降低了推理过程中的内存需求,相较于Qwen-2.5和LLaMA-3.1的327KB和516KB,内存效率大幅提升。

🧮**MoE框架提升计算效率:** DeepSeek-V3采用混合专家(MoE)模型,总参数量达到6710亿,但每个token仅激活370亿参数,与需要激活全部参数的稠密模型相比,大幅降低了计算和内存需求,例如,LLaMA-3.1每个token需要2448 GFLOPS,而DeepSeek-V3仅需250 GFLOPS。

🚀**多Token预测加速生成:** DeepSeek-V3集成了多Token预测(MTP)模块,能够在单步中生成多个token,从而将生成速度提高了1.8倍,并实现了80-90%的token接受率,有效提升了推理吞吐量。

🌐**FP8混合精度加速训练:** DeepSeek-V3采用FP8混合精度训练,能够在保证精度损失小于0.25%的前提下,实现更快的计算速度。该方法通过分块量化技术,在较小规模的模型上进行了验证,并成功应用于6710亿参数的模型。

The growth in developing and deploying large language models (LLMs) is closely tied to architectural innovations, large-scale datasets, and hardware improvements. Models like DeepSeek-V3, GPT-4o, Claude 3.5 Sonnet, and LLaMA-3 have demonstrated how scaling enhances reasoning and dialogue capabilities. However, as their performance increases, so do computing, memory, and communication bandwidth demands, placing substantial strain on hardware. Without parallel progress in model and infrastructure co-design, these models risk becoming accessible only to organizations with massive resources. This makes optimizing training cost, inference speed, and memory efficiency a critical area of research.

A core challenge is the mismatch between model size and hardware capabilities. LLM memory consumption grows over 1000% annually, while high-speed memory bandwidth increases by less than 50%. During inference, caching prior context in Key-Value (KV) stores adds to memory strain and slows processing. Dense models activate all parameters per token, escalating computational costs, particularly for models with hundreds of billions of parameters. This results in billions of floating-point operations per token and high energy demands. Time Per Output Token (TPOT), a key performance metric, also suffers, impacting user experience. These problems call for solutions beyond simply adding more hardware.

Techniques like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce memory usage by sharing attention weights. Windowed KV caching lowers memory use by storing only recent tokens, but can limit long-context understanding. Quantized compression with low-bit formats like 4-bit and 8-bit cuts memory further, though sometimes with trade-offs in accuracy. Precision formats such as BF16 and FP8 improve training speed and efficiency. While useful, these techniques often tackle individual issues rather than a comprehensive solution to scaling challenges.

Researchers from DeepSeek-AI introduced a more integrated and efficient strategy with the development of DeepSeek-V3, designed to scale intelligently rather than excessively. Utilizing 2,048 NVIDIA H800 GPUs, the model achieves state-of-the-art performance while focusing on cost-efficiency. Instead of depending on expansive infrastructure, the team engineered the model architecture to work harmoniously with hardware constraints. Central to this effort are innovations such as Multi-head Latent Attention (MLA) for memory optimization, a Mixture of Experts (MoE) framework for computational efficiency, and FP8 mixed-precision training to accelerate performance without sacrificing accuracy. A custom Multi-Plane Network Topology was also employed to minimize inter-device communication overhead. Collectively, these components make DeepSeek-V3 a scalable and accessible solution, capable of rivaling much larger systems while operating on significantly leaner resources.

The architecture achieves memory efficiency by reducing the KV cache requirement per token to just 70 KB using MLA, compared to 327 KB and 516 KB in Qwen-2.5 and LLaMA-3.1, respectively. This reduction is accomplished by compressing attention heads into a smaller latent vector jointly trained with the model. Computational efficiency is further boosted with the MoE model, which increases total parameters to 671 billion but only activates 37 billion per token. This contrasts sharply with dense models that require full parameter activation. For example, LLaMA-3.1 needs 2,448 GFLOPS per token, while DeepSeek-V3 operates at just 250 GFLOPS. Also, the architecture integrates a Multi-Token Prediction (MTP) module, enabling the generation of multiple tokens in a single step. The system achieves up to 1.8x improvement in generation speed, and real-world measurements show 80-90% token acceptance for speculative decoding.

Using a system interconnected by CX7 400 Gbps InfiniBand NICs, DeepSeek-V3 achieves a theoretical TPOT of 14.76 milliseconds, equal to 67 tokens per second. With higher-bandwidth setups like NVIDIA GB200 NVL72 offering 900 GB/s, this number can be reduced to 0.82 milliseconds TPOT, potentially achieving 1,200 tokens per second. The practical throughput is lower due to compute-communication overlap and memory limitations, but the framework lays the foundation for future high-speed implementations. FP8 precision further adds to the speed gains. The training framework applies tile-wise 1×128 and block-wise 128×128 quantization, with less than 0.25% accuracy loss compared to BF16. These results were validated on smaller 16B and 230B parameter versions before integration into the 671B model.

Several key takeaways from the research on insights into DeepSeek-V3 include:

    MLA compression reduces KV cache size per token from 516 KB to 70 KB, significantly lowering memory demands during inference.Only 37 billion of the 671 billion total parameters are activated per token, dramatically reducing compute and memory requirements without compromising model performance.DeepSeek-V3 requires just 250 GFLOPS per token, compared to 2,448 GFLOPS for dense models like LLaMA-3.1, highlighting its computational efficiency.Achieves up to 67 tokens per second (TPS) on a 400 Gbps InfiniBand network, with the potential to scale to 1,200 TPS using advanced interconnects like NVL72.Multi-Token Prediction (MTP) improves generation speed by 1.8×, with a token acceptance rate of 80-90%, enhancing inference throughput.FP8 mixed-precision training enables faster computation with less than 0.25% accuracy degradation, validated through extensive small-scale ablations.Capable of running on a $10,000 server equipped with a consumer-grade GPU, delivering nearly 20 TPS, making high-performance LLMs more accessible.

In conclusion, the research presents a well-rounded framework for building powerful and resource-conscious large-scale language models. By directly addressing fundamental constraints, such as memory limitations, high computational costs, and inference latency, the researchers demonstrate that intelligent architecture-hardware co-design can unlock high performance without relying on vast infrastructure. DeepSeek-V3 is a clear example of how efficiency and scalability coexist, enabling broader adoption of cutting-edge AI capabilities across diverse organizations. This approach shifts the narrative from scaling through brute force to scaling through smarter engineering.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.

The post This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DeepSeek-V3 大语言模型 硬件优化 模型效率
相关文章