MarkTechPost@AI 10小时前
OpenBMB Releases MiniCPM4: Ultra-Efficient Language Models for Edge Devices with Sparse Attention and Fast Inference
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

MiniCPM4是一款由OpenBMB团队开发的、专为边缘设备设计的高效大语言模型。该模型在架构、训练数据、训练算法和推理系统方面进行了全面优化,旨在解决传统大语言模型在资源受限设备上部署的难题。MiniCPM4提供0.5B和8B两种参数规模的版本,通过创新的稀疏注意力机制、高质量的训练数据和高效的推理系统,实现了在边缘设备上的高性能表现,例如在128K长度文档处理速度上,MiniCPM4比Qwen3-8B快7倍。这使得MiniCPM4适用于离线助手、实时移动AI和自主嵌入式系统等应用。

💡 **模型架构创新:** MiniCPM4采用了InfLLM v2稀疏注意力机制,该机制将key-value缓存划分为块,并使用语义内核选择相关的块进行注意力计算,从而减少了60%的注意力计算量。同时,它支持长达128K token的序列,同时保持速度和连贯性。

📚 **高质量训练数据:** MiniCPM4使用UltraClean生成和过滤训练数据集,仅使用了8万亿个训练token,而竞争模型如Qwen3-8B则使用了36万亿个。这使得MiniCPM4能够使用更少的数据实现更好的性能。UltraFineWeb在英语和中文基准测试中分别比FineWeb高出3.61和1.98个百分点。

🚀 **高效推理系统:** MiniCPM4的CPM.cu推理系统结合了CUDA优化和推测性采样,实现了更快的推理速度。在边缘设备(如Jetson AGX Orin和RTX 4090)上,MiniCPM4处理128K长度文档的速度比Qwen3-8B快7倍,平均解码速度超过200 tokens/s。

📈 **卓越的基准测试表现:** 在MMLU基准测试中,MiniCPM4(8B版本)得分32.24%,在ARC-C和ARC-E基准测试中分别获得35.67%和70.62%的成绩,均超过了之前的模型。

The Need for Efficient On-Device Language Models

Large language models have become integral to AI systems, enabling tasks like multilingual translation, virtual assistance, and automated reasoning through transformer-based architectures. While highly capable, these models are typically large, requiring powerful cloud infrastructure for training and inference. This reliance leads to latency, high costs, and privacy concerns, limiting their deployment on resource-constrained edge devices. Models like GPT and LLaMA, with billions of parameters, cannot efficiently run on local hardware due to their size and the complexity of their training and inference processes. Moreover, their dependence on massive datasets and high-performance GPUs makes them unsuitable for mobile or embedded environments. To overcome these challenges, there is a growing need for lightweight, efficient models that can perform well locally without sacrificing reasoning and context-handling capabilities.

Limitations of Existing Solutions

Several methods have been explored to address these challenges. Sparse attention mechanisms, such as NSA and MoBA, aim to reduce memory consumption; however, they either fall short in decoding efficiency or introduce significant architectural overhead. For data handling, previous methods have leaned on large-scale web scraping, resulting in noisy and unstructured corpora. Filtering methods have included fastText classifiers and manual curation, which either lack depth or scalability. On the training side, frameworks such as StepLaw have been used to optimize hyperparameters based on predictable scaling laws; however, they often require extensive experimentation and GPU cycles, creating a barrier to entry. Inference optimizations, such as FlashAttention, reduce computational complexity but still fall short of delivering the speeds required for real-time applications on edge devices.

Introducing MiniCPM4: Efficient Architecture, Data, and Inference

Researchers from OpenBMB introduced MiniCPM4, a suite of highly efficient large language models designed specifically for on-device deployment. The development includes two variants: one with 0.5 billion parameters and another with 8 billion. The model was built with improvements in four core dimensions: model architecture, training data, training algorithm, and inference systems. For architecture, the team introduced InfLLM v2, a sparse attention mechanism that accelerates both prefilling and decoding without sacrificing context comprehension. On the data front, UltraClean was employed to generate and filter training datasets, enabling the use of just 8 trillion training tokens compared to the 36 trillion used by competitive models like Qwen3-8 B. ModelTunnel v2 guided the training process with efficient hyperparameter tuning, and CPM.cu handled inference with platform-agnostic CUDA-based execution.

Technical Innovations in MiniCPM4

MiniCPM4’s tech stack is designed to strike a balance between performance and resource utilization. InfLLM v2 partitions key-value caches into blocks and selects top-K relevant blocks using semantic kernels for attention, reducing attention computation by 60% compared to NSA. Its dynamic context block selection and token-level query group processing allow it to support sequences up to 128K tokens while maintaining speed and coherence. UltraClean relies on efficient data verification, utilizing a pre-trained LLM and annealing-based fine-tuning on 10 billion tokens. This results in higher-quality datasets, UltraFineWeb in English and UltraFineWeb-zh in Chinese, which outperform FineWeb by 3.61 and 1.98 percentage points, respectively, in average benchmark performance. UltraChat v2 further supports post-training by generating reasoning-rich, multi-turn dialogues.

Benchmark Performance and Speed Gains

In terms of raw performance, the 8B version achieved MMLU scores of 32.24%, outperforming FineWeb (28.84%) and FineWeb-edu (31.80%). On ARC-C and ARC-E, it scored 35.67% and 70.62% respectively, surpassing competing datasets by over 10 percentage points. Compared to Qwen3-8B, MiniCPM4 used only 22% of the training data yet delivered a 7-fold increase in inference speed on 128 K-length documents when tested on end-side GPUs like Jetson AGX Orin and RTX 4090. The average decoding speed reached over 200 tokens/s for long-context inputs, and the architecture degraded gracefully to dense attention for shorter sequences. Additionally, the use of BitCPM4 enabled quantization-aware training, allowing deployment on devices with even stricter memory constraints without losing performance fidelity.

Key Takeaways from MiniCPM4:

Conclusion: Efficient LLMs for Edge AI Applications

In conclusion, the comprehensive approach taken by the MiniCPM4 team addressed all key inefficiencies associated with current LLMs. By introducing novel architectural, training, and deployment strategies, the model maintains high-quality responses, supports long-context comprehension, and performs well under edge constraints. The success of this work extends beyond raw metrics to demonstrate that state-of-the-art performance is achievable outside the cloud. It enables new application domains, such as secure offline assistants, real-time mobile AI, and autonomous embedded systems, without the traditional computational burden.


Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post OpenBMB Releases MiniCPM4: Ultra-Efficient Language Models for Edge Devices with Sparse Attention and Fast Inference appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MiniCPM4 边缘设备 语言模型 稀疏注意力 高效推理
相关文章