MarkTechPost@AI 01月16日
MiniMax-Text-01 and MiniMax-VL-01 Released: Scalable Models with Lightning Attention, 456B Parameters, 4B Token Contexts, and State-of-the-Art Accuracy
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

MiniMax系列模型,包括MiniMax-Text-01和MiniMax-VL-01,通过创新技术突破了长文本处理的瓶颈。这些模型采用了闪电注意力机制和混合架构,有效提升了处理长序列的效率,并将上下文窗口扩展至惊人的400万个token。MiniMax-Text-01拥有4560亿参数,其中459亿在每个token激活,而MiniMax-VL-01则集成了轻量级视觉Transformer,用于处理视觉语言数据。通过优化计算策略,MiniMax系列模型在保持与GPT-4等领先模型相当的性能的同时,显著提升了长文本处理能力。

💡MiniMax-Text-01模型拥有4560亿参数,其中459亿在每个token激活,采用混合注意力机制,训练时支持100万token上下文,推理时可扩展至400万token。

👁️‍🗨️MiniMax-VL-01模型集成了轻量级视觉Transformer模块,通过四阶段训练流程处理5120亿视觉语言token,在DocVQA和AI2D基准测试中表现出色。

⚡️MiniMax系列模型采用了创新的闪电注意力机制,将计算复杂度降至线性级别,并结合混合注意力架构和增强的线性注意力序列并行算法,实现了高效的长文本处理。

🚀通过优化的CUDA内核和并行化策略,MiniMax模型在Nvidia H20 GPU上实现了超过75%的模型浮点运算利用率,显著提升了计算效率,并在多个基准测试中取得了突破性成果。

Large Language Models (LLMs) and Vision-Language Models (VLMs) transform natural language understanding, multimodal integration, and complex reasoning tasks. Yet, one critical limitation remains: current models cannot efficiently handle extremely large contexts. This challenge has prompted researchers to explore new methods and architectures to improve these models’ scalability, efficiency, and performance.

Existing models typically support token context lengths between 32,000 and 256,000, which limits their ability to handle scenarios requiring larger context windows, such as extended programming instructions or multi-step reasoning tasks. Increasing context sizes is computationally expensive due to the quadratic complexity of traditional softmax attention mechanisms. Researchers have explored alternative attention methods, such as sparse attention, linear attention, and state-space models, to address these challenges, but large-scale implementation remains limited.

Sparse attention focuses on relevant inputs to reduce computational overhead, while linear attention simplifies the attention matrix for scalability. However, adoption has been slow due to compatibility issues with existing architectures and suboptimal real-world performance. For example, state-space models effectively process long sequences but often lack the robustness and accuracy of transformer-based systems in complex tasks.

Researchers from MiniMax have introduced the MiniMax-01 series, including two variants to address these limitations:

    MiniMax-Text-01: MiniMax-Text-01 comprises 456 billion total parameters, with 45.9 billion activated per token. It leverages a hybrid attention mechanism for efficient long-context processing. Its context window extends to 1 million tokens during training and 4 million tokens during inference.MiniMax-VL-01: MiniMax-VL-01 integrates a lightweight Vision Transformer (ViT) module and processes 512 billion vision-language tokens through a four-stage training pipeline.

The models employ a novel lightning attention mechanism, reducing the computational complexity of processing long sequences. Also, integrating a Mixture of Experts (MoE) architecture enhances scalability and efficiency. The MiniMax models feature 456 billion parameters, of which 45.9 billion are activated for each token. This combination allows the models to process context windows of up to 1 million tokens during training and extrapolate to 4 million tokens during inference. By leveraging advanced computational strategies, the MiniMax-01 series offers unprecedented capabilities in long-context processing while maintaining performance on par with state-of-the-art models such as GPT-4 and Claude-3.5.

The lightning attention mechanism achieves linear computational complexity, enabling the model to scale effectively. The hybrid attention architecture alternates between lightning and softmax attention layers, ensuring a balance between computational efficiency and retrieval capabilities. The models also incorporate an enhanced Linear Attention Sequence Parallelism (LASP+) algorithm, efficiently handling extensive sequences. Also, the vision-language model MiniMax-VL-01 integrates a lightweight vision transformer module, enabling it to process 512 billion vision-language tokens through a four-stage training process. These innovations are complemented by optimized CUDA kernels and parallelization strategies, achieving over 75% Model Flops Utilization on Nvidia H20 GPUs.

Performance evaluations reveal that the MiniMax models achieve groundbreaking results across various benchmarks: 

These models also offer a 20–32 times longer context window than traditional counterparts, significantly enhancing their utility for long-context applications.

In conclusion, the MiniMax-01 series, comprising MiniMax-Text-01 and MiniMax-VL-01, represents a breakthrough in addressing scalability and long-context challenges. It combines innovative techniques like lightning attention with a hybrid architecture. By leveraging advanced computational frameworks and optimization strategies, researchers have introduced a solution that extends context capabilities to an unprecedented 4 million tokens and matches or surpasses the performance of leading models like GPT-4.


Check out the Paper and Models on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)

The post MiniMax-Text-01 and MiniMax-VL-01 Released: Scalable Models with Lightning Attention, 456B Parameters, 4B Token Contexts, and State-of-the-Art Accuracy appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MiniMax模型 长文本处理 闪电注意力 混合架构 视觉语言模型
相关文章