MarkTechPost@AI 2024年11月14日
Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta AI和斯坦福大学的研究人员提出了一种名为混合Transformer(MoT)的新架构,旨在解决多模态基础模型训练成本高昂的问题。MoT采用稀疏的多模态Transformer结构,通过为文本、图像和语音等不同模态分配特定的参数,实现了模态特定的优化,从而显著降低了计算需求。研究表明,MoT在保持性能的同时,将计算成本降低了37.2%到55.8%,并加速了训练过程,在图像和文本生成任务中取得了显著的效率提升。这一突破有望推动更易访问、更节能的多模态AI模型的开发,为各种复杂应用带来新的可能性。

🤔**高效的多模态处理:** MoT在文本、图像和语音等多种模态上实现了与密集型模型相当的性能,同时仅需37.2%到55.8%的计算资源。

🚀**训练加速:** 在Chameleon模型中,MoT将图像任务的训练时间缩短了52.8%,文本任务的训练时间缩短了24.4%,同时保持了准确性。

🧩**自适应可扩展性:** MoT通过有效地处理多种模态的离散和连续标记,无需额外的处理层,展现出强大的适应性。

💰**实时应用中的资源减少:** 在NVIDIA A100 GPU上的性能评估表明,MoT显著减少了训练时间,使其成为实时应用的可行选择。

💡**模态特定参数优化:** MoT通过为不同模态分配独立的参数,例如前馈网络、注意力矩阵和归一化层,实现了模态特定的优化,从而提升了处理效率和输出准确性。

Advancements in AI have paved the way for multi-modal foundation models that simultaneously process text, images, and speech under a unified framework. These models can potentially transform various applications, from content creation to seamless translation across media types, as they enable the generation and interpretation of complex data. However, achieving this requires immense computational resources, which creates a barrier to scaling and operational efficiency. Training these multi-modal systems is complex, as each modality, whether text, image, or audio, introduces unique challenges, requiring customized handling while maintaining cohesion within the model’s framework. Balancing this level of diversity in data types has proven difficult regarding both processing power and training efficiency.

A primary issue faced in multi-modal AI research is that traditional language models are optimized for text, and extending them to incorporate images and audio requires substantial computational power. Large language models, or LLMs, designed specifically for text-based tasks do not naturally integrate other modalities due to the inherent differences in how each modality needs to be processed. For instance, a text model optimized on trillions of tokens can only extend to image and speech data with conflicts in the training dynamics. Consequently, the computational load escalates, with these models requiring up to five times the data and processing power compared to text-only models. Researchers, therefore, aim to find architectures that can accommodate these requirements without a proportional increase in resources.

Various strategies currently address this need for computational efficiency in multi-modal models. One prominent approach is using sparse architectures, such as Mixture-of-Experts (MoE), which activates only specific parts of the model as needed. MoE operates by utilizing “experts” to manage different aspects of the data, reducing the workload of the model at any given moment. However, MoE has limitations, including instability caused by unbalanced expert utilization and difficulty managing training dynamics at scale. Furthermore, MoE’s routing mechanism tends to focus on specific aspects of the data, often leading to an imbalance in training different modalities, thus requiring additional techniques to stabilize the process and maintain efficiency.

FAIR at Meta and Stanford University researchers introduced a new architecture called Mixture-of-Transformers (MoT). The MoT, built as a sparse, multi-modal transformer, reduces computational demands by incorporating modality-specific parameters. Unlike traditional dense models that rely on uniform processing, MoT utilizes distinct components for each modality, text, image, and speech, allowing for modality-specific optimization without requiring additional model components. For example, MoT assigns unique feed-forward networks, attention matrices, and normalization layers to each modality while maintaining a unified attention mechanism across the entire input data sequence, enhancing processing efficiency and output accuracy.

The Mixture-of-Transformers framework leverages this sparse design by decoupling the model parameters according to modality, optimizing training and inference phases. For instance, MoT separates text, image, and speech parameters during a multi-modal task, applying customized processing layers for each. This process reduces the need for dense model layers to accommodate all modalities simultaneously. As a result, MoT achieves a balance of efficiency and effectiveness that traditional dense models lack. For instance, in tests involving text and image generation within the Chameleon 7B model, MoT delivered comparable results to dense baselines with only 55.8% of the FLOPs and even less 37.2% when integrating a third modality, such as speech. This efficiency gain translates to significant reductions in resource usage, which, in large-scale AI models, can lead to major cost savings.

Mixture-of-Transformers showed notable improvements across multiple evaluation criteria. Compared to dense transformer models, the architecture reduced pretraining times for text and image tasks by over 40%. In the Chameleon setting, where the model processes text and images using autoregressive objectives, MoT reached the dense model’s final validation loss using just 55.8% of the computational power. Furthermore, MoT accelerated the training process by achieving the same levels of accuracy in image quality with 47.2% of the time required by dense models, and it achieved text quality in 75.6% of the typical time. Such efficiency gains were further confirmed in the Transfusion setting. MoT matched dense baseline image performance while using only one-third of the FLOPs, proving its adaptability and resource efficiency in handling complex multi-modal data.

The research offers several key takeaways, highlighting the potential of Mixture-of-Transformers to redefine multi-modal AI processing:

In conclusion, Mixture-of-Transformers presents an innovative approach to multi-modal modeling by offering an efficient, scalable solution for integrating diverse data types within a single framework. Through a sparse architecture that leverages modality-specific processing, MoT significantly reduces computational load while delivering robust performance across various tasks. This breakthrough could transform the landscape of AI, enabling more accessible, resource-efficient models for advanced multi-modal applications.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI WEBINAR] Implementing Intelligent Document Processing with GenAI in Financial Services and Real Estate Transactions

The post Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

混合Transformer 多模态 AI 预训练 计算效率
相关文章