MarkTechPost@AI 01月08日
EPFL Researchers Releases 4M: An Open-Source Training Framework to Advance Multimodal AI
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

EPFL的研究人员推出了开源框架4M,旨在训练通用且可扩展的多模态基础模型。4M通过整合图像、文本、语义特征和几何元数据等多种数据,实现了跨模态预测。与传统框架相比,4M支持21种模态,显著提升了多模态处理能力。其核心创新在于使用离散标记化,将不同模态转换为统一的标记序列,通过Transformer架构进行联合训练,简化了训练过程并提高了效率。4M在多项任务中表现出色,为多模态人工智能的发展奠定了基础。

🧮 4M框架采用离散标记化技术,将图像、文本等多种模态数据转换为统一的标记序列,使得模型能够利用Transformer架构进行联合训练。

🎨 该框架支持细粒度和可控的数据生成,通过调节特定模态(如人体姿势或元数据)来控制生成内容,并具备跨模态检索能力。

🚀 4M框架在COYO700M和CC12M等大型数据集上进行训练,模型参数高达30亿,通过将密集数据压缩为稀疏标记序列,优化了内存和计算效率。

📊 实验结果表明,4M在21种模态上表现出色,其XL模型在语义分割任务中达到了48.1的mIoU分数,同时处理的任务数量是早期模型的三倍。

Multimodal foundation models are becoming increasingly relevant in artificial intelligence, enabling systems to process and integrate multiple forms of data—such as images, text, and audio—to address diverse tasks. However, these systems face significant challenges. Existing models often struggle to generalize across a wide variety of modalities and tasks due to their reliance on limited datasets and modalities. Additionally, the architecture of many current models suffers from negative transfer, where performance on certain tasks deteriorates as new modalities are added. These challenges hinder scalability and the ability to deliver consistent results, underscoring the need for frameworks that can unify diverse data representations while preserving task performance.

Researchers at EPFL have introduced 4M, an open-source framework designed to train versatile and scalable multimodal foundation models that extend beyond language. 4M addresses the limitations of existing approaches by enabling predictions across diverse modalities, integrating data from sources such as images, text, semantic features, and geometric metadata. Unlike traditional frameworks that cater to a narrow set of tasks, 4M expands to support 21 modalities, three times more than many of its predecessors.

A core innovation of 4M is its use of discrete tokenization, which converts diverse modalities into a unified sequence of tokens. This unified representation allows the model to leverage a Transformer-based architecture for joint training across multiple data types. By simplifying the training process and removing the need for task-specific components, 4M achieves a balance between scalability and efficiency. As an open-source project, it is accessible to the broader research community, fostering collaboration and further development.

Technical Details and Advantages

The 4M framework utilizes an encoder-decoder Transformer architecture tailored for multimodal masked modeling. During training, modalities are tokenized using specialized encoders suited to their data types. For instance, image data employs spatial discrete VAEs, while text and structured metadata are processed using a WordPiece tokenizer. This consistent approach to tokenization ensures seamless integration of diverse data types.

One notable feature of 4M is its capability for fine-grained and controllable data generation. By conditioning outputs on specific modalities, such as human poses or metadata, the model provides a high degree of control over the generated content. Additionally, 4M’s cross-modal retrieval capabilities allow for queries in one modality (e.g., text) to retrieve relevant information in another (e.g., images).

The framework’s scalability is another strength. Trained on large datasets like COYO700M and CC12M, 4M incorporates over 0.5 billion samples and scales up to three billion parameters. By compressing dense data into sparse token sequences, it optimizes memory and computational efficiency, making it a practical choice for complex multimodal tasks.

Results and Insights

The capabilities of 4M are evident in its performance across various tasks. In evaluations, it demonstrated robust performance across 21 modalities without compromising results compared to specialized models. For instance, 4M’s XL model achieved a semantic segmentation mIoU score of 48.1, matching or exceeding benchmarks while handling three times as many tasks as earlier models.

The framework also excels in transfer learning. Tests on downstream tasks, such as 3D object detection and multimodal semantic segmentation, show that 4M’s pretrained encoders maintain high accuracy across both familiar and novel tasks. These results highlight its potential for applications in areas like autonomous systems and healthcare, where integrating multimodal data is critical.

Conclusion

The 4M framework marks a significant step forward in the development of multimodal foundation models. By tackling scalability and cross-modal integration challenges, EPFL’s contribution sets the stage for more flexible and efficient AI systems. Its open-source release encourages the research community to build on this work, pushing the boundaries of what multimodal AI can achieve. As the field evolves, frameworks like 4M will play a crucial role in enabling new applications and advancing the capabilities of AI.


Check out the Paper, Project Page, GitHub Page, Demo, and Blog. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation IntelligenceJoin this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

The post EPFL Researchers Releases 4M: An Open-Source Training Framework to Advance Multimodal AI appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

多模态 AI框架 Transformer 4M 深度学习
相关文章