MarkTechPost@AI 2024年08月27日
Meta presents Transfusion: A Recipe for Training a Multi-Modal Model Over Discrete and Continuous Data
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta等提出Transfusion,整合语言建模和扩散过程,解决多模态数据处理难题,在多项任务中表现出色。

🌐 Transfusion是一种创新方法,将语言建模和扩散过程集成在单一的变压器架构中,能处理离散和连续数据,无需单独架构或量化,克服了现有方法的局限性。

📖 该方法结合了文本的下一个标记预测损失和图像的扩散过程,实现统一训练流程。其包括模态特定的编码和解码层以及图像内双向注意力的使用等关键创新,能有效处理多种数据类型。

💻 Transfusion在平衡的文本和图像数据混合物上进行训练,模型架构包含具有模态特定组件的变压器,文本被标记为离散序列,图像使用变分自编码器编码为潜在斑块,在多项基准测试中表现优越。

The rapid advancement of AI has led to the development of powerful models for discrete and continuous data modalities, such as text and images, respectively. However, integrating these distinct modalities into a single model remains a significant challenge. Traditional approaches often require separate architectures or compromise on data fidelity by quantizing continuous data into discrete tokens, leading to inefficiencies and performance limitations. This challenge is crucial for the advancement of AI, as overcoming it would enable more versatile models capable of processing and generating both text and images seamlessly, thereby enhancing applications in multi-modal tasks.

Current methods to address multi-modal generation primarily focus on specialized models for either discrete or continuous data. Language models, like transformers, excel at handling sequences of discrete tokens, making them highly effective for tasks involving text. Conversely, diffusion models are the state-of-the-art for generating high-quality images by learning to reverse a noise-adding process. However, these models typically require separate training pipelines for each modality, leading to inefficiencies. Moreover, some approaches attempt to unify these modalities by quantizing images into discrete tokens for processing by language models, but this often results in information loss, limiting the model’s ability to generate high-resolution images or perform complex multi-modal tasks.

A team of researchers from Meta, Waymo and University of Southern California propose Transfusion, an innovative method that integrates language modeling and diffusion processes within a single transformer architecture. This proposed method addresses the limitations of existing approaches by allowing the model to process and generate both discrete and continuous data without the need for separate architectures or quantization. Transfusion combines the next-token prediction loss for text with the diffusion process for images, enabling a unified training pipeline. The approach includes key innovations, such as modality-specific encoding and decoding layers and the use of bidirectional attention within images, which collectively enhance the model’s ability to handle diverse data types efficiently and effectively. This integration represents a significant step forward in creating more versatile AI systems capable of performing complex multi-modal tasks.

Transfusion is trained on a balanced mixture of text and image data, with each modality being processed through its specific objective: next-token prediction for text and diffusion for images. The model’s architecture consists of a transformer with modality-specific components, where text is tokenized into discrete sequences and images are encoded as latent patches using a variational autoencoder (VAE). The model employs causal attention for text tokens and bidirectional attention for image patches, ensuring that both modalities are processed effectively. Training is conducted on a large-scale dataset consisting of 2 trillion tokens, including 1 trillion text tokens and 692 million images, each represented by a sequence of patch vectors. The use of U-Net down and up blocks for image encoding and decoding further enhances the model’s efficiency, particularly when compressing images into patches.

Transfusion demonstrates superior performance across several benchmarks, particularly in tasks involving text-to-image and image-to-text generation. This innovative approach outperforms existing methods by a significant margin in key metrics such as Frechet Inception Distance (FID) and CLIP scores. For example, in a controlled comparison, Transfusion achieves a 2× lower FID score than the Chameleon models, demonstrating better scaling and reduced computational costs. A critical evaluation table highlights these results, showcasing the effectiveness of Transfusion across various benchmarks. Notably, the 7B parameter model achieves a FID score of 16.8 on the MS-COCO benchmark, outperforming other approaches that require more computational resources to achieve similar results .

In conclusion, Transfusion represents a novel approach to multi-modal learning, effectively combining language modeling and diffusion processes within a single architecture. By addressing the inefficiencies and limitations of existing methods, Transfusion offers a more integrated and efficient solution for processing and generating both text and images. This proposed method has the potential to significantly impact various AI applications, particularly those involving complex multi-modal tasks, by enabling a more seamless and effective integration of diverse data modalities.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 49k+ ML SubReddit

Find Upcoming AI Webinars here

The post Meta presents Transfusion: A Recipe for Training a Multi-Modal Model Over Discrete and Continuous Data appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Transfusion 多模态学习 语言建模 扩散过程
相关文章