MarkTechPost@AI 02月06日
Meta AI Introduces MILS: A Training-Free Multimodal AI Framework for Zero-Shot Image, Video, and Audio Understanding
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta AI推出了MILS,一种无需训练的多模态AI框架,它赋予大型语言模型(LLMs)处理图像、视频和音频等多种模态内容的能力。MILS通过迭代优化循环,利用生成器(LLM)和评分器(预训练的多模态模型)协同工作,无需额外训练即可实现零样本泛化。该框架在图像、视频和音频的理解与生成任务中表现出色,超越了以往的零样本方法,为多模态AI提供了一种更具适应性和可扩展性的解决方案。

🖼️ MILS采用迭代优化框架,包含一个生成器(LLM)和一个评分器(预训练的多模态模型)。生成器负责为多模态任务生成候选解决方案,而评分器则根据相关性、连贯性和与输入数据的对齐程度对这些解决方案进行排序,从而不断优化输出。

🎤 MILS在多种多模态任务中表现出色。在图像字幕生成方面,它使用Llama 3.1 8B作为生成器,CLIP模型作为评分器,生成更准确和信息丰富的字幕。对于视频和音频字幕生成,它分别使用ViCLIP和ImageBind进行评估,无需特定任务训练即可超越大规模数据集训练的模型。

🎨 对于文本到图像的生成,MILS通过优化文本描述来提高图像质量和保真度,并且在风格迁移方面也表现出色,能够生成更优化的编辑提示,指导风格迁移模型生成更具视觉一致性的转换。此外,MILS还实现了跨模态算术功能,将来自不同模态的信息组合成一个连贯的输出。

Large Language Models (LLMs) are primarily designed for text-based tasks, limiting their ability to interpret and generate multimodal content such as images, videos, and audio. Conventionally, multimodal operations are task-specific models trained on large amounts of labeled data, which makes them resource-hungry and rigid. Zero-shot methods are also restricted to pretraining with paired multimodal datasets, limiting their flexibility to new tasks. The challenge is to make LLMs perform multimodal reasoning and generation without task-specific training, curated data, or model adaptation. Overcoming this challenge would significantly enhance the applicability of LLMs to multimodal content processing and generation dynamically across multiple domains.

Conventional multimodal AI systems are based on models like CLIP for image-text alignment or diffusion models for media generation. Still, these methods are restricted to extensive training on curated data. Zero-shot captioning models like ZeroCap and MeaCap try to overcome this but are still restricted to fixed architectures and gradient-based optimization, restricting their generalization capability across different modalities. These methods have three limitations: they are restricted to extensive labeled data, they cannot generalize beyond the training distribution, and they are based on gradient-based methods that restrict their flexibility to new tasks. Without overcoming these limitations, multimodal AI is restricted to fixed tasks and datasets, restricting its potential for further applications.

Researchers from Meta propose MILS (Multimodal Iterative LLM Solver), a test-time optimization framework that enhances LLMs with multimodal reasoning capabilities without requiring additional training. Rather than adjusting the LLM or retraining it on multimodal data, MILS uses an iterative optimization cycle with a GENERATOR and a SCORER. The GENERATOR, an LLM, produces candidate solutions for multimodal tasks like image captions, video descriptions, or stylized image prompts, while the SCORER, a pre-trained multimodal model, ranks the generated solutions by relevance, coherence, and alignment with input data. Alternating between the two, MILS repeatedly refines its outputs with real-time feedback, continually improving performance. This enables zero-shot generalization across several modalities, including text, images, videos, and audio, making it an extremely versatile solution for multimodal AI applications.

MILS is implemented as a gradient-free optimization method, employing pre-trained models without tuning their parameters. The framework has been used in a variety of multimodal tasks. For image captioning, MILS employs Llama 3.1 8B as the GENERATOR and CLIP-based models as the SCORER to iteratively find optimal captions until the most accurate and descriptive caption is generated. The same iterative process is employed for video frames, with ViCLIP being used for evaluation, and for captioning audio, MILS extends the process to audio data with the use of ImageBind as the SCORER, allowing LLMs to generate natural language descriptions of sounds. For text-to-image generation, MILS optimizes image generation prompts by optimizing textual descriptions prior to sending them to diffusion-based models, generating more high-quality images. The framework even extends to style transfer, where it generates optimized editing prompts that direct style transfer models to generate more visually consistent transformations. In addition, it proposes cross-modal arithmetic, which combines heterogeneous modalities, such as an audio caption and an image description, into one multimodal representation. Using pre-trained models as scoring functions, MILS might avoid explicit multimodal training while being task-agnostic.

MILS achieves robust zero-shot performance on a variety of multimodal tasks and outperforms previous work on both captioning and generation. For image captioning, it is more semantically accurate than previous zero-shot models and generates more natural and informative captions. For captioning video and audio, it outperforms models trained on large-scale datasets even with zero task-specific training. For text-to-image generation, MILS improves image quality and fidelity, and human evaluators prefer its synthesized images in an overwhelming majority of cases. MILS is also effective for style transfer, learning optimal prompts for better visual transformation. Finally, MILS achieves new cross-modal arithmetic features, allowing combined information from modalities to generate coherent outputs. These findings demonstrate the flexibility and efficiency of MILS, making it a paradigm-breaking alternative to multimodal AI systems based on carefully curated training data.

MILS offers a new paradigm for multimodal AI in its ability to let LLMs generate and process text, image, video, and audio content without training and fine-tuning. Its test-time iterative optimization mechanism allows emergent zero-shot generalization, outperforming previous zero-shot methods but staying simple. Using pre-trained LLMs and multimodal models in adaptive feedback, MILS creates a new state-of-the-art for multimodal AI, allowing for more adaptive and scalable AI systems that can dynamically process multimodal reasoning and generation tasks.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)

The post Meta AI Introduces MILS: A Training-Free Multimodal AI Framework for Zero-Shot Image, Video, and Audio Understanding appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MILS 多模态AI 零样本学习
相关文章