MarkTechPost@AI 2024年09月30日
Ovis-1.6: An Open-Source Multimodal Large Language Model (MLLM) Architecture Designed to Structurally Align Visual and Textual Embeddings
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Ovis 1.6 是一种新的开源多模态大语言模型 (MLLM),它旨在通过结构化对齐视觉和文本嵌入来解决多模态学习中长期存在的视觉和文本表示不一致问题。与传统的连接器方法不同,Ovis 采用了一种概率方法来生成更具意义的视觉嵌入,从而实现了视觉和文本信息更有效的集成。实验结果表明,Ovis 在各种基准测试中优于其他类似规模的开源 MLLM,并在需要处理高分辨率图像的复杂任务中表现出色。

🤔 Ovis 1.6 采用了一种新颖的视觉嵌入表,它结构化地对齐视觉和文本嵌入,从而增强模型处理多模态数据的能力。这种方法通过使用类似于文本嵌入的查找表来创建结构化的视觉表示,使视觉编码器能够生成与文本嵌入兼容的嵌入。此外,Ovis 还利用概率令牌来表示视觉补丁,这些令牌在视觉嵌入表中被索引多次,从而进一步加强了视觉和文本表示之间的结构对齐。

💪 与传统的连接器方法相比,Ovis 的这种结构化对齐方法在各种基准测试中取得了显著的性能提升。例如,在 MathVista-Mini 基准测试中,Ovis 得分 1808,明显高于其竞争对手。在 RealWorldQA 基准测试中,Ovis 的得分也超过了 GPT4V 和 Qwen-VL-Plus 等领先的专有模型,得分 2230,而 GPT4V 的得分仅为 2038。这些结果表明,Ovis 在处理复杂的多模态任务方面具有优势,使其成为未来该领域发展的有希望的候选者。

📊 Ovis 在一系列通用多模态基准测试(包括 MMBench 和 MMStar)中也表现出色,在这些基准测试中,它始终优于 Mini-Gemini-HD 和 Qwen-VL-Chat 等模型,优势范围为 7.8% 到 14.1%,具体取决于具体的基准测试。此外,Ovis 在不同参数层级(7B、14B)上均表现出稳定的性能,使其能够适应各种模型规模和计算资源。

🚀 Ovis 的先进多模态功能使其能够应用于复杂和具有挑战性的现实世界场景,例如视觉问答和图像字幕,而现有的模型在这些场景中难以应对。

🧐 Ovis 研究结果表明,通过结构化对齐视觉和文本嵌入,可以显著提高多模态模型的性能,为多模态学习领域的发展开辟了新的可能性。

Artificial intelligence (AI) is transforming rapidly, particularly in multimodal learning. Multimodal models aim to combine visual and textual information to enable machines to understand and generate content that requires inputs from both sources. This capability is vital for tasks such as image captioning, visual question answering, and content creation, where more than a single data mode is required. While many models have been developed to address these challenges, only some have effectively aligned the disparate representations of visual and textual data, leading to inefficiencies and suboptimal performance in real-world applications.

A significant challenge in multimodal learning arises from how text and image data are encoded and represented. Textual data are typically defined using embeddings derived from a lookup table, ensuring a structured and consistent format. In contrast, visual data are encoded using vision transformers, which produce unstructured continuous embeddings. This discrepancy in representation makes it easier for existing multimodal models to fuse visual and textual data seamlessly. As a result, models struggle to interpret complex visual-textual relationships, limiting their capabilities in advanced AI applications that require coherent understanding across multiple data modalities.

Traditionally, researchers have attempted to mitigate this problem by using a connector, such as a multi-layer perceptron (MLP), to project visual embeddings into a space that can be aligned with textual embeddings. While effective in standard multimodal tasks, this architecture must resolve the fundamental misalignment between visual and textual embeddings. Leading models like LLaVA and Mini-Gemini incorporate advanced methods like cross-attention mechanisms and dual vision encoders to improve performance. However, they still face limitations due to the inherent differences in tokenization and embedding strategies, highlighting the need for a novel approach that addresses these issues at a structural level.

Researchers team from Alibaba Group and Nanjing University introduced a new version of Ovis: Ovis 1.6 is a new multimodal large language model (MLLM) that structurally aligns visual and textual embeddings to address this challenge. Ovis employs a unique visual embedding look-up table, similar to the one used for textual embeddings, to create structured visual representations. This table enables the visual encoder to produce embeddings compatible with textual embeddings, resulting in more effective visual and textual information integration. The model also utilizes probabilistic tokens for visual patches mapped into the visual embedding table multiple times. This approach mirrors the structured representation used in textual data, facilitating a coherent combination of visual and textual inputs.

Ovis’s core innovation lies in using a visual embedding table that aligns visual tokens with their textual counterparts. A probabilistic token represents each image patch and indexes the visual embedding table multiple times to generate a final visual embedding. This process captures the rich semantics of each visual patch and results in embeddings structurally similar to textual tokens. In contrast to conventional methods, which rely on linear projections to map visual embeddings into a joint space, Ovis adopts a probabilistic approach to generate more meaningful visual embeddings. This method enables Ovis to overcome the limitations of connector-based architectures and achieve better performance in multimodal tasks.

Empirical evaluations of Ovis demonstrate its superiority over other open-source MLLMs of similar sizes. For instance, in the MathVista-Mini benchmark, Ovis scored 1808, significantly higher than its competitors. Similarly, in the RealWorldQA benchmark, Ovis outperformed leading proprietary models such as GPT4V and Qwen-VL-Plus, scoring 2230, compared to GPT4V’s 2038. These results highlight Ovis’s strength in handling complex multimodal tasks, making it a promising candidate for future advancements in the field. The researchers also evaluated Ovis on a series of general multimodal benchmarks, including MMBench and MMStar, where it consistently surpassed models like Mini-Gemini-HD and Qwen-VL-Chat by a margin of 7.8% to 14.1%, depending on the specific benchmark.

Key Takeaways from the research:

In conclusion, the researchers have successfully addressed the longstanding misalignment between visual and textual embeddings. By introducing a structured visual embedding strategy, Ovis enables more effective multimodal data integration, improving performance across various tasks. The model’s ability to outperform open-source and proprietary models of similar parameter scales, such as Qwen-VL-Max, underscores its potential as a new standard in multimodal learning. The research team’s approach offers a significant step forward in developing MLLMs, providing new avenues for future research and application.


Check out the Paper, GitHub, and HF Model. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 52k+ ML SubReddit.

We are inviting startups, companies, and research institutions who are working on small language models to participate in this upcoming ‘Small Language Models’ Magazine/Report by Marketchpost.com. This Magazine/Report will be released in late October/early November 2024. Click here to set up a call!

The post Ovis-1.6: An Open-Source Multimodal Large Language Model (MLLM) Architecture Designed to Structurally Align Visual and Textual Embeddings appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Ovis 1.6 多模态大语言模型 视觉嵌入 文本嵌入 结构化对齐
相关文章