MarkTechPost@AI 2024年07月14日
InternLM-XComposer-2.5 (IXC-2.5): A Versatile Large-Vision Language Model that Supports Long-Contextual Input and Output
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

InternLM-XComposer-2.5 (IXC-2.5) 是一款支持长上下文输入和输出的多功能大型视觉语言模型,它在各种基准测试中表现出色,包括自由格式的文本图像对话、OCR、视频理解、文章撰写和网页制作。IXC-2.5 支持 24K 交错图像文本上下文窗口,可扩展至 96K,支持长期人机交互和内容创建。

📒 **IXC-2.5 在理解和生成方面具有很强的能力,支持各种任务,包括文本图像对话、OCR、视频理解、文章撰写和网页制作。** IXC-2.5 旨在解决当前开源 LVLMs 在处理长上下文输入和输出方面面临的挑战,并通过使用 24K 交错图像文本上下文窗口(可扩展至 96K)来支持长期人机交互和内容创建。这使得它能够处理更长的文本和图像序列,并生成更连贯和一致的输出。

📊 **IXC-2.5 在图像理解、视频分析和对话方面取得了显著进展。** 在图像理解方面,IXC-2.5 采用了一种统一的动态图像分区策略,能够处理 560×560 分辨率的图像,并使用每张子图像 400 个 token 来进行处理。在视频分析方面,IXC-2.5 将视频视为连接的帧,并使用缩放身份策略来处理高分辨率图像。在对话方面,IXC-2.5 支持多轮多图像对话,并能够根据上下文信息来理解和生成响应。

💻 **IXC-2.5 在各种基准测试中表现出色,在视频理解、高分辨率图像分析、多图像多轮理解、通用视觉问答和截图到代码翻译等方面都取得了优异的成绩。** 在视频理解方面,IXC-2.5 在 5 个基准测试中的 4 个中都优于开源模型,与闭源 API 相媲美。在结构化高分辨率任务方面,IXC-2.5 与更大的模型竞争,在表格和表单理解方面表现出色。在多图像多轮理解方面,IXC-2.5 在 MMDU 基准测试中比之前的模型提高了 13.8%。在通用视觉问答任务方面,IXC-2.5 与开源模型和闭源模型相匹敌或超越,特别是在一些挑战中,它优于 GPT-4V 和 Gemini-Pro。在截图到代码翻译方面,IXC-2.5 甚至在平均性能上超过了 GPT-4V,展示了它在各种多模态任务中的多功能性和有效性。

📢 **IXC-2.5 的架构包括一个 ViT-L/14 视觉编码器、InternLM2-7B 语言模型和部分 LoRA。** 它通过统一的动态图像分区策略来处理各种输入,并使用 Whisper 进行转录和 MeloTTS 进行语音合成来支持音频输入/输出。这种多功能架构使 IXC-2.5 能够有效地处理各种输入类型和复杂任务。

📡 **IXC-2.5 的出现标志着大型视觉语言模型的重大进展,为未来的研究开辟了道路,这些研究可能扩展到长上下文视频理解和交互历史分析。** 这些进展有望增强人工智能在各种现实世界应用中帮助人类的能力,这标志着多模态人工智能技术的重大进步。

Large Language Models (LLMs) have made significant strides in recent years, prompting researchers to explore the development of Large Vision Language Models (LVLMs). These models aim to integrate visual and textual information processing capabilities. However, current open-source LVLMs face challenges in matching the versatility of proprietary models like GPT-4, Gemini Pro, and Claude 3. The primary obstacles include limited diversity in training data and difficulties in handling long-context input and output. Researchers are striving to enhance open-source LVLMs’ ability to perform a wide range of vision-language comprehension and composition tasks, bridging the gap between open-source and closed-source leading paradigms in terms of versatility and performance across various benchmarks.

Researchers have made significant efforts to tackle the challenges in developing versatile LVLMs. These approaches include text-image conversation models, high-resolution image analysis techniques, and video understanding methods. For text-image conversations, most existing LVLMs focus on single-image multi-round interactions, with some extending to multi-image inputs. High-resolution image analysis has been tackled through two main strategies: high-resolution visual encoders and image patchification. Video understanding in LVLMs has employed techniques such as sparse sampling, temporal pooling, compressed video tokens, and memory banks.

Also, researchers have explored webpage generation, moving from simple UI-to-code transformations to more complex tasks using large vision-language models trained on synthetic datasets. However, these approaches often lack diversity and real-world applicability. To align model outputs with human preferences, techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have been adapted for multimodal LVLMs, focusing on reducing hallucinations and improving response quality.

Researchers from Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong, SenseTime Group, and Tsinghua University have introduced InternLM-XComposer-2.5 (IXC-2.5), representing a significant advancement in LVLMs, offering versatility and long-context capabilities. This model excels in comprehension and composition tasks, including free-form text-image conversations, OCR, video understanding, article composition, and webpage crafting. IXC-2.5 supports a 24K interleaved image-text context window, extendable to 96K, enabling long-term human-AI interaction and content creation.

The model introduces three key comprehension upgrades: ultra-high resolution understanding, fine-grained video analysis, and multi-turn multi-image dialogue support. For composition tasks, IXC-2.5 incorporates additional LoRA parameters, enabling webpage creation and high-quality text-image article composition. The latter benefits from Chain-of-Thought and Direct Preference Optimization techniques to enhance content quality.

IXC-2.5 enhances its predecessors’ architecture with a ViT-L/14 Vision Encoder, InternLM2-7B Language Model, and Partial LoRA. It handles diverse inputs through a Unified Dynamic Image Partition strategy, processing images at 560×560 resolution with 400 tokens per sub-image. The model employs a scaled identity strategy for high-resolution images and treats videos as concatenated frames. Multi-image inputs are handled with interleaved formatting. IXC-2.5 also supports audio input/output using Whisper for transcription and MeloTTS for speech synthesis. This versatile architecture enables effective processing of various input types and complex tasks.

IXC-2.5 demonstrates exceptional performance across various benchmarks. In video understanding, it outperforms open-source models in 4 out of 5 benchmarks, matching closed-source APIs. For structural high-resolution tasks, IXC-2.5 competes with larger models, excelling in form and table understanding. It significantly improves multi-image multi-turn comprehension, outperforming previous models by 13.8% on the MMDU benchmark. In general visual QA tasks, IXC-2.5 matches or surpasses both open-source and closed-source models, notably outperforming GPT-4V and Gemini-Pro on some challenges. For screenshot-to-code translation, IXC-2.5 even surpasses GPT-4V in average performance, showcasing its versatility and effectiveness across diverse multimodal tasks.

IXC-2.5 represents a significant advancement in Large Vision-Language Models, offering long-contextual input and output capabilities. This model excels in ultra-high resolution image analysis, fine-grained video comprehension, multi-turn multi-image dialogues, webpage generation, and article composition. Despite utilizing a modest 7B Large Language Model backend, IXC-2.5 demonstrates competitive performance across various benchmarks. This achievement paves the way for future research into more contextual multi-modal environments, potentially extending to long-context video understanding and interaction history analysis. Such advancements promise to enhance AI’s capacity to assist humans in diverse real-world applications, marking a crucial step forward in multimodal AI technology.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post InternLM-XComposer-2.5 (IXC-2.5): A Versatile Large-Vision Language Model that Supports Long-Contextual Input and Output appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

InternLM-XComposer-2.5 大型视觉语言模型 长上下文 多模态AI
相关文章