MarkTechPost@AI 2024年11月25日
Insight-V: Empowering Multi-Modal Models with Scalable Long-Chain Reasoning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Insight-V通过独特的数据生成和多代理框架,解决多模态推理中的难题。它提升了推理能力和任务特定性能,在多个基准任务中有显著改进,为多模态推理模型发展奠定基础。

🎯Insight-V结合数据生成与多代理框架,解决多模态推理挑战

📈拥有超过20万推理样本和120多万总结示例的结构化数据集

💪通过多粒度路径评估生成多样化连贯推理路径,提升推理准确性

🌟在多个基准任务中表现出色,推动多模态推理模型发展

The capability of multimodal large language models (MLLMs) to enable complex long-chain reasoning that incorporates text and vision raises an even greater barrier in the realm of artificial intelligence. While text-centric reasoning tasks are being gradually advanced, multimodal tasks add additional challenges rooted in the lack of rich, comprehensive reasoning datasets and efficient training strategies. Currently, many models tend to lack accuracy in their reasoning when exposed to complex data involving images, which limits their application to real-world applications in systems with autonomous activity, medical diagnoses, or learning materials.

Traditional methods for enhancing reasoning capacity rely largely on Chain-of-Thought (CoT) prompting or structured datasets. However, these approaches have significant drawbacks. Crafting annotated datasets for the task of visual reasoning is very resource-intensive and requires enormous human resources. Reasoning and summarizing in a single step often results in badly fragmented or just plain bizarre reasoning chains. Moreover, with the lack of datasets and the direct approach to training these systems, they cannot generalize effectively across a variety of tasks. These constraints call for new methodologies to be developed to amplify the reasoning capability of multi-modal artificial intelligence systems.

Researchers from NTU, Tencent, Tsinghua University, and Nanjing University introduced Insight-V to tackle these challenges through a unique combination of scalable data generation and a multi-agent framework. It offers an incremental methodology for generating diversified and coherent reasoning pathways through a multi-granularity methodology of pathway evaluation to ensure the quality of the generated pathways. A distinct multi-agent system decomposes tasks into two specialized roles: the reasoning agent, which generates detailed logical steps, and the summary agent, which validates and refines these outputs for accuracy. By leveraging Iterative Direct Preference Optimization (DPO), a reinforcement learning method, the system achieves alignment with human-like judgment. This collaborative architecture enables significant advances in reasoning accuracy and task-specific performance.

Insight-V has a structured dataset of more than 200K reasoning samples and over 1.2 million summarization examples obtained from related benchmarks such as LLaVA-NeXT and other curated data for training. The reasoning agent aims to give finalized step-by-step processes for solving logical problems, while the summary agent critically evaluates and polishes these steps to reduce errors. Training starts with role-specific supervised fine-tuning, gradually moving to iterative preference optimization, refining the output to be closer to actual human decision-making. This style of training maintains a structured approach toward robust generalization across domains and complex reasoning tasks.

Multi-modal reasoning performance improvement of the system on benchmark tasks is major with a mean relative improvement of 7.0% over LLaVA-NeXT and 2.9% from the baseline model. Insight-V improves performance over tasks such as chart-oriented detailed analysis and mathematical reasoning besides generalization capability in perception-focused evaluation modules like TextVQA. This is the reason for steady performance improvement across these tasks that validates the utility and merit of the system, hence its placement firmly as a hallmark development in multi-modal reasoning models. 

Insight-V offers a transformative framework for addressing key challenges in multi-modal reasoning by integrating innovative data generation techniques with a collaborative multi-agent architecture.  Improved reasoning over structured datasets, task-specific decomposition, and reinforcement learning optimizations are significant contributions in the context. This work ensures that MLLMs will indeed face reasoning-intensive tasks effectively while being versatile across different domains. In that regard, Insight-V serves as the basic basis for further development toward building systems that utilize complex reasoning within challenging visual-linguistic environments.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI VIRTUAL CONFERENCE] SmallCon: Free Virtual GenAI Conference ft. Meta, Mistral, Salesforce, Harvey AI & more. Join us on Dec 11th for this free virtual event to learn what it takes to build big with small models from AI trailblazers like Meta, Mistral AI, Salesforce, Harvey AI, Upstage, Nubank, Nvidia, Hugging Face, and more.

The post Insight-V: Empowering Multi-Modal Models with Scalable Long-Chain Reasoning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Insight-V 多模态推理 数据生成 推理准确性
相关文章