MarkTechPost@AI 01月12日
ProVision: A Scalable Programmatic Approach to Vision-Centric Instruction Data for Multimodal Language Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

PROVISION是一种可扩展的程序化系统,利用场景图作为图像的符号表示,生成以视觉为中心的指令数据,用于训练多模态语言模型。该系统结合人工编写的程序和自动或手动创建的场景图,确保了数据的可解释性、准确性和可扩展性,同时避免了大型语言模型(LLM)或多模态模型(MLM)方法中常见的幻觉和许可限制。通过从Visual Genome和DataComp生成超过1000万个数据点,PROVISION显著提升了MLM在各种视觉任务中的性能,并在CVBench、QBench2和Mantis-Eval等基准测试中取得了高达8%的增益。

💡PROVISION系统使用场景图作为图像的符号表示,结合人工编写的程序,生成视觉指令数据,解决了传统方法中成本高、易产生幻觉等问题。

🖼️该系统利用24个单图像和14个多图像指令生成器,创建了多样化的问答对,涵盖对象属性、关系和空间深度等多种视觉任务,增强了多模态模型的视觉理解能力。

📊PROVISION-10M数据集包含超过1000万个指令样本,通过在预训练和微调阶段使用该数据集,MLM在CVBench等基准测试中性能提升高达8%,证明了该方法的有效性。

⚙️该系统采用模块化框架,支持用户自定义,可以根据不同的视觉推理和多模态AI应用需求,生成定制化的数据,具有高度的灵活性和可扩展性。

The rise of multimodal applications has highlighted the importance of instruction data in training MLMs to handle complex image-based queries effectively. Current practices for generating such data rely on LLMs or MLMs, which, despite their effectiveness, face several challenges. These include high costs, licensing restrictions, and susceptibility to hallucinations—generating inaccurate or unreliable content. Additionally, the generation process is often opaque, making it difficult to customize or interpret outputs, limiting its scalability and reliability. Visual instruction data is crucial for enabling MLMs to respond effectively to user queries about input images, but existing methods for its collection and generation remain constrained by these issues.

Recent advancements in MLMs, such as the LLaVA and InstructBLIP models, have leveraged multimodal data to achieve remarkable results in visual-language tasks. However, despite significant progress, these models often underperform in vision-specific tasks like depth estimation and localization due to the limited availability of instruction data for such tasks. While most synthetic data methods rely on LLMs, MLMs, or diffusion models, programmatic approaches like those used in GQA and AGQA focus primarily on evaluation. Unlike these methods, newer approaches aim to generate adaptable single- and multi-image instruction data for training, addressing the limitations of existing techniques and broadening the scope of multimodal learning.

Researchers from the University of Washington, Salesforce Research, and the University of Southern California introduced PROVISION. This scalable programmatic system uses scene graphs as symbolic image representations to generate vision-centric instruction data. By combining human-written programs with automatically or manually created scene graphs, PROVISION ensures interpretability, accuracy, and scalability while avoiding hallucinations and licensing constraints common in LLM/MLM-driven methods. The system generates over 10 million data points (PROVISION-10M) from Visual Genome and DataComp, covering diverse tasks like object, attribute, and depth-based queries. This data improves MLM performance, yielding up to 8% gains on benchmarks like CVBench, QBench2, and Mantis-Eval across pretraining and fine-tuning stages.

The study introduces a method for generating vision-centric instruction data using augmented scene graphs enhanced with depth and segmentation labels. For single-image scenarios, 24 generators create diverse question-answer pairs using pre-defined templates, focusing on object attributes, relations, and spatial depth. Multi-image generators enable advanced reasoning tasks like comparison and aggregation across scene graphs. The scene graph generation pipeline integrates object detection (YOLO-world), segmentation (SAM-2), attribute detection (finetuned CoCa and LLaVA-1.5), relation extraction (Osprey), and depth estimation (Depth Anything V2). The modular framework supports customization, enabling users to create diverse data for visual reasoning and multimodal AI applications.

The experiments involve synthesizing instruction data to improve model performance. Results show that manually annotated scene graphs outperform those generated by models, and both data format (short answer vs. multiple choice) and data scale significantly impact outcomes. Incorporating synthesized data in both pre-training and fine-tuning stages yields optimal results. The PROVISION-10M dataset was constructed using Visual Genome’s manually annotated scene graphs and generated scene graphs from high-resolution images, producing over 10 million instruction samples. These were tested in augmentation and replacement settings across various benchmarks, demonstrating the effectiveness of scene graphs for creating useful instructions, whether real or automatically generated.

In conclusion, The PROVISION system generates vision-centric instruction data for MLMs using scene graph representations and human-written programs. Applied to Visual Genome and DataComp, it creates PROVISION-10M, a dataset with over 10 million instructions, improving MLM performance during pretraining and instruction tuning. The system uses 24 single-image and 14 multi-image instruction generators, producing diverse queries about objects, attributes, and relationships. PROVISION achieves up to 8% performance on benchmarks like CVBench and Mantis-Eval. While limitations include dependency on scene graph quality and human-written programs, future enhancements may improve automation and scalability using LLMs.


Check out the Paper and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation IntelligenceJoin this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

The post ProVision: A Scalable Programmatic Approach to Vision-Centric Instruction Data for Multimodal Language Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

多模态 视觉指令数据 场景图 PROVISION MLM
相关文章