MarkTechPost@AI 2024年07月17日
ChartGemma: A Multimodal Model Instruction-Tuned on Data Generated Directly from a Diverse Range of Real-World Chart Images
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

ChartGemma 是一款由 York 大学、Mila - 魁北克人工智能研究院、Salesforce Research 和南洋理工大学的研究人员开发的图表理解和推理模型。ChartGemma 与现有的模型不同,它是在直接从图表图像中生成的数据上进行训练的,从而捕获了详细的视觉信息。ChartGemma 基于 PaliGemma 架构,比其他模型更小、更高效。ChartGemma 在五个基准测试中实现了图表摘要、问答和事实核查方面的最先进结果。定性研究表明,它可以生成真实且准确的摘要,使其非常适合现实世界中的图表分析。

ChartGemma 与现有的图表理解模型不同,它是在直接从图表图像中生成的数据上进行训练的,而不是依赖于数据表格,从而更好地捕捉到图表中的视觉信息。

ChartGemma 基于 PaliGemma 架构,包含 SigLIP 视觉编码器和 Gemma-2B 语言模型,能够处理更复杂的图表结构,并生成更准确的摘要和答案。

ChartGemma 在五个基准测试中表现出色,包括图表问答、图表事实核查、图表检查、开放式图表问答和图表到文本,证明了其在图表理解和推理方面的强大能力。

ChartGemma 的研究表明,在图表理解和推理任务中,使用直接从图表图像中生成的数据进行训练,可以显著提升模型的性能和泛化能力。

ChartGemma 的研究为未来发展更强大的图表理解和推理模型提供了新的方向,例如构建更具多样性和人类指导的训练数据集,以及开发更全面的基准测试,以评估图表中复杂视觉元素的相关指标。

Charts are essential tools in various fields, but current models for chart understanding have limitations. They often rely on data tables rather than visual patterns and use weakly aligned vision-language models, limiting their effectiveness with complex charts. Although language-augmented vision models perform well in general tasks, they need help with specialized chart analysis. Researchers have tried instruction-tuning these models for better chart comprehension, but data quality and model alignment issues persist. A simple, improved approach is needed to develop a robust foundation model for effective chart understanding and reasoning in diverse, real-world scenarios.

Researchers from York University, MILA – Quebec AI Institute, Salesforce Research, and Nanyang Technological University developed ChartGemma, an advanced chart understanding and reasoning model. Unlike existing models, ChartGemma is trained on data generated directly from chart images, capturing detailed visual information. Built on the PaliGemma backbone, it is smaller and more efficient than other models. ChartGemma achieves state-of-the-art results in chart summarization, question answering, and fact-checking across five benchmarks. Qualitative studies show it generates realistic and accurate summaries, making it highly effective for real-world chart analysis.

Chart representation learning has evolved from models fine-tuned from language or vision-language bases to those pre-trained with chart-specific objectives. Instruction-tuning of pre-trained vision-language models (VLMs) has been explored to enhance chart applicability, but these methods rely on underlying data tables and weakly-aligned VLMs. Benchmarks for chart modeling range from question answering to open-ended tasks like explanation generation and summarization. Instruction-tuning has generalized language models across functions and is now standard for multimodal VLMs. However, domain-specific instruction-tuning for charts using data tables fails to capture the complexity of real-world charts, limiting model effectiveness.

ChartGemma uses the PaliGemma architecture, featuring the SigLIP vision encoder and the Gemma-2B language model. The vision encoder processes 448×448 pixel images, converting them into visual tokens mapped into the language model’s embedding space. These tokens are then combined with text embeddings and processed by the Gemma-2B model, which uses full attention for input tokens and causal masking for output tokens to enhance contextual understanding. Unlike existing chart VLLMs that require a two-stage training approach, ChartGemma employs a single-stage method, directly fine-tuning instruction-tuning data. This is facilitated by PaliGemma’s extensive pre-training on diverse image-text pairs, allowing for better adaptability and generalization.

ChartGemma is compared with various open-source chart-specialist models, VLLMs tuned on chart data and state-of-the-art closed-source multimodal LLMs. It is evaluated on five benchmarks assessing chart representation and reasoning abilities: ChartQA, ChartFC, ChartCheck, OpenCQA, and Chart2Text, along with a manually curated set of 100 unseen charts. Performance metrics include relaxed accuracy, accuracy, and GPT-4 judged informativeness and factual correctness. ChartGemma outperforms other models on most tasks, demonstrating superior generalization, especially in understanding realistic instructions and complex charts, despite its relatively small size.

ChartGemma, a multimodal model instruction tuned on data generated from diverse real-world chart images using an advanced backbone architecture, addresses key shortcomings of current models. Unlike existing methods that generate instruction-tuning data from underlying tables and use weakly aligned backbones, ChartGemma uses actual chart images, enhancing adaptability and generalizability. The approach significantly improves performance, producing more realistic, informative, and factually correct outputs with a smaller parameter count. Future work includes creating a more diverse, human-instructed tuning dataset and proposing a generalized benchmark for evaluating complex visual elements in charts with relevant metrics.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post ChartGemma: A Multimodal Model Instruction-Tuned on Data Generated Directly from a Diverse Range of Real-World Chart Images appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

图表理解 人工智能 深度学习 ChartGemma PaliGemma
相关文章