MarkTechPost@AI 02月08日
IBM AI Releases Granite-Vision-3.1-2B: A Small Vision Language Model with Super Impressive Performance on Various Tasks
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

IBM发布了Granite-Vision-3.1-2B,这是一款专为文档理解而设计的紧凑型视觉语言模型。该模型能够从表格、图表和示意图等多种视觉格式中提取内容,适用于数据分析、信息检索和决策制定等应用。Granite-Vision-3.1-2B在精心策划的数据集上进行训练,集成了图像和文本模态,从而提高了其解释能力。通过使用SigLIP处理视觉数据,并结合两层MLP桥接视觉和文本信息,以及基于Granite-3.1-2B-Instruct的大型语言模型,该模型在多个基准测试中表现良好,尤其是在文档理解方面,为企业应用提供了强大的视觉和文本数据处理能力。

🖼️ Granite-Vision-3.1-2B是IBM发布的一款紧凑型视觉语言模型,旨在解决AI领域中视觉和文本数据整合的复杂挑战,尤其擅长处理表格、图表等结构化视觉文档。

⚙️ 该模型由三个关键组件构成:Vision Encoder使用SigLIP高效处理视觉数据;Vision-Language Connector采用两层MLP桥接视觉和文本信息;Large Language Model基于Granite-3.1-2B-Instruct,具备128k的上下文长度。

📊 在性能评估方面,Granite-Vision-3.1-2B在多个基准测试中表现出色,例如在ChartQA基准上获得了0.86分,超越了1B-4B参数范围内的其他模型;在TextVQA基准上获得了0.76分,展示了强大的文本信息解读能力。

☁️ 该模型原生支持transformers和vLLM,使其能够适应各种用例,并可部署在Colab T4等云环境中,为研究人员和专业人士提供增强AI驱动的文档处理能力的实用工具。

The integration of visual and textual data in artificial intelligence presents a complex challenge. Traditional models often struggle to interpret structured visual documents such as tables, charts, infographics, and diagrams with precision. This limitation affects automated content extraction and comprehension, which are crucial for applications in data analysis, information retrieval, and decision-making. As organizations increasingly rely on AI-driven insights, the need for models capable of effectively processing both visual and textual information has grown significantly.

IBM has addressed this challenge with the release of Granite-Vision-3.1-2B, a compact vision-language model designed for document understanding. This model is capable of extracting content from diverse visual formats, including tables, charts, and diagrams. Trained on a well-curated dataset comprising both public and synthetic sources, it is designed to handle a broad range of document-related tasks. Fine-tuned from a Granite large language model, Granite-Vision-3.1-2B integrates image and text modalities to improve its interpretative capabilities, making it suitable for various practical applications.

The model consists of three key components:

    Vision Encoder: Uses SigLIP to process and encode visual data efficiently.Vision-Language Connector: A two-layer multilayer perceptron (MLP) with GELU activation functions, designed to bridge visual and textual information.Large Language Model: Built upon Granite-3.1-2B-Instruct, featuring a 128k context length for handling complex and extensive inputs.

The training process builds on LlaVA and incorporates multi-layer encoder features, along with a denser grid resolution in AnyRes. These enhancements improve the model’s ability to understand detailed visual content. This architecture allows the model to perform various visual document tasks, such as analyzing tables and charts, executing optical character recognition (OCR), and answering document-based queries with greater accuracy.

Evaluations indicate that Granite-Vision-3.1-2B performs well across multiple benchmarks, particularly in document understanding. For example, it achieved a score of 0.86 on the ChartQA benchmark, surpassing other models within the 1B-4B parameter range. On the TextVQA benchmark, it attained a score of 0.76, demonstrating strong performance in interpreting and responding to questions based on textual information embedded in images. These results highlight the model’s potential for enterprise applications requiring precise visual and textual data processing.

IBM’s Granite-Vision-3.1-2B represents a notable advancement in vision-language models, offering a well-balanced approach to visual document understanding. Its architecture and training methodology allow it to efficiently interpret and analyze complex visual and textual data. With native support for transformers and vLLM, the model is adaptable to various use cases and can be deployed in cloud-based environments such as Colab T4. This accessibility makes it a practical tool for researchers and professionals looking to enhance AI-driven document processing capabilities.


Check out the ibm-granite/granite-vision-3.1-2b-preview and ibm-granite/granite-3.1-2b-instruct. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)

The post IBM AI Releases Granite-Vision-3.1-2B: A Small Vision Language Model with Super Impressive Performance on Various Tasks appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Granite-Vision-3.1-2B 视觉语言模型 文档理解 人工智能
相关文章