MarkTechPost@AI 06月04日 14:55
NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

英伟达推出了Llama Nemotron Nano VL,一款针对文档理解任务设计的视觉语言模型。该模型基于Llama 3.1架构,结合轻量级视觉编码器,旨在高效且精确地处理复杂文档结构,如扫描表格、财务报告和技术图表。它支持多模态输入,包括多页文档中的视觉和文本元素,并针对token高效推理进行了优化。该模型在OCRBench v2基准测试中表现出色,尤其在提取结构化数据和回答依赖于布局的查询方面,性能优于其他同类模型,并且支持多种部署方式。

🖼️ Llama Nemotron Nano VL 是一款视觉语言模型(VLM),旨在高效处理文档理解任务。它基于Llama 3.1 架构,并结合了轻量级视觉编码器,特别适用于扫描表格、财务报告和技术图表等复杂文档的处理。

⚙️ 该模型架构集成了CRadioV2-H视觉编码器和Llama 3.1 8B Instruct-tuned语言模型,能够协同处理多模态输入,包括多页文档中的视觉和文本元素。它针对token高效推理进行了优化,支持高达16K的上下文长度,并支持多张图像与文本输入。

📊 在OCRBench v2基准测试中,Llama Nemotron Nano VL 展现了卓越的性能,尤其在提取结构化数据(如表格和键值对)和回答依赖于布局的查询方面表现出色。该模型在非英语文档和质量下降的扫描件中也表现出良好的泛化能力。

🚀 Llama Nemotron Nano VL 支持灵活部署,包括服务器和边缘推理。英伟达提供了4位量化版本(AWQ),可使用TinyChat和TensorRT-LLM进行高效推理,并兼容Jetson Orin等受限环境。关键技术特征包括模块化NIM(NVIDIA Inference Microservice)支持、ONNX和TensorRT导出支持以及预计算视觉嵌入选项。

NVIDIA has introduced Llama Nemotron Nano VL, a vision-language model (VLM) designed to address document-level understanding tasks with efficiency and precision. Built on the Llama 3.1 architecture and coupled with a lightweight vision encoder, this release targets applications requiring accurate parsing of complex document structures such as scanned forms, financial reports, and technical diagrams.

Model Overview and Architecture

Llama Nemotron Nano VL integrates the CRadioV2-H vision encoder with a Llama 3.1 8B Instruct-tuned language model, forming a pipeline capable of jointly processing multimodal inputs — including multi-page documents with both visual and textual elements.

The architecture is optimized for token-efficient inference, supporting up to 16K context length across image and text sequences. The model can process multiple images alongside textual input, making it suitable for long-form multimodal tasks. Vision-text alignment is achieved via projection layers and rotary positional encoding tailored for image patch embeddings.

Training was conducted in three phases:

All training was performed using NVIDIA’s Megatron-LLM framework with Energon dataloader, distributed over clusters with A100 and H100 GPUs.

Benchmark Results and Evaluation

Llama Nemotron Nano VL was evaluated on OCRBench v2, a benchmark designed to assess document-level vision-language understanding across OCR, table parsing, and diagram reasoning tasks. OCRBench includes 10,000+ human-verified QA pairs spanning documents from domains such as finance, healthcare, legal, and scientific publishing.

Results indicate that the model achieves state-of-the-art accuracy among compact VLMs on this benchmark. Notably, its performance is competitive with larger, less efficient models, particularly in extracting structured data (e.g., tables and key-value pairs) and answering layout-dependent queries.

updated as on June 3, 2025

The model also generalizes across non-English documents and degraded scan quality, reflecting its robustness under real-world conditions.

Deployment, Quantization, and Efficiency

Designed for flexible deployment, Nemotron Nano VL supports both server and edge inference scenarios. NVIDIA provides a quantized 4-bit version (AWQ) for efficient inference using TinyChat and TensorRT-LLM, with compatibility for Jetson Orin and other constrained environments.

Key technical features include:

Conclusion

Llama Nemotron Nano VL represents a well-engineered tradeoff between performance, context length, and deployment efficiency in the domain of document understanding. Its architecture—anchored in Llama 3.1 and enhanced with a compact vision encoder—offers a practical solution for enterprise applications that require multimodal comprehension under strict latency or hardware constraints.

By topping OCRBench v2 while maintaining a deployable footprint, Nemotron Nano VL positions itself as a viable model for tasks such as automated document QA, intelligent OCR, and information extraction pipelines.


Check out the Technical details and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

The post NVIDIA AI Releases Llama Nemotron Nano VL: A Compact Vision-Language Model Optimized for Document Understanding appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Llama Nemotron Nano VL 视觉语言模型 文档理解 英伟达
相关文章