MarkTechPost@AI 2024年12月07日
NVIDIA AI Introduces NVILA: A Family of Open Visual Language Models VLMs Designed to Optimize both Efficiency and Accuracy
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

英伟达推出NVILA,这是一系列开放的视觉语言模型(VLM),旨在优化效率和准确性。NVILA采用“先扩展后压缩”的方法,提高了空间和时间分辨率,以保留视觉输入的细节,然后将它们压缩成更少、更密集的令牌。这种方法使NVILA能够有效处理高分辨率图像和长视频序列,降低训练成本4.5倍,减少微调内存需求3.4倍,并将推理速度提高1.6到2.8倍,同时在多个基准测试中表现出色。

🚀NVILA采用“先扩展后压缩”策略,通过提高图像分辨率(如896×896像素)和应用令牌压缩来保留重要信息,同时减少令牌数量,有效处理高分辨率图像。

⏳对于视频输入,NVILA通过应用时间压缩来处理更多帧,在准确性和计算效率之间取得平衡,使其适用于机器人导航等领域。

🧠NVILA还采用了FP8混合精度和数据集修剪等技术来加速训练并降低内存使用,自适应学习率和参数高效微调确保模型能够处理特定领域的任务。

🏎️在部署过程中,NVILA对视觉塔使用W8A8量化,对语言组件使用W4A16量化,从而加快推理速度,同时保持性能。

🏅NVILA在DocVQA和TextVQA等任务上实现了高达30%的更高准确性,其长上下文功能优于GPT-4o和Gemini 1.5等专有模型。

Visual language models (VLMs) have come a long way in integrating visual and textual data. Yet, they come with significant challenges. Many of today’s VLMs demand substantial resources for training, fine-tuning, and deployment. For instance, training a 7-billion-parameter model can take over 400 GPU days, which makes it inaccessible to many researchers. Fine-tuning is equally demanding, often requiring over 64GB of GPU memory, far exceeding what consumer hardware can handle. Deploying these models in environments with limited computational resources, such as edge devices or robotics, is another hurdle. These limitations highlight the urgent need for VLMs that are not only powerful but also efficient and scalable.

To tackle these challenges, NVIDIA has introduced NVILA, a family of open VLMs designed with efficiency and accuracy in mind. Building on the VILA model, NVILA adopts a “scale-then-compress” approach. This method increases spatial and temporal resolutions to preserve details in visual inputs and then compresses them into fewer, denser tokens. This combination allows NVILA to handle high-resolution images and long video sequences effectively.

NVILA’s design optimizes every stage of the model lifecycle. It reduces training costs by 4.5×, cuts fine-tuning memory requirements by 3.4×, and improves inference speeds by 1.6 to 2.8× compared to other VLMs. Importantly, these gains do not come at the expense of accuracy. NVILA performs on par with or better than many benchmarks, excelling in visual question answering, video understanding, and document processing tasks. NVIDIA also plans to release NVILA’s code and models, fostering greater accessibility and reproducibility.

Technical Details

At the heart of NVILA’s efficiency is its “scale-then-compress” strategy. Spatial scaling increases image resolutions to dimensions like 896×896 pixels, compared to the usual 448×448. To mitigate the computational cost of scaling, NVILA uses token compression to retain essential information while reducing the number of tokens. For video inputs, the model processes more frames by applying temporal compression, balancing accuracy and computational efficiency.

NVILA incorporates further innovations to streamline training and fine-tuning. Techniques like FP8 mixed precision and dataset pruning accelerate training and lower memory usage. Adaptive learning rates and parameter-efficient fine-tuning ensure the model can handle domain-specific tasks without excessive resource demands. During deployment, NVILA uses advanced quantization—W8A8 for the vision tower and W4A16 for language components—to speed up inference while maintaining performance.

Performance Highlights

NVILA’s value lies in making advanced VLMs more accessible while addressing the need for efficient AI systems. Some key metrics include:

NVILA’s potential spans diverse fields, including robotics and healthcare. For example, its temporal localization capabilities make it ideal for robotic navigation, while its NVILA-M3 framework integrates expert models to improve diagnostic accuracy in medical imaging.

Conclusion

NVILA represents a meaningful step forward in the development of visual language models. By rethinking architecture and optimizing the entire lifecycle, NVIDIA has created a model that balances efficiency and accuracy. NVILA addresses the limitations of traditional VLMs and expands their applicability to resource-constrained and specialized environments. With NVIDIA’s commitment to open access, NVILA is set to inspire further research and innovation in AI.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

[Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)

The post NVIDIA AI Introduces NVILA: A Family of Open Visual Language Models VLMs Designed to Optimize both Efficiency and Accuracy appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

视觉语言模型 英伟达 NVILA 人工智能 深度学习
相关文章