MarkTechPost@AI 07月30日 15:21
Apple Researchers Introduce FastVLM: Achieving State-of-the-Art Resolution-Latency-Accuracy Trade-off in Vision Language Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

苹果公司的研究人员推出FastVLM,一种创新的视觉语言模型(VLM),旨在解决高分辨率图像处理的挑战。通过引入FastViTHD混合视觉编码器,FastVLM能够显著减少视觉令牌数量和编码时间,从而优化了图像质量、处理时间和模型性能之间的权衡。与现有模型相比,FastVLM在多个基准测试中展现出更快的速度和更高的准确性,尤其是在处理文本丰富和图表密集型数据时,为VLM在高分辨率图像理解方面树立了新的标杆。

🖼️ FastVLM通过FastViTHD混合视觉编码器优化高分辨率图像处理,该编码器设计用于减少令牌数量和缩短编码时间,实现了分辨率、延迟和准确性之间的理想平衡。

⏱️ FastVLM大幅提升了时间到首个令牌(TTFT)的效率,在LLaVA1.5设置下比LLaVA-OneVision快3.2倍,并且在相同的0.5B LLM下使用更小的视觉编码器取得了更优越的性能。

🚀 与ConvLLaVA等其他模型相比,FastVLM在TextVQA和DocVQA等基准测试中表现出更高的准确性和更快的处理速度,在高分辨率下优势尤为明显,处理速度可达2倍。

💡 FastVLM通过在Qwen2-7B解码器上进行快速预训练,仅需约30分钟即可完成VLM的阶段1训练,展示了其高效的训练能力,并使用更少的视觉令牌实现了更好的结果。

🌟 FastVLM在多个VLM基准测试中取得了具有竞争力的性能,并显著提高了TTFT和视觉骨干参数数量的效率,同时在M1 MacBook Pro硬件上进行了严格的基准测试,提供了优于当前方法的先进分辨率-延迟-准确性权衡。

Vision Language Models (VLMs) allow both text inputs and visual understanding. However, image resolution is crucial for VLM performance for processing text and chart-rich data. Increasing image resolution creates significant challenges. First, pretrained vision encoders often struggle with high-resolution images due to inefficient pretraining requirements. Running inference on high-resolution images increases computational costs and latency during visual token generation, whether through single high-resolution processing or multiple lower-resolution tile strategies. Second, high-resolution images produce more tokens, which leads to an increase in LLM prefilling time and time-to-first-token (TTFT), which is the sum of the vision encoder latency and the LLM prefilling time.

Large multimodal models such as Frozen and Florence used cross-attention to combine image and text embeddings within the intermediate LLM layers. Auto-regressive architectures like LLaVA, mPLUG-Owl, MiniGPT-4, and Cambrian-1 are effective. For efficient image encoding, CLIP-pretrained vision transformers remain widely adopted, with variants like SigLIP, EVA-CLIP, InternViT, and DFNCLIP. Methods like LLaVA-PruMerge and Matryoshka-based token sampling attempt dynamic token pruning, while hierarchical backbones such as ConvNeXT and FastViT reduce token count through progressive downsampling.  Recently, ConvLLaVA was introduced, which uses a pure-convolutional vision encoder to encode images for a VLM. 

Researchers from Apple have proposed FastVLM, a model that achieves an optimized tradeoff between resolution, latency, and accuracy by analyzing how image quality, processing time, number of tokens, and LLM size affect each other. It utilizes FastViTHD, a hybrid vision encoder designed to output fewer tokens and reduce encoding time for high-resolution images. FastVLM achieves an optimal balance between visual token count and image resolution only by scaling the input image. It shows a 3.2 times improvement in TTFT in the LLaVA1.5 setup and achieves superior performance on key benchmarks using the same 0.5B LLM when compared to LLaVA-OneVision at maximum resolution. It delivers 85 times faster TTFT while using a 3.4 times smaller vision encoder.

All FastVLM models are trained on a single node with 8 times NVIDIA H100-80GB GPUs, where stage 1 training of VLM is fast, taking around 30 minutes to train with a Qwen2-7B decoder. Further, FastViTHD enhances the base FastViT architecture by introducing an additional stage with a downsampling layer. This ensures self-attention operates on tensors downsampled by a factor of 32 rather than 16, reducing image encoding latency while generating 4 times fewer tokens for the LLM decoder. The FastViTHD architecture contains five stages: the first three stages utilize RepMixer blocks for efficient processing, while the final two stages employ multi-headed self-attention blocks, creating an optimal balance between computational efficiency and high-resolution image understanding.

When compared with ConvLLaVA using the same LLM and similar training data, FastVLM achieves 8.4% better performance on TextVQA and 12.5% improvement on DocVQA while operating 22% faster. The performance advantage increases at higher resolutions, where FastVLM maintains 2× faster processing speeds than ConvLLaVA across various benchmarks. FastVLM matches or surpasses MM1 performance across diverse benchmarks by using intermediate pretraining with 15M samples for resolution scaling, while generating 5 times fewer visual tokens. Moreover, FastVLM not only outperforms Cambrian-1 but also runs 7.9 times faster. With scaled instruction tuning, it delivers better results while using 2.3 times fewer visual tokens.

In conclusion, researchers introduced FastVLM, an advancement in VLM by utilizing the FastViTHD vision backbone for efficient high-resolution image encoding. The hybrid architecture, pretrained on reinforced image-text data, reduces visual token output while maintaining minimal accuracy sacrifice compared to existing approaches. FastVLM achieves competitive performance across VLM benchmarks while delivering notable efficiency improvements in both TTFT and vision backbone parameter count. Rigorous benchmarking on M1 MacBook Pro hardware shows that FastVLM offers a state-of-the-art resolution-latency-accuracy trade-off superior to the current methods.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

You may also like NVIDIA’s Open Sourced Cosmos DiffusionRenderer [Check it now]

The post Apple Researchers Introduce FastVLM: Achieving State-of-the-Art Resolution-Latency-Accuracy Trade-off in Vision Language Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

FastVLM 视觉语言模型 高分辨率图像处理 人工智能 深度学习
相关文章