MarkTechPost@AI 2024年10月31日
Meta AI Releases LongVU: A Multimodal Large Language Model that can Address the Significant Challenge of Long Video Understanding
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta AI 发布了 LongVU,这是一个旨在解决传统多模态大型语言模型 (MLLM) 在处理长视频内容时面临的上下文长度限制问题的新型模型。LongVU 采用时空自适应压缩机制,通过智能地减少视频标记数量来保留重要的视觉细节。它利用 DINOv2 特征和跨模态查询相结合的方式,有效地减少了视频数据中的空间和时间冗余,从而能够处理长视频序列,同时不会丢失关键信息。

🗺️ LongVU 通过时空自适应压缩机制来解决长视频理解的难题。它使用 DINOv2 特征来提取帧,并通过文本引导的跨模态查询进行选择性帧特征减少,以及基于时间依赖性的空间标记减少。

🚀 LongVU 的性能优于其他模型。在 VideoMME 基准测试中,LongVU 的总体准确率比 LLaVA-OneVision 高出约 5%。即使使用 Llama3.2-3B 语言主干将其缩减为轻量级版本,LongVU 在长视频任务中也取得了显著进步,比以前最先进的模型提高了 3.4%。

💡 LongVU 的优势在于它能够处理长视频内容,同时保持高性能。它使用每秒一帧 (1fps) 的采样视频输入,将每帧的标记数量平均减少到两个,从而能够在 8k 上下文长度内容纳长达一小时的视频序列。

📊 LongVU 的性能在 MVBench 评估集上与 GPT-4V 竞争,在某些情况下甚至超过了 GPT-4V,这表明它在理解密集采样视频输入方面的有效性。

🚀 LongVU 的轻量级架构和高效压缩使其能够将高级视频理解扩展到各种用例,包括移动和低资源环境。通过降低计算成本而又不影响准确性,LongVU 为未来的 MLLM 设置了新的标准。

Understanding and analyzing long videos has been a significant challenge in AI, primarily due to the vast amount of data and computational resources required. Traditional Multimodal Large Language Models (MLLMs) struggle to process extensive video content because of limited context length. This challenge is especially evident with hour-long videos, which need hundreds of thousands of tokens to represent visual information—often exceeding the memory capacity of even advanced hardware. Consequently, these models struggle to provide consistent and comprehensive video understanding, limiting their real-world applications.

Meta AI Releases LongVU

Meta AI has released LongVU, an MLLM designed to address the challenge of long video understanding within a commonly used context length. LongVU employs a spatiotemporal adaptive compression mechanism that intelligently reduces the number of video tokens while preserving essential visual details. By leveraging a combination of DINOv2 features and cross-modal queries, LongVU effectively reduces spatial and temporal redundancies in video data, enabling the processing of long-form video sequences without losing critical information.

LongVU uses a selective frame feature reduction approach guided by text queries and leverages DINOv2’s self-supervised features to discard redundant frames. This method has a significant advantage over traditional uniform sampling techniques, which either lead to the loss of important information by discarding keyframes or become computationally infeasible by retaining too many tokens. The resulting MLLM has a lightweight design, allowing it to operate efficiently and achieve state-of-the-art results on video understanding benchmarks.

Technical Details and Benefits of LongVU

LongVU’s architecture combines DINOv2 features for frame extraction, selective frame feature reduction through text-guided cross-modal queries, and spatial token reduction based on temporal dependencies. Initially, DINOv2’s feature similarity objective is used to eliminate redundant frames, reducing the token count. LongVU then applies a cross-modal query to prioritize frames relevant to the input text query. For the remaining frames, a spatial pooling mechanism further reduces the token representation while preserving the most important visual details.

This approach maintains high performance even when processing hour-long videos. The spatial token reduction mechanism ensures that essential spatial information is retained while redundant data is eliminated. LongVU processes one-frame-per-second (1fps) sampled video input, effectively reducing the number of tokens per frame to an average of two, accommodating hour-long video sequences within an 8k context length—a common limitation for MLLMs. The architecture balances token reduction with the preservation of crucial visual content, making it highly efficient for long video processing.

Importance and Performance of LongVU

LongVU represents a significant breakthrough in long video understanding by overcoming the fundamental issue of limited context length faced by most MLLMs. Through spatiotemporal compression and effective cross-modal querying, LongVU achieves impressive results on key video understanding benchmarks. For example, on the VideoMME benchmark, LongVU outperforms a strong baseline model, LLaVA-OneVision, by approximately 5% in overall accuracy. Even when scaled down to a lightweight version using the Llama3.2-3B language backbone, LongVU demonstrated substantial gains, achieving a 3.4% improvement over previous state-of-the-art models in long video tasks.

LongVU’s robustness is further highlighted by its competitive results against proprietary models like GPT-4V. On the MVBench evaluation set, LongVU not only reduced the performance gap with GPT-4V but also surpassed it in some cases, demonstrating its effectiveness in understanding densely sampled video inputs. This makes LongVU particularly valuable for applications that require real-time video analysis, such as security surveillance, sports analysis, and video-based educational tools.

Conclusion

Meta AI’s LongVU is a major advancement in video understanding, especially for lengthy content. By using spatiotemporal adaptive compression, LongVU effectively addresses the challenges of processing videos with temporal and spatial redundancies, providing an efficient solution for long video analysis. Its superior performance across benchmarks highlights its edge over traditional MLLMs, paving the way for more advanced applications.

With its lightweight architecture and efficient compression, LongVU extends high-level video understanding to diverse use cases, including mobile and low-resource environments. By reducing computational costs without compromising accuracy, LongVU sets a new standard for future MLLMs.


Check out the Paper and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[Trending] LLMWare Introduces Model Depot: An Extensive Collection of Small Language Models (SLMs) for Intel PCs

The post Meta AI Releases LongVU: A Multimodal Large Language Model that can Address the Significant Challenge of Long Video Understanding appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LongVU Meta AI 多模态大型语言模型 长视频理解 时空自适应压缩
相关文章