MarkTechPost@AI 01月29日
InternVideo2.5: Hierarchical Token Compression and Task Preference Optimization for Video MLLMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

InternVideo2.5是一种新型视频多模态大语言模型,通过长而丰富的上下文建模,显著提升了视频理解能力。该模型利用直接偏好优化,将密集的视觉任务注释融入模型,并通过自适应分层令牌压缩技术,实现高效的时空表示。实验结果表明,InternVideo2.5在短视频和长视频问答任务中均表现出色,尤其在短时空理解和隐式记忆方面,相较于其他模型有显著提升。尽管仍存在计算成本高的问题,但其在多模态人工智能领域的发展潜力巨大。

🚀 InternVideo2.5通过长而丰富的上下文建模(LRC),显著提高了视频多模态大语言模型(MLLMs)的性能,尤其是在处理精细的视频细节和复杂的时序结构方面。

🎯 该模型采用了直接偏好优化技术,将密集的视觉任务标注集成到MLLMs中,从而提升了模型在视觉任务上的表现,比如对象跟踪。

⏱️ 自适应分层令牌压缩技术被用于创建紧凑的时空表示,使得模型能够更有效地处理视频数据,并降低计算成本。

🏆 在多个视频理解基准测试中,InternVideo2.5表现出了卓越的性能,特别是在短视频的时空理解和长视频的隐式记忆方面,相较于其他模型有显著的提升。

Multimodal large language models (MLLMs) have emerged as a promising approach towards artificial general intelligence, integrating diverse sensing signals into a unified framework. However, MLLMs face substantial challenges in fundamental vision-related tasks, significantly underperforming compared to human capabilities. Critical limitations persist in object recognition, localization, and motion recall, presenting obstacles to comprehensive visual understanding. Despite ongoing research and scaling efforts, a clear pathway to achieving human-level visual comprehension remains elusive. The current work highlights the complexity of developing adaptive and intelligent multimodal systems that can interpret and reason across different sensory inputs with human-like precision and flexibility.

Existing research on MLLMs has pursued multiple approaches to address visual understanding challenges. Current methodologies combine vision encoders, language models, and connectors through instruction tuning, enabling complex tasks like image description and visual query responses. Researchers have explored various dimensions including model architecture, model size, training corpus, and performance optimization. Video-capable MLLMs have shown capabilities in processing sequential visuals and comprehending spatiotemporal variations. However, existing methods face significant limitations in handling fine-grained visual tasks such as precise segmentation and temporal grounding, so two strategies have emerged to tackle these challenges: the pixel-to-sequence (P2S) methodology, and the pixel-to-embedding (P2E) approach.

Researchers from Shanghai AI Laboratory, Nanjing University, and Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences have proposed a new version of InternVideo2.5, a novel approach to improve video MLLM through long and rich context (LRC) modeling. It addresses limitations in perceiving fine-grained video details and capturing complex temporal structures. The proposed method focuses on integrating dense vision task annotations into MLLMs using direct preference optimization and developing compact spatiotemporal representations through adaptive hierarchical token compression. The researchers aim to expand the model’s capabilities in video understanding, enabling more robust performance across various benchmarks.

The proposed architecture presents a complex multimodal framework that integrates advanced video processing and language modeling techniques. The system uses dynamic video sampling, processing between 64 to 512 frames, with each 8-frame clip compressed to 128 tokens, resulting in 16 tokens per frame representation. Key architectural components include a Temporal Head based on CG-DETR architecture and a Mask Head utilizing SAM2’s pre-trained weights. For temporal processing, the framework utilizes InternVideo2 for video feature extraction, with query features processed through the language model. The system implements two-layer MLPs for positioning prompts and spatial input encoding into the multimodal language model to optimize spatiotemporal capabilities.

InternVideo2.5 demonstrates remarkable performance across video understanding benchmarks in short and long video question-answering tasks. Compared to its base model InternVL2.5, the proposed approach shows significant improvements, with notable increases of over 3 points on MVBench and Perception Test for short video predictions. InternVideo2.5 exhibits superior performance in short-duration spatiotemporal understanding, compared to models like GPT4-o and Gemini-1.5-Pro. The Needle-In-The-Haystack (NIAH) evaluation further validates the model’s enhanced implicit memory capabilities, successfully showing superior recall in a complex 5,000-frame single-hop task.

In conclusion, researchers introduced a new version of InternVideo2.5, a novel video MLLM designed to enhance perception and understanding through long and rich context (LRC) modeling. The method utilizes direct preference optimization to transfer dense visual annotations and adaptive hierarchical token compression for efficient spatiotemporal representation. The research highlights significant improvements in visual capabilities, including object tracking, and underscores the critical importance of multimodal context resolution in advancing MLLM performance. However, the study shows limitations such as high computational costs and the need for further research in extending context processing techniques, presenting exciting opportunities for future investigation in the multimodal AI field.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)

The post InternVideo2.5: Hierarchical Token Compression and Task Preference Optimization for Video MLLMs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

多模态大语言模型 视频理解 时空建模 InternVideo2.5 人工智能
相关文章