Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

cs.AI updates on arXiv.org 07月11日 12:04

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

本文提出了一种名为STTM的训练免费时空合并方法，通过利用视频数据中的局部时空冗余，有效降低视频大语言模型（LLM）的计算复杂度，提升视频理解能力。

arXiv:2507.07990v1 Announce Type: cross Abstract: Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2$\times$ speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3$\times$ speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video. The project page is available at https://www.jshyun.me/projects/sttm.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

视频LLM 时空合并 STTM 计算效率视频理解

相关文章

LLaVA-NeXT: Advancements in Multimodal Understanding and Video Comprehension

CinePile: A Novel Dataset and Benchmark Specifically Designed for Authentic Long-Form Video Understanding

Gemini视频推理遥遥领先GPT-4o，Jeff Dean连续转发三次，首个视频多模态基准Video-MME来了

Meet Jockey: A Conversational Video Agent Powered by LangGraph and Twelve Labs API

AI 视频不只是视频生成！英伟达领投 5000 万，专注视频理解的这家公司值得关注

生成式AI可能迎来下一个风口：TTT模型

生成式AI可能迎来下一个风口：TTT模型

育碧 7 月 27 日分享最新研究成果：游戏 AI 的可见性测试计算

SF-LLaVA: A Training-Free Video LLM that is Built Upon LLaVA-NeXT and Requires No Additional Fine-Tuning to Work Effectively for Various Video Tasks

Faster LLMs with speculative decoding and AWS Inferentia2