Generalist Forecasting with Frozen Video Models via Latent Diffusion

cs.AI updates on arXiv.org 07月21日 12:06

Generalist Forecasting with Frozen Video Models via Latent Diffusion

本文发现视觉模型的感知能力与其在短期内的预测性能密切相关，并构建了一种新型预测框架，通过训练潜在扩散模型预测未来特征，应用于多种预训练模型及不同抽象层次，提升视频理解能力。

arXiv:2507.13942v1 Announce Type: cross Abstract: Forecasting what will happen next is a critical skill for general-purpose systems that plan or act in the world at different levels of abstraction. In this paper, we identify a strong correlation between a vision model's perceptual ability and its generalist forecasting performance over short time horizons. This trend holds across a diverse set of pretrained models-including those trained generatively-and across multiple levels of abstraction, from raw pixels to depth, point tracks, and object motion. The result is made possible by a novel generalist forecasting framework that operates on any frozen vision backbone: we train latent diffusion models to forecast future features in the frozen representation space, which are then decoded via lightweight, task-specific readouts. To enable consistent evaluation across tasks, we introduce distributional metrics that compare distributional properties directly in the space of downstream tasks and apply this framework to nine models and four tasks. Our results highlight the value of bridging representation learning and generative modeling for temporally grounded video understanding.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

视觉模型预测性能视频理解

相关文章

LLaVA-NeXT: Advancements in Multimodal Understanding and Video Comprehension

CinePile: A Novel Dataset and Benchmark Specifically Designed for Authentic Long-Form Video Understanding

This AI Paper by the National University of Singapore Introduces MambaOut: Streamlining Visual Models for Improved Accuracy

Nomic AI Releases Nomic Embed Vision v1 and Nomic Embed Vision v1.5: CLIP-like Vision Models that Can be Used Alongside their Popular Text Embedding Models

树莓派上部署RAG，微软Phi-3技术报告揭示「小而美」模型如何诞生

Pixel Transformer: Challenging Locality Bias in Vision Models

Gemini视频推理遥遥领先GPT-4o，Jeff Dean连续转发三次，首个视频多模态基准Video-MME来了

Anthropic: ↩️ Claude 3.5 Sonnet is now our strongest vision model. Sonnet now surpasses Claude 3 Opus across all standard vision benchmarks. Improve...

GPT-4o一夜被赶超，Anthropic推出Claude 3.5，网友3分钟克隆马里奥游戏

Meet Jockey: A Conversational Video Agent Powered by LangGraph and Twelve Labs API