MarkTechPost@AI 04月29日 04:25
ViSMaP: Unsupervised Summarization of Hour-Long Videos Using Meta-Prompting and Short-Form Datasets
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

ViSMaP是一种创新的无监督方法,旨在解决长视频摘要的难题。它利用短视频数据集和元提示策略,无需昂贵的人工标注即可实现对长达一小时视频的有效总结。ViSMaP通过迭代生成和优化伪摘要,并结合多个大型语言模型,最终达到与完全监督模型相媲美的性能。该方法在跨领域视频摘要方面表现出色,并具有良好的领域适应性,为长视频内容理解和处理提供了新的思路。

💡ViSMaP利用大型语言模型(LLMs)和元提示策略,从短视频模型生成的片段描述中迭代生成和优化伪摘要,从而实现长视频的无监督摘要。

🧪该研究通过在Ego4D-HCap上进行长视频摘要、在MSRVTT、MSVD和YouCook2数据集上进行跨领域泛化以及在EgoSchema上进行短视频适应等多个场景进行评估,证明了ViSMaP的有效性。

🎯ViSMaP首先使用TimeSFormer特征、视觉-语言对齐模块和文本解码器训练一个模型来总结3分钟的视频,然后将长视频分割成3分钟的片段,并生成伪字幕,最后通过对称交叉熵损失对模型进行微调,以管理噪声标签并提高适应性。

📊实验结果表明,ViSMaP在性能上与完全监督方法相当,并且能够有效地适应各种视频数据集,同时强调了元提示和对比学习等组件模块的优势。

Video captioning models are typically trained on datasets consisting of short videos, usually under three minutes in length, paired with corresponding captions. While this enables them to describe basic actions like walking or talking, these models struggle with the complexity of long-form videos, such as vlogs, sports events, and movies that can last over an hour. When applied to such videos, they often generate fragmented descriptions focused on isolated actions rather than capturing the broader storyline. Efforts like MA-LMM and LaViLa have extended video captioning to 10-minute clips using LLMs, but hour-long videos remain a challenge due to a shortage of suitable datasets. Although Ego4D introduced a large dataset of hour-long videos, its first-person perspective limits its broader applicability. Video ReCap addressed this gap by training on hour-long videos with multi-granularity annotations, yet this approach is expensive and prone to annotation inconsistencies. In contrast, annotated short-form video datasets are widely available and more user-friendly.

Advancements in visual-language models have significantly enhanced the integration of vision and language tasks, with early works such as CLIP and ALIGN laying the foundation. Subsequent models, such as LLaVA and MiniGPT-4, extended these capabilities to images, while others adapted them for video understanding by focusing on temporal sequence modeling and constructing more robust datasets. Despite these developments, the scarcity of large, annotated long-form video datasets remains a significant hindrance to progress. Traditional short-form video tasks, like video question answering, captioning, and grounding, primarily require spatial or temporal understanding, whereas summarizing hour-long videos demands identifying key frames amidst substantial redundancy. While some models, such as LongVA and LLaVA-Video, can perform VQA on long videos, they struggle with summarization tasks due to data limitations.

Researchers from Queen Mary University and Spotify introduce ViSMaP, an unsupervised method for summarising hour-long videos without requiring costly annotations. Traditional models perform well on short, pre-segmented videos but struggle with longer content where important events are scattered. ViSMaP bridges this gap by using LLMs and a meta-prompting strategy to iteratively generate and refine pseudo-summaries from clip descriptions created by short-form video models. The process involves three LLMs working in sequence for generation, evaluation, and prompt optimisation. ViSMaP achieves performance comparable to fully supervised models across multiple datasets while maintaining domain adaptability and eliminating the need for extensive manual labelling.

The study addresses cross-domain video summarization by training on a labelled short-form video dataset and adapting to unlabelled, hour-long videos from a different domain. Initially, a model is trained to summarize 3-minute videos using TimeSFormer features, a visual-language alignment module, and a text decoder, optimized by cross-entropy and contrastive losses. To handle longer videos, they are segmented into 3-minute clips, and pseudo-captions are generated. An iterative meta-prompting approach with multiple LLMs (generator, evaluator, optimizer) refines summaries. Finally, the model is fine-tuned on these pseudo-summaries using a symmetric cross-entropy loss to manage noisy labels and improve adaptation.

The study evaluates VisMaP across three scenarios: summarization of long videos using Ego4D-HCap, cross-domain generalization on MSRVTT, MSVD, and YouCook2 datasets, and adaptation to short videos using EgoSchema. VisMaP, trained on hour-long videos, is compared against supervised and zero-shot methods, such as Video ReCap and LaViLa+GPT3.5, demonstrating competitive or superior performance without supervision. Evaluations use CIDEr, ROUGE-L, METEOR scores, and QA accuracy. Ablation studies highlight the benefits of meta-prompting and component modules, such as contrastive learning and SCE loss. Implementation details include the use of TimeSformer, DistilBERT, and GPT-2, with training conducted on an NVIDIA A100 GPU.

In conclusion, ViSMaP is an unsupervised approach for summarizing long videos by utilizing annotated short-video datasets and a meta-prompting strategy. It first creates high-quality summaries through meta-prompting and then trains a summarization model, reducing the need for extensive annotations. Experimental results demonstrate that ViSMaP performs on par with fully supervised methods and adapts effectively across various video datasets. However, its reliance on pseudo labels from a source-domain model may impact performance under significant domain shifts. Additionally, ViSMaP currently relies solely on visual information. Future work could integrate multimodal data, introduce hierarchical summarization, and develop more generalizable meta-prompting techniques.


Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post ViSMaP: Unsupervised Summarization of Hour-Long Videos Using Meta-Prompting and Short-Form Datasets appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ViSMaP 无监督学习 视频摘要 元提示 长视频
相关文章