MarkTechPost@AI 2024年10月07日
Vinoground: A Temporal Counterfactual Large Multimodal Models LMM Evaluation Benchmark Encompassing 1000 Short and Natural Video-Caption Pairs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Vinoground是一个由威斯康星大学研究者提出的评估基准,包含1000个短且自然的视频及其字幕,用于评估LLM对视频的理解能力。该基准揭示了当前LLM在视频理解方面的不足,探讨了模型的性能及改进方向。

🌐Vinoground是具有挑战性的数据集,包含自然的连续动作和变换,能暴露当前LLM在视频理解方面的缺陷,其数据分为对象、动作、视角等主要类别及交互、循环、空间、上下文等次要类别。

📝在评估模型时,使用GPT -4生成反事实字幕,通过FAISS库进行特征提取并与VATEX的字幕进行匹配,还利用了VATEX数据集的未训练部分进行视频策划。

💪不同模型在Vinoground上的表现各异,GPT 4o在生成模型中表现最佳,而一些基于剪辑的模型表现甚至不如随机水平,开源模型的表现也各有不同。

Generative Intelligence has remained a hot topic for some time, with the current world witnessing an unprecedented boom in AI-related innovations and research, especially after the introduction of Large Language Models. A significant amount of funding is being allocated to LLM-related research in academia and industry, and it is intended to create the subsequent breakthrough intervention that would disrupt the industry afterward. Upon scrutinizing Large Language Multi Models, we see that the general sentiment persisting today claims that these have successfully addressed the challenges associated with video content, specifically short ones. Consequently, LLMs are moving to address more challenging tasks in multimodal content, including longer videos. However, Is this claim authentic, and have we achieved human par performance with SOTA LLMs for short videos?  Vinoground scrutinizes this claim and assesses if we are ready to level up or if LLMs need to revisit their foundations in video comprehension.

Vinoground is a temporal counterfactual LLM evaluation benchmark by researchers from the University of Wisconsin. It consists of 1000 short and natural videos along with their captions. This challenging dataset assesses LLMs’ ability to comprehend videos with dense temporal information. The factor distinguishing VinoGround from its contemporaries is its naturality, with real-life consecutive actions and transformations truly testing and exposing the incapabilities of current LLMs on video frames. There are barely a few benchmarks that replicate the true practical test ground for LLMs, while many benchmarks are temporally sparse and show that LLMs have single-frame biases, and other temporal counterfactuals are unnatural. State-of-the-art proprietary and open-source LLMs exhibited poor performance on Vinoground, indicating they need to achieve reliable video comprehension.

This data is categorized into three major categories: Object, Action, and Viewpoint. Furthermore,  there are four minor categories- interaction, cyclical, spatial, and contextual. Models are assessed against each of these categories. Next comes the caption generation, where the authors choose GPT -4 to generate counterfactual captions over costly human annotations. These captions had to have the exact words in different permutations. Video Curation was perhaps the most crucial task, and Vinoground utilized the untrained test and validation part of the VATEX dataset. VATEX’s captions were matched against GPT-generated ones through feature extraction via the FAISS library. If there were no suitable matches, authors looked up YouTube to find a muse for their GPT captions. Finally, the dataset was divided as per the following criterion : 

    Object- Videos showed a transformation in the object’s state.Action- Set of tasks occurring in different orders to see if the model can catch this swap.Viewpoint-changes in the camera angle, perspective, or focusInteraction-videos where a human changes their way of interacting with an objectCyclical- Videos with procedural temporal activities and dependent activitiesSpatial- Object movements across spaceContextual- To understand changes in the background or general information of the entire video 

Vinoground exposed the claims of both proprietary and open-sourced models. Clip-based models like video clips and LanguageBind performed even worse than the random chances.GPT  4o performed the best regarding generative models, with 54% on the text score metric. It was performed using Chain of Thought (CoT) prompting on GPT, but there was a tradeoff with group performance. Open-sourced models like LLaVA-OneVision and Qwen2-VL performed parallelly, and their performance didn’t alter after using CoT. 

Experts say we need AI that is 1) Accurate and Reliable, 2) Energy Efficient, and  3) Customizable in the same order of priority not otherwise. Developers claim that their LLMs are reliable in performance and at par with humans, but researchers like VinoGround give a reality check for the AI community and LLM developers to ponder over their claims.


Check out the Paper, Project, and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

Interested in promoting your company, product, service, or event to over 1 Million AI developers and researchers? Let’s collaborate!

The post Vinoground: A Temporal Counterfactual Large Multimodal Models LMM Evaluation Benchmark Encompassing 1000 Short and Natural Video-Caption Pairs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Vinoground LLM评估 视频理解
相关文章