MarkTechPost@AI 01月16日
ByteDance Researchers Introduce Tarsier2: A Large Vision-Language Model (LVLM) with 7B Parameters, Designed to Address the Core Challenges of Video Understanding
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

字节跳动研究人员推出了Tarsier2,一个拥有70亿参数的大型视觉语言模型,旨在解决视频理解的核心挑战。Tarsier2在生成详细的视频描述方面表现出色,超越了GPT-4o和Gemini-1.5-Pro等模型。它在问答、定位和具身智能等任务中也表现出强大的性能。通过扩展到4000万视频-文本对的预训练数据集,以及精细的时间对齐和直接偏好优化(DPO)等技术,Tarsier2实现了显著的性能提升。在DREAM-1K数据集上,它在F1得分方面分别优于GPT-4o 2.8%和Gemini-1.5-Pro 5.8%。

🎬Tarsier2模型的核心架构包含视觉编码器、视觉适配器和大型语言模型,采用三阶段训练流程:预训练、监督微调和直接偏好优化。

⏱️预训练阶段使用4000万视频-文本对数据集,包含低级动作和高级情节细节的评论视频,为模型学习奠定基础。在监督微调阶段,通过精细的时间对齐,模型能够准确关联事件和视频帧,减少幻觉并提高精度。

🎯直接偏好优化(DPO)阶段利用自动生成的偏好数据,改进模型的决策,并进一步减少幻觉。这些技术进步不仅改善了视频描述的生成,还提高了模型在视频任务中的整体多功能性。

🏆Tarsier2在多个基准测试中取得了显著成果,在DREAM-1K基准测试中,首次超过40%的整体召回率,并在15个公共基准测试中创造了新的性能记录。在E.T. Bench-Grounding测试中,Tarsier2的平均F1得分达到35.5%,突显其在时间理解方面的能力。

Video understanding has long presented unique challenges for AI researchers. Unlike static images, videos involve intricate temporal dynamics and spatial-temporal reasoning, making it difficult for models to generate meaningful descriptions or answer context-specific questions. Issues like hallucination, where models fabricate details, further compromise the reliability of existing systems. Despite advancements with models such as GPT-4o and Gemini-1.5-Pro, achieving human-level video comprehension remains a complex task. Accurate event perception and sequence understanding, coupled with reducing hallucination, are crucial hurdles to overcome.

ByteDance researchers have introduced Tarsier2, a large vision-language model (LVLM) with 7 billion parameters, designed to address the core challenges of video understanding. Tarsier2 excels in generating detailed video descriptions, surpassing models like GPT-4o and Gemini-1.5-Pro. Beyond video descriptions, it demonstrates strong performance in tasks such as question-answering, grounding, and embodied intelligence. With an expanded pre-training dataset of 40 million video-text pairs, fine-grained temporal alignment, and Direct Preference Optimization (DPO) during training, Tarsier2 achieves noteworthy improvements. For example, on the DREAM-1K dataset, it outperforms GPT-4o by 2.8% and Gemini-1.5-Pro by 5.8% in F1 scores.

Technical Innovations and Benefits

Tarsier2 integrates several technical advancements to enhance performance. The model’s architecture includes a vision encoder, vision adaptor, and a large language model, combined in a three-stage training process:

    Pre-training: A dataset of 40 million video-text pairs, enriched with commentary videos that capture both low-level actions and high-level plot details, provides a solid foundation for learning.Supervised Fine-Tuning (SFT): Fine-grained temporal alignment during this stage ensures the model accurately associates events with corresponding video frames, reducing hallucination and improving precision.Direct Preference Optimization (DPO): This phase employs automatically generated preference data to refine the model’s decision-making and minimize hallucinations.

These advancements not only improve the generation of detailed video descriptions but also enhance the model’s overall versatility across video-centric tasks.

Results and Insights

Tarsier2 achieves impressive results across multiple benchmarks. Human evaluations reveal an 8.6% performance advantage over GPT-4o and a 24.9% improvement over Gemini-1.5-Pro. On the DREAM-1K benchmark, it becomes the first model to exceed a 40% overall recall score, highlighting its ability to detect and describe dynamic actions comprehensively. Furthermore, it sets new performance records on 15 public benchmarks, including tasks like video question-answering and temporal reasoning. In the E.T. Bench-Grounding test, Tarsier2 achieves the highest mean F1 score of 35.5%, underlining its capabilities in temporal understanding. Ablation studies further underscore the critical role of the expanded pre-training dataset and DPO phase in enhancing performance metrics like F1 scores and accuracy.

Conclusion

Tarsier2 marks a significant step forward in video understanding by addressing key challenges such as temporal alignment, hallucination reduction, and data scarcity. ByteDance researchers have delivered a model that not only outperforms leading alternatives in key metrics but also provides a scalable framework for future advancements. As video content continues to dominate digital media, models like Tarsier2 hold immense potential for applications ranging from content creation to intelligent surveillance.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)

The post ByteDance Researchers Introduce Tarsier2: A Large Vision-Language Model (LVLM) with 7B Parameters, Designed to Address the Core Challenges of Video Understanding appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Tarsier2 视觉语言模型 视频理解 人工智能
相关文章