cs.AI updates on arXiv.org 04月01日
Transfer-learning for video classification: Video Swin Transformer on multiple domains
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文研究了视频Swin Transformer (VST) 在域外环境中的泛化能力。通过在Kinetics-400数据集上进行迁移学习,然后在FCVID和Something-Something两个大型数据集上评估VST的性能。研究发现,在FCVID数据集上,VST在不重新训练整个模型的情况下达到了85%的top-1准确率,与现有最佳水平持平。然而,在Something-Something数据集上,准确率仅为21%。研究表明,当目标类别与训练模型时使用的类别类型相同时,VST能够很好地泛化,例如从Kinetics-400到FCVID的迁移。但当类别类型不同时,迁移学习后的准确率会降低,如从Kinetics-400到Something-Something。

🎬 视频Swin Transformer (VST) 是一种纯Transformer模型,用于视频分类,在多个数据集上取得了领先的准确性和效率。

🔄 研究采用迁移学习方法,在Kinetics-400数据集上预训练VST,然后将其应用于FCVID和Something-Something两个大型数据集,这种方法所需的内存是直接从头开始训练的4倍。

🎯 在FCVID数据集上,VST在不重新训练整个模型的情况下达到了85%的top-1准确率,与现有最佳水平相当。这表明VST在处理与训练数据相似的视频类别时具有良好的泛化能力。

📉 在Something-Something数据集上,VST的准确率仅为21%。研究表明,当目标类别与训练模型的类别类型不同时,迁移学习后的准确率较低。例如,Kinetics-400主要包含对象类别,而Something-Something主要包含动作类别。

arXiv:2210.09969v2 Announce Type: replace-cross Abstract: The computer vision community has seen a shift from convolutional-based to pure transformer architectures for both image and video tasks. Training a transformer from zero for these tasks usually requires a lot of data and computational resources. Video Swin Transformer (VST) is a pure-transformer model developed for video classification which achieves state-of-the-art results in accuracy and efficiency on several datasets. In this paper, we aim to understand if VST generalizes well enough to be used in an out-of-domain setting. We study the performance of VST on two large-scale datasets, namely FCVID and Something-Something using a transfer learning approach from Kinetics-400, which requires around 4x less memory than training from scratch. We then break down the results to understand where VST fails the most and in which scenarios the transfer-learning approach is viable. Our experiments show an 85\% top-1 accuracy on FCVID without retraining the whole model which is equal to the state-of-the-art for the dataset and a 21\% accuracy on Something-Something. The experiments also suggest that the performance of the VST decreases on average when the video duration increases which seems to be a consequence of a design choice of the model. From the results, we conclude that VST generalizes well enough to classify out-of-domain videos without retraining when the target classes are from the same type as the classes used to train the model. We observed this effect when we performed transfer-learning from Kinetics-400 to FCVID, where most datasets target mostly objects. On the other hand, if the classes are not from the same type, then the accuracy after the transfer-learning approach is expected to be poor. We observed this effect when we performed transfer-learning from Kinetics-400, where the classes represent mostly objects, to Something-Something, where the classes represent mostly actions.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

视频Swin Transformer 迁移学习 泛化能力 视频分类 Kinetics-400
相关文章