Transfer-learning for video classification: Video Swin Transformer on multiple domains

cs.AI updates on arXiv.org 04月01日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

本文研究了视频Swin Transformer (VST) 在域外环境中的泛化能力。通过在Kinetics-400数据集上进行迁移学习，然后在FCVID和Something-Something两个大型数据集上评估VST的性能。研究发现，在FCVID数据集上，VST在不重新训练整个模型的情况下达到了85%的top-1准确率，与现有最佳水平持平。然而，在Something-Something数据集上，准确率仅为21%。研究表明，当目标类别与训练模型时使用的类别类型相同时，VST能够很好地泛化，例如从Kinetics-400到FCVID的迁移。但当类别类型不同时，迁移学习后的准确率会降低，如从Kinetics-400到Something-Something。

🎬 视频Swin Transformer (VST) 是一种纯Transformer模型，用于视频分类，在多个数据集上取得了领先的准确性和效率。

🔄 研究采用迁移学习方法，在Kinetics-400数据集上预训练VST，然后将其应用于FCVID和Something-Something两个大型数据集，这种方法所需的内存是直接从头开始训练的4倍。

🎯 在FCVID数据集上，VST在不重新训练整个模型的情况下达到了85%的top-1准确率，与现有最佳水平相当。这表明VST在处理与训练数据相似的视频类别时具有良好的泛化能力。

📉 在Something-Something数据集上，VST的准确率仅为21%。研究表明，当目标类别与训练模型的类别类型不同时，迁移学习后的准确率较低。例如，Kinetics-400主要包含对象类别，而Something-Something主要包含动作类别。

arXiv:2210.09969v2 Announce Type: replace-cross Abstract: The computer vision community has seen a shift from convolutional-based to pure transformer architectures for both image and video tasks. Training a transformer from zero for these tasks usually requires a lot of data and computational resources. Video Swin Transformer (VST) is a pure-transformer model developed for video classification which achieves state-of-the-art results in accuracy and efficiency on several datasets. In this paper, we aim to understand if VST generalizes well enough to be used in an out-of-domain setting. We study the performance of VST on two large-scale datasets, namely FCVID and Something-Something using a transfer learning approach from Kinetics-400, which requires around 4x less memory than training from scratch. We then break down the results to understand where VST fails the most and in which scenarios the transfer-learning approach is viable. Our experiments show an 85\% top-1 accuracy on FCVID without retraining the whole model which is equal to the state-of-the-art for the dataset and a 21\% accuracy on Something-Something. The experiments also suggest that the performance of the VST decreases on average when the video duration increases which seems to be a consequence of a design choice of the model. From the results, we conclude that VST generalizes well enough to classify out-of-domain videos without retraining when the target classes are from the same type as the classes used to train the model. We observed this effect when we performed transfer-learning from Kinetics-400 to FCVID, where most datasets target mostly objects. On the other hand, if the classes are not from the same type, then the accuracy after the transfer-learning approach is expected to be poor. We observed this effect when we performed transfer-learning from Kinetics-400, where the classes represent mostly objects, to Something-Something, where the classes represent mostly actions.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签