MarkTechPost@AI 01月13日
This AI Paper Introduces Toto: Autoregressive Video Models for Unified Image and Video Pre-Training Across Diverse Tasks
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta FAIR和UC Berkeley的研究团队推出了Toto系列自回归视频模型,它通过将视频视为离散视觉标记的序列,并应用因果Transformer架构来预测后续标记,从而解决了传统方法的局限性。该模型在包含超过一万亿个图像和视频标记的统一数据集上进行训练,实现了图像和视频训练的轻松结合。Toto模型利用dVAE标记化技术,并采用RMSNorm和RoPE嵌入等方法来提高模型性能。实验结果表明,Toto模型在图像分类、动作识别、视频跟踪和机器人操作等多个任务中均表现出色,展现了其多功能性和可扩展性。

🎬 Toto模型的核心创新在于将视频视为离散视觉标记的序列,并使用因果Transformer架构进行处理,有效地捕捉了视频中的时空关系。

🖼️ 该模型采用dVAE标记化方法,将图像和视频帧转换为8k词汇量的标记序列,每个帧被调整大小并标记化为256个标记,方便模型统一处理图像和视频数据。

🚀 通过在ImageNet和HowTo100M数据集上的训练,Toto模型在图像分类、动作识别和视频跟踪等多个任务中均取得了优异的成绩,超越了以往的生成模型。

🤖 Toto模型在机器人操作任务中表现出更高的效率,例如,Toto-base模型在Franka机器人上成功实现了立方体拾取任务,准确率达到63%,证明了其在实际应用中的潜力。

Autoregressive pre-training has proved to be revolutionary in machine learning, especially concerning sequential data processing. Predictive modeling of the following sequence elements has been highly effective in natural language processing and, increasingly, has been explored within computer vision domains. Video modeling is one area that has hardly been explored, giving opportunities for extending into action recognition, object tracking, and robotics applications. These developments are due to growing datasets and innovation in transformer architectures that treat visual inputs as structured tokens suitable for autoregressive training.

Modeling videos has unique challenges due to their temporal dynamics and redundancy. Unlike text with a clear sequence, video frames usually contain redundant information, making it difficult to tokenize and learn proper representations. Proper video modeling should be able to overcome this redundancy while capturing spatiotemporal relationships in frames. Most frameworks have focused on image-based representations, leaving the optimization of video architectures open. The task requires new methods to balance efficiency and performance, particularly when video forecasting and robotic manipulation are at play.

Visual representation learning via convolutional networks and masked autoencoders has been effective for image tasks. Such approaches typically fail regarding video applications as they cannot entirely express temporal dependencies. Tokenization methods such as dVAE and VQGAN normally convert visual information into tokens. These have shown effectiveness, but scaling such an approach becomes challenging in scenarios with mixed datasets involving images and videos. Patch-based tokenization does not generalize to cater to various tasks efficiently in a video.

A research team from Meta FAIR and UC Berkeley has introduced the Toto family of autoregressive video models. Their novelty is to help address the limitations of traditional methods, treating videos as sequences of discrete visual tokens and applying causal transformer architectures to predict subsequent tokens. The researchers developed models that could easily combine image and video training by training on a unified dataset that includes more than one trillion tokens from images and videos. The unified approach enabled the team to take advantage of the strengths of autoregressive pretraining in both domains.

The Toto models use dVAE tokenization with an 8k-token vocabulary to process images and video frames. Each frame is resized and tokenized separately, resulting in sequences of 256 tokens. These tokens are then processed by a causal transformer that uses the features of RMSNorm and RoPE embeddings to establish improved model performance. The training was done on ImageNet and HowTo100M datasets, tokenizing at a resolution of 128×128 pixels. The researchers also optimized the models for downstream tasks by replacing average pooling with attention pooling to ensure a better quality of representation.

The models show good performance across the benchmarks. For ImageNet classification, the largest Toto model achieved a top-1 accuracy of 75.3%, outperforming other generative models like MAE and iGPT. In the Kinetics-400 action recognition task, the models achieve a top-1 accuracy of 74.4%, proving their capability to understand complex temporal dynamics. On the DAVIS dataset for semi-supervised video tracking, the models obtain J&F scores of up to 62.4, thus improving over previous state-of-the-art benchmarks established by DINO and MAE. Moreover, on robotics tasks like object manipulation, Toto models learn much faster and are more sample efficient. For example, the Toto-base model attains a cube-picking real-world task on the Franka robot with an accuracy of 63%. Overall, these are impressive results regarding the versatility and scalability of these proposed models with diverse applications.

The work provided significant development in video modeling by addressing redundancy and challenges in tokenization. The researchers successfully showed “through unified training on both images and videos, that this form of autoregressive pretraining is generally effective across a range of tasks.” Innovative architecture and tokenization strategies provide a baseline for further dense prediction and recognition research. This is one meaningful step toward unlocking the full potential of video modeling in real-world applications.


Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation IntelligenceJoin this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

The post This AI Paper Introduces Toto: Autoregressive Video Models for Unified Image and Video Pre-Training Across Diverse Tasks appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

自回归模型 视频建模 Transformer 人工智能 Toto
相关文章