MarkTechPost@AI 01月18日
Researchers from Meta AI and UT Austin Explored Scaling in Auto-Encoders and Introduced ViTok: A ViT-Style Auto-Encoder to Perform Exploration
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta和UT Austin的研究人员推出了ViTok,一种基于Vision Transformer (ViT)的自编码器,旨在解决传统CNN编码器在图像和视频处理中面临的扩展性问题。ViTok通过采用Transformer架构并利用Llama框架,支持大规模的图像和视频编码,克服了数据集限制。研究重点包括瓶颈缩放、编码器缩放和解码器缩放,旨在优化视觉编码,提高重建精度和生成性能。ViTok在图像和视频重建方面表现出色,同时在生成任务中也展现出竞争力,是一种高效且准确的视觉编码新方案。

🖼️ ViTok采用基于Vision Transformer (ViT)的自编码器架构,与传统的CNN编码器不同,它利用Transformer的优势进行大规模图像和视频编码。

⚙️ ViTok着重研究了瓶颈缩放、编码器缩放和解码器缩放三个方面,旨在优化视觉编码。研究发现,增大瓶颈尺寸可以提高重建质量,但过大的瓶颈可能会使生成任务复杂化;增大编码器尺寸对重建的益处不大,且可能影响生成性能;而增大解码器尺寸可以提高重建质量,但其对生成任务的益处不尽相同。

📹 ViTok在图像重建(ImageNet-1K, COCO)和视频重建(UCF-101)基准测试中均取得了最先进的性能,并能在降低计算量的同时保持竞争力,展示了其在处理时空数据方面的优势。

Modern image and video generation methods rely heavily on tokenization to encode high-dimensional data into compact latent representations. While advancements in scaling generator models have been substantial, tokenizers—primarily based on convolutional neural networks (CNNs)—have received comparatively less attention. This raises questions about how scaling tokenizers might improve reconstruction accuracy and generative tasks. Challenges include architectural limitations and constrained datasets, which affect scalability and broader applicability. There is also a need to understand how design choices in auto-encoders influence performance metrics such as fidelity, compression, and generation.

Researchers from Meta and UT Austin have addressed these issues by introducing ViTok, a Vision Transformer (ViT)-based auto-encoder. Unlike traditional CNN-based tokenizers, ViTok employs a Transformer-based architecture enhanced by the Llama framework. This design supports large-scale tokenization for images and videos, overcoming dataset constraints by training on extensive and diverse data.

ViTok focuses on three aspects of scaling:

    Bottleneck scaling: Examining the relationship between latent code size and performance.Encoder scaling: Evaluating the impact of increasing encoder complexity.Decoder scaling: Assessing how larger decoders influence reconstruction and generation.

These efforts aim to optimize visual tokenization for both images and videos by addressing inefficiencies in existing architectures.

Technical Details and Advantages of ViTok

ViTok uses an asymmetric auto-encoder framework with several distinctive features:

    Patch and Tubelet Embedding: Inputs are divided into patches (for images) or tubelets (for videos) to capture spatial and spatiotemporal details.Latent Bottleneck: The size of the latent space, defined by the number of floating points (E), determines the balance between compression and reconstruction quality.Encoder and Decoder Design: ViTok employs a lightweight encoder for efficiency and a more computationally intensive decoder for robust reconstruction.

By leveraging Vision Transformers, ViTok improves scalability. Its enhanced decoder incorporates perceptual and adversarial losses to produce high-quality outputs. Together, these components enable ViTok to:

Results and Insights

ViTok’s performance was evaluated using benchmarks such as ImageNet-1K, COCO for images, and UCF-101 for videos. Key findings include:

Results highlight ViTok’s strengths in efficiency and accuracy:

Conclusion

ViTok offers a scalable, Transformer-based alternative to traditional CNN tokenizers, addressing key challenges in bottleneck design, encoder scaling, and decoder optimization. Its robust performance across reconstruction and generation tasks highlights its potential for a wide range of applications. By effectively handling both image and video data, ViTok underscores the importance of thoughtful architectural design in advancing visual tokenization.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)

The post Researchers from Meta AI and UT Austin Explored Scaling in Auto-Encoders and Introduced ViTok: A ViT-Style Auto-Encoder to Perform Exploration appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ViTok Vision Transformer 自编码器 视觉编码 图像视频处理
相关文章