MarkTechPost@AI 2024年12月26日
CoordTok: A Scalable Video Tokenizer that Learns a Mapping from Co-ordinate-based Representations to the Corresponding Patches of Input Videos
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

CoordTok是一种新型视频标记化方法,旨在解决长视频处理中的挑战。它通过学习坐标表示到视频块的映射,将视频编码为分解的三平面表示,并重建对应于随机采样的时空坐标的块。这种方法显著降低了内存和计算成本,同时保持了视频质量。通过引入分层架构,CoordTok能够捕捉视频的局部和全局特征,更有效地处理长视频。实验结果表明,CoordTok在编码长视频时,所需令牌数远少于传统方法,并保持了较高的重建质量。这为视频生成模型的高效训练提供了有力支持。

💡 CoordTok通过学习坐标表示到视频块的映射,将视频编码为分解的三平面表示,并重建对应于随机采样的时空坐标的块,从而实现高效的视频标记化。

⚙️ CoordTok采用分层架构,能够捕捉视频的局部和全局特征,使得长视频处理更加高效,降低了内存和计算成本。

📉 CoordTok在编码长视频时,所需的令牌数量远低于传统方法,例如,在128帧128x128分辨率的视频编码中,CoordTok仅需1280个令牌,而传统方法则需要6144或8192个令牌,显著降低了计算资源需求。

🖼️ 通过结合ℓ2损失和LPIPS损失进行微调,CoordTok的重建质量得到了进一步提升,例如CoordTok-L模型达到了26.9的PSNR值,保证了高质量的视频重建。

Breaking down videos into smaller, meaningful parts for vision models remains challenging, particularly for long videos. Vision models rely on these smaller parts, called tokens, to process and understand video data, but creating these tokens efficiently is difficult. While recent tools achieve better video compression than older methods, they struggle to handle large video datasets effectively. A key issue is their inability to fully utilize temporal coherence, the natural pattern where video frames are often similar over short periods, which video codecs use for efficient compression. These tools are also computationally expensive to train and are limited to short clips, making them not very effective in capturing patterns and processing longer videos.

Current video tokenization methods have high computational costs and struggle to handle long video sequences efficiently. Early approaches used image tokenizers to compress videos frame by frame but ignored the natural continuity between frames, reducing their effectiveness. Later methods introduced spatiotemporal layers, reduced redundancy, and used adaptive encoding, but they still required rebuilding entire video frames during training, which limited them to short clips. Video generation models like autoregressive methods, masked generative transformers, and diffusion models are also limited to short sequences. 

To solve this, researchers from KAIST and UC Berkeley proposed CoordTok, which learns a mapping from coordinate-based representations to the corresponding patches of input videos. Motivated by recent advances in 3D generative models, CoordTok encodes a video into factorized triplane representations and reconstructs patches corresponding to randomly sampled (x, y, t) coordinates. This approach allows large tokenizer models to be trained directly on long videos without requiring excessive resources. The video is divided into space-time patches and processed using transformer layers, with the decoder mapping sampled (x, y, t) coordinates to corresponding pixels. This reduces both memory and computational costs while preserving video quality.

Based on this, researchers updated CoordTok to efficiently process a video by introducing a hierarchical architecture that grasped local and global features from the video. This architecture represented a factorized triplane to process patches of space and time, making long-duration video processing easier without excessively using computational resources. This approach greatly reduced the memory and computation requirements and maintained high video quality.

Researchers improved the performance by adding a hierarchical structure that captured the local and global features of videos. This structure allowed the model to process space-time patches more efficiently using transformer layers, which helped generate factorized triplane representations. As a result, CoordTok handled longer videos without demanding excessive computational resources. For example, CoordTok encoded a 128-frame video with 128×128 resolution into 1280 tokens, while baselines required 6144 or 8192 tokens to achieve similar reconstruction quality. The model’s reconstruction quality was further improved by fine-tuning with both ℓ2 loss and LPIPS loss, enhancing the accuracy of the reconstructed frames. This combination of strategies reduced memory usage by up to 50% and computational costs while maintaining high-quality video reconstruction, with models like CoordTok-L achieving a PSNR of 26.9.

In conclusion, the proposed framework by researchers, CoordTok, proves to be an efficient video tokenizer that uses coordinate-based representations to reduce computational costs and memory requirements while encoding long videos.

It allows memory-efficient training for video generation models, making handling long videos with fewer tokens possible. However, it is not strong enough for dynamic videos and suggests further potential improvements, such as using multiple content planes or adaptive methods. This work can serve as a starting point for future research on scalable video tokenizers and generation, which can be beneficial for comprehending and generating long videos.


Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post CoordTok: A Scalable Video Tokenizer that Learns a Mapping from Co-ordinate-based Representations to the Corresponding Patches of Input Videos appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

CoordTok 视频标记化 长视频处理 坐标表示 AI视频
相关文章