MarkTechPost@AI 2024年10月23日
This AI Paper Introduces a Unified Perspective on the Relationship between Latent Space and Generative Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

DiGIT 是一种新的生成模型,它通过分离编码器和解码器的训练,并利用判别式自监督模型训练编码器,来实现更稳定的潜在空间,从而提高图像生成质量。与传统的自编码器方法不同,DiGIT 采用了一种新颖的策略,将编码器的潜在特征空间转换为离散标记,并使用因果 Transformer 预测下一个标记,在 ImageNet 上取得了优异的性能。该研究表明,图像自回归模型可以类似于自然语言处理中的 GPT 模型,同时强调了稳定潜在空间的重要性,并为图像自回归模型的生成预训练提供了新的方向。

🤔 DiGIT 是一种新的生成模型,它通过分离编码器和解码器的训练,并利用判别式自监督模型训练编码器,来实现更稳定的潜在空间,从而提高图像生成质量。与传统的自编码器方法不同,DiGIT 采用了一种新颖的策略,将编码器的潜在特征空间转换为离散标记,并使用因果 Transformer 预测下一个标记,在 ImageNet 上取得了优异的性能。

💡 该研究表明,图像自回归模型可以类似于自然语言处理中的 GPT 模型,同时强调了稳定潜在空间的重要性,并为图像自回归模型的生成预训练提供了新的方向。

🚀 DiGIT 的研究结果表明,通过使用更小的标记网格可以获得更高的精度,并且增加 K-Means 聚类数量可以提高模型性能,这进一步证明了在自回归建模中使用更大词汇量的优势。

🧐 DiGIT 的研究结果挑战了重建能力强就意味着潜在空间对自回归生成有效的普遍观点。

💪 该研究旨在重新激发对图像自回归模型的生成预训练的兴趣,鼓励重新评估定义生成模型潜在空间的基本组件,并为此迈出新技术和方法的一步。

In recent years, there have been drastic changes in the field of image generation, mainly due to the development of latent-based generative models, such as Latent Diffusion Models (LDMs) and Mask Image Models (MIMs). Reconstructive autoencoders, like VQGAN and VAE, can reduce images into smaller and easier forms called low-dimensional latent space. This allows these models to create very realistic images. Considering the major influence of autoregressive (AR) generative models, such as Large Language Models in natural language processing (NLP), it’s interesting to explore whether similar approaches can work for images. Even though autoregressive models use the same latent space as models like LDMs and MIMs, they still somewhere fails in image generation. This stands in sharp contrast to natural language processing (NLP), where the autoregressive model GPT has achieved major dominance.

Current methods like LDMs and MIMs use reconstructive autoencoders, such as VQGAN and VAE, to transform images into a latent space. However, these approaches face challenges with stability and performance too. It is seen that, in the VQGAN model, as the image reconstruction quality improves (indicated by a lower FID score), the overall generation quality can actually decline. To address these issues, researchers have proposed a new method called Discriminative Generative Image Transformer (DiGIT). Unlike traditional autoencoder approaches, DiGIT separates the training of encoders and decoders, starting with the encoder-only training through a discriminative self-supervised model.

A team of researchers from the School of Data Science and the School of Computer Science and Technology at the University of Science and Technology of China, as well as the State Key Laboratory of Cognitive Intelligence and Zhejiang University propose Discriminative Generative Image Transformer (DiGIT). This method separates the training of encoders and decoders, beginning with encoder, training through a discriminative self-supervised model. This strategy enhances the stability of the latent space, making it more robust for autoregressive modeling. They utilize a method inspired by VQGAN to convert the encoder’s latent feature space into discrete tokens using K-means clustering. The research suggests that image autoregressive models can operate similarly to GPT models in natural language processing. The main contributions of this work include a unified perspective on the relationship between latent space and generative models, emphasizing the importance of stable latent spaces; a novel method that separates the training of encoders and decoders to stabilize the latent space; and an effective discrete image tokenizer that enhances the performance of image autoregressive models. 

          The architecture of DiGIT

During testing, researchers matched each image patch with the nearest token from the codebook. After training a causal Transformer to predict the next token using these tokens, the researchers got good results on ImageNet. The DiGIT model surpasses previous techniques in image understanding and generation, demonstrating that using a smaller token grid can lead to higher accuracy. Experiments conducted by researchers highlighted the effectiveness of the proposed discriminative tokenizer, which significantly boosts model performance, as the number of parameters increases. The study also found that increasing the number of K-Means clusters enhances accuracy, reinforcing the advantages of a larger vocabulary in autoregressive modeling.

In conclusion, this paper presents a unified view of how latent space and generative models are related, highlighting the importance of a stable latent space in image generation and introducing a simple yet effective image tokenizer and an autoregressive generative model called DiGIT. The results also challenge the common belief that being good at reconstruction means also having an effective latent space for autoregressive generation. Through this work, the researchers aim to rekindle interest in the generative pre-training of image auto-regressive models, encourage a reevaluation of the fundamental components that define latent space for generative models, and make this a step towards new technologies and methods!


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)

The post This AI Paper Introduces a Unified Perspective on the Relationship between Latent Space and Generative Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DiGIT 潜在空间 生成模型 图像生成 自回归模型
相关文章