MarkTechPost@AI 2024年12月27日
Microsoft and Tsinghua University Researchers Introduce Distilled Decoding: A New Method for Accelerating Image Generation in Autoregressive Models without Quality Loss
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

清华大学和微软研究院的研究人员提出了一种名为“蒸馏解码”(DD)的新方法,旨在加速自回归(AR)模型的图像生成过程,同时保持图像质量。传统的AR模型生成图像速度慢,DD利用流匹配技术,将噪声输入映射到预训练AR模型的输出分布,通过训练轻量级网络直接预测最终数据序列,从而显著减少生成步骤,实现高达217.8倍的加速。该方法不仅提高了生成速度,还保持了图像质量,为AR模型在实时应用中开辟了新的可能性。

🚀 自回归模型在图像生成领域表现出色,但其逐个生成token的方式导致速度较慢,限制了在实时应用中的部署。

💡 “蒸馏解码”(DD)方法通过流匹配技术,将噪声输入直接映射到预训练AR模型的输出,从而大幅减少生成步骤,实现加速。

✨ DD方法在加速的同时,依然保持了图像的质量,克服了传统加速方法中速度与质量难以兼顾的难题,并在不同模型(如VAR和LlamaGen)上都表现出一致的性能。

⚙️ DD方法不需要原始AR模型的训练数据,使得其在实际应用中更具可行性,并允许用户根据需求选择一步或多步生成路径,灵活平衡速度和质量。

🎯 实验结果表明,DD方法在ImageNet-256数据集上,对于VAR模型实现了6.3倍的加速,对于LlamaGen模型实现了高达217.8倍的加速,同时图像质量仅有轻微下降。

Autoregressive (AR) models have changed the field of image generation, setting new benchmarks in producing high-quality visuals. These models break down the image creation process into sequential steps, each token generated based on prior tokens, creating outputs with exceptional realism and coherence. Researchers have widely adopted AR techniques for computer vision, gaming, and digital content creation applications. However, the potential of AR models is often constrained by their inherent inefficiencies, particularly their slow generation process, which remains a significant hurdle in real-time applications.

Among many concerns, a critical one that AR models face is their speed. The token-by-token generation process is inherently sequential, meaning each new token must wait for its predecessor to complete. This approach limits scalability and results in high latency during image generation tasks. For instance, generating a 256×256 image using traditional AR models like LlamaGen requires 256 steps, translating to approximately five seconds on modern GPUs. Such delays hinder their deployment in applications that demand instantaneous results. Also, while AR models excel in maintaining the fidelity of their outputs, they struggle to meet the growing demand for both speed and quality in large-scale implementations.

Efforts to accelerate AR models have yielded various methods, such as predicting multiple tokens simultaneously or adopting masking strategies during generation. These approaches aim to reduce the required steps but often compromise the quality of the generated images. For example, in multi-token generation techniques, the assumption of conditional independence among tokens introduces artifacts, undermining the cohesiveness of the output. Similarly, masking-based methods allow for faster generation by training models to predict specific tokens based on others, but their effectiveness diminishes when generation steps are drastically reduced. These limitations highlight the need for a new approach to enhance AR model efficiency.

Tsinghua University and Microsoft Research researchers have introduced a solution to these challenges: Distilled Decoding (DD). This method builds on flow matching, a deterministic mapping that connects Gaussian noise to the output distribution of pre-trained AR models. Unlike conventional methods, DD does not require access to the original training data of the AR models, making it more practical for deployment. The research demonstrated that DD can transform the generation process from hundreds of steps to as few as one or two while preserving the quality of the output. For example, on ImageNet-256, DD achieved a speed-up of 6.3x for VAR models and an impressive 217.8x for LlamaGen, reducing generation steps from 256 to just one.

The technical foundation of DD is based on its ability to create a deterministic trajectory for token generation. Using flow matching, DD maps noisy inputs to tokens to align their distribution with the pre-trained AR model. During training, the mapping is distilled into a lightweight network that can directly predict the final data sequence from a noise input. This process ensures faster generation and provides flexibility in balancing speed and quality by allowing intermediate steps when needed. Unlike existing methods, DD eliminates the trade-off between speed and fidelity, enabling scalable implementations across diverse tasks.

In experiments, DD highlights its superiority over traditional methods. For instance, using VAR-d16 models, DD achieved one-step generation with an FID score increase from 4.19 to 9.96, showcasing minimal quality degradation despite a 6.3x speed-up. For LlamaGen models, the reduction in steps from 256 to one resulted in an FID score of 11.35, compared to 4.11 in the original model, with a remarkable 217.8x speed improvement. DD demonstrated similar efficiency in text-to-image tasks, reducing generation steps from 256 to two while maintaining a comparable FID score of 28.95 against 25.70. The results underline DD’s ability to drastically enhance speed without significant loss in image quality, a feat unmatched by baseline methods.

Several key takeaways from the research on DD include:

    DD reduces generation steps by orders of magnitude, achieving up to 217.8x faster generation than traditional AR models.Despite the accelerated process, DD maintains acceptable quality levels, with FID score increases remaining within manageable ranges.DD demonstrated consistent performance across different AR models, including VAR and LlamaGen, regardless of their token sequence definitions or model sizes.The approach allows users to balance quality and speed by choosing one-step, two-step, or multi-step generation paths based on their requirements.The method eliminates the need for the original AR model training data, making it feasible for practical applications in scenarios where such data is unavailable.Due to its efficient distillation approach, DD can potentially impact other domains, such as text-to-image synthesis, language modeling, and image generation.

In conclusion, with the introduction of Distilled Decoding, researchers have successfully addressed the longstanding speed-quality trade-off that has plagued AR generation processes by leveraging flow matching and deterministic mappings. The method accelerates image synthesis by reducing steps drastically and preserves the outputs’ fidelity and scalability. With its robust performance, adaptability, and practical deployment advantages, Distilled Decoding opens new frontiers in real-time applications of AR models. It sets the stage for further innovation in generative modeling.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post Microsoft and Tsinghua University Researchers Introduce Distilled Decoding: A New Method for Accelerating Image Generation in Autoregressive Models without Quality Loss appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

自回归模型 图像生成 蒸馏解码 流匹配 加速
相关文章