MarkTechPost@AI 06月07日 14:35
ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

ByteDance的研究人员推出DetailFlow,这是一个创新的1D自回归图像生成框架,旨在提高图像生成速度和效率。该框架采用从粗到精的方法,通过减少token数量来降低计算负担,同时保持图像质量。DetailFlow在ImageNet 256x256基准测试中表现出色,实现了更高的生成速度和更优的图像质量,为图像合成研究带来了新的突破。

💡自回归图像生成技术:DetailFlow基于自回归模型,类似于语言模型,逐个生成图像token,以保持图像的结构一致性,支持图像操作和多模态转换。

📉Token数量的挑战:传统方法需要大量token来表示高分辨率图像,导致计算成本高、速度慢。DetailFlow旨在通过减少token数量来解决这个问题。

🚀DetailFlow的核心机制:DetailFlow采用1D tokenization,将token序列从全局到精细细节进行排列,通过预测更精细的尺度来创建图像,从而减少token需求,实现图像的语义排序。

✨实验结果与优势:在ImageNet 256x256基准测试中,DetailFlow在更少的token数量下,实现了比VAR和FlexVAR更高的图像质量和更快的生成速度。DetailFlow还通过自修正机制提高了输出质量。

Autoregressive image generation has been shaped by advances in sequential modeling, originally seen in natural language processing. This field focuses on generating images one token at a time, similar to how sentences are constructed in language models. The appeal of this approach lies in its ability to maintain structural coherence across the image while allowing for high levels of control during the generation process. As researchers began to apply these techniques to visual data, they found that structured prediction not only preserved spatial integrity but also supported tasks like image manipulation and multimodal translation effectively.

Despite these benefits, generating high-resolution images remains computationally expensive and slow. A primary issue is the number of tokens needed to represent complex visuals. Raster-scan methods that flatten 2D images into linear sequences require thousands of tokens for detailed images, resulting in long inference times and high memory consumption. Models like Infinity need over 10,000 tokens for a 1024×1024 image. This becomes unsustainable for real-time applications or when scaling to more extensive datasets. Reducing the token burden while preserving or improving output quality has become a pressing challenge.

Efforts to mitigate token inflation have led to innovations like next-scale prediction seen in VAR and FlexVAR. These models create images by predicting progressively finer scales, which imitates the human tendency to sketch rough outlines before adding detail. However, they still rely on hundreds of tokens—680 in the case of VAR and FlexVAR for 256×256 images. Moreover, approaches like TiTok and FlexTok use 1D tokenization to compress spatial redundancy, but they often fail to scale efficiently. For example, FlexTok’s gFID increases from 1.9 at 32 tokens to 2.5 at 256 tokens, highlighting a degradation in output quality as the token count grows.

Researchers from ByteDance introduced DetailFlow, a 1D autoregressive image generation framework. This method arranges token sequences from global to fine detail using a process called next-detail prediction. Unlike traditional 2D raster-scan or scale-based techniques, DetailFlow employs a 1D tokenizer trained on progressively degraded images. This design allows the model to prioritize foundational image structures before refining visual details. By mapping tokens directly to resolution levels, DetailFlow significantly reduces token requirements, enabling images to be generated in a semantically ordered, coarse-to-fine manner.

The mechanism in DetailFlow centers on a 1D latent space where each token contributes incrementally more detail. Earlier tokens encode global features, while later tokens refine specific visual aspects. To train this, the researchers created a resolution mapping function that links token count to target resolution. During training, the model is exposed to images of varying quality levels and learns to predict progressively higher-resolution outputs as more tokens are introduced. It also implements parallel token prediction by grouping sequences and predicting entire sets at once. Since parallel prediction can introduce sampling errors, a self-correction mechanism was integrated. This system perturbs certain tokens during training and teaches subsequent tokens to compensate, ensuring that final images maintain structural and visual integrity.

The results from the experiments on the ImageNet 256×256 benchmark were noteworthy. DetailFlow achieved a gFID score of 2.96 using only 128 tokens, outperforming VAR at 3.3 and FlexVAR at 3.05, both of which used 680 tokens. Even more impressive, DetailFlow-64 reached a gFID of 2.62 using 512 tokens. In terms of speed, it delivered nearly double the inference rate of VAR and FlexVAR. A further ablation study confirmed that the self-correction training and semantic ordering of tokens substantially improved output quality. For example, enabling self-correction dropped the gFID from 4.11 to 3.68 in one setting. These metrics demonstrate both higher quality and faster generation compared to established models.

By focusing on semantic structure and reducing redundancy, DetailFlow presents a viable solution to long-standing issues in autoregressive image generation. The method’s coarse-to-fine approach, efficient parallel decoding, and ability to self-correct highlight how architectural innovations can address performance and scalability limitations. Through their structured use of 1D tokens, the researchers from ByteDance have demonstrated a model that maintains high image fidelity while significantly reducing computational load, making it a valuable addition to image synthesis research.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

The post ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DetailFlow 图像生成 自回归模型 ByteDance
相关文章