MarkTechPost@AI 2024年11月26日
Researchers from NVIDIA and MIT Present SANA: An Efficient High-Resolution Image Synthesis Pipeline that Could Generate 4K Images from a Laptop
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA和MIT的研究人员开发了SANA,一个高效的文本到图像生成框架,能够快速生成高达4096×4096分辨率的高质量图像。SANA仅使用590M参数就能生成高质量图像,且无需大型服务器支持,甚至可以在笔记本电脑GPU上运行。该框架通过改进的自动编码器、线性DiT和解码器,以及创新的文本编码器和训练策略,有效降低了训练和推理成本,并在图像质量和生成速度方面超越了现有模型,例如Pix-Art Σ。SANA还采用了Flow-DPM-Solver,将推理采样步骤减少到14-20步,同时提升了性能,并实现了边缘部署优化,显著提升了图像生成效率。

🎨 **高效的自动编码器:**SANA将自动编码器的压缩率提高到32,降低了潜在标记消耗,同时保持了图像重建质量,有效减少了图像生成过程中的资源消耗和冗余信息。

🚀 **改进的DiT:**SANA在DiT中采用了线性注意力机制和Mix-FFNs,将复杂度从O(N2)降低到O(N),并优化了标记聚合,提升了模型效率。

📚 **Gemma文本编码器:**SANA使用小型解码器模型Gemma作为文本编码器,其更强的指令遵循和推理能力,以及上下文学习能力,在性能上优于大型编码器模型,例如T5。

🖼️ **多字幕自动标注和CLIP评分采样:**SANA利用多个视觉语言模型为训练图像生成多个字幕,并使用基于CLIP评分的采样策略,选择高质量的文本进行训练,提升了文本图像的一致性。

💡 **Flow-DPM-Solver推理优化:**SANA提出了Flow-DPM-Solver,通过修正流公式降低信噪比,并预测速度场,将推理采样步骤减少到14-20步,同时提升了性能。

💻 **边缘部署优化:**SANA通过量化和混合精度策略,实现了在笔记本电脑上的高效部署,速度提升了2.4倍,将高分辨率图像生成技术带到更多用户手中。

Diffusion models have pulled ahead of others in text-to-image generation. With continuous research in this field over the past year, we can now generate high-resolution, realistic images that are indistinguishable from authentic images.  However, with the increasing quality of the hyperrealistic images model, parameters are also escalating, and this trend results in high training and inference costs. Ever-increasing computational expenses and model complexity take image models further away from consumers’ reach. This requires a high-quality and high-resolution image generator that is computationally efficient and runs very fast on cloud and edge devices.

Researchers from NVIDIA and  MIT have created SANA, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment remarkably fast.SANA  0.6 B has just 590 M parameters to generate quality images. The model does not require massive servers to run; it could be deployed even on a laptop GPU. Sana superseded its competitors in terms of quality offered and service time. It performed better than Pix-Art Σ, which generated images at the resolution of 3840×2160 at a relatively slow rate. SANA mitigates training and inference costs with an improved autoencoder, a linear DiT, and a decoder – only a small LLM, Gemma, as a text encoder. The authors further propose automatic labeling and training strategies to improve the consistency between text and images. They utilize multiple VLMs to generate captions. This is followed by a clip score-based training strategy where authors dynamically select captions with high clip scores for multiple captions based on probability. At last, a Flow-DPM-Solver is put forth that reduces the inference sampling steps from 28-50 to 14-20 steps, all while outperforming current strategies. 

To understand this paper, we must look at all the innovations sequentially :

Efficient AutoEncoders: Authors increased the compression ratio of AutoEncoders to 32 from 8 used previously, which reduced latent token consumption by 4 times. High-quality images generally contain high redundancy; thus, a reduction in compression ratio does not affect the quality of the reconstruction of the images. This redundancy is more of a bane in image generation as, besides eating up resources, it led to substandard quality of images.

A Better DiT: Next in the framework, the authors use a vanilla self-attention mechanism with linear attention blocks in DiT (Document Image Transformer) to decrease the complexity from O(N2) to O(N). The DiT authors also replaced the original MLP Feed Forward Networks with Mix-FFNs by incorporating a3×3 depthwise convolution, leading to better token aggregation.

Triton Acceleration: Authors used Triton for faster inference and training. It fused the forward and backward passes of the linear attention blocks. Fusing activation functions, precision conversions, padding operations, and divisions into Matrix multiplications reduced overheads of data transfer.

Text-Encoder Design: Authors utilize Gemma -2, a small decoder-based large language model. Its small architecture has better instruction following and reasoning abilities with Chain of Thought, and Context Learning provides better performance than huge encoder-based models like T5.

Multi-Caption Auto-labelling and CLIP-Score-based Caption Sampler: Authors used 4 Vision Language Models to label each training image. Multiple images increased the accuracy and diversity of captions. Further, the authors use a clip score-based sampler to sample high-quality text with greater probability.

Flow-Based Training and Inference: SANA proposes Flow-DPM-Solver, a modification of DPM-Solver++ with Rectified Flow formulation to achieve a lower signal-noise ratio. In addition to the above utility, the proposed workflow also predicts the velocity field, unlike the latter. Consequently, Flow-DPM-Solver converges at 14∼20 steps with better performance.

Edge Deployment: SANA is quantized with per token symmetric 8-bit integers for activation and weights. Moreover, to preserve a high semantic similarity to the 16-bit variant while incurring minimal runtime overhead, authors retained various layers of the model at complete precision. This optimization in deployment on the laptop increased speed by 2.4 times.

To sum up, SANA’s framework proposed many implementations that achieved new heights in image generation – 4K delivering 100 times better throughput than SOTA. A further challenge would be to see how SANA could be optimized for the video paradigm.


Check out the Paper, GitHub Page, and Demo. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniques’ Read the Full Report (Promoted)

The post Researchers from NVIDIA and MIT Present SANA: An Efficient High-Resolution Image Synthesis Pipeline that Could Generate 4K Images from a Laptop appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SANA 文本到图像 图像生成 AI 高分辨率
相关文章