MarkTechPost@AI 2024年07月21日
DiT-MoE: A New Version of the DiT Architecture for Image Generation
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

DiT-MoE 是一种新的 DiT 架构,用于图像生成,它使用稀疏混合专家(MoE)层来替换 DiT 中的一些密集前馈层,从而实现参数高效的 MoE 扩散模型训练。DiT-MoE 在各种指标上都表现出色,在 ImageNet 256×256 数据集上,DiT-MoE 模型的 FID 得分为 1.72,优于所有其他架构的模型。

🤔 DiT-MoE 是一种新的 DiT 架构,它通过将 DiT 中的一些密集前馈层替换为稀疏 MoE 层来实现参数高效的 MoE 扩散模型训练。 DiT-MoE 的核心创新在于利用稀疏混合专家 (MoE) 技术,将模型参数分配给不同的专家子模型,每个专家子模型只处理特定类型的输入数据。这种方法可以有效地减少模型的整体参数量,同时保持模型的性能。 DiT-MoE 还引入了共享专家和专家损失均衡的设计,以提高模型的效率和稳定性。共享专家可以捕获不同输入之间的共同知识,而专家损失均衡可以减少不同专家子模型之间的冗余。 DiT-MoE 的研究成果表明,MoE 技术在扩散模型中具有很大的潜力,可以有效地提高模型的效率和性能。

🤖 DiT-MoE 在图像生成任务中取得了显著的成果,在 ImageNet 256×256 数据集上,DiT-MoE 模型的 FID 得分为 1.72,优于所有其他架构的模型。 DiT-MoE 的性能优势主要归功于其稀疏 MoE 架构和高效的训练方法。DiT-MoE 使用稀疏条件计算来训练大型扩散 Transformer 模型,这使得模型能够在保持性能的同时,以更少的计算资源进行训练和推理。 DiT-MoE 还利用简单的设计来有效地利用基于输入的模型稀疏性,进一步提高了模型的效率。

📈 DiT-MoE 的研究结果为探索大型扩散模型的条件计算开辟了新的途径。 DiT-MoE 的研究成果表明,MoE 技术在扩散模型中具有很大的潜力,可以有效地提高模型的效率和性能。未来的研究方向包括训练更稳定、更快的异构专家架构,以及改进知识蒸馏方法。

🚀 DiT-MoE 的出现为图像生成领域带来了新的突破,它将推动该领域的发展,并促进更多高效、高性能的图像生成模型的诞生。

🎯 DiT-MoE 的研究成果证明了稀疏 MoE 架构在扩散模型中的有效性,为未来模型设计提供了新的思路。

🌟 DiT-MoE 的研究成果为推动图像生成领域的进一步发展提供了新的方向,并为其他领域的研究提供了宝贵的参考。

Recently, diffusion models have become powerful tools in various fields, like image and 3D object generation. Their success comes from their ability to handle denoising tasks with different types of noise, efficiently turning random noise into the target data distribution through repeated denoising steps. Using Transformer-based structures, it has been shown that adding more parameters usually improves performance. However, training and running these models is costly. This is because the deep networks are dense, meaning every example uses all parameters, leading to high computational costs as they scale up.

The current method, Conditional Computation, is a promising scaling technique that aims to increase model capacity while keeping the training and inference costs constant. This is done by using only a subset of parameters for each example. Another method, the Mixture of Experts (MoEs), combines the outputs of sub-models, or experts through an input-dependent router and has been successfully used in various fields. In the field of NLP, top-k gating in LSTMs has been introduced, along with auxiliary losses to keep the experts balanced. Lastly, in MoEs for diffusion models, research has been done using multiple expert models, each focusing on a specific range of timesteps.

Researchers from Kunlun Inc. Beijing, China have proposed DiT-MoE, a new version of the DiT architecture for image generation. DiT-MoE modifies some of the dense feedforward layers in DiT by replacing them with sparse MoE layers. In these layers, each image token is directed to a specific subset of experts, which are MLP layers. Moreover, the architecture includes two main designs, one is sharing part of experts to capture common knowledge and the second is balancing expert loss to reduce redundancy in different routed experts. This paper thoroughly analyzes how these features make it possible to train a parameter-efficient MoE diffusion model and observe some fascinating patterns in expert routing from various perspectives.

The AdamW optimizer is used without weight decay across all datasets, with a constant learning rate. An exponential moving average (EMA) of the DiT-MoE weights is used during training with a decay rate of 0.9999, and the results are based on this EMA model. The proposed models are trained on an Nvidia A100 GPU using the ImageNet dataset at various resolutions. Moreover, techniques like classifier-free guidance are applied during training, and a pre-trained variational autoencoder model from Stable Diffusion on huggingface2 is used. The performance of image generation is evaluated using the Frechet Inception Distance (FID), a common metric for assessing the quality of generated images.

The evaluation results on conditional image generation across various metrics show excellent performance compared to dense competitors. On the class-conditional ImageNet 256×256 dataset, the DiT-MoE model achieves an FID score of 1.72, outperforming all previous models with different architectures. Moreover, DiT-MoE uses only 1.5 billion parameters and significantly outperforms Transformer-based competitors like Large-DiT-3B, Large-DiT-7B, and LlamaGen-3B. This shows the potential of MoE in diffusion models. Similar improvements are seen in almost all evaluation metrics on the class-conditional ImageNet 512×512 dataset.

In summary, researchers have developed DiT-MoE, an updated version of the DiT architecture for image generation. DiT-MoE enhances some of the dense feedforward layers in DiT by replacing them with sparse MoE layers. This method uses sparse conditional computation to train large diffusion transformer models, leading to efficient inference and significant enhancement in image generation tasks. Also, simple designs are used to utilize model sparsity efficiently based on inputs. This paper presents the start of exploring large-scale conditional computation for diffusion models. Future work includes training more stable and faster heterogeneous expert architectures and improving knowledge distillation.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

Find Upcoming AI Webinars here

The post DiT-MoE: A New Version of the DiT Architecture for Image Generation appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DiT-MoE 图像生成 扩散模型 MoE 稀疏计算
相关文章