How Do Diffusion Models Work? Simple Explanation: No Mathematical Jargon, Promised!

Last Updated on June 3, 2024 by Editorial Team Author(s): Suhaib Arshad Originally published on Towards AI. Background Knowledge Essentially, there are 3 common types of generative models: Generative Adversarial Networks (GANs), Variational Autoencoder, and Flow-based models. Although they have proven their spot as high-quality image-generating models, they fall short on a few aspects. GAN models have symptoms of unstable training due to which they are harder to converge; along with that, they also suffer from a lack of diversity while producing output. VAEs have a massive tradeoff to make between how accurately they can recreate the input data and how well they can organize the data in the hidden space. Flow models need high computational power because they must use reversible steps to generate data. fig 1: Comparing diffusion models with other generative models (GANs and VAE) based on KPIs (source: owned) How do diffusion models come into play? Inspired by these challenges came the origin story of diffusion models. In deep learning, diffusion models have already replaced State-of-the-art generative frameworks like GANs or VAEs. With Diffusion models showcasing the following capabilities over others: Outperforms GANs in terms of high-quality and high-fidelity image samples. The training process is very stable compared to GANs. Capitalizes on the probabilistic framework similar to VAEs, ensuring a structured latent space. fig 2: Architecture comparison between Diffusion and other generative image models (source: Diffusion Model Clearly Explained! | by Steins | Medium) Real World Applications of Diffusion models Along with ChatGPT, there has been a skyrocket in development of technology and the diffusion model is one of the biggest byproducts of it. Plenty of papers published have proven the extraordinary capability of Diffusion models, such as diffusion beating GANs on image synthesis. Some of the most popular Diffusion models are DALL-E 3 by OpenAI, Stable Diffusion 3 by Stability AI, and Midjourney. DALL-E 3 is an image generation model by OpenAI (images below are generated using DALL-E 3). fig 3: Image generated by DALL-E 3 based on prompt (source: owned) fig 4: Image generated by DALL-E 3 based on prompt (source: owned) fig 5: Image generated by DALL-E 3 based on prompt (source: owned) With the current wave in the success of Diffusion models, many AI enthusiasts might be interested in understanding the diffusion algorithm under the hood. Here are the top 3 use cases where diffusion models can be used: Image Generation: Diffusion models have a proven ability to generate high-quality images from random noise, iteratively refining the process. Resolution Enhancer: These models have the ability to take a low resolution image and turn it into a higher resolution one. Image Inpainting: An interesting use case where we can use diffusion models to in-paint part of the image that we want to change or remove. What Exactly are Diffusion Models? The concept of the diffusion model is little over 9 years old. In 2015 there was a paper published “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” [1]. The name “diffusion” is comparative to how particles change state, or “diffuse”, when in contact with a medium, similar to how molecules react when placed in different mediums like: ? a freezing space (solid state) or ?in a boiling space (gaseous state). Similar to non-equilibrium statistical physics, the idea behind diffusion models in deep learning is to slowly and iteratively destroy the structure of a data distribution through a forward diffusion process. And then gradually learn to reverse the diffusion process to restore the structure in data, resulting in a high diversity and high-quality image-generating model. Diffusion models undergo a series of sequential steps to slowly add random noise to data/Image and then learn to reverse the diffusion process to get the desired or generated data/Image samples back from the noise. fig 6: Addition of noise to image progressively at each timestep(forward diffusion)(source: owned) fig 7: Gif of forward diffusion process(source: owned) fig 6: Removal of noise from image iteratively at each timestep(Reverse diffusion)(source: owned) fig 9: Gif of reverse diffusion process(source: owned) Forward diffusion process Going beyond the definition, let me explain how the forward process works in simple words. Basically, every image follows some non-random distribution, which we don't know of. The core idea is to destroy that distribution by sequentially adding noise and at the end we will end up with pure noise. (a) Original Image(source: owned) (b) Pure noise(source: owned) How do we add noise to the image? We use something called a variance scheduler for adding noise to the data. The scheduler controls the rate at which noise is added at each time step. Its values can range from 0 to 1. The values are usually kept low to prevent variance from exploding. A 2020 paper published by UC Berkeley [2] uses a linear schedule so the output looks like that: fig 10: linear schedule with 1000 time steps.(source: owned) Hower in a 2021 paper by OpenAI [3] decided that using a linear schedule is inefficient for the process. As you can notice from the above figure, a vast majority of the information gets lost halfway through the total steps. So they came up with their own schedule ,called a cosine schedule (Fig. 11). This change in approach of scheduling noise helped in reducing the number of steps down to 50. fig 11: Cosine schedule with 50 time steps.(source: owned) Reverse diffusion process The whole objective of the reverse diffusion process is to retrieve images back from pure noise (Also called Denoising). For that reason, we use neural networks. Similar to the working of GANs, we train neural networks to produce like a generator does in the GAN model. The distinction being that instead of doing all the work in one step, like GAN networks do, we will split the process into multiple timesteps and remove noise gradually. Although this approach is time taking, it is more efficient and easier to train. fig 12: The directed graphical model. (source: Deep Unsupervised Learning using Nonequilibrium Thermodynamics, arXiv:1503.03585, 2015.) A common misconception related […]

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签