Diffusion Model
An AI model that generates images or other media by learning to reverse a gradual noise-adding process.
Diffusion models are the technology behind most modern AI image generators — including Stable Diffusion, DALL-E 3, Midjourney, and Flux. They work by learning two processes: first, gradually adding random noise to training images until they're pure static; second, learning to reverse this process — denoising a noisy image step by step until a clean, coherent image emerges.
At generation time, the model starts with pure random noise and progressively denoises it, guided by a text prompt or other conditioning signal. Each denoising step slightly refines the image. After 20–50 steps, you have a photorealistic or artistic image matching your description.
Diffusion Model Architecture
- U-Net / DiT — the denoising network that predicts noise at each step
- VAE — encodes images to latent space for efficiency (Latent Diffusion)
- CLIP encoder — converts text prompts to conditioning vectors
- Scheduler — controls the denoising process (DDPM, DDIM, DPM++)
Diffusion models have largely displaced GANs for high-quality image generation due to their training stability and output quality. They've been extended to text-to-video, 3D generation, audio synthesis, and protein structure prediction. The main limitation is generation speed — each sample requires multiple forward passes — though distillation techniques have reduced this to 1–4 steps.