Diffusion Model

An AI model that generates images or other media by learning to reverse a gradual noise-adding process.

Diffusion models are the technology behind most modern AI image generators — including Stable Diffusion, DALL-E 3, Midjourney, and Flux. They work by learning two processes: first, gradually adding random noise to training images until they're pure static; second, learning to reverse this process — denoising a noisy image step by step until a clean, coherent image emerges.

At generation time, the model starts with pure random noise and progressively denoises it, guided by a text prompt or other conditioning signal. Each denoising step slightly refines the image. After 20–50 steps, you have a photorealistic or artistic image matching your description.

Key insight: The model never memorizes training images. It learns the statistical structure of images and how noise relates to content, allowing it to generate genuinely novel images in any style described in text.

Diffusion Model Architecture

U-Net / DiT — the denoising network that predicts noise at each step
VAE — encodes images to latent space for efficiency (Latent Diffusion)
CLIP encoder — converts text prompts to conditioning vectors
Scheduler — controls the denoising process (DDPM, DDIM, DPM++)

Diffusion models have largely displaced GANs for high-quality image generation due to their training stability and output quality. They've been extended to text-to-video, 3D generation, audio synthesis, and protein structure prediction. The main limitation is generation speed — each sample requires multiple forward passes — though distillation techniques have reduced this to 1–4 steps.

Related Terms

← Back to Glossary