Diffusion Models: How AI Generates Stunning Images (and Video)

In August 2022, Stability AI released Stable Diffusion, an open-source text-to-image model. For the first time, regular people could generate photorealistic images by typing a description.

A few weeks later, DALL-E 2 was released to the public. Then came Midjourney. By 2024, AI-generated images became mainstream. Artists weren't just using them—they were competing in contests. Photographers were panicked.

All of this is powered by diffusion models, a surprisingly elegant approach to image generation.

The Core Idea: Learning to Denoise

Here's the insight that makes diffusion models work:

Imagine you have a clear image. You gradually add random noise to it, over and over. After many steps, you end up with pure noise—no recognizable image left.

Now reverse the process: start with pure noise, and gradually remove noise. If you can learn to reverse that process, you can generate images.

This is diffusion: noise → [gradually denoise] → image.

The Forward Process: Adding Noise

Let's say you start with a real image.

Step 0: Original image (clear, no noise)

Step 1: Add a tiny bit of random noise. Still mostly recognizable.

Step 2: Add more noise. Getting blurrier.

Step 3: Even more noise.

...

Step 50: Pure noise. No image remaining.

Mathematically, each step is:

x_t = sqrt(α_t) * x_{t-1} + sqrt(1 - α_t) * ε

where ε is random Gaussian noise

This is the "forward diffusion process." You're gradually corrupting the image.

This forward process is not learned—it's fixed and deterministic. You know how to turn an image into noise.

The Reverse Process: Removing Noise

Now the clever part: learn to reverse this.

Step 50: You have pure noise.

Step 49: Predict what the noise looks like in this pure-noise image. Remove it. You're left with "slightly less noisy" noise.

Step 48: Predict the noise at this step. Remove it. Even less noisy.

...

Step 0: Predict the final noise. Remove it. You have an image.

The reverse process is learned. You train a neural network to predict: "Given this noisy image at step t, what noise was added?"

ε_predicted = Network(noisy_image, t)
x_{t-1} = (x_t - sqrt(1 - α_t) * ε_predicted) / sqrt(α_t)

Once you've learned to denoise at every step, you can generate images by starting with noise and iteratively denoising.

Training a Diffusion Model

How do you train this denoising network?

Start with a real image
Randomly pick a noise level t (1 to 50)
Add noise up to that level (forward process)
Ask the network to predict the noise
Compare prediction to actual noise (loss function: mean squared error)
Update network to predict better
Repeat on millions of images

After training on millions of images and billions of (image, noise_level, noise) triplets, the network becomes very good at predicting noise at any level.

The training is simple: just predicting noise. The magic is that this enables generation.

Text Conditioning: From Text to Image

So far, we've been generating images from random noise. To get "a cat wearing sunglasses," we need to guide the generation process.

Text conditioning: Include the text description as input to the denoising network.

predicted_noise = Network(noisy_image, noise_level, text_embedding)

During training:

Random images paired with captions
Train network to denoise conditioned on caption
Network learns "when caption mentions 'cat,' the image should look like a cat"

During generation:

Start with noise
At each step, denoise conditioned on your text prompt
The text guides the denoising toward matching your description

For this to work well, you need:

A huge dataset of images with captions (LAION has 5+ billion image-text pairs)
A good way to encode text (using models like CLIP)
A robust denoising network (usually a UNet architecture)

CLIP: Aligning Images and Text

Most modern diffusion models use CLIP (Contrastive Language-Image Pre-training from OpenAI) to encode both text and images into the same representation space.

CLIP works by:

Training on image-caption pairs
Using contrastive loss: match captions to their correct image, push away incorrect matches
Result: text and images are represented in the same space

Now you can:

Encode your text prompt with CLIP
Pass that encoding to the diffusion model
The model learns to generate images matching that text representation

This is why text prompts work so well. CLIP already understands language-image alignment.

The Architecture: UNet

The denoising network is typically a UNet, an architecture originally designed for medical image segmentation.

Structure:

Input (noisy image) → Downsampling path → Bottleneck → Upsampling path → Output (noise prediction)
                         ↓                                      ↑
                    (Skip connections connecting downsampling to upsampling)

Why UNet?

Skip connections preserve spatial information (important for images)
Downsampling extracts features
Upsampling reconstructs spatial detail
It's efficient and works well in practice

The UNet is conditioned on:

The noise level t (which diffusion step)
The text embedding (what to generate)

Why Diffusion Works Better Than GANs

Before diffusion, GANs (Generative Adversarial Networks) were the state-of-the-art for image generation.

Aspect	GANs	Diffusion
Training stability	Tricky (mode collapse)	Stable
Diversity	Sometimes limited	Very diverse
Quality	Good but varying	Consistently high
Speed	Fast inference	Slow (100+ denoising steps)
Scalability	Doesn't scale as well	Scales to billions of parameters

Why diffusion is better:

Stability — Simple loss function (predict noise), no adversarial dynamics, easier to train
Diversity — Each generation follows a different noise schedule, producing diverse outputs
Scalability — Scales better to large models and datasets
Quality — Seems to produce higher visual quality at scale

Why GANs are faster: GANs generate in one forward pass. Diffusion requires many steps (100–1000 denoising steps).

For speed-critical applications, GANs are still used. But for quality, diffusion dominates.

Real-World Models

Stable Diffusion (2022)

Released by Stability AI, open-source and free. Runs on consumer GPUs.

Model size: ~4 billion parameters
Training data: LAION-5B (5 billion images)
Speed: ~20–50 seconds per image on consumer GPUs
Quality: Good for most purposes

Stable Diffusion is available in three versions (SD 1.5, SD XL, and newer), with community fine-tuning creating countless variants.

DALL-E 3 (OpenAI, 2023)

Proprietary, available via API or ChatGPT.

Quality: Excellent coherence, handles text in images, detailed understanding
Speed: Few seconds (optimized inference)
Cost: Pay-per-image

DALL-E 3 has better understanding of nuance and text rendering than Stable Diffusion.

Midjourney (2023–Present)

Proprietary, Discord-based interface.

Quality: Aesthetically stunning, great artistic style
Speed: Fast (behind the scenes, probably batched inference)
Community: Very active, lots of shared prompts and techniques

Midjourney's aesthetic is distinctive—users love the look or find it too "stylized" depending on preference.

Other Players

Runway — Text-to-video diffusion
Pika — Text-to-video, video-to-video
Google's Imagen — Proprietary, not widely released
Adobe Firefly — Integrated into Creative Cloud
Microsoft Designer — Integrated with Bing/Edge

Image-to-Image: More Than Text-to-Image

Diffusion doesn't stop at text-to-image. You can also do image-to-image:

Start with a real image (not pure noise)
Add noise up to level t
Denoise from that point, conditioned on text

This lets you:

Modify existing images
Change style
Repose subjects
Inpaint (fill in masked areas)

The amount of noise you add controls the amount of change:

Lots of noise → Major modifications
Little noise → Subtle tweaks

Video Generation: The Next Frontier

Text-to-video is the frontier in 2024–2025. Sora (OpenAI), Runway, Pika, and others can generate videos from text.

How it works:

Video is essentially 2D diffusion extended to 3D (spatial + temporal dimensions).

Temporal information: Instead of just spatial features, include temporal coherence. Frame t should be related to frame t-1.
Autoregressive generation: Generate frames one at a time, or generate all at once with temporal consistency.
Conditioning: Condition on text, existing frames, or camera movements.

Challenges:

Temporal consistency — All frames must be coherent together
Physics — Motion must follow laws of physics (this is hard for neural networks)
Computation — Videos are many frames, so much more compute than images

Sora (released early 2024) showed impressive results, but motion sometimes looks off and objects sometimes violate physics. This will improve.

Prompt Engineering for Diffusion

Generating good images requires good prompts. Some principles:

Be specific:

❌ "A cat"
✅ "A tabby cat wearing sunglasses, sitting on a beach, photorealistic, 8k"

Include style:

"In the style of Monet" / "Oil painting" / "Anime" / "Cyberpunk"

Add quality descriptors:

"Highly detailed" / "Sharp focus" / "Professional photography" / "Intricate"

Specify artist or influence:

"In the style of Studio Ghibli" / "Inspired by Beeple"

Use negative prompts:

"A cat wearing sunglasses, NOT blurry, NOT ugly, NOT deformed"

Negative prompts tell the model what to avoid.

Most advanced models (DALL-E 3, newer Midjourney) handle natural language better, so prompts can be more conversational.

Training Data and Bias

Diffusion models are trained on internet images. Internet images reflect human biases.

This means:

Gender bias: Women more likely portrayed in certain roles
Racial bias: Different skin tones represented differently
Copyright issues: Training on copyrighted images without permission
Style bias: Certain artistic styles overrepresented

Mitigations:

Filtering data: Remove copyrighted or harmful content
Balanced sampling: Oversample underrepresented groups
Fine-tuning: Train on curated datasets to reduce bias
User controls: Let users select style or avoid certain aesthetics

This is ongoing work. No perfect solution yet.

Computational Cost

Training a diffusion model is expensive:

Model	Parameters	Training Cost
Stable Diffusion	1B text encoder + 860M UNet	~$100K–$1M
DALL-E 2	Unreleased	$Millions
Midjourney	Proprietary	Estimated $10M+

This creates a barrier to entry. You basically need to be OpenAI, Stability AI, Anthropic, or well-funded to train from scratch.

Inference is much cheaper. Running Stable Diffusion on a consumer GPU costs <$0.01 per image (electricity). Paid services (DALL-E, Midjourney) charge $0.02–$0.20 per image.

The Comparison: Diffusion vs. Other Approaches

Diffusion vs. GANs

Diffusion: Stable training, high quality, slow generation GANs: Fast generation, unstable training, lower consistency

Winner: Diffusion for quality, GANs for speed

Diffusion vs. Autoregressive Models

Some models (like PixelCNN) generate images one pixel at a time. This works but is slow (generating one pixel at a time for millions of pixels).

Diffusion operates on the full image at every step, making it faster.

Winner: Diffusion

Diffusion vs. Variational Autoencoders (VAEs)

VAEs are generative models using variational inference. They work okay but tend to produce blurrier images than diffusion.

Winner: Diffusion

The Future of Diffusion

Efficiency

Current diffusion models require many steps (100–1000) to generate high quality. Researchers are working on faster versions:

Fewer steps — Distillation techniques let you generate in 4–10 steps
Faster UNet architectures — Smaller, more efficient denoising networks
Approximations — Approximate attention, quantization, pruning

Generating images from text is cool. Generating images from other images, audio, or 3D point clouds is coming.

Real-Time Generation

Currently, generation takes seconds to minutes. Real-time generation (for interactive design tools) is a frontier.

Video and 3D

Video generation (Sora, Runway) is nascent. 3D asset generation is coming. Imagine describing a 3D scene and getting a usable 3D model.

Personalization

Fine-tuning diffusion models on personal photos lets you generate images in your style or with your likeness. Legal/ethical implications TBD.

FAQs

Q: Can diffusion models generate any image? A: Anything they've seen patterns for in training. Novel combinations work, but truly unprecedented things are hard. And certain content (explicit, private) is filtered.

Q: Why is diffusion generation slow? A: Because it requires many denoising steps (100–1000). Each step is a neural network forward pass. Each pass takes time. Newer approaches (distillation, fewer steps) are improving this.

Q: Are AI-generated images copyrighted? A: Legally murky. The image is probably not copyrighted (no human author), so you might own it. But training data includes copyrighted images, raising questions.

Q: Can you fine-tune diffusion models? A: Yes, using techniques like LoRA. Train on a dataset of your style/subject, and the model learns to generate similar images.

Q: How do you prevent misuse (deepfakes, explicit content)? A: Filtering training data, filtering prompts, and detection systems. It's an arms race. No perfect prevention.

The Takeaway

Diffusion models are a breakthrough in generative AI. The core idea—learning to reverse a noise process—is elegant and practical.

By conditioning on text (or images, or other data), you can guide generation toward what you want. By training on billions of images, you get models that understand visual concepts deeply.

The result? Anyone can now generate images. Some are photorealistic. Some are artistic. Some are useful for design. Some are just fun.

The technology is maturing. Video generation (Sora, Runway, Pika) shows diffusion scales beyond images. The future probably includes interactive AI tools where you describe what you want and the AI generates it, all in real-time.

This is one of the most tangible examples of AI capability. While language models (ChatGPT, Claude) are impressive, image generation let people see AI capabilities. That had huge cultural impact.

And That's the Journey

You've now learned:

What is AI — The basics and history
How machines learn — Supervised, unsupervised, reinforcement learning
Neural networks — The building blocks of modern AI
Generative AI — The breakthroughs in text, image, and video generation
Synthetic data — Training data that's AI-generated
Transfer learning — Reusing learned knowledge
Transformers — The architecture behind ChatGPT and modern AI
Attention mechanism — How AI learns what to focus on
Diffusion models — How AI generates images

You now understand the foundations of modern AI. The field is moving fast, but these concepts form the bedrock.

Where do you go from here? Build something. Fine-tune a model. Generate images. Write with ChatGPT. Use Claude to code. Apply these tools to your problem.

The best way to learn is by doing.

Welcome to the AI era.

You've finished the series! If you want to dive deeper into specific topics, check out research papers on arxiv.org, follow researchers on Twitter/X, or contribute to open-source AI projects.

Tools that use this

Put this knowledge into practice

midjourney

canva ai

invideo ai

Test your understanding

3 questions · 2 minutes

1 / 3

How do diffusion models generate images?

0 correct so far