Remember when GPUs were just for gaming? Those days are ancient history. Today, GPUs are the reason we can train language models with trillions of parameters, run real-time image recognition on your phone, and make AI accessible at all. This is the story of how video game processors became the secret weapon of AI.
What's a GPU (And Why It Matters)
A GPU is a processor designed to do one thing incredibly well: process thousands of things in parallel.
CPUs are like a chef working alone: incredibly skilled, can do anything, but only one dish at a time.
GPUs are like a restaurant kitchen: lots of simpler workers doing the same task simultaneously.
CPU: Sequential, complex decisions, general-purpose GPU: Parallel, simple operations, specialized-purpose
Neural networks are just massive matrix multiplications. Multiplying a 10,000×10,000 matrix? That's 100 million operations. CPUs do them one at a time. GPUs do thousands simultaneously. The result: 100-500x speedup.
Inside a GPU: The Hardware
CUDA Cores (NVIDIA) or Stream Processors (AMD)
Thousands of tiny processors. An NVIDIA H100 has 16,896 CUDA cores. Each core isn't powerful—but doing 16,896 things at once is.
VRAM (Video Random Access Memory)
GPU's high-speed memory. Critical spec:
- Consumer GPUs: 8-24GB (RTX 4090)
- Enterprise GPUs: 40-80GB (A100, H100)
- Largest: 192GB (H100 with HBM3)
More VRAM = larger models you can load = bigger batches = faster training.
Tensor Cores
Special circuits inside modern GPUs optimized for neural network math. They don't multiply one number—they multiply entire 4×4 matrices in one cycle. Gaming? Not needed. AI? Indispensable.
Memory Bandwidth
How fast data moves in/out of VRAM. Matters hugely:
- Slow: PCIe (16GB/s)
- Fast: NVLink (900GB/s between H100s)
Why? Training a model involves billions of memory transfers. Slow bandwidth = everything waits for memory.
Clock Speed & Power
Higher speed = faster computation but more heat and power draw. A modern H100:
- Power: 700W
- Requires proper cooling
- Temperature limits (>80°C and it throttles)
Types of GPUs You Actually Encounter
Consumer Gaming GPUs
NVIDIA RTX 4090, RTX 4080, AMD RX 7900 XTX
- VRAM: 12-24GB
- Cost: $1500-2000
- Performance: Excellent for learning, serious hobby work
- Real-world: Many startups start here
Professional/Data Center GPUs
NVIDIA H100, A100, AMD MI300
- VRAM: 40-192GB
- Cost: $10,000-40,000 per GPU
- Performance: Required for production, large-scale training
- Enterprise standard
Mobile GPUs
Apple Neural Engine, Qualcomm Adreno, Google Tensor
- VRAM: Shared with system RAM (4-12GB typical)
- Cost: Built-in (not separate)
- Performance: Optimized for inference, power-efficient
- Use case: Your phone running AI
Older GPUs Still Useful
NVIDIA V100, A40, RTX 6000
- VRAM: 16-32GB
- Cost: $500-2000 used
- Performance: Still competitive, great bang-for-buck
- Reality: Many companies still use these
GPUs vs. CPUs: The Real Tradeoffs
| Aspect | CPU | GPU |
|---|---|---|
| Best at | Complex logic, general tasks | Parallel math, same operation repeated |
| Cores | 8-64 | 2,000-16,000 |
| Memory | 16-256GB RAM | 8-192GB VRAM |
| Speed (single task) | Fast | Slower |
| Speed (parallel) | Slow | 100-500x faster |
| Power/Watt | Efficient | Power-hungry |
| Cost | $500-5,000 | $1,500-40,000 |
| Best for | Web servers, logic | Matrix ops, AI, graphics |
Truth: You need both. CPU orchestrates, GPU computes.
GPU Compute in Real Scenarios (2025)
Training GPT-Like Models
Baseline: Train on CPU → 1 token/second
GPU (single RTX 4090) → 100 tokens/second (100x faster)
GPU Cluster (8× H100s) → 100,000 tokens/second
Cost to train GPT-3 scale model:
CPU cluster: $5-10 million
GPU cluster: $500K-2 million
Real-Time Inference
Sentiment analysis: 1000 requests/second
CPU (16-core): Can handle ~10 RPS
GPU (RTX 4090): Can handle ~500 RPS
GPU (H100): Can handle ~5000 RPS
Mobile Vision
Recognize objects in a photo
CPU (Apple M2): 500ms (too slow, feels sluggish)
Neural Engine (Apple M2): 50ms (feels instant)
Result: Neural Engine (GPU-like) is 10x faster
The GPU Shortage Problem (Lessons from 2023)
In 2023, everyone wanted H100s. They couldn't get them. Why?
- NVIDIA was the sole manufacturer
- Supply couldn't keep up with demand (everyone training large language models)
- Lead times: 6+ months
- Prices: 3x MSRP on secondary market
This created industries:
- GPU rental cloud (Lambda Labs, RunPod)
- GPU sharing platforms
- Incentive to use CPUs/TPUs (less effective but available)
- Push toward model efficiency (LoRA, pruning, distillation)
Lesson: GPU availability constrains AI development. Diversification matters (AMD, custom chips).
Choosing the Right GPU
For Learning/Research
Pick: RTX 4090 or RTX 3090
- Cost: $1500-2000
- VRAM: 24GB
- Performance: Excellent
- Real-world: All major startups using consumer GPUs
For Serious Training
Pick: A100 or H100 (rental or owned)
- Cost: Rental $2-4/hour, owning $20K+
- VRAM: 40-80GB
- Performance: Industry standard
- Scaling: Add more for distributed training
For Inference/Deployment
Pick: Varies by use
- High-volume: Dedicated serving GPU (cheaper/quieter A10)
- Mobile: Built-in accelerator
- Edge: Specialized hardware (Intel Movidius, TPU)
For Mixed Workloads
Pick: Multiple GPU types
- Training: H100 or A100
- Inference: A10 or T4 (cheaper, sufficient)
- Development: RTX 4090 (buy once)
Scaling Across Multiple GPUs
One GPU is simple. Multiple GPUs is complex:
Data Parallelism
Batch of 1000 images
↓
Split into 8 batches (125 each)
↓
GPU 1: Process 125 → Gradients
GPU 2: Process 125 → Gradients
...
GPU 8: Process 125 → Gradients
↓
Synchronize, average gradients
↓
Update model once per all GPUs
Works well up to ~8-16 GPUs. Communication becomes bottleneck beyond.
Model Parallelism
Model too big for one GPU
↓
Split layers across GPUs
GPU1: Layers 1-10
GPU2: Layers 11-20
GPU3: Layers 21-30
↓
Forward pass: GPU1 → GPU2 → GPU3
Backward pass: GPU3 → GPU2 → GPU1
Complex but necessary for models >70B params.
GPU Memory Management (Critical Skill)
Memory fills up fast:
Model weights: 40GB (H100 has 80GB)
Activations (forward pass): 20GB
Gradients (backward pass): 40GB
Optimizer state: 80GB
Total: 180GB > 80GB available
→ Out of memory!
Solutions:
- Gradient checkpointing: Recompute instead of store (save memory, use compute)
- Mixed precision: Use FP16 where possible, FP32 where needed (half memory)
- Batch size reduction: Smaller batches = less memory (but slower)
- Offloading: Store on CPU RAM, load to GPU as needed (slow)
DeepSpeed, mentioned earlier, specifically solves this problem.
GPU vs. TPU vs. Other Accelerators
| Hardware | Best At | Drawbacks | When to Use |
|---|---|---|---|
| GPU | Everything | Power-hungry, expensive | Default choice, always works |
| TPU | Matrix math, inference | Only Google ecosystem | Google Cloud AI |
| CPU | Logic, general | Slow at math | Everything else |
| FPGA | Specialized | Hard to program | Custom hardware |
| ASIC | Ultra-specific | Can't reprogram | Bitcoin mining, specialized |
Rule: 99% of AI projects use GPUs. Specialize only if you must.
Real GPU Economics
Small Startup Scenario
Team: 5 ML engineers
Budget: $50K
Option A: Buy 2× RTX 4090
Cost: $4K
Downside: Can't parallelize effectively, slow training
Timeline: 10 days to train a model
Option B: Rent GPUs on Lambda Labs
Cost: $1/hour for RTX 4090
Training: 50 hours → $50
Monthly budget used for: 1000 hours of GPU ≈ 20 simultaneous training jobs
Upside: Scale easily, no capex
Timeline: Same (10 days for one model), but can run 20 in parallel
Verdict: Rent. Most startups rent.
Enterprise Scenario
Team: 100+ ML engineers
Training volume: 10,000 GPU-hours/month
Option A: Cloud rental (Lambda, AWS, GCP)
Cost: $2/hour × 10,000 = $20K/month
Option B: Own a cluster
Cost: 80 GPUs × $10K = $800K upfront
Running cost: $5K/month (power, cooling, staff)
Break-even: ~5 years
Upside: Predictable costs, control
Verdict: Own. (Over 5 years, much cheaper)
GPU Programming (For Developers)
You don't need to learn GPU programming (CUDA). High-level frameworks handle it:
PyTorch:
model = model.to("cuda") # Move to GPU
output = model(input) # Automatic GPU computation
TensorFlow:
with tf.device('/GPU:0'):
result = model(input)
The framework translates Python to CUDA kernels automatically. You don't write CUDA unless optimizing something specific.
Cooling, Power, Physical Setup
GPUs need serious cooling:
- Air cooling: Fans, heatsinks, proper ventilation
- Liquid cooling: More efficient, quieter, for datacenters
- Operating temps: 60-80°C (above 80°C = throttling)
Power consumption:
- Single RTX 4090: 700W (entire computer supply needed)
- 8× H100 cluster: 5600W (industrial power, dedicated circuit)
- Cooling: 10-30% additional power (air conditioning needed)
Real cost of GPU cluster:
- Hardware: 50%
- Power/cooling: 30%
- Infrastructure: 20%
The GPU Hierarchy (2025)
Best for Learning: NVIDIA RTX 4090 (24GB, great price/performance)
Best for Production: NVIDIA H100 (80GB, latest, fastest)
Best for Enterprise: NVIDIA A100 (40GB, proven, slightly older but reliable)
Best Value: Used V100/A40 (older but still powerful, $500-2000)
Best for Mobile: Apple Neural Engine or Qualcomm Adreno (power-efficient)
Best for Budget: T4 or RTX 3060 12GB (cheaper, slower but works)
FAQs
Do I need GPU for learning ML? For deep learning: yes. For traditional ML: CPU is fine. Start with GPU to avoid limits.
Can I use my gaming GPU for training? Absolutely. RTX 4090 is great for deep learning. Just needs NVIDIA drivers + CUDA.
What's the memory rule of thumb? 8GB minimum, 24GB comfortable, 40GB+ for serious training. Generally: more VRAM = bigger models = faster iteration.
Should I buy or rent? If training <100 GPU-hours/month: rent. If training >1000 GPU-hours/month: buy.
Will AMD GPUs replace NVIDIA? Maybe eventually. Currently: NVIDIA dominates (80% market). AMD improving. Most tools optimize for NVIDIA first.
How often do GPUs become obsolete? Every 1-2 years new generation, but older GPUs stay useful 5+ years. H100 (2023) still fastest. V100 (2017) still powerful.
Next up: Explore Edge Computing to see how to deploy efficient models on devices without powerful GPUs.