What is tokenization?
Tokenization is the process of breaking text into smaller pieces called tokens. These could be words, subwords, or even individual characters. It's the critical first step that lets AI systems understand language.
Think of it like breaking a sentence into words. "I am learning about AI" becomes ["I", "am", "learning", "about", "AI"]. The AI can't understand full text—it needs chunks.
Why tokenization matters
Language models like ChatGPT, Claude, and Gemini can't process raw text directly. They process numbers. Tokenization bridges the gap—converting text into a format machines can actually use.
It's foundational. Bad tokenization = misunderstood meaning. Good tokenization = better language understanding.
How tokenization works
- Input - Raw text arrives
- Breaking - Text gets split into tokens
- Encoding - Each token gets converted to a number (token ID)
- Processing - The model processes these numbers
- Output - Predictions or responses are generated
The model sees [501, 1524, 13829, 100] where you see "I am learning AI."
Types of tokenization
Word Tokenization
Split text by spaces and punctuation. "Hello world!" becomes ["Hello", "world"]. Simple and intuitive but inefficient—requires a huge vocabulary to cover all possible words.
Subword Tokenization
Breaking words into smaller pieces. "tokenization" becomes ["token", "ization"]. More efficient—handles new words by combining known pieces. Used in modern models like ChatGPT and BERT.
Character Tokenization
Breaking into individual letters. "Hello" becomes ["H", "e", "l", "l", "o"]. Extremely granular, handles any text but creates very long sequences. Usually too verbose for modern use.
Byte Pair Encoding (BPE)
Starts with characters, learns common byte pairs, merges them. Builds an efficient vocabulary by identifying patterns. Popular in modern language models.
WordPiece
Similar to BPE but adds word boundaries. Used by BERT and other transformers. Balances efficiency with meaning preservation.
SentencePiece
Language-agnostic approach. Works across languages without needing space separation. Critical for multilingual models.
Why tokenization is tricky
Vocabulary Size
Too small = can't represent all words. Too large = inefficient processing. Finding the sweet spot is an art.
Language Differences
English relies on spaces. Chinese, Japanese don't. Different languages require different tokenization strategies.
Punctuation and Special Characters
Handling contractions ("don't"), hyphens, emojis—all add complexity.
Context Matters
"New York" could be two tokens or one (if treated as a special token). The right tokenization depends on the task.
Real examples
ChatGPT's Tokenization
Uses a modified BPE approach. The word "Hello" might be one token. The phrase "Hmm, let me think" becomes multiple tokens: ["Hmm", ",", " let", " me", " think"]. Spaces are sometimes separate tokens.
BERT
Uses WordPiece. "Unbelievable" becomes ["Un", "##believe", "##able"]. The "##" indicates continuation tokens.
Multilingual Models
Google's mBERT uses SentencePiece to handle 100+ languages with a single tokenizer.
Why this matters for AI quality
Token limits
Language models have context windows (like ChatGPT's 4K or 128K tokens). Long text gets cut off. Efficient tokenization fits more meaning into fewer tokens.
Translation accuracy
Poor tokenization = misunderstandings. Proper tokenization preserves meaning across languages.
Instruction following
Models follow instructions better when instructions are well-tokenized. A poorly tokenized input confuses the model.
Cost efficiency
Many APIs charge per token. Efficient tokenization = lower costs. 1000 word essay might be 1200 tokens with good tokenization, 1800 with poor.
Your tokenization questions, answered
What is tokenization in NLP?
Breaking text into smaller units (tokens) that language models can process. Essential preprocessing step.
Why can't models just read text directly?
Models work with numbers. They need a mapping from text to numerical representations. Tokenization creates that bridge.
What's the difference between tokenization and vectorization?
Tokenization splits text. Vectorization converts tokens to numerical vectors. They're related but different steps.
How many tokens is "Hello world"?
Depends on the tokenizer. Likely 2 (one per word) in simple word tokenization. Might be 3 or 4 in subword tokenization depending on spaces.
Can I use the same tokenizer for all languages?
Not optimal. Languages have different structures. Multilingual tokenizers like SentencePiece work across many languages but may not be optimal for specific ones.
Why do models have token limits?
Computational constraints. Longer sequences require more memory and processing power. Context windows balance capability with cost.
Does better tokenization always mean better performance?
Usually yes, but it depends on the task. Task-specific tokenization often outperforms general approaches.
The future of tokenization
As models evolve, tokenization gets smarter. Researchers experiment with:
- Semantic tokenization - Tokens represent meaning, not just surface patterns
- Dynamic tokenization - Different tokenization for different inputs
- Multilingual optimization - Better handling of code-switching and mixed languages
- Efficient encoding - Using fewer tokens to represent more meaning
The goal: make tokenization transparent and efficient so models can focus on understanding.
Next up: explore Natural Language Processing to see how tokenization enables all the cool AI language tricks you see today.