Tokenization: Breaking Language Into Pieces

What is tokenization?

Tokenization is the process of breaking text into smaller pieces called tokens. These could be words, subwords, or even individual characters. It's the critical first step that lets AI systems understand language.

Think of it like breaking a sentence into words. "I am learning about AI" becomes ["I", "am", "learning", "about", "AI"]. The AI can't understand full text—it needs chunks.

Why tokenization matters

Language models like ChatGPT, Claude, and Gemini can't process raw text directly. They process numbers. Tokenization bridges the gap—converting text into a format machines can actually use.

It's foundational. Bad tokenization = misunderstood meaning. Good tokenization = better language understanding.

How tokenization works

Input - Raw text arrives
Breaking - Text gets split into tokens
Encoding - Each token gets converted to a number (token ID)
Processing - The model processes these numbers
Output - Predictions or responses are generated

The model sees [501, 1524, 13829, 100] where you see "I am learning AI."

Types of tokenization

Word Tokenization

Split text by spaces and punctuation. "Hello world!" becomes ["Hello", "world"]. Simple and intuitive but inefficient—requires a huge vocabulary to cover all possible words.

Subword Tokenization

Breaking words into smaller pieces. "tokenization" becomes ["token", "ization"]. More efficient—handles new words by combining known pieces. Used in modern models like ChatGPT and BERT.

Character Tokenization

Breaking into individual letters. "Hello" becomes ["H", "e", "l", "l", "o"]. Extremely granular, handles any text but creates very long sequences. Usually too verbose for modern use.

Byte Pair Encoding (BPE)

Starts with characters, learns common byte pairs, merges them. Builds an efficient vocabulary by identifying patterns. Popular in modern language models.

WordPiece

Similar to BPE but adds word boundaries. Used by BERT and other transformers. Balances efficiency with meaning preservation.

SentencePiece

Language-agnostic approach. Works across languages without needing space separation. Critical for multilingual models.

Why tokenization is tricky

Vocabulary Size

Too small = can't represent all words. Too large = inefficient processing. Finding the sweet spot is an art.

Language Differences

English relies on spaces. Chinese, Japanese don't. Different languages require different tokenization strategies.

Punctuation and Special Characters

Handling contractions ("don't"), hyphens, emojis—all add complexity.

Context Matters

"New York" could be two tokens or one (if treated as a special token). The right tokenization depends on the task.

Real examples

ChatGPT's Tokenization

Uses a modified BPE approach. The word "Hello" might be one token. The phrase "Hmm, let me think" becomes multiple tokens: ["Hmm", ",", " let", " me", " think"]. Spaces are sometimes separate tokens.

BERT

Uses WordPiece. "Unbelievable" becomes ["Un", "##believe", "##able"]. The "##" indicates continuation tokens.

Multilingual Models

Google's mBERT uses SentencePiece to handle 100+ languages with a single tokenizer.

Why this matters for AI quality

Token limits

Language models have context windows (like ChatGPT's 4K or 128K tokens). Long text gets cut off. Efficient tokenization fits more meaning into fewer tokens.

Translation accuracy

Poor tokenization = misunderstandings. Proper tokenization preserves meaning across languages.

Instruction following

Models follow instructions better when instructions are well-tokenized. A poorly tokenized input confuses the model.

Cost efficiency

Many APIs charge per token. Efficient tokenization = lower costs. 1000 word essay might be 1200 tokens with good tokenization, 1800 with poor.

Your tokenization questions, answered

What is tokenization in NLP?

Breaking text into smaller units (tokens) that language models can process. Essential preprocessing step.

Why can't models just read text directly?

Models work with numbers. They need a mapping from text to numerical representations. Tokenization creates that bridge.

What's the difference between tokenization and vectorization?

Tokenization splits text. Vectorization converts tokens to numerical vectors. They're related but different steps.

How many tokens is "Hello world"?

Depends on the tokenizer. Likely 2 (one per word) in simple word tokenization. Might be 3 or 4 in subword tokenization depending on spaces.

Can I use the same tokenizer for all languages?

Not optimal. Languages have different structures. Multilingual tokenizers like SentencePiece work across many languages but may not be optimal for specific ones.

Why do models have token limits?

Computational constraints. Longer sequences require more memory and processing power. Context windows balance capability with cost.

Does better tokenization always mean better performance?

Usually yes, but it depends on the task. Task-specific tokenization often outperforms general approaches.

The future of tokenization

As models evolve, tokenization gets smarter. Researchers experiment with:

Semantic tokenization - Tokens represent meaning, not just surface patterns
Dynamic tokenization - Different tokenization for different inputs
Multilingual optimization - Better handling of code-switching and mixed languages
Efficient encoding - Using fewer tokens to represent more meaning

The goal: make tokenization transparent and efficient so models can focus on understanding.

Next up: explore Natural Language Processing to see how tokenization enables all the cool AI language tricks you see today.

Tools that use this

Put this knowledge into practice

chatgpt

claude

grammarly

Test your understanding

3 questions · 2 minutes

1 / 3

What is tokenisation in NLP?

0 correct so far