🧠 All Things AI
Advanced

Tokenization & Vocabulary Design

A language model never sees text. It sees integers — indices into a fixed vocabulary of subword units. Tokenization is the algorithm that maps raw Unicode strings to those integers before the model is trained or queried, and back again when reading its output. Because every model capability ultimately operates on tokens, the tokenization algorithm is not a mundane pre-processing step. It is an architectural choice that shapes sequence lengths, multilingual fairness, arithmetic ability, and the cost of every inference call.

Why Subwords, Not Words or Characters?

Two extremes exist. Character-level tokenization gives a tiny vocabulary (256 bytes or ~26 letters) but produces very long sequences, and the model must learn to compose characters into words entirely from the training signal — an enormous burden. Word-level tokenization gives compact sequences but requires a vocabulary large enough to cover every surface form; rare words become out-of-vocabulary tokens and the model has no way to handle morphological variation ("run", "runs", "running" are all different entries).

Subword tokenization sits between the two extremes. Frequent words get their own token; rare or compound words are split into pieces that the model has seen more often. "unhappiness" might become ["un", "happi", "ness"]. The model learns to assemble meaning from parts, generalising to unseen forms, while keeping sequence lengths manageable. This approach has dominated transformer pretraining since GPT-2 (2019).

Byte Pair Encoding (BPE)

BPE was originally a data compression algorithm (Gage 1994) that iteratively replaces the most frequent byte pair in a corpus with a single new symbol. Neural machine translation researchers adapted it for subword segmentation (Sennrich et al., 2016), and OpenAI's GPT-2 popularised a byte-level variant that became the basis for the GPT family, LLaMA, Mistral, and most modern decoder-only models.

The algorithm runs at vocabulary construction time, not inference time:

  1. Initialise: Start with a vocabulary of individual characters (or individual bytes in byte-level BPE). Every word in the training corpus is represented as a sequence of these base symbols, with a special end-of-word marker if needed.
  2. Count pairs: Scan the full training corpus and count how often each adjacent pair of symbols co-occurs.
  3. Merge the top pair: The most frequent pair (e.g. "e" + "s" → "es") is merged into a single new symbol and added to the vocabulary. All occurrences in the corpus are updated.
  4. Repeat: Steps 2–3 are repeated until the target vocabulary size is reached. The merge rules are stored in order.

At inference time the tokenizer applies the learned merge rules deterministically — the same text always produces the same token sequence. Byte-level BPE (used by GPT-2 and its descendants) starts from the 256 raw bytes of UTF-8, guaranteeing that any Unicode input can be tokenized without an unknown-token fallback.

WordPiece

WordPiece (Schuster & Nakamura 2012; refined for BERT by Devlin et al. 2018) follows the same general structure as BPE but uses a different merge criterion. Instead of picking the most frequent pair, WordPiece picks the pair that maximises the likelihood of the training data under a unigram language model of subword units. Concretely, it selects the pair that produces the greatest increase in:

score(A, B) = freq(AB) / (freq(A) × freq(B))

This normalises by how often A and B appear independently — preferring pairs whose joint frequency exceeds what chance would predict. The effect is a vocabulary that is slightly more semantically coherent than BPE: common words stay whole because their pair frequency relative to component frequency is high, while rare strings are aggressively split.

WordPiece also marks continuation pieces with a "##" prefix (e.g. ["happi", "##ness"]) to distinguish a piece that starts a word from one that continues it. BERT, DistilBERT, and Electra all use WordPiece.

SentencePiece

SentencePiece (Kudo & Richardson 2018) is a library, not just an algorithm. It implements both BPE and a unigram language model variant directly on raw Unicode text — treating the input as a sequence of Unicode codepoints rather than pre-tokenized words. This makes it language-agnostic: it does not require whitespace to be a meaningful word boundary, which matters enormously for languages like Japanese, Chinese, Thai, or Arabic.

SentencePiece Strengths

  • Processes raw text — no language-specific pre-tokenization
  • Whitespace is treated as a normal character (represented as ▁)
  • Reproducible: same model file gives identical tokens on any platform
  • Supports both BPE and unigram LM algorithms under one API
  • Used by T5, mT5, LLaMA, Gemma, and many multilingual models

Unigram LM Variant

The unigram LM algorithm starts with a large vocabulary and iteratively removes tokens whose removal decreases corpus likelihood least. It converges to a vocabulary of a target size. Unlike BPE (which is greedy-constructive), this is greedy-destructive. At inference time it uses the Viterbi algorithm to find the most likely segmentation under the unigram model, which means it can produce slightly different splits than BPE for the same vocabulary size.

Vocabulary Size Tradeoffs

Vocabulary size is a key hyperparameter with cascading effects. Modern LLMs typically use 32,000–128,000 tokens.

DimensionSmaller vocabulary (e.g. 32K)Larger vocabulary (e.g. 128K+)
Sequence lengthLonger — more tokens per sentenceShorter — more meaning per token
Embedding table sizeSmaller — fewer parameters in embedding + unembedding layersLarger — embedding matrix can dominate small models
Rare word coveragePoor — rare terms fall back to many small piecesBetter — technical terms, proper nouns get own tokens
Training costHigher per token in context (longer sequences)Lower per semantic unit (denser sequences)
Multilingual performanceNon-English languages fragment heavilyCan allocate dedicated capacity to multiple scripts

GPT-2 used 50,257 tokens. GPT-4 uses ~100,000. LLaMA 1/2 used 32,000; LLaMA 3 expanded to 128,000 specifically to improve multilingual and code performance. Gemma 2 uses 256,128 — unusually large — to give near-character-level granularity while maintaining efficiency for common English text.

How Tokenization Affects Model Capability

Tokenization is not neutral. The way numbers, code, and non-English text are segmented has measurable effects on model behaviour.

Arithmetic Failures

GPT-2/3 tokenizers split numbers at arbitrary boundaries — "127" might be one token while "128" is two tokens ["12", "8"]. This creates inconsistent positional representations for digits. Models must learn arithmetic across tokenization noise. Larger vocabularies that tokenize numbers as individual digits (consistent boundary) dramatically improve multi-step arithmetic.

Code Tokenization

Source code contains long identifier names, indentation (often 2–4 spaces per level), and repeated structural patterns. Models trained with vocabularies that include common code keywords as single tokens (e.g. "def", "return", "class") and that encode indentation compactly use context windows more efficiently. Code-focused models like DeepSeek Coder and StarCoder tune their tokenizers for this.

Multilingual Bias

If a tokenizer is trained predominantly on English text, non-English languages receive fewer vocabulary slots. A sentence in Turkish or Telugu may require 5–10× more tokens than the same content in English. This inflates inference cost, reduces effective context length, and disadvantages these languages in perplexity and generation quality — even if the model is otherwise capable.

Fertility and Token Efficiency

Fertility is the ratio of tokens to words (or characters) for a given piece of text under a given tokenizer. It is the primary measure of how efficiently a tokenizer represents a language.

A fertility of 1.0 means each word is exactly one token. English text with a well-matched tokenizer typically achieves fertility around 1.3–1.5 (some words split, punctuation adds tokens). The same tokenizer applied to Arabic, Korean, or Hindi often shows fertility of 4–8 — meaning a 128K-token context window effectively becomes a 16K–32K context window for users of those languages.

This is not a niche concern. As LLMs are deployed globally, English-centric tokenizer design creates structural price and quality discrimination against non-English users. Organisations building multilingual products should benchmark tokenizer fertility across their target languages before selecting a base model, since different models at the same parameter count can have wildly different effective capacities for non-Latin scripts.

Special Tokens

Every tokenizer vocabulary includes a set of special tokens that serve structural roles during training and inference. They are never generated from normal text; they are inserted programmatically.

TokenCommon symbolsRole
BOS<s>, <|begin_of_text|>Beginning of sequence. Prepended to every input. Gives the model a clean initial state.
EOS</s>, <|end_of_text|>, <|eot_id|>End of sequence. Model learns to generate this when a response is complete. Inference stops here.
PAD<pad>Padding to batch sequences to the same length. Attention masks exclude pad positions from the loss.
SEP[SEP]Separator between two segments (e.g. question and passage in BERT). Signals context boundary.
MASK[MASK]MLM placeholder: replaces tokens the model must predict during BERT-style training.
CLS[CLS]Classification token: BERT prepends this; the [CLS] final hidden state is used as a sequence-level representation for classification tasks.
UNK<unk>Unknown token for character/word tokenizers that cannot represent an input. Byte-level BPE eliminates the need for this.

Modern chat-tuned models add further special tokens for conversation structure: role markers like <|im_start|> and <|im_end|> (OpenAI's ChatML format), or LLaMA 3's <|start_header_id|>/ <|end_header_id|> markers that delimit system, user, and assistant turns. These are trained into the model's understanding of conversation format during instruction tuning.

Checklist: Do You Understand This?

  • Walk through the BPE merge algorithm step by step for a tiny corpus of three words. What is the first merge likely to be?
  • WordPiece and BPE both build subword vocabularies bottom-up from characters. What is the difference in how they choose which pair to merge?
  • Why does SentencePiece not need whitespace-based pre-tokenization, and why does that matter for non-Latin scripts?
  • A tokenizer has vocabulary size 32K. What are two concrete downsides compared to a 128K vocabulary, and one advantage?
  • Why does tokenizing numbers as multi-character strings (e.g. "127" as one token) hurt arithmetic reasoning, and how do modern models mitigate this?
  • What is token fertility, and how would you measure whether a given tokenizer disadvantages a non-English language?
  • What is the difference in the role of [MASK] during training vs [PAD] — and which would appear at inference time?