Tokenization for language modeling: BPE vs. Unigram Language Modeling (2020)
Published on: 2025-06-13 22:59:02
A human reading dest-igmat-ize would have trouble understanding it until they recognized that the st belongs in the next group. Indeed, hyphenation dictionaries for e-readers disallow hyphens that break syllables for exactly this reason.
Byte pair encoding (BPE)
The tokenizer used by GPT-2 (and most variants of Bert) is built using byte pair encoding (BPE). Bert itself uses some proprietary heuristics to learn its vocabulary but uses the same greedy algorithm as BPE to tokenize. BPE comes from information theory: the objective is to maximally compress a dataset by replacing common substrings with tokens, morphology be damned.
Are we disadvantaging language models by partitioning words in ways that obscure the relationships between them? Perhaps this doesn't matter. Traditional NLP models used whole words as tokens which by definition have to be learned independently, so we're better off than we used to be! Bert and GPT-2 are able to achieve astounding results with these 'misleading'
... Read full article.