The Bitter Lesson is coming for Tokenization

24 Jun, 2025

a world of LLMs without tokenization is desirable and increasingly possible

Published on 24/06/2025 • ⏱️ 29 min read

In this post, we highlight the desire to replace tokenization with a general method that better leverages compute and data. We'll see tokenization's role, its fragility and we'll build a case for removing it. After understanding the design space, we'll explore the potential impacts of a recent promising candidate (Byte Latent Transformer) and build strong intuitions around new core mechanics.

As it's been pointed out countless times - if the trend of ML research could be summarised, it'd be the adherence to The Bitter Lesson - opt for general-purpose methods that leverage large amounts of compute and data over crafted methods by domain experts. More succinctly articulated by Ilya Sutskever, "the models, they just want to learn". Model ability has continued to be blessed with the talent influx, hardware upgrades, model architectural advances and initial data ubiquity to enable this reality in recent years.

the pervasive tokenization problem

However, one of the documented bottlenecks in the text transformer world that has received less optimisation effort is the very mechanism that shapes its world view - tokenization.

If you're not aware, one of the popular text tokenization methods for transformers, Byte-Pair Encoding (BPE), is a learned procedure that extracts an effectively compressed vocabulary (of desired size) from a dataset by iteratively merging the most frequent pairs of existing tokens.

It's worth remembering that this form of tokenization is not a strict requirement of the transformer. In practice, it means that we're able to represent more bytes given a fixed number of entries in the transformer's embedding table. From our earlier definition, effective is doing some heavy lifting. Ideally, the vocabulary of tokens is perfectly constructed for the task at hand such that it obtains the optimal trade off of byte compression to reduce the transformer's FLOPS while maintaining enough of a granular representation to achieve the lowest possible loss. Another ideal attribute is that tokens that do get merged, end up being well-modelled during training.

... continue reading