Finding Optimal Tokenizers

In this post, I will present an algorithm that was able to compute an optimal tokenizer in some settings. This result is cool because optimal tokenization is theoretically intractable, but seems to be solvable in practice. My finding is very similar to various results on the Traveling Salesman Problem (TSP), where even difficult instances can be solved optimally using cutting-plane techniques.

I'll highlight that, while this result is cool, there are a few reasons that it isn't necessarily useful. First, the existing state of the art was already somewhat close to optimal (often within 1%). Second, even if a tokenizer is optimal on the training data, it may not generalize as well as other tokenizers when evaluated on held out test data. Finally, inefficient tokenizers are basically fine: you can pay for the cost of a less efficient tokenizer by slightly increasing your vocabulary size.

Despite the above caveats, I had a really fun time working on this project, and I hope others will be interested in pushing the frontier of this problem as well.

Background: Tokenizers

Frontier LLMs are typically trained on sequences of integers known as tokens. Each token refers to some sequence of bytes, and these byte sequences often correspond to common words. For example, in the GPT-5 tokenizer, the token 290 corresponds to the bytes “ the”, and 6602 corresponds to “ token”, so the text “ the token” can be encoded as the sequence [290, 6602].

The mapping from tokens to bytes, known as the “vocabulary”, is fixed before the LLM is even trained. Typically, we try to find a vocabulary that compresses a slice of training data. In particular, we would like to pick a vocabulary of a fixed size that minimizes the number of tokens required to encode the data. The dominant technique for finding such a vocabulary is byte-pair encoding (BPE), a decades-old greedy compression algorithm.

Tokenization as integer linear programming

In a recent paper, Tempus et al. connected tokenization to integer linear programming. The basic idea of their approach is to represent the entire dataset's tokenization as a set of integer variables.

In this formulation, there's a “color” variable for each possible vocabulary entry. In particular, we create one color variable for every unique substring of the dataset. A color variable is 1 if the corresponding byte sequence is in the vocabulary, or 0 otherwise. We add a single constraint to force the sum of color variables to equal the target vocabulary size.

A color corresponds to some sequence of bytes, but a given sequence of bytes may occur many times throughout the dataset. For each occurrence of a color, there's a separate “edge” variable. The edges work together to encode an actual tokenization of the dataset. If an edge is 1, then the edge's corresponding token is used in this particular place. The objective of our linear program is to minimize the sum of all the edge variables, i.e. the number of tokens used to encode our dataset.

... continue reading