From text to token: How tokenization pipelines work

From Text to Token: How Tokenization Pipelines Work

By James Blackwood-Sewell on October 10, 2025

When you type a sentence into a search box, it’s easy to imagine the search engine seeing the same thing you do. In reality, search engines (or search databases) don’t store blobs of text, and they don’t store sentences. They don’t even store words in the way we think of them. They dismantle input text (both indexed and query), scrub it clean, and reassemble it into something slightly more abstract and far more useful: tokens. These tokens are what you search with, and what is stored in your inverted indexes to search over.

Let’s slow down and watch that pipeline in action, pausing at each stage to see how language is broken apart and remade, and how that affects results.

We’ll use a twist on "The quick brown fox jumps over the lazy dog" as our test case. It has everything that makes tokenization interesting: capitalization, punctuation, an accent, and words that change as they move through the pipeline. By the end, it’ll look different, but be perfectly prepared for search.

The full-text database jumped over the lazy café dog

This isn’t a complete pipeline, just a look at some of the common filters you’ll find in lexical search systems. Different databases and search engines expose many of these filters as composable building blocks that you can enable, disable, or reorder to suit your needs. The same general ideas apply whether you're using Lucene/Elasticsearch, Tantivy/ParadeDB, or Postgres full-text search.

Filtering Text With Case and Character Folding

Before we even think about breaking our text down we need to think about filtering out anything which isn’t useful. This usually means auditing the characters which make up our text string: transforming all letters to lower-case, and if we know we might have them folding any diacritics (like in résumé, façade, or Noël) to their base letter.

This step ensures that characters are normalized and consistent before tokenization begins. Café becomes cafe, and résumé becomes resume, allowing searches to match regardless of accents. Lowercasing ensures that database matches Database, though it can introduce quirks: like matching Olive (the name) with olive (the snack). Most systems accept this trade-off: false positives are better than missed results. Code search is a notable exception, since it often needs to preserve symbols and respect casing like camelCase or PascalCase.

... continue reading