Slightly better named character reference tokenization than Chrome, Safari, FF

Slightly better named character reference tokenization than Chrome, Safari, and Firefox

2025-06-26

Note: I am not a 'browser engine' person, nor a 'data structures' person. I'm certain that an even better implementation than what I came up with is very possible.

A while back, for no real reason, I tried writing an implementation of a data structure tailored to the specific use case of the Named character reference state of HTML tokenization (here's the link to that experiment). Recently, I took that implementation, ported it to C++, and used it to make some efficiency gains and fix some spec compliance issues in the Ladybird browser.

Throughout this, I never actually looked at the implementations used in any of the major browser engines (no reason for this, just me being dumb). However, now that I have looked at Blink/WebKit/Gecko (Chrome/Safari/Firefox, respectively), I've realized that my implementation seems to be either on-par or better across the metrics that the browser engines care about:

Efficiency (at least as fast, if not slightly faster)

Compactness of the data (uses ~60% of the data size of Chrome's/Firefox's implementation)

Ease of use

Note: I'm singling out these metrics because, in the python script that generates the data structures used for named character reference tokenization in Blink (the browser engine of Chrome/Chromium), it contains this docstring (emphasis mine): """This python script creates the raw data that is our entity database. The representation is one string database containing all strings we could need, and then a mapping from offset+length -> entity data. That is compact, easy to use and efficient."""

So, I thought I'd take you through what I came up with and how it compares to the implementations in the major browser engines. Mostly, though, I just think the data structure I used is neat and want to tell you about it (fair warning: it's not novel).

... continue reading