Tech News
← Back to articles

Slightly better named character reference tokenization than Chrome, Safari, FF

read original related products more articles

Slightly better named character reference tokenization than Chrome, Safari, and Firefox

2025-06-26

Note: I am not a 'browser engine' person, nor a 'data structures' person. I'm certain that an even better implementation than what I came up with is very possible.

A while back, for no real reason, I tried writing an implementation of a data structure tailored to the specific use case of the Named character reference state of HTML tokenization (here's the link to that experiment). Recently, I took that implementation, ported it to C++, and used it to make some efficiency gains and fix some spec compliance issues in the Ladybird browser.

Throughout this, I never actually looked at the implementations used in any of the major browser engines (no reason for this, just me being dumb). However, now that I have looked at Blink/WebKit/Gecko (Chrome/Safari/Firefox, respectively), I've realized that my implementation seems to be either on-par or better across the metrics that the browser engines care about:

Efficiency (at least as fast, if not slightly faster)

Compactness of the data (uses ~60% of the data size of Chrome's/Firefox's implementation)

Ease of use

Note: I'm singling out these metrics because, in the python script that generates the data structures used for named character reference tokenization in Blink (the browser engine of Chrome/Chromium), it contains this docstring (emphasis mine): """This python script creates the raw data that is our entity database. The representation is one string database containing all strings we could need, and then a mapping from offset+length -> entity data. That is compact, easy to use and efficient."""

So, I thought I'd take you through what I came up with and how it compares to the implementations in the major browser engines. Mostly, though, I just think the data structure I used is neat and want to tell you about it (fair warning: it's not novel).

... continue reading