Slightly better named character reference tokenization than Chrome, Safari, and Firefox 2025-06-26 Note: I am not a 'browser engine' person, nor a 'data structures' person. I'm certain that an even better implementation than what I came up with is very possible. A while back, for no real reason, I tried writing an implementation of a data structure tailored to the specific use case of the Named character reference state of HTML tokenization (here's the link to that experiment). Recently, I took that implementation, ported it to C++, and used it to make some efficiency gains and fix some spec compliance issues in the Ladybird browser. Throughout this, I never actually looked at the implementations used in any of the major browser engines (no reason for this, just me being dumb). However, now that I have looked at Blink/WebKit/Gecko (Chrome/Safari/Firefox, respectively), I've realized that my implementation seems to be either on-par or better across the metrics that the browser engines care about: Efficiency (at least as fast, if not slightly faster) Compactness of the data (uses ~60% of the data size of Chrome's/Firefox's implementation) Ease of use Note: I'm singling out these metrics because, in the python script that generates the data structures used for named character reference tokenization in Blink (the browser engine of Chrome/Chromium), it contains this docstring (emphasis mine): """This python script creates the raw data that is our entity database. The representation is one string database containing all strings we could need, and then a mapping from offset+length -> entity data. That is compact, easy to use and efficient.""" So, I thought I'd take you through what I came up with and how it compares to the implementations in the major browser engines. Mostly, though, I just think the data structure I used is neat and want to tell you about it (fair warning: it's not novel). What is a named character reference?🔗 A named character reference is an HTML entity specified using an ampersand ( & ) followed by an ASCII alphanumeric name. An ordained set of names will get transformed during HTML parsing into particular code point(s). For example, ◯ is a valid named character reference that gets transformed into the symbol ◯, while & will get transformed into &. Note: The ◯ symbol is the Unicode code point U+25EF , which means it could also be specified as a numeric character reference using either ◯ or ◯ . We're only focusing on named character references, though. Here's a few properties of named character references that are relevant for what we'll ultimately be aiming to implement: Always start with & Only contain characters in the ASCII range Case-sensitive Usually, but not always, end with ; Are transformed into either 1 or 2 code points Irrelevant side note: those code point(s) usually make up one grapheme (i.e. most second code points are combining code points), but not always (e.g. fj maps to U+0066 U+006A which are just the ASCII letters fj ) Most crucially, though, the mappings of named character references are fixed. The HTML standard contains this note about named character references: Note: This list is static and will not be expanded or changed in the future. This means that it's now safe to represent the data in the minimum amount of bits possible without any fear of needing to accommodate more named character reference mappings in the future. Note: This is a big part of why I think better solutions than mine should be very possible. I feel like I've only scratched the surface in terms of a purpose-built data structure for this particular task. Named character reference tokenization overview🔗 I'm specifically going to be talking about the Named character reference state of HTML tokenization. You can read the spec for it if you want (it's fairly short) but there's really only one sentence we're interested in: Consume the maximum number of characters possible, where the consumed characters are one of the identifiers in the first column of the named character references table. This sentence is really the crux of the implementation, but the spec is quite hand-wavy here and conceals a lot of complexity. The example given a bit later in the spec hints at why the above sentence is not so straightforward (edited to exclude irrelevant details; for context, ¬ is a valid named character reference, as is ∉ ): Example: If the markup contains the string I'm ¬it; I tell you , the character reference is parsed as "not", as in, I'm ¬it; I tell you . But if the markup was I'm ∉ I tell you , the character reference would be parsed as "notin;", resulting in I'm ∉ I tell you . That is, with the string ¬it; , the characters up to and including ¬i can still lead to a valid named character reference ( ∉ , among others), so we only know that ¬it; is invalid once we've reached ¬it which can no longer lead to a valid named character reference. What this means is that there needs to be some amount of backtracking involved, as the goal is to consume only the characters that are part of the longest valid named character reference. That's not the only complication, though... The spectre of document.write 🔗 Due to in; The expected result after parsing is ∉ , meaning the resolved character reference is ∉ . Let's go through why that is the case. Note: I'm not super familiar with the tree construction side of HTML parsing, so don't expect my explanation below to be fully accurate from that point of view. My explanation is solely focused on the tokenizer's side of things. After the closing script tag is tokenized, the parser adds an "insertion point" after it, and then the script within the tag itself is executed. So, right before the script is executed, the tokenizer input can be visualized as (where is the insertion point): in; And then after the tag, ¬ comes next in the input stream (which was inserted by document.write ), and then the characters after ¬ are in; , so ultimately a tokenizer going character-by-character should see an unbroken ∉ , recognize that as a valid character reference, and translate it to ∉ . Note: The ordering here matters. For example, if you put ¬ first and use document.write to add a trailing in; like so: ¬ then it does not result in ∉ (the result of ∉ ), but instead ¬in; , since from the tokenizer's point of view it sees ¬< , and therefore treats ¬ as a character reference since the < cannot lead to any other valid character references. Then, after the in; This, too, is expected to result in ∉ (i.e. ∉ ). Keep in mind that the tokenizer will advance after each document.write call, so if the tokenizer tries to lookahead past the insertion point at any point before the full script is run, it will resolve the wrong string as a character reference ( ∈ , &nin; , or &noin; ). Here's a visualization that shows the insertion point and the various states of the input stream after each document.write call: in; Therefore, while HTML tokenizers can theoretically look ahead, they can never look past the end of an insertion point. What this all means, implementation-wise🔗 All of this is to say that a "consume the longest valid named character reference" implementation probably needs to use one of two strategies: Lookahead (but never beyond an insertion point) until we're certain that we have enough characters to rule out a longer named character reference. If we do not yet have enough characters to be certain, backtrack and try again until we can be certain. Never lookahead, and instead match character-by-character until we're certain it's no longer possible to match a longer valid named character reference. Backtrack to the end of longest full match found. The second strategy seems like the better approach to me, so that's what my implementation will be focused on. We'll see both strategies later on, though. Note: I'm not sure if it's feasible to implement an HTML parser that tries to resolve in; (this is expected to be parsed into ∉ , the same as the other examples above) Trie implementation🔗 So, we want an implementation that can iterate character-by-character and (at any point) efficiently determine if it's possible for the next character to lead to a longer valid named character reference. A data structure that seems pretty good for this sort of thing is a trie. A trie is a specialized tree where each node contains a character that can come after the character of its parent node in a set of words. Below is a representation of a trie containing this small subset of named character references: ¬ → ¬ ∉ → ∉ ⋷ → ⋷ ⋶ → ⋶ ∌ → ∌ ⋾ → ⋾ ⋽ → ⋽ Note the lack of a semicolon at the end of ¬ . This is a real variant, and it was chosen over ¬ to simplify the example n o t i n n i v v a b c a b c ; ; ; ; ; ; Notes: The & is excluded from the trie since it's the first character of every named character reference. is excluded from the trie since it's the first character of every named character reference. The nodes with a red outline are those marked as the end of a valid word in the set. Typically, trie visualizations put the letters on the connections between nodes rather than on the nodes themselves. I'm putting them on the nodes themselves for reasons that will be discussed later. With such a trie, you search for the next character within the list of the current node's children (starting from the root). If the character is found within the children, you then set that child node as the current node and continue on for the character after that, etc. For invalid words, this means that you naturally stop searching as soon as possible (after the first character that cannot lead to a longer match). For valid words, you trace a path through the trie and end up on an end-of-word node (and you may also pass end-of-word nodes on the way there). Here's what the path through the trie would look like for the named character reference ⋶ : n o t i n n i v v a b c a b c ; ; ; ; ; ; ⋶ You'll notice that the mapped code point (⋶) is present on the diagram above as well. This is because it is trivial to use a trie as a map to look up an associated value, since each word in the set must end at a distinct node in the trie (e.g. no two words can share an end-of-word node). Conveniently, using the trie as a map is exactly what we want to be able to do for named character references, since ultimately we need to convert the longest matched named character reference into the relevant code point(s). A brief detour: Representing a trie in memory🔗 Note: The code examples in this section will be using Zig syntax. One way to represent a trie node is to use an array of optional pointers for its children (where each index into the array represents a child node with that byte value as its character), like so: const Node = struct { children : [ 256 ] ?* Node , end_of_word : bool , }; Earlier, I said that trie visualizations typically put the letters on the connections between nodes rather than the nodes themselves, and, with this way of representing the trie, I think that makes a lot of sense, since the connections are the information being stored on each node. Note: For the examples in this section, we'll use a trie that only contains the words GG , GL , and HF . So, this representation can be visualized like so: G H G L F With this, checking if a character can come after the current node is a straightforward O(1) array access: if ( node . children [ c ] != null ) { } but it comes at the cost of a lot of potentially wasted space, since most nodes will have many null children. One way to mitigate the wasted space would be to switch from an array of children to a linked list of children, where the parent stores an optional pointer to its first child, and each child stores an optional pointer to its next sibling: const Node = struct { char : u8 , first_child : ?* Node , next_sibling : ?* Node , end_of_word : bool , }; Now that char is stored on each node directly, I (in turn) think it makes sense to visualize the trie with the characters shown on the nodes themselves, like so: G H G L F While this 'linked list' representation saves on memory, it transforms the search for a particular child into a O(n) linear scan across the children: var node = starting_node . first_child orelse return null ; while ( true ) { if ( node . char == c ) { } node = node . next_sibling orelse return null ; } This linear search can be slow, especially if the nodes are individually heap-allocated and therefore could be very spread out in memory, leading to a lot of random memory accesses and cache misses. Additionally, pointers themselves take up quite a bit of space (8 bytes on 64-bit machines). If we ultimately want to decrease the size of the node, getting rid of the pointer fields would be helpful as well. We can solve multiple of these problems at once by: Enforcing that all nodes are proximate in memory by storing them all in one array Replace all pointers with indexes into that array Ensure that children are always contiguous (i.e. to access a sibling you just increment the index by 1) With this approach, Node could look like this: const Node = packed struct { char : u8 , first_child_index : u3 , last_sibling : bool , end_of_word : bool , }; And the array of nodes for this particular example trie would look like this: const nodes = [ 6 ] Node { .{ . first_child_index = 1 , . char = 0 , . last_sibling = true , . end_of_word = false }, .{ . first_child_index = 3 , . char = 'G' , . last_sibling = false , . end_of_word = false }, .{ . first_child_index = 5 , . char = 'H' , . last_sibling = true , . end_of_word = false }, .{ . first_child_index = 0 , . char = 'G' , . last_sibling = false , . end_of_word = true }, .{ . first_child_index = 0 , . char = 'L' , . last_sibling = true , . end_of_word = true }, .{ . first_child_index = 0 , . char = 'F' , . last_sibling = true , . end_of_word = true }, }; Note: first_child_index having the value 0 doubles as a 'no children' indicator, since index 0 is always the root node and therefore can never be a valid child node. This representation can be visualized like so: 0 1 2 3 4 5 G H G L F You might find it interesting to note that this diagram is functionally the same as the previous ('children as a linked list') one; the connections are exactly the same, the nodes have just been rearranged. This still means that searching a child list uses a O(n) linear scan, but this representation makes those searches much more friendly to the CPU, and greatly reduces the size of each node. Some hard numbers🔗 To get an idea of how the different representations compare, here's a breakdown for a trie containing the full set of named character references (2,231 words). The code I'm using for the benchmarks below is available here. Data size🔗 Note: The sizes below assume a 64-bit architecture, i.e. pointers are 8 bytes wide. Representation 1 ('connections'): Each node contains a fixed-size array of optional child node pointers Each node is 2056 bytes wide (using [256]?*Node as the children field) There are 9,854 nodes in the trie, so 2,056 * 9,854 = 20,259,824 bytes total for the full trie ( 19.32 MiB ) ('connections'): Representation 2 ('linked list'): Each node contains a pointer to its first child and its next sibling Each node is 24 bytes wide There are 9,854 nodes in the trie, so 24 * 9,854 = 236,496 bytes total for the full trie ( 230.95 KiB ) ('linked list'): Representation 3 ('flattened'): Each node contains the index of its first child, and all nodes are allocated in one contiguous array Each node is 4 bytes wide There are 9,854 nodes in the trie, so 4 * 9,854 = 39,416 bytes total for the full trie ( 38.49 KiB ) ('flattened'): That is, the 'flattened' version is 1/ 514 the size of the 'connections' version, and 1/ 6 the size of the 'linked list' version. Note: I went with [256]?*Node to keep the representations similar in what information they are capable of storing. Instead, you could make the trie only support ASCII characters and use [128]?*Node for the children field, which would roughly cut the size of the trie in half (from 19.32 MiB to 9.70 MiB). In the 'linked list' and 'flattened' representations, restricting the characters to the ASCII set would only decrease the size of the nodes by 1 bit ( char field would go from a u8 to a u7 ). As mentioned, the 'linked list' and 'flattened' versions sacrifice the O(1) lookup of the 'connections' version in favor of reducing the data size, so while the 'flattened' version claws some performance back from the 'linked list' version, the 'connections' version is the fastest: Representation 1 ('connections'): 501.596ms ( 50ns per contains call) ('connections'): Representation 2 ('linked list'): 965.138ms ( 96ns per contains call) ('linked list'): Representation 3 ('flattened'): 609.215ms ( 60ns per contains call) ('flattened'): One interesting thing to note is that the above results for representations 1 & 2 rely on a friendly allocation pattern for the nodes (i.e. the memory addresses of the nodes happening to end up fairly close to eachother) This is admittedly pretty likely when constructing a trie all at once, but, if we intentionally force a horrendous allocation pattern, where each allocated node gets put on an entirely separate page, we can see the effects very clearly: Representation 1 ('connections'): 1.025s ( 102ns per contains call) (each contains call takes ~2x longer than it did) ('connections'): Representation 2 ('linked list'): 4.372s ( 437ns per contains call) (each contains call takes ~4x longer than it did) ('linked list'): Representation 3 ('flattened'): No difference since it always allocates one contiguous chunk of memory ('flattened'): If we run the relevant benchmarks through poop , we can confirm the cause of the slowdown (these results are from the 'connections' version): Benchmark 1 : ./trie-friendly-allocations measurement mean ± σ min … max outliers delta wall_time 496 ± 24.1 440 … 518 cpu_cycles 2.01 ± 100 1.78 … 2.11 instructions 1.94 ± 26.3 1.94 … 1.94 cache_references 107 ± 3.40 98.8 … 109 2 (18%) cache_misses 33.1 ± 1.15 30.3 … 34.0 2 (18%) branch_misses 21.7 ± 20.7 21.7 … 21.7 Benchmark 2 : ./trie-horrendous-allocations measurement mean ± σ min … max outliers delta wall_time 1.07 ± 21.9 1.05 … 1.11 💩+116.4% ± 5.5% cpu_cycles 4.17 ± 92.4 4.09 … 4.33 💩+106.9% ± 5.6% instructions 1.92 ± 38.4 1.92 … 1.92 cache_references 145 ± 1.63 144 … 147 💩+ 35.3% ± 3.3% cache_misses 62.2 ± 839 61.8 … 63.7 💩+ 88.3% ± 3.7% branch_misses 21.5 ± 51.6 21.4 … 21.5 Note that the instruction counts are roughly the same between the 'friendly' and 'horrendous' versions, so the increased cpu_cycles and wall_time can presumably be attributed to the increase in cache misses and pointer chasing (the trie code itself is identical between the two versions). When using a DAFSA for this particular task, we're trading off a ~20% difference in lookup speed for 2-3 orders of magnitude difference in data size. This seems pretty okay, especially for what we're ultimately interested in implementing: a fully static and unchanging data structure, so there's no need to worry about how easy it is to modify after construction. For completeness, I'll also note that it's possible to eliminate pointers while still using the 'connections' representation. For example, it could be done by allocating all nodes into an array, and then making the children field something like [256]u16 where the values are indexes into the array of nodes ( u16 because it needs to be able to store an index to one of the 9,854 nodes in the trie, but it's not ?u16 because the index 0 can double as null since the root can never be a child) This would keep the O(1) complexity to find a particular child and decrease the data size, but it would still be 2 orders of magnitude larger than the 'flattened' version (using [256]u16 would make the trie take up 4.81 MiB, [128]u16 would be 2.41 MiB). An important note moving forward🔗 In the next section, I will show diagrams that look like this in an effort to make them easier to understand: G H G L F but keep in mind that really the 'flattened' representation (representation 3) is being used, i.e. the most accurate visualization of the representation would look like this: 012345 G H G L F DAFSA implementation🔗 A while back at the same Zig meetup where I gave a talk about my Windows resource compiler, Niles Salter aka Validark gave a talk titled Better data structures and where to find them. It was nominally about his novel autocomplete data structure, but the stated purpose of the talk was to get people interested in learning about data structures and potentially inventing their own. Unfortunately, the recording/audio quality didn't end up being good enough to warrant uploading the talk anywhere (see the recording of my talk to get a sense of that). During the talk, I thought back to when I contributed to an HTML parser implementation and had to leave proper named character reference tokenization as a TODO because I wasn't sure how to approach it. I can't remember if a deterministic acyclic finite state automaton (DAFSA) was directly mentioned in the talk, or if I heard about it from talking with Niles afterwards, or if I learned of it while looking into trie variations later on (since the talk was about a novel trie variation), but, in any case, after learning about the DAFSA, it sounded like a pretty good tool for the job of named character references, so I revisited named character reference tokenization with that tool in hand. In other words, the talk (at least partially) served its purpose for me in particular. I didn't come up with anything novel, but it got me to look into data structures more and I have Niles to thank for that. What is a DAFSA?🔗 Note: There are a few names for a DAFSA: DAWG, MA-FSA, etc. A DAFSA is essentially the 'flattened' representation of a trie, but, more importantly, certain types of redundant nodes are eliminated during its construction (the particulars of this aren't too relevant here so I'll skip them; see here if you're interested). Going back to the same subset of named character references as the example in the "Trie implementation" section above, a DAFSA would represent that set of words like so: n o t i n n i v a b c ; As you can see, the v , a , b , c and ; nodes are now shared between all the words that use them. This takes the number of nodes down to 13 in this example (compared to 22 for the trie). The downside of this node consolidation is that we lose the ability to associate a given end-of-word node with a particular value. In this DAFSA example, all words except not end on the exact same node, so how can we know where to look for the associated value(s) for those words? Here's an illustration of the problem when matching the word ⋶ : n o t i n n i v a b c ; To get around this downside, it'd be possible to use something like a separate hash map or devise some minimal perfect hashing scheme to lookup the associated code point(s) for a matching word after the fact, but, luckily, we don't have to worry too much about that because it turns out it is possible to do... Minimal perfect hashing using a DAFSA🔗 First detailed in Applications of finite automata representing large vocabularies (Cláudio L. Lucchesi, Tomasz Kowaltowski, 1993) (pdf), the technique for minimal perfect hashing using a DAFSA is actually rather simple/elegant: Within each node, store a count of all possible valid words from that node. For the example we've been using, those counts would look like this: 7 7 7 3 3 3 3 3 1 1 1 1 Then, to get a unique index for a given word, traverse the DAFSA as normal, but: For any non-matching node that is iterated when searching a list of children, add their number to the unique index to the unique index For nodes that match the current character, if the node is a valid end-of-word, add 1 to the unique index Note that the "non-matching node that is iterated when searching a list of children" part of this algorithm effectively relies on the DAFSA using the O(n) search of the 'flattened' representation of a trie discussed earlier. For example, if we had a DAFSA with a , b , c , and d as possible first characters (in that order), and the word we're looking for starts with c , then we'll iterate over a and b when looking for c in the list of children, so we add the numbers of the a and b nodes (whatever they happen to be) to the unique index. Here's an illustration: a b c d Note: If the a or b nodes were marked as valid end-of-words, we ignore that bit of information—we only add 1 to the unique index if a matching node (in this case c ) is marked as an end-of-word. For a given word in the set, applying this algorithm will produce a number between 1 and the total number of words in the DAFSA (inclusive), and it's guaranteed that each word will end up with a unique number (i.e. this is a minimal perfect hash). Here's what that looks like for the example we've been using: Current Word: not Autoplay: on ➤ ⮜ +1 +3 +3 +1 5 7 7 7 3 3 3 3 3 1 1 1 1 ?? Unique Index: 0 After you have the unique index of a word, you can then use a lookup array for the associated values and index into it using unique_index - 1 . Note: It's not relevant here, but I think it's pretty cool that it's also possible to reconstruct the associated word if you have its unique index. You basically do the same thing as when you are calculating a unique index, but instead of adding to the unique index you're subtracting from it as you traverse the DAFSA. A tidbit that I accidentally ran into is that this 'reverse lookup' only works if you include the end-of-word nodes themselves in their own 'counts of possible words from that node'. Trie vs DAFSA for named character references🔗 Ok, so now that we have two different data structures that seem pretty well suited for named character reference matching—a trie and a DAFSA—how do they compare? It's now (finally) time to start using the full set of named character references and all of their mapped code point(s) to see how they stack up. Some numbers to keep in mind upfront: There are 2,231 named character references total A trie will use 9,854 nodes to encode the set of named character references A DAFSA will use 3,872 nodes to encode the set of named character references Another brief detour: representing the mapped value(s)🔗 As mentioned earlier, each named character reference is mapped to either one or two code points. Unicode code points have a range of 0x0 - 0x10FFFF , but if we actually look at the set of code points used by named character references, there are a few properties worth noting: The maximum value of the first code point is U+1D56B , which takes 17 bits to encode, so all first code point values can fit into a 17 bit wide unsigned integer. , which takes 17 bits to encode, so all first code point values can fit into a 17 bit wide unsigned integer. The set of distinct second code point values is actually very small, with only 8 unique code points. This means that an enum that's only 4 bits wide (3 bits for the 8 different values, 1 additional bit to encode 'no second code point') can be used to store all the information about the second code point. With both of these properties taken together, it's possible to encode the mapped code point values for any given named character reference in 21 bits. With padding between the elements of an array of 21-bit integers, though, that will round up to 4 bytes per element (11 bits of padding), so it ends up being the same as if 32 bit integers were used. Here's a diagram of a possible memory layout of an array using this representation, where is the bits of the first code point, is the bits of the second code point, and is the padding bits between elements: byte index element index 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 However, while using 21 bits to represent the mapped code point(s) does not automatically lead to any saved bytes over a 32 bit integer, it opens up the possibility to tightly pack an array of 21-bit elements in order to actually save some bytes. Yet, doing so means that storing/loading elements from the tightly packed array becomes trickier (both computationally and implementation-wise). Here's the same diagram as before, but with the elements tightly packed (no padding bits between elements): byte index element index 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 You'll notice that no elements past the first start or end on byte boundaries, meaning in order to load an element, a fair bit of bitwise operations are required (bit shifting, etc). This makes array accesses more expensive, but that isn't necessarily a big deal for our use case, since we only ever access the array of values once per named character reference, and only after we're certain we have a match. So, tightly bitpacking the value array is a viable way to save some extra bytes for our purposes. Note: This is just context for the next section where I'll mention data sizes for versions that use the "regular array" representation or the "tightly bitpacked array" representation for the values. Data size🔗 For the DAFSA, the size calculation is pretty straightforward: The data of each node can fit into 4 bytes with a few bits to spare (expand below if you're interested in the details), and there are 3,872 nodes in the DAFSA, so that's 15,488 bytes total Nitty-gritty DAFSA node size details Ultimately, the goal is to keep the node size less than or equal to 32 bits while storing the following data on each node: An ASCII character This can technically be represented in 6 bits, since the actual alphabet of characters used in the list of named character references only includes 61 unique characters ('1'...'8', ';', 'a'...'z', 'A'...'Z'). However, to do so you'd need to convert between the 6 bit representation and the actual ASCII value of each character to do comparisons. We aren't desperate to save bits, though, so we can get away with representing this value as 8 bits, which makes comparisons with any byte value trivial. A "count of all possible valid words from that node" Empirically, the highest value within our particular DAFSA for this field is 168 , which can fit into 8 bits. An "end of word" flag 1 bit A "last sibling" flag 1 bit An "index of first child" field There are 3,872 nodes in our DAFSA, so all child indexes can fit into 12 bits. In total, that's 8 + 8 + 1 + 1 + 12 = 30 bits, so we have 2 bits to spare while remaining within the 32 bit target size with this representation. There are 2,231 named character references, and the mapped code points for each of them need either 4 bytes (if using a regular array) or 21 bits (if using a tightly bitpacked array) For the regular array, that's 8,924 bytes total For the tightly bitpacked array, that's 5,857 bytes total So, the DAFSA and the lookup array for the values (together) will use either 24,412 bytes (23.84 KiB) or 21,345 bytes (20.84 KiB) total. For the trie, there's slightly more to discuss around data representation before we can get to the data size calculations. It was glossed over in the Trie implementation section, but when using the 'flattened' representation of a trie there are effectively two ways to handle value lookups for each word: Store an array of 2,231 values (one for each named character reference) and then also store an index into that array on each end-of-word node. This increases the bit size needed for each node by 12 bits (since 2,231 can be encoded in 12 bits) Store an array of 9,854 values (one for each node in the trie) and then index into the value array by re-using the index of the end-of-word node. This makes the size of the value array 4.42 times larger, but does not affect the node size Beyond that, the details aren't super relevant. Suffice it to say that each node will either take up 5 bytes or 3 bytes depending on which of the two 'value array' strategies you choose (note: I actually mean 5 bytes and 3 bytes, as they can be represented as one array of 2- or 4-byte values and one array of 1-byte values so padding between elements doesn't factor in). The summary is that, depending on the particular representation, the trie will use between 57,993 bytes and 68,777 bytes (56.63 KiB to 67.16 KiB) total, or, if the values array is tightly bitpacked, between 54,926 bytes and 55,227 bytes (53.64 KiB to 53.93 KiB) total. Ultimately, the data size of the trie is going to be at least 2x larger than the equivalent DAFSA. Luckily, there's an existing HTML parser implementation written in Zig called rem that uses a trie for its named character reference tokenization, so getting some relevant benchmark results from actual HTML parsing for a trie vs DAFSA comparison was pretty easy. Note: The trie in rem uses the 'flattened' representation of a trie, with a value array that re-uses the index of the end-of-word node (so there are a lot of empty slots in the value array). From my benchmarking, it turns out that the DAFSA implementation uses more instructions than the trie implementation because it needs to do extra work to build up the unique index during iteration, but the DAFSA saves on cache misses (presumably due to the smaller overall size of the DAFSA and its node re-use) and everything just about evens out in terms of wall clock time: Benchmark 1 : ./trie measurement mean ± σ min … max outliers delta wall_time 44.4 ± 1.57 43.0 … 61.8 peak_rss 61.8 ± 62.5 61.6 … 61.9 cpu_cycles 49.5 ± 463 48.5 … 51.9 instructions 76.2 ± 2.95 76.2 … 76.2 cache_references 2.54 ± 21.6 2.48 … 2.63 cache_misses 119 ± 1.64 115 … 128 branch_misses 322 ± 1.02 319 … 328 Benchmark 2 : ./dafsa measurement mean ± σ min … max outliers delta wall_time 44.3 ± 561 43.8 … 48.4 peak_rss 61.6 ± 66.6 61.5 … 61.7 cpu_cycles 53.0 ± 566 52.3 … 54.9 💩+ 7.0% ± 0.1% instructions 78.7 ± 2.59 78.7 … 78.7 💩+ 3.2% ± 0.0% cache_references 2.49 ± 30.0 2.43 … 2.60 ⚡- 2.0% ± 0.1% cache_misses 90.9 ± 1.32 86.4 … 95.4 ⚡- 23.5% ± 0.2% branch_misses 331 ± 730 330 … 337 💩+ 2.9% ± 0.0% Note: Bitpacking the value array of the trie doesn't seem to affect the overall performance, and we see the same sort of pattern: more instructions (due to the added bit shifting during value lookup), but fewer cache misses. Benchmark results Benchmark 1: ./trie measurement mean ± σ min … max outliers delta wall_time 43.3 ± 249 43.1 … 45.9 peak_rss 61.8 ± 63.5 61.6 … 61.9 cpu_cycles 49.4 ± 159 49.1 … 50.3 instructions 76.2 ± 2.61 76.2 … 76.2 cache_references 2.52 ± 13.6 2.48 … 2.57 cache_misses 118 ± 933 116 … 122 branch_misses 321 ± 548 319 … 325 Benchmark 2: ./trie-bitpacked-values measurement mean ± σ min … max outliers delta wall_time 43.3 ± 504 43.0 … 46.2 peak_rss 61.6 ± 64.0 61.5 … 61.7 cpu_cycles 48.9 ± 180 48.6 … 50.0 instructions 77.4 ± 2.56 77.4 … 77.4 💩+ 1.6% ± 0.0% cache_references 2.53 ± 23.7 2.48 … 2.68 cache_misses 104 ± 1.18 101 … 107 ⚡- 12.5% ± 0.2% branch_misses 335 ± 1.05 333 … 344 💩+ 4.3% ± 0.0% For the use-case of named character references, using a DAFSA instead of a trie cuts the size of the data at least in half while performing about the same. The Ladybird implementation🔗 First, let's take a look at what the Ladybird implementation looked like before my changes: state implementation, matching implementation. Here's a rough summary of the approach, in pseudo-code: var match = null ; var remaining_input = input . substring ( current_offset , input . length - current_offset ); for ( var entity : entities ) { if ( remaining_input . starts_with ( entity )) { if ( match == null or entity . length > match . length ) { match = entity ; } } } if ( match != null ) { consume_and_advance ( match . length - 1 ); } This has two major problems: It is inefficient, since the input string is being compared against the entire list of named character references one-at-a-time It does not handle document.write correctly, as previously discussed in The spectre of document.write . It's doing lookahead, but it does not account for insertion points, as it makes the mistake of looking past insertion points. So, if document.write is used to write one-character-at-a-time, it will attempt to resolve the named character reference before all the characters are available (e.g. in the case of in; , it will erroneously try matching against ∈ and then exit the named character reference state) My pull request focused on fixing both of those problems. The data structure I used is exactly the DAFSA implementation as described so far, with a value array that is not bitpacked, because: As we've seen, it doesn't affect performance, so complicating the implementation didn't seem worth it Using an enum for the second code point would either mean using an extra bit (since enums are signed integers in C++) or using some workaround to keep it using the minimal number of bits The last piece of the puzzle that I haven't mentioned yet is the NamedCharacterReferenceMatcher , which handles DAFSA traversal while providing an API well-tailored to the named character reference state, specifically. The details aren't too important, so here are the relevant bits of the exposed API: bool try_consume_code_point ( u32 c ); u8 overconsumed_code_points (); Optional < NamedCharacterReferenceCodepoints > code_points (); So, with all that context, here's what the Ladybird implementation looks like after my changes (slightly simplified for clarity; here's the full implementation): BEGIN_STATE ( NamedCharacterReference ) { if ( matcher . try_consume_code_point ( current_code_point )) { temporary_buffer . append ( current_code_point ); continue ; } else { DONT_CONSUME_CHARACTER ; } auto overconsumed_code_points = matcher . overconsumed_code_points (); if ( overconsumed_code_points > 0 ) { backtrack_to ( current_offset - overconsumed_code_points ); temporary_buffer . shrink_by ( overconsumed_code_points ); } auto mapped_code_points = matcher . code_points (); if ( mapped_code_points ) { } else { FLUSH_CODEPOINTS_CONSUMED_AS_A_CHARACTER_REFERENCE ; SWITCH_TO ( AmbiguousAmpersand ); } } Note also that Ladybird follows 'spec-driven development', meaning that the goal is for its code to be implemented to match the text of the relevant specification as closely as possible. Here's what the named character reference state specification looks like for reference: 13.2.5.73 Named character reference state Consume the maximum number of characters possible, where the consumed characters are one of the identifiers in the first column of the named character references table. Append each character to the temporary buffer when it's consumed. If there is a match ... Otherwise Flush code points consumed as a character reference. Switch to the ambiguous ampersand state. Overall, these changes made the Ladybird tokenizer: About 1.23x faster on an arbitrary set of HTML files from real websites (albeit an old set of files) on an arbitrary set of HTML files from real websites (albeit an old set of files) About 8x faster on a very named-character-reference-specific benchmark (tokenizing an HTML file with nothing but tens of thousands of valid and invalid named character references) on a very named-character-reference-specific benchmark (tokenizing an HTML file with nothing but tens of thousands of valid and invalid named character references) Roughly 95 KiB smaller (very crude estimate, solely judged by the difference in the final binary size) (very crude estimate, solely judged by the difference in the final binary size) Handle document.write emitting one-character-at-a-time correctly But that's all pretty low-hanging fruit, as the previous Ladybird implementation had some obvious problems. In fact, we can actually improve on this some more (and will later on), but I think it's worth looking at the Firefox/Chrome/Safari implementations now to see how this DAFSA version stacks up against them. Comparison to the major browser engines🔗 Before we get to the actual comparisons, there's (unfortunately) a lot that has to be discussed. First, you'll notice from the Ladybird benchmarks above that an 8x improvement in a very named-character-reference-specific benchmark only led to a 1.23x improvement in the average case. This points to the fact that named character reference matching is not something that HTML tokenizers typically do very often, and that named character reference matching being fast enough is likely just fine, all things considered. Note: I realize that this makes this whole endeavor rather pointless, but hopefully it's still interesting enough without the possibility of making a huge performance impact. Second, instead of going the route of putting my DAFSA implementation into the other browsers' engines to compare, I went with taking the other browsers' implementations and putting them into Ladybird. Not only that, though, I also made the Firefox/Chrome/Safari implementations conform to the API of NamedCharacterReferenceMatcher (for reasons that will be discussed soon). So, in order for my benchmarking to be accurate you'll have to trust that: I faithfully integrated the Firefox/Chrome/Safari implementations into Ladybird The performance characteristics exhibited would hold when going the other direction (putting my implementation into their tokenizer) The benchmarks I'm using can actually give useful/meaningful results in the first place For the first point, the only real assurance I can give you is that the same number of web platform tests within the html/syntax/parsing category were passing with each browser's implementation integrated. The second point will be discussed more later on. For the third point, we have to go on yet another detour... On the difficulty of benchmarking🔗 My initial benchmarking setup was straightforward: Have a separate branch of the codebase for each implementation Compile separate benchmark binaries for each branch Run each version and compare the results However, this approach ultimately left me with some inexplicable results. Here's an example of such results, where the benchmark that exclusively tests the relevant parts of the code shows the Blink (Chrome) implementation being slightly faster than mine: Benchmark 1 : ./TestHTMLTokenizerBlink --bench named_character_references measurement mean ± σ min … max outliers delta wall_time 118 ± 1.68 115 … 121 Benchmark 2 : ./TestHTMLTokenizerDafsa --bench named_character_references measurement mean ± σ min … max outliers delta wall_time 120 ± 1.50 117 … 123 💩+ 2.1% ± 0.4% Yet, when I ran a benchmark that only occasionally exercises the code that I changed (a benchmark using a sample of real HTML files), I got unexpected results. What we should expect is either a very slight difference in the same direction, or (more realistically) no discernible difference, as this effect size should not be noticeable in the average case. Instead, I got the opposite result and a larger effect: Benchmark 1 : ./TestHTMLTokenizerBlink --bench benchfiles measurement mean ± σ min … max outliers delta wall_time 1.85 ± 28.4 1.79 … 1.88 2 (17%) Benchmark 2 : ./TestHTMLTokenizerDafsa --bench benchfiles measurement mean ± σ min … max outliers delta wall_time 1.79 ± 20.7 1.76 … 1.84 ⚡- 3.2% ± 1.1% Taken together, the only explanation for these sorts of results would be that the parts of the code that I didn't change got faster in one version, and not the other. The elephant demands attention🔗 Well, this explanation—that the code I didn't change got faster—is actually likely to be the correct one, and I've known about this possibility ever since I watched the excellent talk "Performance Matters" by Emery Berger. I recommend watching the talk, but the short explanation is that changes to one part of the codebase may inadvertently cause the compiler to reorganize unrelated parts of the compiled binary in ways that affect performance ('layout'). However, while I was aware of this possible confounder, up until now I have basically ignored it whenever doing benchmarking (and I'm assuming that's true for most programmers). This is the first time I've come face-to-face with the problem and had to reckon with its effects, and I think a large part of that is because I'm dealing with a much larger codebase than I typically would when benchmarking, so there's a lot of room for inadvertent layout changes to cascade and cause noticeable performance differences. Note: Reducing the amount of code that's compiled in order to benchmark the tokenizer is not as easy as one might think, as the HTML tokenizer has a lot of interconnected dependencies (HTML parser, JavaScript library, countless other things), so when benchmarking the HTML tokenizer I'm pulling in all of Ladybird's LibWeb which is rather large. Elephant mitigation🔗 Unfortunately, the solution presented in the talk (Stabilizer, which allows you to constantly randomize a binary's layout during runtime to control for layout-based performance differences) has bitrotted and only works with an ancient version of LLVM. So, instead, I thought I'd try a different benchmarking setup to counteract the problem: Only compile one binary, and have it contain the code for all the different named character reference matching implementations Choose the named character reference matching implementation to use at runtime This introduces some dynamic dispatch overhead into the mix which may muddy the results slightly, but, in theory, this should eliminate the effects of layout differences, as whatever the binary layout happens to be, all implementations will share it. In practice, this did indeed work, but introduced another unexplained anomaly that we'll deal with afterwards. After moving all the implementations into the same binary, I got these results for the 'average case' benchmark: Note: Remember that we're expecting this benchmark to show no meaningful difference across the board, as it mostly tests code that hasn't been changed (i.e. named character reference matching has little effect on this benchmark, as mentioned earlier). Benchmark 1 : ./BenchHTMLTokenizerFiles dafsa measurement mean ± σ min … max outliers delta wall_time 1.78 ± 29.8 1.75 … 1.87 peak_rss 88.7 ± 59.2 88.7 … 88.8 cpu_cycles 6.71 ± 127 6.62 … 7.10 instructions 16.1 ± 99.2 16.1 … 16.1 cache_references 324 ± 2.87 319 … 331 2 (17%) cache_misses 10.2 ± 66.4 10.1 … 10.3 branch_misses 8.69 ± 2.99 7.82 … 18.2 Benchmark 3 : ./BenchHTMLTokenizerFiles blink measurement mean ± σ min … max outliers delta wall_time 1.79 ± 24.8 1.73 … 1.82 peak_rss 89.1 ± 234 89.0 … 89.8 cpu_cycles 6.70 ± 57.7 6.62 … 6.79 instructions 16.1 ± 128 16.1 … 16.1 cache_references 325 ± 1.90 321 … 328 cache_misses 10.3 ± 54.3 10.2 … 10.4 branch_misses 7.86 ± 24.6 7.79 … 7.89 Benchmark 4 : ./BenchHTMLTokenizerFiles gecko measurement mean ± σ min … max outliers delta wall_time 1.72 ± 9.79 1.70 … 1.74 ⚡- 3.2% ± 1.1% peak_rss 88.8 ± 83.8 88.7 … 89.0 cpu_cycles 6.69 ± 41.1 6.63 … 6.76 instructions 16.1 ± 72.3 16.1 … 16.1 cache_references 323 ± 1.53 320 … 325 cache_misses 10.2 ± 35.8 10.1 … 10.2 branch_misses 7.79 ± 49.1 7.76 … 7.95 2 (17%) The remaining Gecko (Firefox) wall_time difference is consistently reproducible, but not readily explainable using the other metrics measured, as there is no significant difference in CPU cycles, instructions, cache usage, etc. After attempting some profiling and trying to use strace to understand the difference, my guess is that this comes down to coincidental allocation patterns being friendlier when choosing the Gecko version. Note: There is no heap allocation in any of the named character reference matching implementations, but the size of each NamedCharacterReferenceMatcher subclass is different (the Gecko version is 32 bytes while the others are either 16 or 48 bytes), and an instance of the chosen subclass is heap allocated at the start of the program. If we use strace -e %memory -c , the dafsa and blink versions consistently use more brk / mmap / munmap syscalls (especially brk ): Dafsa % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 47.18 0.012463 113 110 brk 31.90 0.008427 443 19 munmap 18.25 0.004822 9 508 mmap 2.66 0.000703 5 134 mprotect ------ ----------- ----------- --------- --------- ---------------- 100.00 0.026415 34 771 total Blink % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 55.95 0.018094 138 131 brk 28.81 0.009318 490 19 munmap 12.93 0.004181 8 508 mmap 2.32 0.000749 5 134 mprotect ------ ----------- ----------- --------- --------- ---------------- 100.00 0.032342 40 792 total The Gecko version, even though it uses roughly the same amount of memory overall, consistently has fewer of these syscalls: Gecko % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 37.07 0.006560 385 17 munmap 31.73 0.005615 75 74 brk 26.50 0.004689 9 506 mmap 4.70 0.000831 6 134 mprotect ------ ----------- ----------- --------- --------- ---------------- 100.00 0.017695 24 731 total I don't know enough about the glibc / libstdc++ allocator implementation(s) to know why this would be the case, and the magnitude of the difference reported by strace doesn't seem large enough to explain the results, but I'm at least a little bit confident that this is the cause, since, after inserting padding to each NamedCharacterReferenceMatcher subclass to ensure they are all the same size, the wall_time difference went away: Benchmark 1 : ./BenchHTMLTokenizerFiles dafsa measurement mean ± σ min … max outliers delta wall_time 1.81 ± 29.8 1.73 … 1.85 peak_rss 89.7 ± 203 89.5 … 90.3 cpu_cycles 6.74 ± 58.0 6.68 … 6.87 instructions 16.1 ± 142 16.1 … 16.1 cache_references 322 ± 2.45 316 … 325 cache_misses 10.2 ± 25.1 10.1 … 10.2 branch_misses 7.84 ± 26.1 7.77 … 7.87 Benchmark 2 : ./BenchHTMLTokenizerFiles blink measurement mean ± σ min … max outliers delta wall_time 1.80 ± 23.0 1.74 … 1.83 peak_rss 89.6 ± 205 89.2 … 90.1 2 (17%) cpu_cycles 6.74 ± 37.0 6.68 … 6.82 instructions 16.1 ± 194 16.1 … 16.1 cache_references 321 ± 1.93 317 … 324 cache_misses 10.3 ± 47.5 10.2 … 10.3 branch_misses 7.87 ± 23.3 7.82 … 7.91 Benchmark 3 : ./BenchHTMLTokenizerFiles gecko measurement mean ± σ min … max outliers delta wall_time 1.80 ± 29.0 1.71 … 1.82 peak_rss 89.7 ± 265 89.5 … 90.5 cpu_cycles 6.73 ± 43.3 6.65 … 6.78 instructions 16.1 ± 156 16.1 … 16.1 2 (17%) cache_references 321 ± 3.29 316 … 329 cache_misses 10.2 ± 52.9 10.1 … 10.2 2 (17%) branch_misses 7.83 ± 22.2 7.76 … 7.86 2 (17%) No meaningful differences across the board, which is what we expect. What this (tentatively) means is that heap allocation is another potential confounder, and that something as inconsequential as a single allocation being a different size (in our case the NamedCharacterReferenceMatcher instance) may have knock-on effects that last for the rest of the program (or I'm wrong about the cause and this is a red herring). Note: I also wrote a version of the same benchmark using a (very simple and janky) bump allocator and that, too, got rid of the difference (even when the NamedCharacterReferenceMatcher implementations have different sizes). Bump allocation details/results Bump allocation in this case means that a single large chunk of memory is allocated upfront and then all heap allocations afterward are satisfied by doling out a portion of that initial chunk. This means that each named character reference implementation will use the same memory syscalls. Here's the results: Benchmark 1: ./BenchHTMLTokenizerFilesBumpAlloc dafsa measurement mean ± σ min … max outliers delta wall_time 1.87 ± 39.4 1.84 … 1.99 peak_rss 338 ± 168 338 … 338 cpu_cycles 7.02 ± 158 6.91 … 7.48 instructions 16.1 ± 77.1 16.1 … 16.1 cache_references 334 ± 1.89 332 … 338 cache_misses 10.8 ± 110 10.7 … 11.0 branch_misses 8.25 ± 3.12 7.29 … 17.7 Benchmark 2: ./BenchHTMLTokenizerFilesBumpAlloc blink measurement mean ± σ min … max outliers delta wall_time 1.86 ± 9.05 1.84 … 1.87 peak_rss 338 ± 113 338 … 338 cpu_cycles 6.97 ± 31.8 6.92 … 7.03 instructions 16.1 ± 66.1 16.1 … 16.1 cache_references 335 ± 966 333 … 337 cache_misses 10.8 ± 39.9 10.8 … 10.9 branch_misses 7.35 ± 75.4 7.30 … 7.51 2 (17%) Benchmark 3: ./BenchHTMLTokenizerFilesBumpAlloc gecko measurement mean ± σ min … max outliers delta wall_time 1.85 ± 6.66 1.84 … 1.86 peak_rss 338 ± 158 338 … 338 cpu_cycles 6.96 ± 27.3 6.91 … 7.01 instructions 16.1 ± 96.6 16.1 … 16.1 cache_references 334 ± 1.35 332 … 337 cache_misses 10.7 ± 60.7 10.7 … 10.9 branch_misses 7.29 ± 22.6 7.27 … 7.35 Consequently, this means that I won't bother with the 'average case' benchmarking moving forward. In other words, spoiler alert: nothing in this article will move the needle on this 'average case' benchmark. Side note: conforming to one API🔗 Something worth mentioning here is that I've made the choice to convert the Firefox/Chrome/Safari implementations to conform to the NamedCharacterReferenceMatcher API used by Ladybird (instead of porting the full named character reference tokenizer state implementation from the other browsers' into Ladybird). This was done for two reasons: First, to rule out differences in the tokenizer state implementation itself (tangential to the matching strategy) affecting the matching speed. This might seem strange now, but the logic behind this will be discussed in detail later. Second, it made it so compiling one binary that can switch between all the different implementations at runtime (for the purposes of removing the confounding effect of layout differences) was very convenient. I'm mentioning this now because it means that I've introduced another possible source of error into my benchmarks; the Firefox/Chrome/Safari implementations that I'm testing are not 1:1 ports, as they had to be transformed to conform to the NamedCharacterReferenceMatcher API (Firefox much more than Chrome/Safari). My converted implementations that I'll be using for benchmarking are available in this branch. Lessons learned🔗 I think the big takeaway here is that there is a lot that can go wrong when benchmarking. Aside from the more esoteric stuff mentioned above, there are also countless simple/dumb mistakes that can be made that can completely ruin a benchmark's integrity. As an example, for a good while when writing this article, I accidentally left a loop in the Chrome version that I only put there for debugging purposes. That loop was just eating CPU cycles for no reason and skewed my benchmark results pretty significantly. Luckily, I found and fixed that particular mistake, but that sort of thing could have easily gone unnoticed and caused me to draw totally invalid conclusions. Beyond that, there's other stuff I haven't mentioned like CPU architecture, compiler flags, etc, etc, etc. What I'm really trying to get across is something like: You should definitely be skeptical of the benchmarking results I'm providing throughout this article. You might want to be skeptical of all benchmarking results, generally. With all that out of the way, let's get into it. Comparison with Gecko (Firefox)🔗 The current implementation of named character reference tokenization in the Gecko engine (Firefox's browser engine) was introduced in 2010, and refined during the rest of 2010. It has remained unchanged since then. Note: Firefox's HTML tokenizer is actually written in Java which is then translated to C++ It does not use any form of a trie, but instead uses a number of arrays (48 in total) to progressively narrow down the possible candidates within the set of named character references until there's no more possible candidates remaining. Here's an overview: The first character is checked to ensure that it is within the a-z or A-Z range and the first character is saved [src] Note: This is one property of all named character references that the Firefox implementation takes advantage of: all named character references start with a character within the a-z or A-Z range—no exceptions. The second character is then used as an index into a HILO_ACCEL array in order to get the 'row' to use for the first character (there are 44 possible rows; the second character is also always within a-z and A-Z , but there happens to be 8 missing characters from the A-Z range) [src] array in order to get the 'row' to use for the first character (there are 44 possible rows; the second character is also always within and , but there happens to be 8 missing characters from the range) [src] If a valid row exists, the first character is then transformed into an index between 0 and 51 (inclusive) and that is used as an index into the 'row' that was retrieved from the second character [src] and (inclusive) and that is used as an index into the 'row' that was retrieved from the second character [src] The value obtained by the combination of the first two characters contains a 32-bit number: The "lo" bits (the least significant 16 bits) gives you an index into the NAMES array starting at the first possible matching name [src] The "hi" bits (the most significant 16 bits) gives you an index into the NAMES array starting at the last possible matching name [src] The values in the NAMES array are struct 's that contain two pieces of information [src]: An index to the start of the remaining characters in the name, within the ALL_NAMES array (an array of bytes) The length of the remaining characters in the name array are 's that contain two pieces of information [src]: The "lo" and "hi" indexes are then incremented/decremented as candidates get ruled out, while taking note of any fully matching candidates. This happens until there are no possible candidates left ( hi < lo or ; is seen). [src] or is seen). [src] The most recently matching candidate's index (if any) is then re-used to look up the mapped code point(s) within the VALUES array (the NAMES and VALUES arrays are the same length) [src] In the very likely scenario that the above description is hard to take in, here's my best attempt at visually illustrating how it works, matching against the valid named character reference ⋶ : notinvc; ^ ☑ the first character ( n ) is within the a-z or A-Z range notinvc; ^ Use the second character ( o ) to get the array to use with the first character ( n ) HILO_ACCEL 0 '\x00' N/A ... 65 'A' HILO_ACCEL_65 ... 110 'n' HILO_ACCEL_110 111 'o' HILO_ACCEL_111 110 'p' HILO_ACCEL_112 ... 122 'z' HILO_ACCEL_122 HILO_ACCEL_111 0 'A' 0x 0011 0010 ... 25 'Z' ... 26 'a' ... ... 39 'n' 0x 0602 05F6 ... 51 'z' 0x 08B3 08B3 0x 0602 05F6 ↙ ↘ hi lo 0x 0602 or 1538 0x 05F6 or 1526 Any possible matches must be between indexes 1526 and 1538 (inclusive) The possible matches are: NAMES 1526 pf; 1527 t 1528 t; 1529 tin; 1530 tinE; 1531 tindot; 1532 tinva; 1533 tinvb; 1534 tinvc; 1535 tni; 1536 tniva; 1537 tnivb; 1538 tnivc; Now we start to narrow those possibilities down: notinvc; ^ Autoplay: off ➤ ⮜ NAMES 1526 p f; 1527 t 1528 t ; 1529 t in; 1530 t inE; 1531 t indot; 1532 t inva; 1533 t invb; 1534 t invc; 1535 t ni; 1536 t niva; 1537 t nivb; 1538 t nivc; notinvc; ^ Autoplay: off ➤ ⮜ NAMES 1526 p f; 1527 t 1528 t ; 1529 ti n; 1530 ti nE; 1531 ti ndot; 1532 ti nva; 1533 ti nvb; 1534 ti nvc; 1535 t n i; 1536 t n iva; 1537 t n ivb; 1538 t n ivc; notinvc; ^ Autoplay: off ➤ ⮜ NAMES 1526 p f; 1527 t 1528 t ; 1529 tin ; 1530 tin E; 1531 tin dot; 1532 tin va; 1533 tin vb; 1534 tin vc; 1535 t n i; 1536 t n iva; 1537 t n ivb; 1538 t n ivc; notinvc; ^ Autoplay: off ➤ ⮜ NAMES 1526 p f; 1527 t 1528 t ; 1529 tin ; 1530 tin E ; 1531 tin d ot; 1532 tinv a; 1533 tinv b; 1534 tinv c; 1535 t n i; 1536 t n iva; 1537 t n ivb; 1538 t n ivc; notinvc; ^ Autoplay: off ➤ ⮜ NAMES 1526 p f; 1527 t 1528 t ; 1529 tin ; 1530 tin E ; 1531 tin d ot; 1532 tinv a ; 1533 tinv b ; 1534 tinvc ; 1535 t n i; 1536 t n iva; 1537 t n ivb; 1538 t n ivc; notinvc; ^ Autoplay: off ➤ ⮜ NAMES 1526 p f; 1527 t 1528 t ; 1529 tin ; 1530 tin E ; 1531 tin d ot; 1532 tinv a ; 1533 tinv b ; 1534 tinvc; 1535 t n i; 1536 t n iva; 1537 t n ivb; 1538 t n ivc; I'm glossing over how exactly the possibilities are narrowed down because it's not super relevant (if you're interested, here's the responsible tokenizer code), but I will note that the 'lo' and 'hi' cursors always move linearly (i.e. each possible match is ruled out one-by-one; there's no binary search or anything like that going on). This approach works well because the first two characters alone fairly reliably narrow down the possibilities to a pretty small range. Out of 2288 possible combinations of the first two characters, 1658 of them (72.5%) lead to zero possible matches. Out of the remaining combinations (those with ≥ 1 possible match), the mean number of matches is 3.54 with a standard deviation of 29.8, and the median number of possible matches is 2. Here's what the full distribution looks like (with the combinations that lead to zero matches included): 0 5 10 15 20 25 30 35 40 45 50 55 0 500 1,000 1,500 Number of possible matches after the first two characters Frequency Note: There are 2 first-two-character-combinations that lead to 55 possible matches: No and su . Now that we have an understanding of how the Firefox implementation works, let's see how it compares using the three metrics that were mentioned at the start. Performance between the Firefox version and the Ladybird DAFSA version is basically a wash in the primary benchmark I'm using (tokenizing a file with tens of thousands of valid and invalid named character references): Benchmark 1 : ./BenchHTMLTokenizer gecko measurement mean ± σ min … max outliers delta wall_time 113 ± 1.12 111 … 115 peak_rss 83.4 ± 93.7 83.0 … 83.5 cpu_cycles 226 ± 877 224 … 230 instructions 438 ± 10.6 438 … 438 cache_references 9.54 ± 130 9.40 … 10.5 cache_misses 427 ± 11.1 406 … 458 branch_misses 578 ± 1.79 575 … 585 Benchmark 2 : ./BenchHTMLTokenizer dafsa measurement mean ± σ min … max outliers delta wall_time 114 ± 1.38 110 … 116 peak_rss 83.3 ± 94.6 83.0 … 83.5 cpu_cycles 229 ± 856 227 … 232 💩+ 1.4% ± 0.1% instructions 450 ± 10.4 450 … 450 💩+ 2.7% ± 0.0% cache_references 9.42 ± 128 9.25 … 10.5 cache_misses 418 ± 9.03 400 … 443 ⚡- 2.0% ± 0.7% branch_misses 575 ± 3.01 570 … 600 However, if we tailor some benchmarks to test the scenarios where each should theoretically perform the worst, we can see some clearer differences. I believe the worst case for the Firefox implementation is successfully matching the named character reference ⫌ . As mentioned earlier, su as the first two characters narrows down the possibilities the least, with 55 remaining possibilities, and ⫌ should take the longest to match out of the remaining possibilities. Here are the results for tokenizing a file with nothing but 30,000 ⫌ sequences in a row: Benchmark 1 : ./BenchHTMLTokenizer gecko gecko-worst-case measurement mean ± σ min … max outliers delta wall_time 50.6 ± 1.11 48.3 … 53.2 peak_rss 53.0 ± 85.4 52.7 … 53.2 cpu_cycles 137 ± 816 135 … 140 instructions 278 ± 9.06 278 … 278 cache_references 3.27 ± 58.4 3.13 … 3.55 cache_misses 361 ± 10.0 342 … 396 branch_misses 314 ± 5.29 306 … 335 Benchmark 2 : ./BenchHTMLTokenizer dafsa gecko-worst-case measurement mean ± σ min … max outliers delta wall_time 45.9 ± 805 43.6 … 47.5 22 (10%) ⚡- 9.3% ± 0.4% peak_rss 53.0 ± 83.9 52.7 … 53.2 cpu_cycles 117 ± 635 116 … 119 ⚡- 14.5% ± 0.1% instructions 259 ± 5.16 259 … 259 ⚡- 7.0% ± 0.0% cache_references 3.27 ± 128 3.15 … 4.10 cache_misses 357 ± 7.06 344 … 384 branch_misses 183 ± 1.95 179 … 193 21 (10%) ⚡- 41.8% ± 0.2% On the flipside, the scenario where the Firefox implementation likely outperforms the Ladybird implementation the most is an invalid named character reference that can be rejected from the first two characters alone (I've arbitrarily chosen &cz ). Here are the results for tokenizing a file with nothing but 30,000 &cz sequences in a row: Benchmark 1 : ./BenchHTMLTokenizer gecko ladybird-worst-case measurement mean ± σ min … max outliers delta wall_time 61.4 ± 958 59.2 … 63.6 peak_rss 65.1 ± 85.5 64.9 … 65.3 24 (15%) cpu_cycles 104 ± 525 102 … 106 instructions 194 ± 4.18 194 … 194 cache_references 5.89 ± 125 5.79 … 6.92 cache_misses 374 ± 4.44 367 … 385 branch_misses 163 ± 1.24 160 … 166 Benchmark 2 : ./BenchHTMLTokenizer dafsa ladybird-worst-case measurement mean ± σ min … max outliers delta wall_time 63.0 ± 1.00 60.1 … 65.0 💩+ 2.6% ± 0.3% peak_rss 65.1 ± 74.3 64.8 … 65.2 cpu_cycles 112 ± 673 111 … 117 💩+ 7.6% ± 0.1% instructions 214 ± 4.26 214 … 214 💩+ 10.4% ± 0.0% cache_references 5.87 ± 54.9 5.77 … 6.19 cache_misses 375 ± 4.52 364 … 391 branch_misses 164 ± 751 161 … 166 However, neither of these scenarios are likely to be all that common in reality. My hunch (but I don't have any data to back this up) is that there are two scenarios that are common in real HTML: Valid and complete named character references Invalid named character references that come from accidentally putting & directly in the markup instead of & (and therefore the & is probably surrounded by whitespace) In the second scenario where an & character is surrounded by whitespace, the tokenizer will never actually enter the named character reference state, since that requires & to be followed by an ASCII alphanumeric character, so all implementations will perform the same there. We can test the first scenario, though. Here are the results for a file with nothing but 30,000 valid named character references (chosen at random) in a row: Benchmark 1 : ./BenchHTMLTokenizer gecko all-valid measurement mean ± σ min … max outliers delta wall_time 43.6 ± 713 41.3 … 45.2 peak_rss 54.4 ± 84.8 54.1 … 54.5 cpu_cycles 103 ± 545 102 … 105 instructions 188 ± 10.6 188 … 188 cache_references 3.63 ± 91.6 3.54 … 4.35 cache_misses 361 ± 8.92 347 … 386 branch_misses 383 ± 1.57 379 … 387 Benchmark 2 : ./BenchHTMLTokenizer dafsa all-valid measurement mean ± σ min … max outliers delta wall_time 43.0 ± 770 40.8 … 44.3 30 (13%) ⚡- 1.3% ± 0.3% peak_rss 54.4 ± 80.6 54.0 … 54.5 cpu_cycles 102 ± 452 101 … 104 ⚡- 1.3% ± 0.1% instructions 191 ± 8.04 191 … 191 💩+ 2.0% ± 0.0% cache_references 3.54 ± 63.5 3.45 … 4.09 ⚡- 2.6% ± 0.4% cache_misses 355 ± 5.23 344 … 370 ⚡- 1.7% ± 0.4% branch_misses 344 ± 1.24 341 … 348 ⚡- 10.1% ± 0.1% Something worth noting about why Firefox fares better here than in the ⫌ benchmark above is that the distribution of possible matches after two characters is very uneven (as mentioned previously). In 42.7% of cases (269 out of 630) where there is at least 1 possible match after the first two characters, there is only 1 possible match. This makes the 'narrowing down the matches' portion of the Firefox implementation essentially just a string comparison. Something else we can look at is raw matching speed in isolation (i.e. not within the context of HTML tokenization). In my benchmarking, the Firefox implementation wins out in this category: Benchmark 1 : ./BenchMatcherGecko measurement mean ± σ min … max outliers delta wall_time 105 ± 1.14 102 … 108 peak_rss 4.56 ± 72.7 4.33 … 4.72 cpu_cycles 426 ± 1.40 424 … 430 instructions 745 ± 81.5 745 … 745 cache_references 8.03 ± 81.6 7.89 … 8.54 cache_misses 28.0 ± 5.70 21.2 … 46.5 branch_misses 5.41 ± 2.49 5.41 … 5.42 Benchmark 2 : ./BenchMatcherDafsa measurement mean ± σ min … max outliers delta wall_time 120 ± 1.46 116 … 123 💩+ 13.8% ± 0.4% peak_rss 4.50 ± 61.6 4.46 … 4.59 cpu_cycles 487 ± 4.39 477 … 496 💩+ 14.3% ± 0.2% instructions 1.02 ± 66.2 1.02 … 1.02 14 (17%) 💩+ 36.7% ± 0.0% cache_references 6.11 ± 90.5 5.98 … 6.81 ⚡- 23.9% ± 0.3% cache_misses 26.0 ± 4.02 20.3 … 39.8 ⚡- 7.1% ± 5.2% branch_misses 6.01 ± 20.5 5.98 … 6.06 💩+ 11.1% ± 0.1% Two things to note: It's unclear how applicable this 'raw matching speed' benchmark is for HTML tokenization (this will be discussed more later). We'll make some improvements to the DAFSA implementation later that will flip these results. Data size🔗 As noted earlier, Firefox uses a total of 48 arrays for its named character reference data: HILO_ACCEL is an array of 123 pointers, so that's 984 bytes on a 64-bit architecture is an array of 123 pointers, so that's 984 bytes on a 64-bit architecture There are 44 HILO_ACCEL_n arrays, each containing 52 32-bit integers, so that's 9,152 bytes arrays, each containing 52 32-bit integers, so that's 9,152 bytes ALL_NAMES is an array of bytes containing the complete set of characters in all named character references, excluding the first two characters of each. This adds up to 12,183 bytes in total is an array of bytes containing the complete set of characters in all named character references, excluding the first two characters of each. This adds up to 12,183 bytes in total NAMES is an array of 2,231 32-bit structs, so that's 8,924 bytes is an array of 2,231 32-bit structs, so that's 8,924 bytes VALUES is also an array of 2,231 32-bit structs, so that's 8,924 bytes Note: The VALUES array does something pretty clever. As mentioned earlier, the largest mapped first code point requires 17 bits to encode, and the largest second code point is U+FE00 so that needs 16 bits to encode. Therefore, to encode both code points directly you'd need at minimum 33 bits. However, all mappings that have a first code point with a value > the u16 max also happen to be mappings that do not have a second code point. So, if you store all the mapped values encoded as UTF-16, you can get away with using two 16-bit integers to store all the possible values (i.e. the > u16 max code points get stored as a surrogate pair, while all the other mappings get stored as two UTF-16 code units). (this is not an improvement over how I store this data, but I thought it was still worth noting) So, in total, the Firefox implementation uses 40,167 bytes (39.23 KiB) for its named character reference data, while Ladybird uses 24,412 bytes (23.84 KiB). That's a difference of 15,755 bytes (15.39 KiB), or, in other words, the Ladybird implementation uses 60.8% of the data size of the Firefox implementation. Note: As mentioned previously, an additional 3,067 bytes (2.99 KiB) could be saved if the values array in the Ladybird implementation was bitpacked. Since, for the purposes of my benchmarking, the Firefox implementation was made to conform to the API of the NamedCharacterReferenceMatcher that is used in the Ladybird implementation, there's no meaningful difference in terms of ease-of-use. However, I'll take this opportunity to talk about what changes were made to make that happen and how the real Firefox implementation differs. The real Firefox implementation uses 3 different tokenizer states and multiple complicated loops with goto statements to move between them statements to move between them The NamedCharacterReferenceMatcher version moves the tokenizer states into the Matcher and replaces the complicated loops with 2 simple loops Very crudely approximated, the implementation went from ~200 SLoC to ~110 SLoC. So, the NamedCharacterReferenceMatcher abstraction may represent a marginal improvement in terms of ease-of-use. Overall, the Firefox implementation fares quite well in this comparison. It's at least as fast, with some potential performance benefits and some potential weaknesses It uses more data to store the named character reference mappings, but saving 15 KiB might not be much of a concern The implementation is a fair bit more complicated, but not necessarily in a meaningful way Comparison with Blink/WebKit (Chrome/Safari)🔗 Blink (the browser engine of Chrome/Chromium) started as a fork of WebKit (the browser engine of Safari), which itself started as a fork of KHTML. There are some differences that have emerged between the two since Blink was forked from WebKit, but for the purposes of this article I'm only going to benchmark against the Blink implementation and assume the results would be roughly the same for the WebKit implementation (the difference mostly comes down to data size, which I'll mention in the "Data size" section later). Note: I'll double check that the Safari implementation has the same performance characteristics and update this article with my findings once I do. Like Firefox, the Chrome/Safari named character reference tokenization does not use a trie. For the 'matching' portion, the Chrome/Safari implementation is actually quite similar in concept to the Firefox implementation: Use the first character to lookup the initial range of possible matches within a sorted array of all named character references [src] For each character after that, use std::ranges::equal_range to narrow the possibilities within the current range ( std::ranges::equal_range uses binary searches to get both the std::ranges::lower_bound and std::ranges::upper_bound ) [src] to narrow the possibilities within the current range ( uses binary searches to get both the and ) [src] If the first possible match in the resulting range is the same length as the current number of characters being matched, mark it as the most recent match [src] Continue until there are no more possible matches (the range is empty) [src] The main differences from the Firefox implementation are that the Chrome/Safari version (a) only uses the first character to narrow the initial possible matches, and (b) uses binary searches to narrow the possibilities after that instead of linearly moving the lo and hi indexes. Similar to Firefox, the performance difference in the 'tens of thousands of valid and invalid named character references' benchmark is basically a wash: Benchmark 1 : ./BenchHTMLTokenizer blink measurement mean ± σ min … max outliers delta wall_time 115 ± 943 113 … 117 peak_rss 83.3 ± 99.6 83.0 … 83.5 cpu_cycles 232 ± 754 230 … 234 instructions 461 ± 4.73 461 … 461 cache_references 9.95 ± 299 9.71 … 12.3 cache_misses 412 ± 5.68 401 … 425 branch_misses 747 ± 1.78 744 … 757 Benchmark 2 : ./BenchHTMLTokenizer dafsa measurement mean ± σ min … max outliers delta wall_time 114 ± 1.26 110 … 117 peak_rss 83.3 ± 77.5 83.1 … 83.5 cpu_cycles 228 ± 882 227 … 233 ⚡- 1.4% ± 0.1% instructions 450 ± 7.33 450 … 450 ⚡- 2.4% ± 0.0% cache_references 9.48 ± 402 9.29 … 13.1 ⚡- 4.7% ± 1.1% cache_misses 412 ± 7.21 398 … 432 branch_misses 575 ± 6.48 571 … 633 ⚡- 23.0% ± 0.2% The DAFSA is faster than Chrome in the all- ⫌ benchmark, but the difference is not as big as it was with Firefox, since Chrome uses binary searches to narrow down the possible matches whereas Firefox uses linear scans: Benchmark 1 : ./BenchHTMLTokenizer blink gecko-worst-case measurement mean ± σ min … max outliers delta wall_time 47.3 ± 782 45.0 … 49.6 peak_rss 53.0 ± 94.4 52.7 … 53.2 cpu_cycles 123 ± 807 122 … 126 instructions 290 ± 7.46 290 … 290 cache_references 3.31 ± 200 3.19 … 5.79 cache_misses 358 ± 8.15 345 … 401 branch_misses 182 ± 1.82 178 … 194 Benchmark 2 : ./BenchHTMLTokenizer dafsa gecko-worst-case measurement mean ± σ min … max outliers delta wall_time 45.9 ± 663 43.6 … 47.3 ⚡- 3.0% ± 0.3% peak_rss 53.0 ± 82.6 52.7 … 53.2 cpu_cycles 117 ± 677 116 … 120 ⚡- 5.0% ± 0.1% instructions 259 ± 7.78 259 … 259 ⚡- 10.8% ± 0.0% cache_references 3.26 ± 99.4 3.17 … 4.00 cache_misses 358 ± 8.14 342 … 386 branch_misses 183 ± 3.07 179 … 214 The DAFSA is once again worse at detecting &cz as invalid, and the results are similar to what they were with Firefox: Benchmark 1 : ./BenchHTMLTokenizer blink ladybird-worst-case measurement mean ± σ min … max outliers delta wall_time 62.6 ± 1.08 60.6 … 64.3 peak_rss 65.1 ± 79.4 64.8 … 65.2 cpu_cycles 108 ± 640 107 … 111 instructions 203 ± 11.3 203 … 203 cache_references 5.92 ± 51.5 5.82 … 6.30 cache_misses 384 ± 8.88 366 … 409 branch_misses 164 ± 1.44 160 … 169 Benchmark 2 : ./BenchHTMLTokenizer dafsa ladybird-worst-case measurement mean ± σ min … max outliers delta wall_time 63.9 ± 1.10 61.8 … 65.5 💩+ 2.0% ± 0.4% peak_rss 65.1 ± 92.8 64.8 … 65.2 cpu_cycles 113 ± 707 112 … 117 💩+ 4.4% ± 0.1% instructions 214 ± 10.9 214 … 214 💩+ 5.6% ± 0.0% cache_references 5.89 ± 63.4 5.76 … 6.31 cache_misses 387 ± 9.20 367 … 412 branch_misses 165 ± 2.77 162 … 197 For the '30,000 valid named character references' benchmark, the DAFSA is slightly faster (similar results as Firefox): Benchmark 1 : ./BenchHTMLTokenizer blink all-valid measurement mean ± σ min … max outliers delta wall_time 43.9 ± 844 41.7 … 47.0 peak_rss 54.3 ± 89.2 54.0 … 54.5 cpu_cycles 105 ± 959 104 … 112 instructions 204 ± 6.18 204 … 204 cache_references 3.78 ± 137 3.67 … 5.37 cache_misses 359 ± 11.2 345 … 457 branch_misses 466 ± 1.43 463 … 472 Benchmark 2 : ./BenchHTMLTokenizer dafsa all-valid measurement mean ± σ min … max outliers delta wall_time 43.1 ± 592 41.1 … 44.3 ⚡- 1.8% ± 0.3% peak_rss 54.4 ± 79.6 54.1 … 54.5 cpu_cycles 102 ± 487 101 … 104 ⚡- 3.3% ± 0.1% instructions 191 ± 5.33 191 … 191 ⚡- 6.0% ± 0.0% cache_references 3.55 ± 74.9 3.46 … 4.21 ⚡- 6.3% ± 0.5% cache_misses 355 ± 5.67 344 … 373 branch_misses 345 ± 2.12 342 … 374 ⚡- 26.0% ± 0.1% In the raw matching speed benchmark, the DAFSA comes out ahead: Benchmark 1 : ./BenchHTMLTokenizerBlink measurement mean ± σ min … max outliers delta wall_time 144 ± 1.33 141 … 148 peak_rss 4.55 ± 65.8 4.33 … 4.72 cpu_cycles 590 ± 2.33 585 … 604 instructions 1.07 ± 83.4 1.07 … 1.07 cache_references 12.3 ± 162 12.1 … 13.0 cache_misses 27.6 ± 5.79 21.1 … 67.6 branch_misses 8.71 ± 21.1 8.68 … 8.79 Benchmark 2 : ./BenchMatcherDafsa measurement mean ± σ min … max outliers delta wall_time 119 ± 1.47 116 … 122 ⚡- 17.6% ± 0.3% peak_rss 4.52 ± 65.7 4.46 … 4.59 cpu_cycles 484 ± 4.33 477 … 496 ⚡- 17.8% ± 0.2% instructions 1.02 ± 80.6 1.02 … 1.02 ⚡- 4.4% ± 0.0% cache_references 6.10 ± 177 5.99 … 7.66 ⚡- 50.5% ± 0.4% cache_misses 25.4 ± 3.42 20.4 … 38.8 ⚡- 7.9% ± 5.3% branch_misses 6.02 ± 21.7 5.98 … 6.06 ⚡- 30.9% ± 0.1% Data size🔗 The Chrome implementation uses four arrays: kStaticEntityStringStorage , an array of all the bytes in every named character reference, with some de-duplication techniques (e.g. the sequence 'b', 'n', 'o', 't', ';' in the array is used for ⌐ , ¬ , and ¬ ). It uses 14,485 bytes total. , an array of all the bytes in every named character reference, with some de-duplication techniques (e.g. the sequence in the array is used for , , and ). It uses 14,485 bytes total. kStaticEntityTable , an array of 2,231 12-byte wide structs containing information about each named character reference (its location in the kStaticEntityStringStorage array, the length of its name, the code point(s) it should be transformed into). It uses 26,722 bytes. , an array of 2,231 12-byte wide structs containing information about each named character reference (its location in the array, the length of its name, the code point(s) it should be transformed into). It uses 26,722 bytes. kUppercaseOffset and kLowercaseOffset are each arrays of offsets into kStaticEntityTable , and both are used as lookup tables for the first character. Getting kUppercaseOffset[char - 'A'] gives you the initial lower bound's offset and kUppercaseOffset[char - 'A' + 1] gives you the initial upper bound's offset (and the same sort of thing for kLowercaseOffset ). Each uses 54 bytes, so that's 108 bytes total. All together, the Chrome implementation uses 41,167 bytes (40.39 KiB) for its named character reference data, while Ladybird uses 24,412 bytes (23.84 KiB). That's a difference of 16,953 bytes (16.56 KiB), or 59.0% of the data size of the Chrome implementation. The Safari implementation uses the same four arrays, but has made a few more data size optimizations: kStaticEntityStringStorage does not include semicolons, and instead that information was moved to a boolean flag within the elements of the kStaticEntityTable array. This brings down the total bytes used by this array to 11,127 (-3,358 compared to the Chrome version) does not include semicolons, and instead that information was moved to a boolean flag within the elements of the array. This brings down the total bytes used by this array to 11,127 (-3,358 compared to the Chrome version) The HTMLEntityTableEntry struct (used in the kStaticEntityTable array) was converted to use a bitfield to reduce the size of the struct from 12 bytes to 8 bytes (57 bits). However, Clang seems to insert padding bits into the struct which brings it back up to 12 bytes anyway (it wants to align the optionalSecondCharacter and nameLengthExcludingSemicolon fields). So, this data size optimization may or may not actually have an effect (I'm not very familiar with the rules around C++ bitfield padding, so I feel like I can't say anything definitive). If the size is reduced to 8 bytes, then kStaticEntityTable uses 8,924 less bytes (17,798 instead of 26,722). So, the Safari implementation uses either 30,040 bytes (29.34 KiB) if HTMLEntityTableEntry uses 12 bytes or 21,116 bytes (20.62 KiB) if HTMLEntityTableEntry uses 8 bytes. This means that Safari's data size optimizations (or at least their intended effect) makes its data size smaller than Ladybird's (even if the Ladybird implementation tightly bitpacked its values array, it'd still use 229 bytes more than the 8-byte- HTMLEntityTableEntry Safari version). This also shows that the larger data size of the Chrome implementation is not inherent to the approach that it uses. For now I'll just say there's no meaningful difference, but there's a caveat that will be discussed later. Overall, the Chrome implementation as-it-is-now fares about as well as the Firefox implementation in this comparison, but has some potential strengths/weaknesses of its own. That is, it covers one weakness of the Firefox implementation by using binary searches instead of linear scans, but it always has to narrow down the possibilities from a larger initial range (since it only uses the first character to get the range of possible matches whereas Firefox uses the first two characters). The Safari version fares much better in terms of data size (potentially beating out my DAFSA version), and its size optimizations could be applied to the Chrome version as well since the core approach is the same between them. At this point, though, you might be asking yourself, why don't we try... Combining Firefox, Chrome, and Safari together🔗 In theory, the best ideas from the Firefox and Chrome/Safari implementations could be combined into one new implementation: Use the combination of the first two characters to get the initial range of possible matches (like Firefox) Use binary searches to narrow down the possible matches (like Chrome/Safari) Don't store the first two characters in the kStaticEntityStringStorage / ALL_NAMES array (like Firefox) / array (like Firefox) Re-use indexes into kStaticEntityStringStorage / ALL_NAMES when possible (like Chrome/Safari, see the ⌐ / ¬ / ¬ example above) / when possible (like Chrome/Safari, see the / / example above) Don't store semicolons in the kStaticEntityStringStorage / ALL_NAMES array (like Safari) / array (like Safari) Reduce the size of the HTMLEntityTableEntry / nsHtml5CharacterName struct (like Safari intends to do) I haven't tested this combination to see how exactly it stacks up, but I would assume it'd be quite good overall. Something I didn't mention about the Chrome implementation🔗 Since I converted the Chrome implementation to use Ladybird's NamedCharacterReferenceMacher API in an effort to improve the accuracy of my benchmarking, one major aspect of the Chrome implementation was lost in translation: the Chrome implementation doesn't actually use the character-by-character tokenization strategy we've discussed so far. Instead, it uses the "lookahead (but never beyond an insertion point) until we're certain we have enough characters ..." strategy mentioned back in the Named character reference tokenization overview. The very high-level summary of the approach (as it is actually implemented in Chrome) is very similar to the description of it in that section: Starting after the ampersand, lookahead as far as possible without looking beyond the end of an insertion point Try to match a full named character reference If you run out of characters because you hit the end of an insertion point while matching, backtrack and try again on the next tokenizer iteration (always starting the lookahead from just after the ampersand, i.e. no state is saved between attempts) Note: It's only possible to 'run out of characters while matching' when there is an active insertion point. If there isn't one (the common case), then this difference in strategy doesn't matter since 'backtrack and try again' will never come into play. Note also that this strategy doesn't inherently require any particular implementation for the 'try to match a full named character reference' part; a trie, or a DAFSA, or Firefox's HILO_ACCEL implementation, or any other approach could be slotted in there with no change in functionality. The downside of the Chrome implementation in particular is actually a choice that was made that's not inherent to the overall approach: they don't preserve any matching state between tokenizer iterations and always backtrack to the & before trying to match again. For example, when matching against something like ∉ that's being written one-character-at-a-time (via document.write ), the matching algorithm described in the "Comparison with Blink/WebKit (Chrome/Safari)" section will be executed for each of: &n , &no , ¬ , ¬i , and ¬in , each resulting in the 'not enough characters' flag being set , , , , and , each resulting in the 'not enough characters' flag being set Finally, ∉ will be matched fully (the semicolon acts as a definitive delimiter) In theory, the redundant work that's performed in these sorts of scenarios should have a noticeable affect on performance, but, in practice, I wasn't able to prove that out with benchmarking. Details and results of my benchmarking The test file (inserting lots of valid named character references one-character-at-a-time using document.write ) ) The branch containing the code under test (a faithful adaptation of Blink's lookahead strategy and a runtime flag to switch to an implementation that preserves matching state between retry attempts due to 'not enough characters') Benchmark 1 : ./headless-browser --dump-text pathological-write-valid.html measurement mean ± σ min … max outliers delta wall_time 1.46 ± 8.11 1.45 … 1.48 peak_rss 62.1 ± 123 61.9 … 62.3 cpu_cycles 5.90 ± 33.5 5.88 … 5.99 instructions 29.1 ± 562 29.1 … 29.1 cache_references 210 ± 988 208 … 211 cache_misses 8.53 ± 322 7.92 … 9.09 branch_misses 5.03 ± 19.2 5.01 … 5.07 Benchmark 2 : ./headless-browser --blink-preserve-state --dump-text pathological-write-valid.html measurement mean ± σ min … max outliers delta wall_time 1.46 ± 4.52 1.45 … 1.47 peak_rss 62.0 ± 164 61.7 … 62.4 cpu_cycles 5.88 ± 18.1 5.86 … 5.93 instructions 29.1 ± 422 29.1 … 29.1 cache_references 209 ± 1.26 207 … 212 cache_misses 8.55 ± 425 8.19 … 9.78 branch_misses 5.00 ± 27.0 4.97 … 5.07 So, despite HTMLEntitySearch::Advance being called 4.5x more in the 'no state preserved' benchmark, no difference shows up in the results. I believe this is because the actual matching is a small part of the work being done in this benchmark, or, in other words, there is a difference but it's being drowned out by all the work being done elsewhere (JavaScript being run, tokenizer input being updated, etc). I have a hunch that Ladybird in particular might be greatly exacerbating this effect and making all this tangential work slower than it theoretically needs to be, especially in the case of updating the tokenizer input. For example, Chrome uses a rope-like SegmentedString to mitigate the cost of inserting into the middle of the input, while Ladybird currently reallocates the entire modified input on each insertion. To sum up what I'm trying to say: The Chrome implementation demonstrably does more work than necessary in the ' document.write one-character-at-a-time' scenario because it doesn't preserve matching state between retries due to 'not enough characters' one-character-at-a-time' scenario because it doesn't preserve matching state between retries due to 'not enough characters' I am unable to create a relevant benchmark using Ladybird, likely due to some inefficiencies in the Ladybird tokenizer implementation, but a clear difference might show up when benchmarking the Chrome tokenizer itself (i.e. if you made the Chrome implementation preserve state in this scenario and benchmarked its tokenizer, I expect it might show some measurable difference) This is important to note because it is the reason I included the following point in my 'list of things you'll need to trust in order for my benchmarking to be accurate' in the intro to the "Comparison to the major browser engines" section: The performance characteristics exhibited would hold when going the other direction (putting my implementation into their tokenizer) That is, I have some reason to believe 'going the other direction' may actually be slightly more favorable to my DAFSA implementation, as all the code outside of the named character reference tokenization state itself likely will do less that will muddy benchmarking results. A benefit of the 'lookahead' strategy🔗 An upside of the 'lookahead' strategy is that, when you know there's no active insertion point, you can always match against the full remaining input in one go. This is potentially an improvement over the 'tokenize-one-character-at-a-time' strategy if there is any significant amount of work that's done between tokenizer iterations. Here's some simple diagrams to illustrate the point: work done to move to the next character work done to move to the next character work done to move to the next character a b c ... work done to skip to the end of the match a b c ... There are two theoretical benefits I can think of with this: Avoiding the work done between each character may lead to better CPU cache usage/branch prediction Moving the cursor ahead by N in one go may be more efficient than moving it ahead by one, N times Luckily, we don't actually need to change much about our 'character-by-character tokenization' approach to get these benefits, as we only need to make it so we use lookahead whenever there's no active insertion point. In Ladybird, that might look something like this (full implementation): BEGIN_STATE ( NamedCharacterReference ) { if ( stop_at_insertion_point == StopAtInsertionPoint :: No ) { } else { } } In practice, this seems to give up to a 1.13x speedup in our Ladybird benchmarks essentially for free: Benchmark results Benchmark 1 : ./BenchHTMLTokenizer dafsa measurement mean ± σ min … max outliers delta wall_time 114 ± 934 112 … 115 peak_rss 83.5 ± 74.9 83.3 … 83.6 cpu_cycles 226 ± 964 223 … 229 instructions 452 ± 6.81 452 … 452 cache_references 9.64 ± 86.1 9.51 … 9.86 cache_misses 418 ± 10.5 399 … 448 branch_misses 573 ± 2.51 570 … 584 Benchmark 2 : ./BenchHTMLTokenizer dafsa-lookahead measurement mean ± σ min … max outliers delta wall_time 108 ± 1.05 106 … 111 ⚡- 5.1% ± 0.4% peak_rss 83.5 ± 81.5 83.3 … 83.6 cpu_cycles 203 ± 869 201 … 205 ⚡- 10.5% ± 0.2% instructions 377 ± 10.8 377 … 377 ⚡- 16.6% ± 0.0% cache_references 9.57 ± 71.3 9.42 … 9.77 cache_misses 415 ± 7.59 401 … 429 branch_misses 553 ± 2.04 547 … 556 ⚡- 3.5% ± 0.2% Benchmark 1 : ./BenchHTMLTokenizer dafsa gecko-worst-case measurement mean ± σ min … max outliers delta wall_time 45.8 ± 866 43.7 … 47.2 11 (10%) peak_rss 53.2 ± 67.6 53.0 … 53.4 cpu_cycles 117 ± 601 116 … 118 instructions 261 ± 6.10 261 … 261 cache_references 3.27 ± 62.6 3.19 … 3.69 cache_misses 354 ± 4.17 345 … 368 branch_misses 182 ± 4.28 178 … 213 Benchmark 2 : ./BenchHTMLTokenizer dafsa-lookahead gecko-worst-case measurement mean ± σ min … max outliers delta wall_time 40.4 ± 756 38.3 … 42.0 13 (10%) ⚡- 11.8% ± 0.5% peak_rss 53.2 ± 77.2 52.9 … 53.4 cpu_cycles 93.4 ± 546 92.5 … 95.8 ⚡- 19.9% ± 0.1% instructions 190 ± 5.41 190 … 190 ⚡- 27.1% ± 0.0% cache_references 3.31 ± 42.2 3.23 … 3.48 cache_misses 354 ± 5.07 345 … 372 branch_misses 153 ± 10.5 146 … 215 17 (14%) ⚡- 16.1% ± 1.2% Benchmark 1 : ./BenchHTMLTokenizer dafsa ladybird-worst-case measurement mean ± σ min … max outliers delta wall_time 63.2 ± 964 61.6 … 65.7 peak_rss 65.3 ± 79.1 65.0 … 65.4 cpu_cycles 112 ± 606 111 … 115 instructions 215 ± 8.83 215 … 215 cache_references 5.91 ± 107 5.78 … 6.50 cache_misses 372 ± 4.43 363 … 390 branch_misses 162 ± 1.17 160 … 165 Benchmark 2 : ./BenchHTMLTokenizer dafsa-lookahead ladybird-worst-case measurement mean ± σ min … max outliers delta wall_time 62.8 ± 1.08 61.3 … 64.3 peak_rss 65.2 ± 88.6 64.9 … 65.4 cpu_cycles 111 ± 454 110 … 112 ⚡- 1.4% ± 0.1% instructions 209 ± 7.60 209 … 209 ⚡- 2.7% ± 0.0% cache_references 5.91 ± 68.1 5.78 … 6.09 cache_misses 374 ± 5.24 365 … 397 branch_misses 164 ± 964 162 … 168 Benchmark 1 : ./BenchHTMLTokenizer dafsa all-valid measurement mean ± σ min … max outliers delta wall_time 43.3 ± 843 40.9 … 45.6 peak_rss 54.5 ± 83.6 54.3 … 54.6 cpu_cycles 100 ± 744 98.4 … 103 instructions 193 ± 11.0 193 … 193 12 (10%) cache_references 3.58 ± 40.3 3.47 … 3.70 cache_misses 363 ± 11.3 344 … 398 branch_misses 344 ± 1.99 341 … 349 Benchmark 2 : ./BenchHTMLTokenizer dafsa-lookahead all-valid measurement mean ± σ min … max outliers delta wall_time 39.4 ± 849 37.2 … 40.8 14 (11%) ⚡- 9.0% ± 0.5% peak_rss 54.5 ± 84.3 54.3 … 54.6 cpu_cycles 84.8 ± 521 84.0 … 87.4 ⚡- 15.5% ± 0.2% instructions 148 ± 6.20 148 … 148 ⚡- 23.2% ± 0.0% cache_references 3.56 ± 51.3 3.47 … 3.82 cache_misses 356 ± 7.44 342 … 384 ⚡- 2.0% ± 0.7% branch_misses 316 ± 3.08 313 … 347 ⚡- 8.1% ± 0.2% This also explains a few things that I glossed over or deferred until later throughout the article: It's the primary reason I made all the browsers' implementations conform to the NamedCharacterReferenceMatcher API and thereby converted them all to use the 'character-by-character tokenization' strategy (to rule out stuff like this from affecting the results without me realizing it). API and thereby converted them all to use the 'character-by-character tokenization' strategy (to rule out stuff like this from affecting the results without me realizing it). It's the reason that the inexplicable benchmark results at the start of the "On the difficulty of benchmarking" section showed the Chrome implementation being faster than the DAFSA implementation. I was using a faithful 1:1 port of the Chrome named character reference state at that point, and so the difference was due to the 'lookahead' strategy rather than the matching implementation. It's a partial explanation for why the results of the 'raw matching speed' benchmark (the one that just tests NamedCharacterReferenceMatcher directly without involving the tokenizer) don't fully translate to equivalent differences in the tokenizer benchmarks. Another difference I didn't mention🔗 Something else that I've failed to mention until now regards exactly how backtracking is performed. Note: This tangent on backtracking is not really related to anything else; I'm only mentioning it now because I had nowhere else to put it. In Ladybird, backtracking is very straightforward: modify the input cursor's position by subtracting N from the current position (where N is the number of overconsumed code points). This is safe to do from a code point boundary perspective because the input is always UTF-8 encoded, and named character reference matching will only ever consume ASCII characters, so going back N bytes is guaranteed to be equivalent to going back N code points. In all the other major browsers, though, backtracking is done by re-inserting the overconsumed code points back into the input stream, at the position just after the current code point. That is, it modifies the input stream such that the next code points to-be-tokenized will be those that were just re-inserted. As of now, I'm unsure if the way that Firefox/Chrome/Safari do it is necessary (and therefore whether or not Ladybird will need to adopt the same strategy to be fully compliant with the spec). If it is necessary, I'm either unaware of the relevant web platform test, or there is a missing web platform test that checks whatever it's necessary for. If it's not necessary, then there may be an opportunity in the other browser engines to simplify backtracking when dealing with named character references. Further improvements to the DAFSA implementation🔗 As of the writing of this article, the DAFSA implementation that's been described so far is exactly what's in Ladybird. During the process of writing this, though, I came up with some ideas (largely inspired by the Firefox/Chrome/Safari implementations) to improve my implementation by utilizing every last possible bit I could get my hands on. This came in the form of two independent optimizations, with one of them just so happening to barely make the other possible. Note: The code examples in this section will be using Zig syntax. A property of named character references that I missed, but that the major browser engines all take advantage of, is that the first character of a named character reference is always within the range of a-z or A-Z (inclusive). In terms of the DAFSA, we can take advantage of this property to accelerate the search for the first character: instead of linearly scanning across the child nodes, we can: Check if the character is an alphabetic ASCII character, and immediately reject any that aren't Create a lookup table for alphabetic ASCII characters that has the resulting DAFSA state pre-computed This would turn the O(n) search within the 'first layer' of the DAFSA into an O(1) lookup. As for what needs to be stored in the lookup table, remember that we build up a 'unique index' when traversing a list of children, with any child that we iterate over adding its number field to the total: a b c d So, we can pre-compute the accumulated unique index that would result when matching a character in the 'first layer,' and then store that in the lookup table. For example, the relevant data for the first four nodes in the first layer of our DAFSA looks like this: .{ . char = 'A' , . number = 27 }, .{ . char = 'B' , . number = 12 }, .{ . char = 'C' , . number = 36 }, .{ . char = 'D' , . number = 54 }, so the corresponding lookup table entries could look like this: 0 , 27 , 39 , 75 , We can then