QRS: Epsilon Wrangling

I haven’t shipped any new features for Quamina in many months, partly due to a flow of real-life distractions, but also I’m up against tough performance problems in implementing Regular Expressions at massive scale. I’m still looking for a breakthrough, but have learned things about building and executing finite automata that I think are worth sharing. This piece has to do with epsilons; anyone who has studied finite automata will know about them already, but I’ll offer background for those people to skip.

I’ve written about this before in Epsilon Love. A commenter pointed out that the definition of “epsilon” in that piece is not quite right per standard finite-automata theory, but it’s still a useful in that it describes how epsilons support constructs like the shell-style “ * ”.

Background · Finite automata come in two flavors: Deterministic (DFA) and Nondeterministic (NFA). DFAs move from state to state one input symbol at a time: it’s simple and easy to understand and to implement. NFAs have two distinguishing characteristics: First, when you’re in a state and an input symbol arrives, you can transfer to more than one other state. Second, a state can have “epsilon transitions” (let’s say “ε” for epsilon), which can happen any time at all while you’re in that state, input or no input.

NFAs are more complicated to traverse (will discuss below) but you need them if you want to implement regular expressions with . and ? and * and so on. You can turn any NFA into a DFA, and I’ll come back to that subject in a future piece.

For implementing NFAs, I’ve been using Thompson's construction, where “Thompson” is Ken Thompson, co-parent of Unix. This technique is also nicely described by Russ Cox in Regular Expression Matching Can Be Simple And Fast. You don’t need to learn it to understand this piece, but I’ll justify design choices by saying “per Thompson”.

I’m going to discuss two specific issues today, ε-closures and a simpler NFA definition.

ε-closures · To set the stage, consider this regexp: A?B?C?X

It should match “X” and “BX” and “ACX” and so on, but not “CAX” or “XX”. Thompson says that you implement A? with a transition to the next state on “A” and another ε-transition to that next state; because if you see an “A” you should transition, but then you can transition anyhow even if you don’t.

The resulting NFA looks like this:

In finite-automaton math, states are usually represented by the letter “q” followed by a number (usually italicized and subscripted, like q 0 , but not here, sorry). Note q4 ’s double circle which means it’s a goal state, i.e. if we get here we’ve matched the regexp. I should add that this was produced with draw.io, which seems to make this sort of thing easy.

... continue reading