Adding lookbehinds to rust-lang/regex

In a previous blogpost, Erik wrote about how he implemented the linear time matching algorithm from [RegElk+PLDI24] in the popular regex engine RE2. In this one, we're looking at how to do the same thing for the official regex engine from the Rust language. Namely, we add support for unbounded captureless lookbehinds. First, let's discover the newly supported feature and its limitations. Lookbehinds allow regexes to make assertions about things preceding some part of the regex pattern without considering it as part of the match. Similarly, negative lookbehinds, which we also support, assert that something is not preceding. Consider the regex (?<=Title:\s+)\w+ which would match the following strings (matches underlined with ~ ): But does not match: As seen in the example, lookbehind expression can be unbounded. This means we do not know ahead of time how many characters a lookbehind will match. This is an improvement over many existing regex engines which support lookbehinds but only if they are of a bounded (like the ubiquitous PCRE) or sometimes even fixed (like Python's re) length. However, as a downside our lookbehinds do not support containing capture groups which are a feature allowing to extract a substring that matched a part of the regex pattern. The actual implementation can be found in the PR to the rust-lang/regex repository. However, there is one other engine in this crate that is of great importance: the meta-regex. This engine is what is used by the wrapping top-level "regex" crate and its job is to select the "best" of the other engines for performing a search. This decision procedure is based on some heuristics that don't matter much for understanding our work, but one important detail is that the engine first tries to find a match without using any regex engine if possible. This is the case when the regex consists of only a literal character sequence or a disjunction of multiple such sequences or if it can be transformed into such a disjunction. A match can then be searched by a "simple" substring search. This crate features many different implementations of a regex engine, each with different capabilities and runtime characteristics. All engines promise to have linear runtime complexity (in the length of the haystack and size of the regex), but the constant factors differ. The main engine that we modified for our implementation is the PikeVM which performs extended NFA simulation. This crate takes care of processing the raw regex input into a representation that can be easily used by the regex compiler. The first step is to parse the string representing a regex into the AST (abstract syntax tree) representation. At this stage all unsupported regex features are rejected, such as backreferences. Then, the AST is translated into a more normalized representation called HIR (high-level intermediate representation). Each node in the HIR stores properties which help the compiler and the engines to optimize or guarantee correctness of the regex matching process. An example of such a property is the length of the shortest and longest string that can be matched by the HIR node. Under the hood, the regex crate is only a thin wrapper around the much more elaborate "regex-automata" crate, which provides several different engine implementations to match regexes. Furthermore, the code for parsing regexes is split into yet another crate called "regex-syntax". In the following, we will give an overview of the architecture of both these crates. In the Rust ecosystem, libraries that are published are called crates. The language team maintains a handful of official crates, among which there is one called "regex" that provides exactly what you would expect: a regex matching engine. Implementation work Most of the initial work to get lookbehinds up and running consisted of implementing the two new NFA state types "WriteLookAround" and "CheckLookAround" (code). The first is essentially the equivalent of a "match" state but for a lookbehind, while the second instructs the regex engine to check whether or not a certain lookbehind matches at the current string position. We named them "*LookAround" instead of "*LookBehind" because, in principle, they are enough to implement lookaheads as well. In this section and the following sub-sections, we will dive deeper into the used lookbehind algorithm and the issues that arise from it. Once these states were properly produced by the regex-parser and compiler pipeline, and once we told the PikeVM how to interpret these instructions, we had an implementation that was basically working. The kind of changes to the code base were very similar in spirit to the changes done to RE2, albeit a bit more verbose, mainly due to Rust's strict type system and the way the crate is structured. However, there were certain challenges worth looking at in detail. Some of them showed up during the initial implementation phase while others surfaced only once we started benchmarking the resulting engine. Frontend Updated frontend pipeline: from raw input to an HIR. To support lookbehinds, the parser had to accept them while still rejecting lookaheads. Additionally, if a lookbehind contained a capture group, the parser would also need to reject it, due to it being unsupported currently by our implementation. The error messages indicate to the user that the syntax is recognized but not supported. During the AST to HIR translation, the stored properties are updated to account for the different semantics of lookbehinds. For example, a lookbehind does not contribute to the length of the match, so the length properties of the HIR node are set to zero. A new property is added to the HIR node that indicates whether or not the node contains a lookbehind. This is used in places where one has to behave differently depending on whether or not the regex contains a lookbehind. The Meta-Engine As explained above, the meta-engine (or meta-regex) bundles all other engines into one and chooses the "most efficient" of them for any search executed. Since we only implemented lookbehind support for the PikeVM, we had to make sure the meta-engine only considers the PikeVM as soon as there are any lookbehind expressions in the regex. This includes forbidding of skipping all engines and using a substring search. This was explicitly disabled, since lookbehinds are not part of the match, yet contribute to what is considered a match. This turned out to be easier than anticipated. The selection strategy essentially boils down to trying to build each of the engines in order of efficiency and using the first one that succeeds. Therefore, we simply had to make sure the build process of all engines other than the PikeVM fails with a proper error, whenever there is a lookbehind in the regex. There was one caveat though: the meta-regex expected the build process of the bounded backtracker to always succeed, because a backtracking engine supports all regex features. The only reason the backtracking engine would not be used is if the configured max memory usage is not sufficient to guarantee linear runtime. Therefore, we added a manual condition that checks for lookbehinds explicitly and skips the backtracker in that case (code). Lookbehind Threads When we started benchmarking, it became evident that something caused matching with lookbehinds to run longer than needed. Once we figured out the cause, it wasn't too hard to fix, but let's first take a look at how we implemented the problematic part initially. The concept of the algorithm is to run multiple NFA simulations in lockstep, one automaton for each lookbehind and one for the main regex. However, in practice it would have taken quite a bit of refactoring to account for the possibility of actually running multiple simulations in lockstep. So how did we do it instead? Well, in the case of the PikeVM, running multiple NFAs is equivalent to running the automaton that consists of a union of all individual sub-automata, since the PikeVM does not perform a depth first search like the backtracking engine. Therefore, that is exactly what we did: compile the automaton for each lookbehind and for the main regex and then patch them together in one big union state from which the simulation is then started. Now for the algorithm to work correctly, it is essential that the NFAs corresponding to lookbehinds that are more deeply nested are executed first. This is because we need the results of inner lookbehinds in order to correctly determine the results of outer ones. To make sure this is the case, the live threads of the inner lookbehinds must have higher priority than the ones of the outer lookbehinds. For the particular implementation of the PikeVM we were dealing with, this means that the threads of the inner lookbehinds must be added to the ordered set of live threads first, which is easy to guarantee by ordering the corresponding sub-automaton earlier in the top-level union state. So far so good, but now comes the problematic part: priority not only determines which threads are processed first, it also determines what match is returned when multiple are found. This is important to uphold the same match semantics as a backtracking engine. Consider, for example, the regex "a*b|a" on the haystack "aaab". At the beginning of the search, the PikeVM registers a match immediately after consuming the first character. However, because the first alternative in a regex takes priority over the second, the PikeVM cannot return that match yet. It must continue scanning the haystack until it reaches the end, at which point it registers that the first alternative actually matches as well and this is the match that is finally returned. Importantly, even on the haystack "aaaa", where the first alternative does not match, the PikeVM must still scan to the end of the string to realize that the first alternative does not match and it is ok to return the match of the second alternative; there is simply no way to know beforehand. What does this have to do with lookbehinds? Remember we compile the NFA in a way where all sub-automata have higher priority than the main regex. What's more, every lookbehind sub-automaton has a ".*?" prefix, in order to make sure we capture all lookbehind matches during our search and not only one at the beginning of the haystack. This in combination means that there are always live threads that have higher priority than any thread that could produce a match from the main regex, which in turn means that the PikeVM will scan to the end of the haystack regardless of found matches. Because this behavior arises as soon as any lookbehind is present, no matter the complexity of the lookbehind or the main regex, this was quite problematic performance wise. As already stated, the solution to this problem was more or less straight forward: we needed to keep the live threads of the lookbehinds in a place separate from the threads that belong to the main regex. That way, as soon as the ordered set of live threads of the main regex runs empty, we can be sure there won't be any more matches and similarly, if there is a match of the highest priority main regex thread, this will actually be registered as such and the search can be concluded immediately. To make this separation work required to refactor the NFA compilation a bit. We still compile all sub-automata into a single NFA struct, albeit without patching them together with a union anymore. Additionally, the NFA struct now has a field that contains the index of all the sub-automata starting states. This is necessary because we now need to explicitly initialize the starting thread for each sub-automaton. Resulting NFA structure after separating the lookbehinds and removing the union. It should be noted that we also considered the idea of tagging the live threads with the information of whether or not they belong to a lookbehind (either by storing this information in the thread itself or in the state of the NFA that is referenced by the thread). While this would most likely have worked too, the code for detecting whether there are still live threads from the main regex would, in comparison, be much more complicated and most likely less performant.

Adding lookbehinds to rust-lang/regex

Share this article

Related Articles