In a previous blogpost, Erik wrote about how he implemented the linear time matching algorithm from [RegElk+PLDI24] in the popular regex engine RE2. In this one, we're looking at how to do the same thing for the official regex engine from the Rust language. Namely, we add support for unbounded captureless lookbehinds.
First, let's discover the newly supported feature and its limitations. Lookbehinds allow regexes to make assertions about things preceding some part of the regex pattern without considering it as part of the match. Similarly, negative lookbehinds, which we also support, assert that something is not preceding. Consider the regex
(?<=Title:\s+)\w+
which would match the following strings (matches underlined with ~ ):
But does not match:
As seen in the example, lookbehind expression can be unbounded. This means we do not know ahead of time how many characters a lookbehind will match. This is an improvement over many existing regex engines which support lookbehinds but only if they are of a bounded (like the ubiquitous PCRE) or sometimes even fixed (like Python's re) length. However, as a downside our lookbehinds do not support containing capture groups which are a feature allowing to extract a substring that matched a part of the regex pattern.
The actual implementation can be found in the PR to the rust-lang/regex repository.
However, there is one other engine in this crate that is of great importance: the meta-regex. This engine is what is used by the wrapping top-level "regex" crate and its job is to select the "best" of the other engines for performing a search. This decision procedure is based on some heuristics that don't matter much for understanding our work, but one important detail is that the engine first tries to find a match without using any regex engine if possible. This is the case when the regex consists of only a literal character sequence or a disjunction of multiple such sequences or if it can be transformed into such a disjunction. A match can then be searched by a "simple" substring search.
This crate features many different implementations of a regex engine, each with different capabilities and runtime characteristics. All engines promise to have linear runtime complexity (in the length of the haystack and size of the regex), but the constant factors differ. The main engine that we modified for our implementation is the PikeVM which performs extended NFA simulation.
This crate takes care of processing the raw regex input into a representation that can be easily used by the regex compiler. The first step is to parse the string representing a regex into the AST (abstract syntax tree) representation. At this stage all unsupported regex features are rejected, such as backreferences. Then, the AST is translated into a more normalized representation called HIR (high-level intermediate representation). Each node in the HIR stores properties which help the compiler and the engines to optimize or guarantee correctness of the regex matching process. An example of such a property is the length of the shortest and longest string that can be matched by the HIR node.
... continue reading