Tech News
← Back to articles

RE#: how we built the fastest regex engine in F#

read original related products more articles

around a year ago, we built a regex engine in F# that not only outperformed the ones in dotnet, but went above and beyond competing with every other industrial regex engine on a large set of industry-standard benchmarks. additionally, it supports the full set of boolean operators (union, intersection, complement) and even a form of context-aware lookarounds, which no other engine has while preserving O(n) search-time complexity. the paper was published at POPL 2025, and i figured it’s time to open source the engine and share the story behind it. consider it a much more casual and chatty version of the paper, with more focus on engineering aspects that went into it.

#r " nuget: resharp " // find all matches Resharp.Regex ( " hello.*world " ) .Matches ( " hello world! " ) // intersection: contains "cat" AND "dog" AND is 5-15 chars long Resharp.Regex ( " _*cat_*&_*dog_*&_{5,15} " ) .Matches ( input ) // complement: does not contain "1" Resharp.Regex ( " ~(_*1_*) " ) .Matches ( input )

brief introduction

almost every regex engine today descends from one of two approaches: Thompson’s NFA construction (1968) or backtracking (1994). Thompson-style engines (grep, RE2, Rust’s regex) give you linear-time guarantees but only support the “standard” fragment - | and * . backtracking engines (the rest, 95% chance the one you’re using) give you a mix of advanced features like backreferences, lookarounds.., but are unreliable, and can blow up to exponential time on adversarial inputs, which is a real security concern known as ReDoS. to be more precise, this exponential behavior is not the only problem with backtracking engines - they also handle the OR ( | ) operator much slower, but let’s try to start with the big picture.

neither camp supports intersection (&) or complement (~), which is a shame because it’s been known as early as 1964, but forgotten since then, before being brought to attention again in 2009, or as Owens, Reppy and Turon put it, it was “lost in the sands of time”.. and subsequently forgotten again until 2019, when it first saw industrial use for credential scanning in SRM, albeit these operators were not exposed to the user back then. RE# is inspired by existing implementations of SRM and .NET NonBacktracking engine (2023), but in a way that hadn’t been done before, and with a lot of engineering work to make it fast and practical for real-world use cases.

a big goal of our paper was to really highlight how useful these operators really are. making it fast was more of a side-goal, so there would be less to complain about, as there was already a lot of disbelief, and we got comments such as it being a “theoretical curiosity”. i spent over a year experimenting with different approaches (there’s an initial draft on arXiv from 2023 where lookarounds had much worse complexity), but finally we found a combination that works extremely well in practice, and it helped me find a deeper intuition for what works both in theory and in the real world.

the result is RE#, the first general-purpose regex engine to support intersection and complement with linear-time guarantees, and also the overall fastest regex engine on a large set of benchmarks. we also wanted to address some more things along the way, like returning the correct matches (the ones you meant, leftmost-longest), when the default semantics are given by the PCRE implementation. or put another way, “it’s correct if it does whatever the one that blows up and causes denial of service does”.

Brzozowski derivatives

a core building block of RE# is the concept of regex derivatives, which is equally old, and equally forgotten until 2009 when it was rediscovered and 10 years later implemented in .NET for credential scanning. both derivatives and intersection/complement operators originally came from a 1964 paper by Janusz Brzozowski (below).

don’t let the complex notation scare you - the core idea is extremely simple. the derivative of a regex R and a character c is simply whatever is left to match after removing the first character.

... continue reading