Skip to content
Tech News
← Back to articles

Regular expressions that work "everywhere"

read original more articles
Why This Matters

This article highlights the challenges of using regular expressions across different tools due to inconsistent implementations and syntax. By identifying a common subset of regex features supported universally in tools like sed, awk, grep, and Emacs, developers can write more portable and reliable code, especially in constrained environments. This approach enhances interoperability and reduces frustration for both developers and users.

Key Takeaways

The most frustrating aspect of regular expressions is that implementations vary. Features supported in one tool may not be supported at all in another tool, or they may be supported with slightly different syntax.

I learned regular expressions in the context Perl, a maximalist regex environment. This led to frustration when features I expect to work are missing [1]. One way around this is to use Perl analogs of other tools, but this is very non-standard. I want to be able to send colleagues and clients code that works out of the box.

As I mentioned in my post on computational survivalism, I occasionally need to work on computers that I cannot install software on. So a better approach is to identify a subset of regex features that work everywhere. The stricter your definition of “everywhere” the less this includes. The strictest subset would be

literals

character classes […]

the special characters . * ^ $

A more relaxed definition of “everywhere” would be the tools you most care about. Currently the tools I most want to use with regular expressions are sed, awk, grep, and Emacs.

Awk as lowest common denominator

If you use the Gnu versions of sed, awk, and grep, and use the -E option with sed and grep, then the list of common features is bigger. The regular expression features of of the three tools are similar, and awk’s features are supported in the other tools, with one exception: word boundaries in awk are \< and \> rather than \b and \B .

I wrote about Awk’s regex features here.

... continue reading