Today, I'll be writing about the aegraph, or acyclic egraph, the data structure at the heart of Cranelift's mid-end optimizer. I introduced this approach in 2022 and, after a somewhat circuitous path involving one full rewrite, a number of interesting realizations and "patches" to the initial idea, various discussions with the wider e-graph community (including a talk (slides) at the EGRAPHS workshop at PLDI 2023 and a recent talk and discussions at the e-graphs Dagstuhl seminar), and a whole bunch of contributed rewrite rules over the past three years, it is time that I describe the why (why an e-graph? what benefits does it bring?), the how (how did we escape the pitfalls of full equality saturation? how did we make this efficient enough to productionize in Cranelift?), and the how much (does it help? how can we evaluate it against alternatives?).
For those who are already familiar with Cranelift's mid-end and its aegraph, note that I'm taking a slightly different approach in this post. I've come to the viewpoint that the "sea-of-nodes" aspect of our aegraph, and the translation passes we've designed to translate into and out of it (with optimizations fused in along the way), are actually more fundamental than the "multi-representation" part of the aegraph, or in other words, the "equivalence class" part itself. I'm choosing to introduce the ideas from sea-of-nodes-first in this post, so we will see a "trivial eclass of one enode" version of the aegraph first (no union nodes), then motivate unions later. In actuality, when I was experimenting then building this functionality in Cranelift in 2022, the desire to integrate e-graphs came first, and aegraphs were created to make them practical; the pedagogy and design taxonomy have only become clear to me over time. With that, let's jump in!
Initial context: Fixpoint Loops and the Pass-Ordering Problem
Around May of 2022, I had introduced a simple alias analysis and related optimizations (removing redundant loads, and doing store-to-load forwarding). It worked fine on all of the expected test cases, and we saw real speedup on a few benchmarks (e.g. 5% on meshoptimizer here) but led to a new question as well: how should we integrate this pass with our other optimization passes, which at the time included GVN (global value numbering), LICM (loop-invariant code motion), constant propagation and some algebraic rewrites?
To see why this is an interesting question, consider how GVN, which canonicalizes values, and redundant load elimination interact, on the following IR snippet:
v2 = load.i64 v0+8 v3 = iadd v2, v1 ;; e.g., array indexing v4 = load.i8 v3 ;; ... (no stores or other side effects here) ... v10 = load.i64 v0+8 v11 = iadd v10, v1 v12 = load.i8 v11
Redundant load elimination (RLE) will be able to see that the load defining v10 can be removed, and v10 can be made an alias of v2 , in a single pass. In a perfect world, we should then be able to see that v11 becomes the same as v3 by means of GVN's canonicalization, and subsequently, v12 becomes an alias of v4 . But those last two steps imply a tight cooperation between two different optimization passes: we need to run one full pass of RLE (result: v10 rewritten), then one full pass of GVN (result: v11 rewritten), then one additional full pass of RLE (result: v12 rewritten). One can see that an arbitrarily long chain of such reasoning steps, bouncing through different passes, might require an arbitrarily long sequence of pass invocations to fully simplify. Not good!
This is known as the pass-ordering problem in the study of compilers and is a classical heuristic question with no easy answers as long as the passes remain separate, coarse-grained algorithms (i.e., not interwoven). To permit some interesting cases to work in the initial Cranelift integration of alias analysis-based rewrites, I made a somewhat ad-hoc choice to invoke GVN once after the alias-analysis rewrite pass.
But this is clearly arbitrary, wastes compilation effort in the common case, and we should be able to do better. In general, the solution should reason about all passes' possible rewrites in a unified framework, and interleave them in a fine-grained way: so, for example, if we can apply RLE then GVN five times in a row just for one localized expression, we should be able to do that, without running each of these passes on the whole function body. In other words, we want a "single fixpoint loop" that iterates until optimization is done at a fine granularity.
Three Building Blocks: Rewrites, Code Motion, and Canonicalization
... continue reading