10-20x Faster LLVM -O0 Back-End

5 years ago, @nikic wrote:

I can’t say a 10% improvement is making LLVM fast again, we would need a 10x improvement for it to deserve that label.

We recently open-sourced TPDE and our fast LLVM baseline back-end (TPDE-LLVM), which is 10-20x faster than the LLVM -O0 back-end with similar runtime performance and 10-30% larger code size. We support a typical subset of LLVM-IR and only target x86-64 and AArch64. Posting this here, as this might be interesting for the LLVM community – questions/comments welcome!

Data on SPEC CPU 2017 (x86-64, compile-time speedup and code size relative to LLVM 19 -O0 back-end):

Benchmark Comp-Time O0 IR Code Size O0 IR Comp-Time O1 IR Code Size O1 IR 600.perl 11.39x 1.27x 15.06x 0.97x 602.gcc 12.54x 1.32x 17.55x 1.01x 605.mcf 9.72x 1.27x 12.47x 0.92x 620.omnetpp 21.46x 1.24x 26.49x 1.03x 623.xalanc 18.99x 1.24x 24.80x 0.98x 625.x264 10.52x 1.26x 15.19x 0.97x 631.deepsjeng 9.60x 1.25x 17.56x 0.97x 641.leela 21.44x 1.24x 18.36x 0.95x 657.xz 10.95x 1.30x 15.15x 0.92x geomean 13.34x 1.27x 17.58x 0.97x

Results for AArch64 are similar (a bit higher speedups due to GlobalISel). Obviously, on optimized IR LLVM’s optimizing back-ends fare much better in terms of run-time (~2x better) and code size (~2x better), but we don’t intend to compete there.

How Does It Work?

See this paper. In essence, we do three passes: one IR cleanup/preparation pass, one analysis pass (loop+liveness), and one codegen pass, which performs lowering, regalloc, and machine code encoding in combination.

Features & Future Plans

Currently, the goal is to support typical Clang O0/O1 IR, so there are a lot of unsupported features. Flang code often works, but sometimes misses some floating-point operations. Rust code works in principle, but some popular crates use illegal (=not natively supported) vector types, which are not implemented yet. (Legalizing vector types/operations is very annoying.)

... continue reading