5 years ago, @nikic wrote:
I can’t say a 10% improvement is making LLVM fast again, we would need a 10x improvement for it to deserve that label.
We recently open-sourced TPDE and our fast LLVM baseline back-end (TPDE-LLVM), which is 10-20x faster than the LLVM -O0 back-end with similar runtime performance and 10-30% larger code size. We support a typical subset of LLVM-IR and only target x86-64 and AArch64. Posting this here, as this might be interesting for the LLVM community – questions/comments welcome!
Data on SPEC CPU 2017 (x86-64, compile-time speedup and code size relative to LLVM 19 -O0 back-end):
Benchmark Comp-Time O0 IR Code Size O0 IR Comp-Time O1 IR Code Size O1 IR 600.perl 11.39x 1.27x 15.06x 0.97x 602.gcc 12.54x 1.32x 17.55x 1.01x 605.mcf 9.72x 1.27x 12.47x 0.92x 620.omnetpp 21.46x 1.24x 26.49x 1.03x 623.xalanc 18.99x 1.24x 24.80x 0.98x 625.x264 10.52x 1.26x 15.19x 0.97x 631.deepsjeng 9.60x 1.25x 17.56x 0.97x 641.leela 21.44x 1.24x 18.36x 0.95x 657.xz 10.95x 1.30x 15.15x 0.92x geomean 13.34x 1.27x 17.58x 0.97x
Results for AArch64 are similar (a bit higher speedups due to GlobalISel). Obviously, on optimized IR LLVM’s optimizing back-ends fare much better in terms of run-time (~2x better) and code size (~2x better), but we don’t intend to compete there.
How Does It Work?
See this paper. In essence, we do three passes: one IR cleanup/preparation pass, one analysis pass (loop+liveness), and one codegen pass, which performs lowering, regalloc, and machine code encoding in combination.
Features & Future Plans
Currently, the goal is to support typical Clang O0/O1 IR, so there are a lot of unsupported features. Flang code often works, but sometimes misses some floating-point operations. Rust code works in principle, but some popular crates use illegal (=not natively supported) vector types, which are not implemented yet. (Legalizing vector types/operations is very annoying.)
Besides some more IR features, our current plans include: DWARF support (we already have some prototype) and better register allocation than “spill everything”. If someone provides sufficient motivation, other big items are non-ELF platforms, non-small-PIC code models, and other targets.
Speculatively Answered Questions
How to use TPDE-LLVM?
The LLVM back-end is usable as a library (e.g., for JIT compilation, also usable with ORC JIT), as llc -like tool, and can be integrated in Clang (needs a patch, plugins can’t provide a custom back-end right now). Some more details here.
Why not make LLVM faster?
We did, LLVM 18->20 got 18% faster on x86-64. I think that another 10-20% might be reasonable to achieve, but I’d expect going beyond that to require deeper changes. Even when doing this, a 10x speedup is unlikely to be achievable.
Which changes to LLVM-IR could allow even faster compilation?
No ConstantExpr inside functions. These are hard to handle, so before compilation, we rewrite them to instructions – by iterating over all instruction operands to find such constants (and also walk through constant aggregates).
inside functions. These are hard to handle, so before compilation, we rewrite them to instructions – by iterating over all instruction operands to find such constants (and also walk through constant aggregates). No arbitrarily-sized struct/array values. The only (IMHO) valid reason for struct/array values inside functions are arguments (e.g. AArch64 HFA/HVA) and multiple return values, but there’s no point in loading a [6000 x {i32, i64}] . (We currently have quadratic runtime in the number of elements per value.)
Not directly related to performance, but would make things easier:
No direct access to thread-local globals (see e.g. the proposal here). We cannot easily generate a function call at arbitrary places, so we rewrite all accesses to use the intrinsic.
No arbitrary bit-width arithmetic. i260 doesn’t work well already, yet Clang occasionally generates this (bitfield structs) – we don’t support multi-word integers other than i128 .
Random “Fun” Facts