Tracking down a 25% Regression on LLVM RISC-V

Similar to the previous post, this post covers my analysis of a benchmark on RISC-V targets. Unlike the previous post, I was able to land a patch to eliminate the performance gap to GCC (for this benchmark)!

TLDR A recent LLVM commit improved isKnownExactCastIntToFP to fold fpext(sitofp x to float) to double into a direct uitofp x to double cast, but this inadvertently broke a downstream narrowing optimization in visitFPTrunc that relied on the fpext to narrow a double to float, causing a ~24% performance regression on RISC-V targets, where fdiv.d (33 cycle latency) was emitted instead of fdiv.s (19 cycle latency). My fix extends getMinimumFPType with range analysis to recognize that fptrunc(uitofp x double) to float can be reduced to uitofp x to float , restoring the narrowing optimization.

Analysis

I was looking at Igalia’s site comparing the performance of LLVM to GCC on RISCV targets, and I noticed this particular benchmark.

As shown in the image below, LLVM requires about ~8% more cycles than GCC for that specific benchmark on the SiFive P550 CPU.

I have included snippets of the relevant basic block assembly. Practically all the cycles were spent on the assembly below.

LLVM

GCC

From the two assembly, it wasn’t immediately obvious to me why GCC was doing better. They seemed almost identical, and if anything, LLVM was able to optimize the branch logic here. The big difference I did notice was that LLVM was doing an fdiv.d , a division with double precision floating point or an f64 .

... continue reading