Tuning LLVM's SLP Vectorizer Cost Model

Similar to my last post, this writeup covers how I solved a performance regression on LLVM by analyzing a benchmark from a RISCV target.

TLDR A recent LLVM patch introduced ordered vector reductions to replace a chain of scalar fadds, but it triggered a performance regression on a benchmark by failing to account for cost of building the initial vector per iteration. This in turned caused unprofitable code to be deemed “profitable.” Issue A recent LLVM patch introduced ordered vector reductions to replace a chain of scalar fadds, but it triggered a performance regression on a benchmark by failing to account for cost of building the initial vector per iteration. This in turned caused unprofitable code to be deemed “profitable.” PR

The Regression

Looking at Igalia’s LNT instance for the BPI-F3, I noticed this particular benchmark with a delta of 89%. Specifically, there was an increase in ~26% issued instructions and a ~48% increase in cycles.

I have attached two more pictures right below, with the first one being the assembly of a basic block from the older build and the corresponding assembly from the newer build.

Info Bn here refers to Billions of cycles. This basic block is basically taking twice as many cycles to execute.

We can see that that newer build of LLVM is performing a sequence of fsd instructions, also known as Float Store Double. It’s essentially storing the floating point values from those registers onto the stack. Specifically, it’s storing the value at the address s1 + 0x80 .

From a preceding basic block that I have not included here, I know that value of the register a5 to be equal to s1 + 0x80 from this instruction.

addi a5 , s1 , 0x80

... continue reading