Performance Debugging with LLVM-mca: Simulating the CPU

Some time ago I had a performance problem that wasn’t easy to explain by just looking at the code, since the version I expected to be faster was actually slower. Since the problem is simple yet illustrative, I am using it as a showcase on how to debug performance issues using llvm-mca.

According to it’s documentation llvm-mca is a performance analysis tool that uses information available in LLVM (e.g. scheduling models) to statically measure the performance of machine code in a specific CPU. In other words, you feed it some instructions and it simulates how the CPU executes those instructions.

The problem we are debugging is a very simple convolution kernel. The plain-old C version of this kernel looks like this:

for (size_t i = 0; i < (n - kernel_size); i++) { out[i] = 0.0; for (size_t k = 0; k < kernel_size; k++) { out[i] += input[i + k] * kernel[k]; } }

We want to vectorize this loop for ARM NEON using outer loop vectorization. What this means is that the outer loop runs 4 instances of inner loop in parallel, something like this:

for (size_t i = 0; i < (n - kernel_size); i+=4) { out[i] = 0.0; out[i+1] = 0.0; out[i+2] = 0.0; out[i+3] = 0.0; for (size_t k = 0; k < kernel_size; k++) { out[i] += input[i+k] * kernel[k]; out[i+1] += input[i+1+k] * kernel[k]; out[i+2] += input[i+2+k] * kernel[k]; out[i+3] += input[i+3+k] * kernel[k]; } }

After this, the same repeated statements (out[i+x]…) are grouped into one vector intrinsic. The vectorized code looks like this:

for (size_t i = 0; i < (in_size - kernel_size); i+=4) { // Original four statements out[i+x] = 0.0; are now one. float32x4_t out_v = vdupq_n_f32(0.0f); for (size_t k = 0; k < kernel_size; k++) { // Load 4 inputs from address input[i+k], // originally these were four accesses input[i+k+x] float32x4_t in_v = vld1q_f32(in + i + k); // Loading kernel from address kernel+k into // all the lanes of the vector register float32x4_t kernel_v = vld1q_dup_f32(kernel + k); // Parallel four multiplications and additions out_v = vaddq_f32(out_v, vmulq_f32(in_v, kernel_v)); } vst1q_f32(out + i, out_v); }

For the simplicity, let’s assume kernel_size is five. If this is the case, we don’t have to reload kernel_v over and over in the loop – we can load it once outside of the loop. Instead of a combination vaddq_f32 and vmulq_f32, we will use vmlaq_laneq_f32 – fused multiply-add. Since the kernel_size is 5, we can completely unroll the inner loop. The resulting code looks like this:

float32x4_t const kernel_0 = vld1q_f32(kernel); float32x4_t const kernel_1 = vld1q_f32(kernel + 4); for (size_t i = 0; i < (in_size - 5); i+=4) { float const * in_i = in + i; float32x4_t in_0 = vld1q_f32(in_i); float32x4_t in_1 = vld1q_f32(in_i + 1); float32x4_t in_2 = vld1q_f32(in_i + 2); float32x4_t in_3 = vld1q_f32(in_i + 3); float32x4_t in_4 = vld1q_f32(in_i + 4); float32x4_t out_v = vmulq_laneq_f32(in_0, kernel_0, 0); out_v = vmlaq_laneq_f32(out_v, in_1, kernel_0, 1); out_v = vmlaq_laneq_f32(out_v, in_2, kernel_0, 2); out_v = vmlaq_laneq_f32(out_v, in_3, kernel_0, 3); out_v = vmlaq_laneq_f32(out_v, in_4, kernel_1, 0); vst1q_f32(out + i, out_v); }

... continue reading