Coding Neon Kernels for the Cortex-A53
Published on: 2025-08-22 03:34:57
Some weeks ago, I presented at FOSDEM my work-in-progress high performance SDR runtime qsdr. I showed a hand-written NEON assembly implementation of a kernel that computes \(y[n] = ax[n] + b\), which I used as the basic math block for benchmarks on a Kria KV260 board (which has a quad-core ARM Cortex-A53 at 1.33 GHz). In that talk I glossed over the details of how I implemented this NEON kernel. There are enough tricks and considerations that I could make a full talk just out of explaining how to write this kernel. This will be the topic for this post.
Note: this post assumes familiarity with the aarch64 assembly syntax, particularly with the way that NEON registers are denoted depending on the context. For example, you should understand that v0.4s and q0 refer to the same 128-bit NEON register, and v0.2s and d0 denote the 64 LSBs of this same NEON register. It might be worth to review the syntax if this doesn’t make sense immediately to you.
Cortex-A53 characteristics
Something pec
... Read full article.