Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks

This post is a high-level explainer for my Master’s thesis, which involves designing hardware architectures for ultrafast inference and online learning using the Kolmogorov-Arnold Network (KAN) architecture. I’ll assume familiarity with standard machine learning concepts, as well as some understanding of hardware and digital circuits; read my previous post here for the latter.

Please read the two papers below for more information, particularly for details on benchmarks and notable results.

[FPGA 2026 Best Paper]

Duc Hoang*, Aarush Gupta*, and Philip C. Harris. “KANELÉ: Kolmogorov–Arnold Networks for Efficient LUT-based Evaluation.” Proceedings of the 2026 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, 2026. https://dx.doi.org/10.1145/ 3748173.3779202 [ICML 2026]

Duc Hoang*, Aarush Gupta*, and Philip Harris. “Ultrafast on-FPGA Online Learning via Spline Locality in Kolmogorov-Arnold Networks.” arXiv preprint arXiv:2602.02056, 2026. https://arxiv.org/abs/2602.02056 *equal contribution

The case for machine learning on FPGAs

Most modern machine learning workloads, whether training or inference, run on graphics processing units (GPUs). Through hardware architectures that support a highly parallel execution model, GPUs can perform simple operations on large amounts of data with extremely high throughput. This makes them ideal for machine learning problems involving large architectures or batch-style training and inference.

However, complex GPU architectures cannot meet the demands of applications that require ultra-low latency (e.g. sub-microsecond latency) and high hardware efficiency. Processors (e.g. CPUs and GPUs) incur significant overhead from scheduling and optimizing instructions, dynamically accessing memory, and so on. Extremely specialized workloads with ultralow latency (e.g. $\sim$nanoseconds) and efficiency requirements are instead better served by custom hardware accelerators.

Field-programmable gate arrays, or FPGAs, are reconfigurable digital logic devices that are extremely well-suited for this style of custom hardware acceleration. FPGAs contain lookup tables (LUTs), which represent digital functions by enumerating the output value for every combination of binary inputs; flip-flops (FFs), which store state; and other memory and computation primitives. These components and the connections between them are reconfigured to design a custom digital circuit, allowing for low-level hardware architecture and algorithm co-design that enables ultrafast machine learning. Importantly, neural networks are implemented directly as digital logic, rather than as instructions that are sequentially executed on a processor.

... continue reading