Intel and AMD unveil new x86 standard to make CPUs better at running AI models

First look: The AI hardware discussion has centered on GPUs for so long that CPUs can feel like an afterthought. Intel and AMD are now trying to tilt that balance back, at least a bit, with a new CPU-focused specification. The effort signals that both companies still see room for CPUs to play a bigger role in certain kinds of machine learning workloads.

The specification, called Advanced Compute Extensions, or ACE, lays out a way to handle AI operations more efficiently on x86 processors. It is not aimed at replacing GPUs in large-scale training environments. Instead, the focus is on smaller models, latency-sensitive tasks, and systems where a GPU is either unavailable or not worth the overhead.

That last point matters more than it might seem. Moving data back and forth between a CPU and GPU is not free. For some workloads, especially those that need quick responses or run on limited hardware, that back-and-forth can become a bottleneck. Keeping the work on the CPU avoids that entirely.

At a technical level, ACE is built around matrix multiplication, which sits at the heart of most AI operations. CPUs have always been able to handle this kind of math, but not particularly efficiently. The industry has leaned on AVX instructions to bridge that gap, even though those instructions were never designed with matrix-heavy workloads in mind.

ACE takes a different approach. It keeps the existing AVX10 register structure but adds dedicated hardware for matrix operations. That decision avoids forcing developers into entirely new data formats or programming models. The extensions still use 512-bit inputs, which helps them fit into existing software and hardware workflows with minimal changes.

The performance gains show up most clearly at the instruction level. For a given set of input vectors, ACE can carry out far more operations than AVX10 – up to sixteen times as many. That does not mean applications will suddenly run sixteen times faster, since real-world performance depends on a range of factors. But it does point to a more efficient use of instructions, which can translate into lower power use and less strain on memory bandwidth.

Power efficiency is one of the more practical benefits here. GPUs are powerful, but they are also energy-intensive, and again, they require data movement that adds overhead. By comparison, a CPU handling these operations directly can be more economical, particularly for edge use cases or single-user applications.

Another piece of the ACE design is consistency. The specification is meant to be implementation-agnostic, which should make life easier for developers working with frameworks like PyTorch and TensorFlow. Rather than juggling different code paths for varying AVX support, developers can aim at a single, consistent target.

The extensions also support a wide range of data types used in machine learning, including INT8, INT32, FP8, FP16, FP32, and BF16. In addition, ACE includes native support for Open Compute Project MX block-scaled formats, which are not part of AVX10. That flexibility reflects how varied model requirements have become, particularly on the inference side.

There is also a more subtle advantage when it comes to heterogeneous computing. NPUs are becoming more common, but they are far from standardized. Moving a workload onto an NPU can introduce its own complications depending on the hardware. ACE offers a way to keep certain tasks on the CPU when speed and simplicity matter more than absolute efficiency.

... continue reading