China bypasses US GPU bans with 1.54-exaflops 'LineShine' supercomputer — CPU-only monster packs 2.4 million Huawei-designed Armv9 cores

The vast majority of leading supercomputers and AI clusters today use CPUs for general-purpose tasks and orchestration and AI GPUs for massive parallel computing workloads to achieve exceptionally high ExaFLOPS-class performance. But in China, we are seeing a different trend, as in recent years the country has deployed a number of CPU-only supercomputers for AI and HPC workloads, largely due to the bans on GPUs from the US preventing the country from sourcing enough for supercomputers. For example, China's National Supercomputing Center recently deployed its 1.54 ExaFLOPS-class machine that uses 20,480 Armv9-based CPUs.

The LineShine LX2 processor

The LineShine supercomputer is based around custom Armv9-based LX2 processors designed specifically for large-scale AI and HPC workloads. China's National Supercomputing Center (NSCC) in Shenzhen does not disclose the developer of the LX2 CPU, though Jon Peddie from Jon Peddie Research outright calls it the 'Huawei LX2' processor. Meanwhile, the CPU could be a custom Huawei HPC CPU, a joint NSCC/Huawei design, or an entirely separate Chinese government-backed HPC processor developer.

(Image credit: China's National Supercomputing Center)

Each LX2 processor uses two compute chiplets and has a total of 304 CPU cores organized into eight CPU clusters containing 38 cores each. Every core includes Arm SVE (Scalable Vector Extension) and SME (Scalable Matrix Extension) units that accelerate vector and matrix operations used in AI training and scientific computing that support FP64, FP32, BF16, FP16, and INT8 data formats. Each core is equipped with 32 KB L1 instruction cache and 32 KB L1 data cache, while every cluster shares a 28.5 MB L2 cache.

Latest Videos From

The processor uses a highly unusual memory subsystem that combines 32 GB of on-package HBM that delivers up to 4 TB/s of bandwidth and up to 256 GB of off-package DDR5 memory. A similar memory subsystem was used by Fujitsu's Arm-based A64FX processor that powers the Fugaku supercomputer, though the LX2 is probably the industry's first Armv9-based CPU for AI and HPC that uses such a memory subsystem.

Each chiplet contains four HBM domains and four DDR domains; there are 16 NUMA domains per processor. HBM access is highly sensitive to locality, whereas DDR memory access is more uniform within a die and is shared between clusters. Such behavior forced developers to design topology-aware memory placement and scheduling techniques (which are particularly handy for AI training), which are executed by a dedicated SDMA engine to move data between DDR and HBM.

When it comes to performance, a single LX2 processor delivers 60.3 TFLOPS FP64 performance, 240 TFLOPS BF16/FP16 throughput, and 960 TOPS INT8 performance. Unlike conventional server CPUs, the architecture appears heavily optimized for dense AI and matrix workloads despite remaining a CPU-centric design. The paper notes that sustaining high utilization of the SME matrix engines required extensive co-design of kernels, runtime scheduling, cache residency management, and tensor placement across the HBM and DDR hierarchy.

The LineShine supercomputer

... continue reading