It has taken nearly two decades and an immense amount of work by millions of people for high performance computing to go mainstream with GenAI. And now, we live in a world where AI servers crammed with accelerators account for half of the money spent on systems worldwide.
There is no law anywhere that says that accelerator has to be a GPU, although that has been the accelerator of choice by far because GPUs are, like CPUs, general purpose processors that are explicitly designed to support various kinds of workloads where high throughput vector processing and, with GenAI and some traditional HPC simulations that have been altered, tensor processing are highly prized.
There is still room for something other than a GPU to accelerate HPC and AI applications, and Pezy Computing KK, whose very name is short for peta, exa, zetta, and yotta, like it is part of some kind of football chant for HPC and AI fans, has spent a decade and a half creating math accelerators that can do the same kinds of work as GPUs, but with a different architecture that aims to drive energy efficiency to its limits. This is exactly what you would expect for a company that was funded by Japan’s New Energy and Industrial Technology Development Organization (NEDO), which is also funding the development of the “Monaka” Arm server CPU designed by Fujitsu that will be used in the “FugakuNext” supercomputer.
The wonder is why FugakuNext doesn’t at least have some portions of its compute coming from Pezy SC accelerators . . . . Perhaps it will when FugakuNext is installed in 2029 or so.
Naoya Hatta, a hardware engineer at Pezy Computing, presented the latest in a line of number-crunching accelerators that have been delivered since the Pezy-1 chip was launched in April 2012 after two years of development. Here is the table Hatta presented at Hot Chips 2025:
And here is an expanded table with more features and analysis by us:
That Pezy-1 chip, which is not shown in Hatta’s table above, had 512 RISC cores for calculations and image processing and two baby Arm cores, all etched in 40 nanometer processes from Taiwan Semiconductor Manufacturing Co. It ran at 533 MHz and was rated at 266 gigaflops (in floating point format) at double precision and 533 gigaflops at single precision.
In 2013, the SC family – short for Super Computer – accelerators debuted, and were used in a number of supercomputers in 2014 that made their way onto the Top500 and Green500 supercomputer rankings. With the first SC variant, the RISC cores were given simultaneous multithreading with eight threads per core, which meant its 1,024 cores running at 733 MHz could present a total of 8,192 threads to applications. This chip, etched with TSMC 28 nanometer process, could drive 750 gigaflops at FP64 and 1.5 teraflops at FP32 precision. The RISC cores that do calculations are called processor elements, or PEs, have 2 KB caches – two per PE for instructions and one per PE for data, and that aggregates out to 2 MB of L2 instruction and 1 MB of L2 data cache across those cores. Each PE also has a 16 KB scratchpad memory, which aggregates out to 16 MB across the chip.
In the Pezy-SC designs, PEs are organized into blocks of four called “villages,” and four villages are aggregated into “cities” that have shared L2 data and instruction caches, and sixteen cities (or 256 PEs) are aggregated into “prefectures” that have 2 MB of shared L3 cache in the center of each prefecture. The Pezy-SC had four DDR4 memory channels and two PCI-Express 3.0 x8 ports, and had a peak power draw of 100 watts.
With the Pezy-SC2 design that came to market in 2017, the L3 cache was shared across the whole complex and weighed in at 40 MB, significantly helping performance. FP16 half precision math was also added to the RISC cores that comprise the PEs, twice as many PEs were added to the complex, and clock speeds rose by 36.4 percent to 1 GHz. The combined effects of this drove floating point throughput in FP64 and FP32 formats up by 5.5X.
... continue reading