This is a guest post from Jiebin Sun, Zhiguo Zhou, Wangyang Guo and Tianyou Li, performance optimization engineers at Intel Shanghai.
Intel's latest processor generations are pushing the number of cores in a server to unprecedented levels - from 128 P-cores per socket in Granite Rapids to 288 E-cores per socket in Sierra Forest, with future roadmaps targeting 200+ cores per socket. These numbers multiply on multi-socket systems, such servers may consist of 400 and more cores. The paradigm of "more, not faster cores" is driven by physical limitations. Since the end of Dennard scaling in the mid-2000s, power density concerns made it increasingly difficult to push single-thread performance further.
For analytical databases like ClickHouse, ultra-high core counts represent a huge opportunity and a complex challenge at the same time. While more cores theoretically mean more power to process tasks in parallel, most databases struggle to utilize the available hardware fully. Bottlenecks for parallel processing like lock contention, cache coherence, non-uniform memory access (NUMA), memory bandwidth, and coordination overhead become significantly worse as the core count increases.
Optimizing for ultra-high core counts #
Over the past three years, I dedicated a part of my professional life to understand and optimize ClickHouse's scalability on Intel Xeon ultra-high core count processors. My work focused on using various profiling and analysis tools - including perf, emon, and Intel VTune - to analyze all 43 ClickBench queries on ultra-high core count servers systematically, identifying bottlenecks, and optimizing the ClickHouse accordingly.
The results have been exciting: individual optimizations routinely deliver speedups of multiple times for individual queries, in some cases up to 10x. The geometric mean of all 43 ClickBench queries consistently improved between 2% and 10% per optimization. The results demonstrate that ClickHouse can be made scale very well on ultra-high core count systems.
The core scaling challenge #
Beyond single-thread performance, several key challenges must be addressed to optimize performance in ultra-high core count systems.
Cache coherence overhead: Bouncing cache lines costs CPU cycles. Lock contention: Amdahl's Law becomes brutal for serialized code sections as little as 1% of the overall code. Memory bandwidth: Utilizing the memory bandwidth effectively is a persistent challenge for data-intensive systems. Proper memory reuse, management and caching becomes critical. Thread coordination: The cost of synchronizing threads grows super-linearly with the number of threads. NUMA effects: The memory latency and bandwidth on multi-socket systems differs for local or remote memory.
This blog post summarizes our optimizations for ClickHouse on ultra-high core count servers. All of them were merged into the main codeline and they now help to speed up queries in ClickHouse deployments around the globe.
... continue reading