Intel's "Clearwater Forest" Xeon 7 E-Core CPU Will Be a Beast

With AMD having attaining more than 40 percent revenue share and more than 27 percent shipment share in the X86 server CPU market in the first half of 2025, that means two things. First, AMD is selling some big, fat X86 CPUs compared to Intel. It also means that Intel, despite all of its many woes, is getting nearly 60 percent of revenues and north of 72 percent of shipments for X86 server CPU here in 2025. No, that is not the share Intel is used to, but that’s life sometimes. And with the rollout of its “Clearwater Rapids” Xeon 7 P-core processor and the Xeon 7 “Clearwater Forest” E-core processors in 2026, everything hinges on the Intel 18A manufacturing process (what might otherwise be called 1.8 nanometers) as well as its 2.5D EMIB interposer and Foveros 3D chip stacking and bonding technologies, both of which saw their initial use in the datacenter on the ill-fated and much-delayed “Ponte Vecchio” Xe Max Series GPU accelerator. To say that a lot is hanging on these two Xeon 7 processors is an understatement. With the hyperscalers and cloud builders ramping up the use of their homegrown Arm server CPUs, every X86 server socket in the datacenter is in contention, and AMD is a fierce competitor that has been metronomic in its regularity of Epyc server CPU launched and dominant because of the ability of Taiwan Semiconductor Manufacturing Corp to leapfrog over Intel Foundry’s processes and packaging. But with 18A and the Xeon 7 next year, there is a chance for Intel to hold back the tide a little and perhaps reach an equilibrium with X86 server CPUs. While the E-core variants of energy-efficient, throughput processors are somewhat niche in their adoption, that is a good thing inasmuch as they will help Intel with ramping the 18A process as well as the 2.5D and 3D packaging techniques that are also expected with the P-core variants of the Xeon 7. Those packaging challenges were enough for Intel to never promise Diamond Rapids for 2025 and for it to push out Clearwater Rapids to the first half of 2026, which it did in January before it had a new chief executive officer after letting Pat Gelsinger go. This delay may once again give AMD a chance to stay ahead of Intel. Back in April, AMD was the first maker of a high end chip – in this case, a future “Venice” Epyc 9006 processor based on the Zen 6 core – to tape out on TSMC’s 2 nanometer N2 process. But Venice is not expected until next year, so there is no benefit for Intel to rush a product out to market early at possibly low yields that are more costly than just waiting a bit until yields are better. There are easier businesses to be in than semiconductor design and manufacturing. . . . In any event, at the Hot Chips conference this week, Don Soltis, an Intel Fellow and the Xeon processor architect, walked through the Clearwater Forest E-core processor. Soltis even had an early sample of the Xeon 7 E-core CPU back from Intel Foundry, which he had tucked into his shirt pocket. (We did not get a good zoom in on the chip, since we are attending Hot Chips remotely this year.) Here is a mockup of the Clearwater Forest socket, which will have to tide us all over: Clearwater Forest starts with the 18A process, of course. The 18A process uses gate-all-around 3D transistors, which Intel refers to as RibbonFET and a big improvement over the FinFET transistor design. Intel pioneered FinFET 3D tri-gate transistors back in 2011 with its 22 nanometer process, and all processes between then and 18A – 14 nanometer, 10 nanometer (including the Intel 7 refinement) all the way down to Intel 3 (3 nanometer) – use FinFET transistors. The “Sierra Forrest” E-core Xeon 6 processor launched in June 2024 was made using Intel 3 as well as EMIB to link chiplets on a socket interposer, but it did not use Foveros 3D stacking. The 18A process delivers 15 percent better performance at the same power and 30 percent better chip density at the same area as the Intel 3 process. The 18A process is married to a backside power delivery technique called PowerVia, which uses both sides of the silicon wafer to deliver data signals on the front side and power to the transistors on the back side. (Prior CPUs from Intel and others delivered power and signal on the front side.) The net result is that transistors are smaller and use less power than even their shrinkage would account for. The 3D construction of the Clearwater Forest CPU is also contributing to its technical efficiency (although its economic efficiency remains to be seen). “Every single circuit we build needs to get power and ground,” Soltis explained in his Hot Chips presentation. “A great place to put your power distribution is right where you need it and not interfere with all of the routing of signals between elements. That’s where you get some power efficiency that I wanted to highlight. One of those is increased cell density, or utilization of cells, which means we get more stuff packed into a smaller area and which is great from an area perspective, and cost and those sort of things. “However, there is a power efficiency benefit because your average trace length is shorter, and a shorter trace is fundamentally more power efficient. Similarly, when you have data paths or larger constructs, you have more routing resources because you do not have to route the power delivery using the same metal to route those signals, so those signals now are able to provide interconnect with lower capacitance and lower resistance for better power efficiency.” “The final one, which is also extremely important is there are IR drops, there’s resistance in the power delivery, and you lose some power in that power delivery with backside metal. We have wire sizes that are much more appropriate for power delivery and less appropriate for general signal integrity, and we have lower losses in our power delivery. Think about it as the resistance is a lot lower than wandering its way down through that metal stack back and forth and coming right up from the transistors.” If you build up from the foundation, you have a base package substrate that is pin-for-pin and socket compatible with the prior Socket E or LGA 4677 socket shared by both the Granite Rapids and Sierra Forest Xeon 6 processors. As the name suggests, it has 4,677 pins for power and signaling. Atop this substrate Intel lays down a pair of its existing I/O chiplets, which were used in the Xeon 6 CPUs and which are etched using its refined 10 nanometer Intel 7 process. The I/O tiles are linked to the EMIB bridges, and then three base chiplets, etched with Intel 3, are set down. These are the same I/O tiles and EMIB tiles that were used in Sierra Forest. The base tiles are different because they have cores stacked on top of them, so they have to have the wires for that. The I/O tiles include the L3 cache, fabric to link the cores, memory controllers for the cores, and other I/O functions. Four EMIB bridges hook together these five chiplets. Each base chiplet then has four CPU core chiplets, which are etched in 18A, stacked on top of them and using the Foveros hybrid bonding invented by Intel to link the wires under the cores to the wires atop the base tile into a 3D processing complex. The whole shebang across EMIB and Foveros wires is what Soltis called a “monolithic mesh coherent interconnect,” but really, the mostly 2D layout of a monolithic die could also be called that. The point is that, logically speaking (meaning according to the logic embodied in the design, not the logic of an argument), this looks like a much faster mesh interconnect and the 3D nature of it doesn’t really affect that logic. Things sometime go up or down instead of going far over there. Drilling down into the Clearwater Forest “Darkmont” E-cores, there are four cores in a module and they wrap around 4 MB of unified L2 cache, which is 17 cycles away from the cores. There is 200 GB/sec of L2 cache bandwidth linked to each core, which is twice as much bandwidth as the “Sierra Glen” cores used in the Sierra Forest CPUs. The L2 cache has a fabric port on it that has 35 GB/sec of bandwidth, which is how the cores talk to the outside world; the cores within a module link to each other through the L2 cache ports. Based on the SPECint_rate_2017 throughput test, the Darkmont core can do 17 percent more instructions per clock than the Sierra Glen core used in the Sierra Forest CPU. So, how did Intel do that? Well, by doubling up the cores and by boosting many of the features in the microarchitecture by somewhere between 1.5X and 2X. It all starts with the front end: The Darkmont core has 64 KB of instruction cache and 32 KB of data cache, just like its Sierra Glen predecessor that was itself a variant of the “Crestmont” core used in PCs. Soltis said that the new E-cores can decode nine instructions per cycle, based on three decoders that can hold three instructions each. The Sierra Glen cores could decode six instructions per cycle, so that is a 1.5X bump there. As usual, the branch predictor has been made better, and is more accurate thanks to a deeper branch history and its ability to handle larger data structures. The out of order engine that is behind the front end is now eight instructions wide (it was five with Sierra Glen, so that is a 1.6X boost), and OOO engine can retire 16 instructions (up from eight with Sierra Glen, a 2X bump). The out of order window is now 416 instructions, up 1.6X from Sierra Glen, and the OOO engine in Darkmont has 26 execution ports, up 1.5X. There are twice as many integer, vector, and store address generation units in this Darkmont core, and 1.5X as many load store generation units. (The wonder is that IPC is not a lot higher, really.) The memory subsystem in the core can do three loads per cycle (up 1.5X) and two stores per cycle (1X or the same). The buffering on the L2 cache is 128 outstanding misses (up 2X over Sierra Glen). Add it all up and you have 72 core modules with four cores and 8 MB of L3 cache for a total of 288 cores and 576 MB of L3 cache in a single the Clearwater Forest Xeon 7 E-core CPU. Of course, what really matters here is performance, and Soltis gave us a hint of where a Clearwater Forest platform might end up: Compared to the 288-core Sierra Forest server platform, the two-socket Clearwater Forest platform, with 576 cores, will be a beast. Soltis says that on a read benchmark test (he did not say which one) the Xeon 7 E-core platform delivered 1,300 GB/sec of memory bandwidth. This was helped by the fact that the Clearwater Forest socket has twelve memory DDR5 memory channels, and they run regular DDR5 memory (not Intel’s MRDIMMs) at 8 GHz speeds. The Clearwater Forest platform has 96 lanes of PCI-Express 5.0 I/O coming off those two processors, for a total of 1,000 GB/sec of measured bandwidth; 64 of those lanes can be allocated to CXL devices, including extended memory. There are also 144 UltraPath Interconnect NUMA links between the two Clearwater Forest CPUs, which have 576 GB/sec of bandwidth to create a shared memory cluster across those two sockets. The chart above says 576 cores with 1,152 MB of L3 cache, which we get. But the chart also says the two-socket Clearwater Forest node is rated at 59 teraflops of oomph. If that is at FP64 precision, we can’t tell until we know the clock speeds, and even then, the cores don’t have 512-bit AVX-512 vector units but rather a pair of simpler 128-bit AVX2 units. If Clearwater Forest ran at 2.56 GHz, then a server with 576 cores with those AVX2 units could do 5.9 teraflops by our math. But not 10X that. We are also not sure what the “5,000 GB/sec” of bandwidth in the chart above refers to. Aggregate L2 cache bandwidth into the 288 Xeon 7 E-cores in this compute engine is 57,600 GB/sec, and the bandwidth from the L2 cache segments into the mesh fabric is 2,520 GB/sec. The peak theoretical memory bandwidth at 8 GHz across two sockets would be a mere 1,536 GB/sec. Go figure. Sign up to our Newsletter Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between. Subscribe now

Intel's "Clearwater Forest" Xeon 7 E-Core CPU Will Be a Beast

Share this article

Related Articles