This week, Tachyum, a firm that has promised a processor that hasn't shipped for six years, and counting, has now published new target specifications and expected performance for its Prodigy universal processor, just a month after announcing its latest round of financing and its intention to 'upgrade' the Prodigy processor, which only exists on paper.
With target specifications for the most powerful Prodigy processor set, some of which seem unattainable in a realistic timeframe, Tachyum claims that a rack powered by its Prodigy Ultimate hardware will be over 21 times faster than Nvidia's upcoming NVL576 rack based on the Rubin Ultra GPUs. However, details about Tachyum's Prodigy processor released this week may indicate that the device will be delayed by four to five more years under best-case scenarios.
Prodigious hardware
As reported a month ago, Tachyum's Prodigy processor — or rather, system-in-package (SiP) — is said to adopt a multi-chiplet design, with each chiplet fabricated on TSMC's 2nm-class node and featuring up to 256 highly-custom cores with an 8-way out-of-order superscalar execution pipeline and matrix and vector accelerators.
Tachyum intends to introduce 12 Prodigy SKUs, with the range-topping Prodigy Ultimate carrying four chiplets and offering 768 or 1024 cores, up to 1 GB of L2 and L3 cache, 128 PCIe lanes, and a 24-channel memory subsystem supporting up to 48 TB of DDR5-17600 memory per socket and up to 3.38 TB/s peak bandwidth per socket. The Prodigy Premium SKU runs two chiplets and offers 256 – 512 cores and a 16-channel memory subsystem, while the Prodigy Entry SKU has 32 – 256 cores and an 8-channel memory subsystem.
(Image credit: Tachyum)
From a Tachyum document, each chiplet contains what appears to be a systolic array of 264 cores organized into four 11×6 groups (66 per group), each integrating eight redundant cores, for a total of 256 cores/256-element matrix unit visible to software per chiplet.
This corroborates Tachyum's claim that its built-in matrix processor supports 16×16, 8×8, and 4×4 operations. Also, such a design provides one extra CPU core/MAC element per row and one extra CPU core/MAC element per column, which is consistent with systolic array design practices that tend to include spare elements for yield and repairability. However, keep in mind that CPUs tend not to use systolic array-like arrangements due to complicated data flows and increased latencies.
From what we can tell, each chiplet is designed to be a fully functional processor with up to 256 cores, 256 MB of L2 and L3 caches, its own eight-channel DDR5 memory subsystem, and I/O that includes up to 96 PCIe 7.0 lanes with 16 controllers. Note that Tachyum seems to reuse PCIe PHY for die-to-die and socket-to-socket interconnections, thus the range-topping Prodigy Ultimate 'only' offers 128 PCIe 7.0 lanes.
(Image credit: Tachyum)
... continue reading