UALink (short for Ultra Accelerator Link) is an upcoming interconnect technology designed to enable high-speed, low-latency communication between AI accelerators (ASICs, GPUs, FPGAs, NPUs, XPUs) and other compute devices across a scale-up logical domain. Many see it as an important path forward for the future of AI data centers due to its planned performance, cost, and power efficiency advantages, not to mention that, as an open standard, it will reduce vendor lock-in.
In 2025, the UALink Consortium published revision 1.0 of the UALink specification, marking a point after which hardware designers can officially implement the technology into their AI/HPC accelerators and switch ASICs required to build AI pods with up to 1,024 accelerators. But while UALink technology is widely supported by the industry, and its specification that defines accelerator-to-accelerator comms is available now, its broad adoption is several years away.
What is UALink?
UALink would enable programmers to treat multiple accelerators like a single processor, with a large memory pool (or at least enable parallelism with minimal effort from developers) and to greatly simplify network communications between processors.
UALink was designed as a competing technology for Nvidia's proprietary NVLink interconnect that is supported by a broad range of industry players, including AMD, Arm, AWS, Broadcom, Cadence, Intel, Google, Marvell, Meta, Microsoft, and Synosys, just to name a few.
The UALink 200G 1.0 specification is designed to support up to 1,024 accelerators per domain (or pod) at a speed of 212.5 GT/s, enabling direct memory access between accelerators using simple load/store and atomic operations, and thus behaves as a single system. UALink is built around a lightweight layered protocol stack that includes a Protocol Layer (UPLI), Transaction Layer (TL), Data Link Layer (DL), and Physical Layer (PL).
(Image credit: UALink)
At the physical layer, UALink reuses standard Ethernet PHY signaling (such as 100GBASE-KR1, 200GBASE-KR2, and 800GBASE-KR4) to simplify implementation, but introduces custom framing, forward error correction (FEC), and latency optimizations. Each serial lane runs at 212.5 GT/s, delivering an effective 200 GT/s of data per lane after FEC overhead. Meanwhile, configurations of x1, x2, or x4 links are supported, to enable up to 800 GT/s of bandwidth per direction per link. The DL layer formats traffic into 640-byte FLITs with CRC and segment headers, while the TL layer compresses request and response messages into 4–16 byte payloads to cut latencies and keep die area in check.
According to developers of UALink, the protocol ensures deterministic latency below 1 µs and achieves up to 93% effective bandwidth utilization, which is very high. UALink does not replace Ethernet, PCIe, or CXL, but is designed to coexist with these technologies within system nodes, serving solely for peer-to-peer traffic between accelerators.
When it comes to system architecture, it is centered around UALink Switches (ULS) that enable point-to-point accelerator communication within and across racks. These switches must support lossless delivery, non-blocking fabric behavior, and virtual pod isolation. Each accelerator is assigned a 10-bit routing ID (thus the limitation of 1,024 accelerators per pod), and switches maintain per-port routing tables to support scale-up topologies. The standard includes fault containment, error detection, and isolation mechanisms that limit failures to a single virtual pod without impacting others in the 'large' scale-up pod.
... continue reading