Tech News
← Back to articles

Amazon launches Trainium3 AI accelerator, competing directly against Blackwell Ultra in FP8 performance — new Trn3 Gen2 UltraServer takes vertical scaling notes from Nvidia's playbook

read original related products more articles

Amazon Web Services this week introduced its next-generation Trainium3 accelerator for AI training and inference. As AWS puts it, the new processor is twice as fast as its predecessor and is four times more efficient. This makes it one of the best solutions for AI training and inference in terms of cost. In absolute numbers, Trainium3 offers up to 2,517 MXFP8 TFLOPS, which is nearly two times lower compared to Nvidia's Blackwell Ultra. However, AWS's Trn3 UltraServer packs 144 Trainium3 chips per rack, and offers 0.36 ExaFLOPS of FP8 performance, therefore matching the performance of Nvidia's NVL72 GB300. This is a very big deal, as very few companies can challenge Nvidia's rack-scale AI systems.

AWS Trainium3

The AWS Trainium3 is a dual-chiplet AI accelerator that is equipped with 144 GB of HBM3E memory using four stacks, which provides peak memory bandwidth of up to 4.9 TB/s. Each compute chiplet, allegedly made by TSMC using its 3nm-class fabrication process, contains four NeuronCore-v4 cores (which feature an extended ISA compared to predecessors) and connects two HBM3E memory stacks. The two chiplets are connected using a proprietary high-bandwidth interface and share 128 independent hardware data-movement engines (which are key for the Trainium architecture), collective communication cores that coordinate traffic between chips, and four NeuronLink-v4 interfaces for scale-out connectivity.

Swipe to scroll horizontally AWS Trainium vs Nvidia Blackwell Accelerator name Trainium2 Trainium3 B200 B300 (Ultra) Architecture Trainium2 Trainium3 Blackwell Blackwell Ultra Process Technology ? N3E or N3P 4NP 4NP Physical Configuration 2 x Accelerators 2 x Accelerators 2 x Reticle Sized GPUs 2 x Reticle Sized GPUs Packaging CoWoS-? CoWoS-? CoWoS-L CoWoS-L FP4 PFLOPs (per Package) - 2.517 10 15 FP8/INT6 PFLOPs (per Package) 1299 2.517 5 5 INT8 POPS (per Package) - - 5 0.33 BF16 PFLOPs (per Package) 0.667 0.671 2.5 2.5 TF32 PFLOPs (per Package) 0.667 0.671 1.15 1.25 FP32 PFLOPs (per Package) 0.181 0.183 0.08 0.08 FP64/FP64 Tensor TFLOPs (per Package) - - 40 1.3 Memory 96 GB HBM3 144 GB HBM3E 192 GB HBM3E 288 GB HBM3E Memory Bandwidth 2.9 GB/s 4.9 GB/s 8 TB/s 8 TB/s HBM Stacks 8 8 8 8 Inter-GPU communications NeuronLink-v3 1.28 TB/s NeuronLink-v4 2.56 TB/s NVLink 5.0, 200 GT/s | 1.8 TB/s bidirectional NVLink 5.0, 200 GT/s | 1.8 TB/s bidirectional SerDes speed (Gb/s unidirectional) ? ? 224G 224G GPU TDP ? ? 1200 W 1400 W Accompanying CPU Intel Xeon AWS Graviton and Intel Xeon 72-core Grace 72-core Grace Launch Year 2024 2025 2024 2025

A NeuronCore-v4 integrates four execution blocks: a tensor engine, a vector engine, a scalar engine, a GPSIMD block, and 32 MB of local SRAM that is explicitly managed by the compiler instead of being cache-controlled. From a software development standpoint, the core is built around a software-defined dataflow model in which data is staged into SRAM by DMA engines, processed by the execution units, and then written back as near-memory accumulation enables DMA to perform read-add-write operations in a single transaction. The SRAM is not coherent across cores and is used for tiling, staging, and accumulation rather than general caching.

Image 1 of 2 (Image credit: AWS) (Image credit: AWS)

The Tensor Engine is a systolic-style matrix processor for GEMM, convolution, transpose, and dot-product operations and supports MXFP4, MXFP8, FP16, BF16, TF32, and FP32 inputs with BF16 or FP32 outputs. Per core, it delivers 315 TFLOPS in MXFP8/MXFP4, 79 TFLOPS in BF16/FP16/TF32, and 20 TFLOPS in FP32, and it implements structured sparsity acceleration using M:N patterns (such as 4:16, 4:12, 4:8, 2:8, 2:4, 1:4, and 1:2), allowing the same 315 TFLOPS peak on supported sparse workloads.

The Vector Engine for vector transforms provides about 1.2 TFLOPS FP32, hardware conversion into MXFP formats, and a fast exponent unit with four times the throughput of the scalar exponent path for attention workloads. The unit supports various data-types, including FP8, FP16, BF16, TF32, FP32, INT8, INT16, and INT32.

... continue reading