Surprisingly fast AI-generated kernels we didn't mean to publish yet
Published on: 2025-06-10 21:03:12
TL;DR
We have some very fast AI-generated kernels in pure CUDA-C without using libraries and DSLs such as CUTLASS and Triton. They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch. Some of our highlighted results:
Matmul (FP32): 101.3% performance of FP32 torch.matmul; problem size: 4096x4096 square matrices
performance of FP32 torch.matmul; problem size: 4096x4096 square matrices Conv2D: 179.9% performance of FP32 torch.nn.Conv2D; problem size: (100, 3, 224, 224) input tensor, conv(in_channels=3, out_channels=96, kernel_size=11, stride=4, padding=2)
performance of FP32 torch.nn.Conv2D; problem size: (100, 3, 224, 224) input tensor, conv(in_channels=3, out_channels=96, kernel_size=11, stride=4, padding=2) Softmax: 111.8% performance of FP32 torch.softmax; problem size: (4096, 65536) input tensor
performance of FP32 torch.softmax; problem size: (4096, 65536) input tensor LayerNorm: 484.4% performance of FP32 t
... Read full article.