Find Related products on Amazon

Shop on Amazon

Surprisingly fast AI-generated kernels we didn't mean to publish yet

Published on: 2025-06-10 21:03:12

TL;DR We have some very fast AI-generated kernels in pure CUDA-C without using libraries and DSLs such as CUTLASS and Triton. They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch. Some of our highlighted results: Matmul (FP32): 101.3% performance of FP32 torch.matmul; problem size: 4096x4096 square matrices performance of FP32 torch.matmul; problem size: 4096x4096 square matrices Conv2D: 179.9% performance of FP32 torch.nn.Conv2D; problem size: (100, 3, 224, 224) input tensor, conv(in_channels=3, out_channels=96, kernel_size=11, stride=4, padding=2) performance of FP32 torch.nn.Conv2D; problem size: (100, 3, 224, 224) input tensor, conv(in_channels=3, out_channels=96, kernel_size=11, stride=4, padding=2) Softmax: 111.8% performance of FP32 torch.softmax; problem size: (4096, 65536) input tensor performance of FP32 torch.softmax; problem size: (4096, 65536) input tensor LayerNorm: 484.4% performance of FP32 t ... Read full article.