Tech News
← Back to articles

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication Through RL

read original related products more articles

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

🥳 Introduction

CUDA-L2 is a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used torch.matmul to state-of-the-art NVIDIA closed-source libraries (cuBLAS, cuBLASLt-heuristic, cuBLASLt-AutoTuning). Paper

Speedup of CUDA-L2 over torch.matmul, cuBLAS, cuBLASLt-heuristic, and cuBLASLt-AutoTuning across 1000 (M,N,K) configurations on A100. Speedup of CUDA-L2 over torch.matmul, cuBLAS, cuBLASLt-heuristic, and cuBLASLt-AutoTuning across 1000 (M,N,K) configurations on A100.

Speedup comparison results across 1000 (M,N,K) configurations on A100. Speedup comparison results across 1000 (M,N,K) configurations on A100.

🎉 What's New

[Dec 2, 2025] Released A100 optimized HGEMM kernels across 1,000 configurations.

🗒️ To-Do List

... continue reading