Skip to content
Tech News
← Back to articles

Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data

read original get GPU Memory Optimization Kit → more articles
Why This Matters

This article highlights how the performance of GPU-based matrix multiplications can vary significantly depending on data content and initialization methods, revealing that predictable, uniform data can unlock faster computation speeds. This insight is crucial for both developers and consumers aiming to optimize AI workloads and ensure consistent performance across different hardware and software setups.

Key Takeaways

It’s 2022. I check out this cool new project, CUTLASS, with very fast matmuls. I take a large matmul, 8192 x 8192 x 8192, and benchmark it in PyTorch, which calls CuBLAS.

python mm_bench.py > CuBLAS: 258 Teraflops

Not bad, 83% flop utilization. Now let’s check out Cutlass’s performance using their profiler.

./cutlass_profiler --operation=Gemm --m=8192 --n=8192 --k=8192 > CUTLASS: 288 Teraflops

!!! 10% higher perf? That’s incredible. CuBLAS is highly optimized for large compute-bound matmuls, and somehow CUTLASS + autotuning is outperforming it by 10%? We gotta start using these matmuls yesterday.

The next step is to bind the CUTLASS kernels into Python and compare against CuBLAS using my previous script.

python cutlass_mm_bench.py > CuBLAS: 258 Teraflops > CUTLASS: 257 Teraflops

Somehow, in the light of Python, all of CUTLASS’s performance gains disappear. This in of itself is not shocking - it’s notoriously difficult to ensure consistent benchmarking across setups.

I tediously ablate the two benchmark scripts, until finally, I find that CUTLASS’s profiler, by default, actually initializes the values in a fairly strange way - it only initializes the inputs with integers. Confused about whether this matters, I try:

zero_inputs = torch.zeros(N, N) randn_inputs = torch.randn(N, N) benchmark(zero_inputs) # 295 Teraflops benchmark(randn_inputs) # 257 Teraflops

... continue reading