Skip to content
Tech News
← Back to articles

Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data (2024)

read original get GPU Computing Optimization Kit → more articles
Why This Matters

This article highlights how the performance of matrix multiplications on GPUs can vary significantly based on the data content, revealing that even highly optimized libraries like CUTLASS and CuBLAS are affected by input data characteristics. This finding underscores the importance for developers and researchers to consider data properties when benchmarking and deploying GPU-accelerated computations, potentially impacting the development of more consistent and reliable AI and scientific computing workloads.

Key Takeaways

It’s 2022. I check out this cool new project, CUTLASS, with very fast matmuls. I take a large matmul, 8192 x 8192 x 8192, and benchmark it in PyTorch, which calls CuBLAS.

python mm_bench.py > CuBLAS: 258 Teraflops

Not bad, 83% flop utilization. Now let’s check out Cutlass’s performance using their profiler.

./cutlass_profiler --operation=Gemm --m=8192 --n=8192 --k=8192 > CUTLASS: 288 Teraflops

!!! 10% higher perf? That’s incredible. CuBLAS is highly optimized for large compute-bound matmuls, and somehow CUTLASS + autotuning is outperforming it by 10%? We gotta start using these matmuls yesterday.

The next step is to bind the CUTLASS kernels into Python and compare against CuBLAS using my previous script.

python cutlass_mm_bench.py > CuBLAS: 258 Teraflops > CUTLASS: 257 Teraflops

Somehow, in the light of Python, all of CUTLASS’s performance gains disappear. This in of itself is not shocking - it’s notoriously difficult to ensure consistent benchmarking across setups.

I tediously ablate the two benchmark scripts, until finally, I find that CUTLASS’s profiler, by default, actually initializes the values in a fairly strange way - it only initializes the inputs with integers. Confused about whether this matters, I try:

zero_inputs = torch.zeros(N, N) randn_inputs = torch.randn(N, N) benchmark(zero_inputs) # 295 Teraflops benchmark(randn_inputs) # 257 Teraflops

... continue reading