Skip to content
Tech News
← Back to articles

Inference cost at scale with napkin math

read original more articles
Why This Matters

This article highlights how simple calculations can help estimate the costs of deploying large-scale AI inference, enabling companies to optimize GPU usage and manage expenses effectively. Understanding these 'napkin math' principles is crucial for scaling AI services efficiently and making informed infrastructure decisions in the tech industry.

Key Takeaways

Inference cost at scale with napkin math

If you serve AI models as a part of your product stack, you've likely wondered what kind of scale your GPU cluster tops out at.

With some rudimentary knowledge about your hardware and model architecture, we can work out the dollar cost-per-user on the back of a napkin .

If you're comfortable reasoning about GPUs and/or LLMs, use this legend to skip to sections of relevance:

Resources on a single GPU

On any GPU's spec-sheet you can find these metrics:

Peak throughput: Number of floating-point operations per second. Usually in TeraFLOPs (1 TFLOP/s = \(10^{12}\) ops/sec).

Number of floating-point operations per second. Usually in TeraFLOPs (1 TFLOP/s = \(10^{12}\) ops/sec). Memory bandwidth: Amount of data that can be moved from global memory (VRAM) to registers (SRAM).Usually in TB/sec.

We'll assume FP-8 quantization to compute throughput, though it's easy to adjust the math for FP-16 as well.

Cost of a Matrix Multiplication

... continue reading