Inference cost at scale with napkin math

If you serve AI models as a part of your product stack, you've likely wondered what kind of scale your GPU cluster tops out at.

With some rudimentary knowledge about your hardware and model architecture, we can work out the dollar cost-per-user on the back of a napkin .

If you're comfortable reasoning about GPUs and/or LLMs, use this legend to skip to sections of relevance:

Resources on a single GPU

On any GPU's spec-sheet you can find these metrics:

Peak throughput: Number of floating-point operations per second. Usually in TeraFLOPs (1 TFLOP/s = \(10^{12}\) ops/sec).

Number of floating-point operations per second. Usually in TeraFLOPs (1 TFLOP/s = \(10^{12}\) ops/sec). Memory bandwidth: Amount of data that can be moved from global memory (VRAM) to registers (SRAM).Usually in TB/sec.

We'll assume FP-8 quantization to compute throughput, though it's easy to adjust the math for FP-16 as well.

Cost of a Matrix Multiplication

... continue reading