LLM Inference Handbook

On this page

Introduction

LLM Inference in Production is your technical glossary, guidebook, and reference - all in one. It covers everything you need to know about LLM inference, from core concepts and performance metrics (e.g., Time to First Token and Tokens per Second), to optimization techniques (e.g., continuous batching and prefix caching) and operation best practices.

Practical guidance for deploying, scaling, and operating LLMs in production.

Focus on what truly matters, not edge cases or technical noise.

Boost performance with optimization techniques tailored to your use case.

Continuously updated with the latest best practices and field-tested insights.

We wrote this handbook to solve a common problem facing developers: LLM inference knowledge is often fragmented; it’s buried in academic papers, scattered across vendor blogs, hidden in GitHub issues, or tossed around in Discord threads. Worse, much of it assumes you already understand half the stack.

There aren’t many resources that bring it all together — like how inference differs from training, why goodput matters more than raw throughput for meeting SLOs, or how prefill-decode disaggregation works in practice.

So we started pulling it all together.

... continue reading