llm-d, Kubernetes native distributed inference
Published on: 2025-07-01 14:37:47
llm-d is a Kubernetes-native high-performance distributed LLM inference framework
- a well-lit path for anyone to serve at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.
With llm-d, users can operationalize gen AI deployments with a modular, high-performance, end-to-end serving solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in Inference Gateway (IGW).
Kubernetes typically scales out application workloads with uniform replicas and round-robin load balancing.
This simple pattern is very effective for most request patterns, which have the following characteristics:
Requests are short-lived and generally uniform in resource utilization
Requests have generally uniform latency service level objectives (SLOs)
Each replica can process each request equally well
Sp
... Read full article.