Find Related products on Amazon

Shop on Amazon

llm-d, Kubernetes native distributed inference

Published on: 2025-07-01 14:37:47

llm-d is a Kubernetes-native high-performance distributed LLM inference framework - a well-lit path for anyone to serve at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators. With llm-d, users can operationalize gen AI deployments with a modular, high-performance, end-to-end serving solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in Inference Gateway (IGW). Kubernetes typically scales out application workloads with uniform replicas and round-robin load balancing. This simple pattern is very effective for most request patterns, which have the following characteristics: Requests are short-lived and generally uniform in resource utilization Requests have generally uniform latency service level objectives (SLOs) Each replica can process each request equally well Sp ... Read full article.