Find Related products on Amazon

Shop on Amazon

Nvidia Dynamo: A Datacenter Scale Distributed Inference Serving Framework

Published on: 2025-06-11 00:44:14

NVIDIA Dynamo | Guides | Architecture and Features | APIs | SDK | NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments. Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLang or others) and captures LLM-specific capabilities such as: Disaggregated prefill & decode inference – Maximizes GPU throughput and facilitates trade off between throughput and latency. – Maximizes GPU throughput and facilitates trade off between throughput and latency. Dynamic GPU scheduling – Optimizes performance based on fluctuating demand – Optimizes performance based on fluctuating demand LLM-aware request routing – Eliminates unnecessary KV cache re-computation – Eliminates unnecessary KV cache re-computation Accelerated data transfer – Reduces inference response time using NIXL. – Reduces inference response time using NIXL. KV cache offloading – Leverages multiple m ... Read full article.