Deploy large AI language models across a multi-GPU cluster and serve them in production at datacenter scale
Scale the input-processing and output-generation phases independently on separate machine pools
Route new requests to machines that already hold relevant cached data to cut redundant GPU computation
Automatically adjust running model replica counts to meet response-time targets at the lowest hardware cost
Requires a multi-GPU Kubernetes cluster and an existing inference engine such as vLLM or SGLang already deployed.
Dynamo is an open-source framework from NVIDIA for running AI language models at datacenter scale. Its job is not to run a single model on a single computer, but to coordinate many machines and GPUs so they work together as one serving system. If you are already using an inference engine like vLLM, SGLang, or TensorRT-LLM, Dynamo sits above them as the layer that handles routing, scaling, and coordination across the cluster. The main problem Dynamo addresses is that running large AI models in production involves two distinct phases: a prefill phase (where the system processes the user's input and builds up a cache of intermediate values called the KV cache) and a decode phase (where it generates each token of the output one at a time). These two phases have different hardware requirements. Dynamo lets you run them on separate pools of machines and scale each pool independently, which the README describes as disaggregated serving. On top of that, Dynamo includes a router that tracks which machines already hold relevant cached data from previous requests and sends new requests there when possible, avoiding redundant computation. It also manages where that cache is stored: on GPU memory first, then CPU memory, then disk, then remote storage, extending how much context the system can hold without buying more GPUs. A planner component watches traffic patterns and adjusts the number of running replicas to meet response-time targets at the lowest cost. A separate tool called AIConfigurator can simulate thousands of possible deployment configurations in seconds to find a good setup before any GPUs are committed. Dynamo is built in Rust for the performance-critical parts and Python for the parts developers interact with. It works with Kubernetes and supports deployment recipes for specific models. The project is released under the Apache 2.0 license.
← ai-dynamo on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.