ai-dynamo/dynamo

★ 6,788RustAudience · ops devopsComplexity · 5/5LicenseSetup · hard

Mindmap

mindmap
  root((dynamo))
    What it does
      Cluster coordination
      Disaggregated serving
      KV cache tiering
    Tech stack
      Rust core
      Python API
      Kubernetes
    Key features
      Smart router
      Auto planner
      AIConfigurator
    Use cases
      Production AI serving
      Cost optimization
      Multi-GPU scaling

mindmap root((dynamo)) What it does Cluster coordination Disaggregated serving KV cache tiering Tech stack Rust core Python API Kubernetes Key features Smart router Auto planner AIConfigurator Use cases Production AI serving Cost optimization Multi-GPU scaling

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Deploy large AI language models across a multi-GPU cluster and serve them in production at datacenter scale

USE CASE 2

Scale the input-processing and output-generation phases independently on separate machine pools

USE CASE 3

Route new requests to machines that already hold relevant cached data to cut redundant GPU computation

USE CASE 4

Automatically adjust running model replica counts to meet response-time targets at the lowest hardware cost

Tech stack

RustPythonKubernetesvLLMSGLangTensorRT-LLM

Getting it running

Difficulty · hard Time to first run · 1day+

Requires a multi-GPU Kubernetes cluster and an existing inference engine such as vLLM or SGLang already deployed.

Use freely for any purpose including commercial products as long as you include the Apache 2.0 copyright and license text.

In plain English

Dynamo is an open-source framework from NVIDIA for running AI language models at datacenter scale. Its job is not to run a single model on a single computer, but to coordinate many machines and GPUs so they work together as one serving system. If you are already using an inference engine like vLLM, SGLang, or TensorRT-LLM, Dynamo sits above them as the layer that handles routing, scaling, and coordination across the cluster. The main problem Dynamo addresses is that running large AI models in production involves two distinct phases: a prefill phase (where the system processes the user's input and builds up a cache of intermediate values called the KV cache) and a decode phase (where it generates each token of the output one at a time). These two phases have different hardware requirements. Dynamo lets you run them on separate pools of machines and scale each pool independently, which the README describes as disaggregated serving. On top of that, Dynamo includes a router that tracks which machines already hold relevant cached data from previous requests and sends new requests there when possible, avoiding redundant computation. It also manages where that cache is stored: on GPU memory first, then CPU memory, then disk, then remote storage, extending how much context the system can hold without buying more GPUs. A planner component watches traffic patterns and adjusts the number of running replicas to meet response-time targets at the lowest cost. A separate tool called AIConfigurator can simulate thousands of possible deployment configurations in seconds to find a good setup before any GPUs are committed. Dynamo is built in Rust for the performance-critical parts and Python for the parts developers interact with. It works with Kubernetes and supports deployment recipes for specific models. The project is released under the Apache 2.0 license.

Copy-paste prompts

Prompt 1

I have a cluster of 8 GPU machines and want to use Dynamo to serve a 70B language model. Show me a Kubernetes deployment config with disaggregated prefill and decode routing.

Prompt 2

Set up Dynamo with vLLM as the inference backend on a 4-node GPU cluster and configure the KV cache to spill from GPU memory to CPU memory to NVMe disk.

Prompt 3

Use Dynamo's AIConfigurator to simulate deployment options for a 13B model on 4 A100 GPUs with a p95 latency target of 500ms and show the recommended config.

Prompt 4

Configure Dynamo's planner to autoscale decode workers based on queue depth so p95 response time stays under 2 seconds during traffic spikes.

Prompt 5

Show me how to integrate Dynamo's smart router with an existing SGLang serving setup so KV-cache-aware routing is enabled.

Open on GitHub → Explain another repo

← ai-dynamo on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.