Skip to content

Commit ff14cf9

Browse files
docs: dynamo serve guide (ai-dynamo#270)
Co-authored-by: Dmitry Tokarev <[email protected]>
1 parent 77b32fb commit ff14cf9

File tree

1 file changed

+254
-0
lines changed

1 file changed

+254
-0
lines changed

docs/guides/dynamo_serve.md

+254
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,254 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
18+
# Using `dynamo serve` to deploy inference graphs locally
19+
20+
This guide explains how to create, configure, and deploy inference graphs for large language models using the `dynamo serve` command.
21+
22+
## Table of Contents
23+
24+
- [What are inference graphs?](#what-are-inference-graphs)
25+
- [Creating an inference graph](#creating-an-inference-graph)
26+
- [Serving the inference graph](#deploying-the-inference-graph)
27+
- [Guided Example](#guided-example)
28+
29+
## What are inference graphs?
30+
31+
Inference graphs are compositions of service components that work together to handle LLM inference. A typical graph might include:
32+
33+
- Frontend: OpenAI-compatible HTTP server that handles incoming requests
34+
- Processor: Processes requests before passing to workers
35+
- Router: Routes requests to appropriate workers based on specified strategy
36+
- Workers: Handle the actual LLM inference (prefill and decode phases)
37+
38+
## Creating an inference graph
39+
40+
Once you've written your various Dynamo services (docs on how to write these can be found [here](../../deploy/dynamo/sdk/docs/sdk/README.md)), you can create an inference graph by composing these services together using the following two mechanisms:
41+
42+
### 1. Dependencies with `depends()`
43+
44+
```python
45+
from components.worker import VllmWorker
46+
47+
class Processor:
48+
worker = depends(VllmWorker)
49+
50+
# Now you can call worker methods directly
51+
async def process(self, request):
52+
result = await self.worker.generate(request)
53+
```
54+
55+
Benefits of `depends()`:
56+
57+
- Automatically ensures dependent services are deployed
58+
- Creates type-safe client connections between services
59+
- Allows calling dependent service methods directly
60+
61+
### 2. Dynamic composition with `.link()`
62+
63+
```python
64+
# From examples/llm/graphs/agg.py
65+
from components.frontend import Frontend
66+
from components.processor import Processor
67+
from components.worker import VllmWorker
68+
69+
Frontend.link(Processor).link(VllmWorker)
70+
```
71+
72+
This creates a graph where:
73+
74+
- Frontend depends on Processor
75+
- Processor depends on VllmWorker
76+
77+
The `.link()` method is useful for:
78+
79+
- Dynamically building graphs at runtime
80+
- Selectively activating specific dependencies
81+
- Creating different graph configurations from the same components
82+
83+
## Deploying the inference graph
84+
85+
Once you've defined your inference graph and its configuration, you can deploy it locally using the `dynamo serve` command! We recommend running the `--dry-run` command so you can see what arguments will be pasesd into your final graph. And then
86+
87+
Lets walk through an example.
88+
89+
## Guided Example
90+
91+
The files referenced here can be found [here](../../examples/llm/components/). You will need 1 GPU minimum to run this example. This example can be run from the `examples/llm` directory
92+
93+
### 1. Define your components
94+
95+
In this example we'll be deploying an aggregated serving graph. Our components include:
96+
97+
1. Frontend - OpenAI-compatible HTTP server that handles incoming requests
98+
2. Processor - Runs processing steps and routes the request to a worker
99+
3. VllmWorker - Handles the prefill and decode phases of the request
100+
101+
```python
102+
# components/frontend.py
103+
class Frontend:
104+
worker = depends(VllmWorker)
105+
worker_routerless = depends(VllmWorkerRouterLess)
106+
processor = depends(Processor)
107+
108+
...
109+
```
110+
111+
```python
112+
# components/processor.py
113+
class Processor(ProcessMixIn):
114+
worker = depends(VllmWorker)
115+
router = depends(Router)
116+
117+
...
118+
```
119+
120+
```python
121+
# components/worker.py
122+
class VllmWorker:
123+
prefill_worker = depends(PrefillWorker)
124+
125+
...
126+
```
127+
128+
Note that our prebuilt components have the maximal set of dependancies needed to run the component. This allows you to plug in different components to the same graph to create different architectures. When you write your own components, you can be as flexible as you'd like.
129+
130+
### 2. Define your graph
131+
132+
```python
133+
# graphs/agg.py
134+
from components.frontend import Frontend
135+
from components.processor import Processor
136+
from components.worker import VllmWorker
137+
138+
Frontend.link(Processor).link(VllmWorker)
139+
```
140+
141+
### 3. Define your configuration
142+
143+
We've provided a set of basic configurations for this example [here](../../examples/llm/configs/agg.yaml). All of these can be changed and also be overriden by passing in CLI flags to serve!
144+
145+
### 4. Serve your graph
146+
147+
As a prerequisite, ensure you have NATS and etcd running by running the docker compose in the deploy directory. You can find it [here](../../deploy/docker-compose.yml).
148+
149+
```bash
150+
docker compose up -d
151+
```
152+
153+
Note that the we point toward the first node in our graph. In this case, it's the `Frontend` service.
154+
155+
```bash
156+
# check out the configuration that will be used when we serve
157+
dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml --dry-run
158+
```
159+
160+
This will print out something like
161+
162+
```bash
163+
Service Configuration:
164+
{
165+
"Frontend": {
166+
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
167+
"endpoint": "dynamo.Processor.chat/completions",
168+
"port": 8000
169+
},
170+
"Processor": {
171+
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
172+
"block-size": 64,
173+
"max-model-len": 16384,
174+
"router": "round-robin"
175+
},
176+
"VllmWorker": {
177+
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
178+
"enforce-eager": true,
179+
"block-size": 64,
180+
"max-model-len": 16384,
181+
"max-num-batched-tokens": 16384,
182+
"enable-prefix-caching": true,
183+
"router": "random",
184+
"tensor-parallel-size": 1,
185+
"ServiceArgs": {
186+
"workers": 1
187+
}
188+
}
189+
}
190+
191+
Environment Variable that would be set:
192+
DYNAMO_SERVICE_CONFIG={"Frontend": {"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B", "endpoint": "dynamo.Processor.chat/completions", "port": 8000}, "Processor": {"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B", "block-size": 64,
193+
"max-model-len": 16384, "router": "round-robin"}, "VllmWorker": {"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B", "enforce-eager": true, "block-size": 64, "max-model-len": 16384, "max-num-batched-tokens": 16384, "enable-prefix-caching":
194+
true, "router": "random", "tensor-parallel-size": 1, "ServiceArgs": {"workers": 1}}}
195+
```
196+
197+
You can override any of these configuration options by passing in CLI flags to serve. For example, to change the routing strategy, you can run
198+
199+
```bash
200+
dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml --Processor.router=random --dry-run
201+
```
202+
203+
Which will print out something like
204+
205+
```bash
206+
#...
207+
"Processor": {
208+
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
209+
"block-size": 64,
210+
"max-model-len": 16384,
211+
"router": "random"
212+
},
213+
#...
214+
```
215+
216+
Once you're ready - simply remove the `--dry-run` flag to serve your graph!
217+
218+
```bash
219+
dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml
220+
```
221+
222+
Once everything is running, you can test your graph by making a request to the frontend from a different window.
223+
224+
```bash
225+
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
226+
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
227+
"messages": [
228+
{
229+
"role": "user",
230+
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
231+
}
232+
],
233+
"stream":false,
234+
"max_tokens": 30
235+
}'
236+
```
237+
238+
## Close your deployment
239+
240+
If you have any lingering processes after running `ctrl-c`, you can kill them by running
241+
242+
```bash
243+
function kill_tree() {
244+
local parent=$1
245+
local children=$(ps -o pid= --ppid $parent)
246+
for child in $children; do
247+
kill_tree $child
248+
done
249+
echo "Killing process $parent"
250+
kill -9 $parent
251+
}
252+
253+
kill_tree $(pgrep circusd)
254+
```

0 commit comments

Comments
 (0)