Skip to content

Commit 2911347

Browse files
authored
[Doc][P/D] documentation pages for LMCache PD disaggregation (LMCache#768)
* [Add] docs for 1p1d and xpyd Signed-off-by: ApostaC <[email protected]> * Improve the xpyd illustration Signed-off-by: ApostaC <[email protected]> --------- Signed-off-by: ApostaC <[email protected]>
1 parent b506fce commit 2911347

File tree

2 files changed

+566
-2
lines changed

2 files changed

+566
-2
lines changed
Lines changed: 255 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,258 @@
11
1p1d
22
====
33

4-
Coming soon...
4+
One Prefiller, One Decoder (1p1d) Example
5+
------------------------------------------
6+
7+
This example demonstrates how to run LMCache with disaggregated prefill using NIXL on a single node with a 1 prefiller + 1 decoder setup. This configuration separates the compute-intensive prefill operations from the decode operations, allowing for better resource utilization and performance optimization.
8+
9+
Architecture Overview
10+
~~~~~~~~~~~~~~~~~~~~~
11+
12+
The 1p1d setup consists of three main components:
13+
14+
1. **Prefiller Server** - Handles the prefill phase of inference (initial prompt processing)
15+
2. **Decoder Server** - Handles the decode phase of inference (token generation)
16+
3. **Proxy Server** - Coordinates requests between the prefiller and decoder
17+
18+
.. code-block::
19+
20+
┌─────────────┐
21+
│ Client │
22+
└─────┬───────┘
23+
24+
┌───────▼───────┐
25+
│ Proxy Server │
26+
│ Port 9000 │
27+
└───┬───────┬───┘
28+
│ │
29+
┌────────▼──┐ ┌─▼────────┐
30+
│ Prefiller │ │ Decoder │
31+
│Port 8100 │ │Port 8200 │
32+
│ GPU 0 │ │ GPU 1 │
33+
└───────────┘ └──────────┘
34+
│ ▲
35+
│ │
36+
└───────┘
37+
NIXL Transfer
38+
39+
Prerequisites
40+
~~~~~~~~~~~~~
41+
42+
- **LMCache**: Install with ``pip install lmcache``
43+
- **NIXL**: Install from `NIXL GitHub repository <https://github.com/ai-dynamo/nixl>`_
44+
- **Hardware**: At least 2 GPUs
45+
- **Model Access**: Valid Hugging Face token (HF_TOKEN) for Llama 3.1 8B Instruct
46+
47+
Quick Start
48+
~~~~~~~~~~~
49+
50+
1. **Set your Hugging Face token**:
51+
52+
.. code-block:: bash
53+
54+
export HF_TOKEN=hf_your_token_here
55+
56+
2. **Navigate to the example directory**:
57+
58+
.. code-block:: bash
59+
60+
cd examples/disagg_prefill/1p1d
61+
62+
3. **Run the example**:
63+
64+
.. code-block:: bash
65+
66+
bash disagg_example_nixl.sh
67+
68+
The script will automatically:
69+
70+
- Launch a prefiller instance on port 8100 (GPU 0)
71+
- Launch a decoder instance on port 8200 (GPU 1)
72+
- Launch a proxy server on port 9000
73+
- Wait for all servers to be ready
74+
75+
Press ``Ctrl+C`` to stop all servers.
76+
77+
Configuration
78+
~~~~~~~~~~~~~
79+
80+
Prefiller Configuration
81+
^^^^^^^^^^^^^^^^^^^^^^^
82+
83+
The prefiller is configured via ``configs/lmcache-prefiller-config.yaml``:
84+
85+
.. code-block:: yaml
86+
87+
local_cpu: False
88+
max_local_cpu_size: 0
89+
max_local_disk_size: 0
90+
remote_serde: NULL
91+
92+
enable_nixl: True
93+
nixl_role: "sender"
94+
nixl_peer_host: "localhost"
95+
nixl_peer_port: 55555
96+
nixl_buffer_size: 1073741824 # 1GB
97+
nixl_buffer_device: "cuda"
98+
nixl_enable_gc: True
99+
100+
Key settings:
101+
- ``nixl_role: "sender"`` - Configures this instance to send KV cache data
102+
- ``nixl_buffer_size: 1GB`` - Buffer size for NIXL transfers
103+
- ``nixl_buffer_device: "cuda"`` - Uses GPU memory for buffering
104+
105+
Decoder Configuration
106+
^^^^^^^^^^^^^^^^^^^^^
107+
108+
The decoder is configured via ``configs/lmcache-decoder-config.yaml``:
109+
110+
.. code-block:: yaml
111+
112+
local_cpu: False
113+
max_local_cpu_size: 0
114+
max_local_disk_size: 0
115+
remote_serde: NULL
116+
117+
enable_nixl: True
118+
nixl_role: "receiver"
119+
nixl_peer_host: "localhost"
120+
nixl_peer_port: 55555
121+
nixl_buffer_size: 1073741824 # 1GB
122+
nixl_buffer_device: "cuda"
123+
nixl_enable_gc: True
124+
125+
Key settings:
126+
- ``nixl_role: "receiver"`` - Configures this instance to receive KV cache data
127+
- Same buffer configuration as the prefiller for compatibility
128+
129+
Components Deep Dive
130+
~~~~~~~~~~~~~~~~~~~~
131+
132+
Proxy Server (disagg_proxy_server.py)
133+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
134+
135+
The proxy server coordinates the disaggregated prefill workflow:
136+
137+
1. **Request Handling**: Receives client requests on port 9000
138+
2. **Prefill Coordination**: Sends requests to the prefiller with ``max_tokens=1``
139+
3. **Response Streaming**: Streams the full response from the decoder
140+
4. **Performance Monitoring**: Tracks Time-To-First-Token (TTFT) statistics
141+
142+
Supported endpoints:
143+
- ``/v1/completions``
144+
- ``/v1/chat/completions``
145+
146+
vLLM Server Launcher (disagg_vllm_launcher.sh)
147+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
148+
149+
This script launches individual vLLM servers with appropriate configurations:
150+
151+
**Prefiller Launch Command**:
152+
153+
.. code-block:: bash
154+
155+
UCX_TLS=cuda_ipc,cuda_copy,tcp \
156+
LMCACHE_CONFIG_FILE=configs/lmcache-prefiller-config.yaml \
157+
VLLM_ENABLE_V1_MULTIPROCESSING=1 \
158+
VLLM_WORKER_MULTIPROC_METHOD=spawn \
159+
CUDA_VISIBLE_DEVICES=0 \
160+
vllm serve meta-llama/Llama-3.1-8B-Instruct \
161+
--port 8100 \
162+
--disable-log-requests \
163+
--enforce-eager \
164+
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_producer",...}'
165+
166+
**Decoder Launch Command**:
167+
168+
.. code-block:: bash
169+
170+
UCX_TLS=cuda_ipc,cuda_copy,tcp \
171+
LMCACHE_CONFIG_FILE=configs/lmcache-decoder-config.yaml \
172+
VLLM_ENABLE_V1_MULTIPROCESSING=1 \
173+
VLLM_WORKER_MULTIPROC_METHOD=spawn \
174+
CUDA_VISIBLE_DEVICES=1 \
175+
vllm serve meta-llama/Llama-3.1-8B-Instruct \
176+
--port 8200 \
177+
--disable-log-requests \
178+
--enforce-eager \
179+
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_consumer",...}'
180+
181+
Testing and Benchmarking
182+
~~~~~~~~~~~~~~~~~~~~~~~~
183+
184+
Basic Test
185+
^^^^^^^^^^
186+
187+
Once all servers are running, you can test with a simple curl command:
188+
189+
.. code-block:: bash
190+
191+
curl -X POST http://localhost:9000/v1/completions \
192+
-H "Content-Type: application/json" \
193+
-d '{
194+
"model": "meta-llama/Llama-3.1-8B-Instruct",
195+
"prompt": "The future of AI is",
196+
"max_tokens": 50,
197+
"temperature": 0.7
198+
}'
199+
200+
Performance Benchmarking
201+
^^^^^^^^^^^^^^^^^^^^^^^^
202+
203+
For comprehensive performance testing, use vLLM's benchmark tool:
204+
205+
.. code-block:: bash
206+
207+
python benchmark_serving.py --port 9000 --seed $(date +%s) \
208+
--model meta-llama/Llama-3.1-8B-Instruct \
209+
--dataset-name random --random-input-len 7500 --random-output-len 200 \
210+
--num-prompts 30 --burstiness 100 --request-rate 1 --ignore-eos
211+
212+
This benchmark:
213+
- Sends requests to port 9000 (proxy server)
214+
- Uses random prompts with 7500 input tokens
215+
- Generates 200 output tokens per request
216+
- Tests with 30 total prompts at 1 request/second
217+
218+
Log Files and Monitoring
219+
~~~~~~~~~~~~~~~~~~~~~~~~
220+
221+
The example generates three log files for monitoring:
222+
223+
- ``prefiller.log`` - Prefiller server logs and errors
224+
- ``decoder.log`` - Decoder server logs and errors
225+
- ``proxy.log`` - Proxy server logs and TTFT statistics
226+
227+
The proxy server automatically calculates and displays TTFT statistics every 5 seconds:
228+
229+
.. code-block::
230+
231+
===============================
232+
Num requests: 10
233+
Prefill node TTFT stats:
234+
- Average (ms): 45.2
235+
- Median (ms): 43.1
236+
- 99th Percentile (ms): 52.8
237+
===============================
238+
239+
Troubleshooting
240+
~~~~~~~~~~~~~~~
241+
242+
Common Issues
243+
^^^^^^^^^^^^^
244+
245+
1. **GPU Memory**: Ensure each GPU has sufficient memory for the model
246+
2. **NIXL Installation**: Verify NIXL is properly installed and accessible
247+
3. **Port Conflicts**: Check that ports 8100, 8200, and 9000 are available
248+
4. **HF Token**: Ensure your Hugging Face token has access to Llama models
249+
250+
Error Recovery
251+
^^^^^^^^^^^^^^
252+
253+
If any server fails to start:
254+
255+
1. Check the corresponding log file for error details
256+
2. Verify GPU availability with ``nvidia-smi``
257+
3. Ensure all dependencies are installed
258+
4. Try restarting with ``Ctrl+C`` followed by re-running the script

0 commit comments

Comments
 (0)