Skip to content

Commit b506fce

Browse files
authored
[Examples][P/D] Examples for Xp1d using LMCache (LMCache#759)
* [Add] new examples for xp1d Signed-off-by: ApostaC <[email protected]> * [fix] pre-commit issues Signed-off-by: ApostaC <[email protected]> * remove the logs Signed-off-by: ApostaC <[email protected]> * remove old script Signed-off-by: ApostaC <[email protected]> * [Add] new readme file Signed-off-by: ApostaC <[email protected]> --------- Signed-off-by: ApostaC <[email protected]>
1 parent be959bf commit b506fce

13 files changed

+746
-27
lines changed
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
## Example of Disaggregated Prefill in vLLM v1
2+
3+
This example demonstrates how to run LMCache with disaggregated prefill using NIXL on a single node.
4+
5+
### Prerequisites
6+
7+
- Install [LMCache](https://github.com/LMCache/LMCache). You can simply run `pip install lmcache`.
8+
- Install [NIXL](https://github.com/ai-dynamo/nixl).
9+
- At least 2 GPUs
10+
- Valid Hugging Face token (HF_TOKEN) for Llama 3.1 8B Instruct.
11+
12+
### Usage
13+
14+
Run
15+
```bash
16+
bash disagg_example_nixl.sh
17+
```
18+
19+
The script will:
20+
21+
1. Launch 1 decoder instance listening on port 8200
22+
2. Launch 1 prefill instances listening on ports 8100
23+
3. Launch a proxy server listening on port 9000
24+
25+
Press `Ctrl+C` to stop the servers.
26+
27+
to start disaggregated prefill and benchmark the performance.
28+
29+
#### Example benchmark command
30+
31+
If you have vLLM [benchmark_serving.py](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py), you can run the following command to benchmark the serving performance of the disaggregated prefill setup:
32+
33+
```bash
34+
python benchmark_serving.py --port 9000 --seed $(date +%s) \
35+
--model meta-llama/Llama-3.1-8B-Instruct \
36+
--dataset-name random --random-input-len 7500 --random-output-len 200 \
37+
--num-prompts 30 --burstiness 100 --request-rate 1 --ignore-eos
38+
```
39+
40+
### Components
41+
42+
#### Server Scripts
43+
- `disagg_vllm_launcher.sh` - Launches individual vLLM servers for prefill/decode, and also launches the proxy server.
44+
- `disagg_proxy_server.py` - FastAPI proxy server that coordinates between prefiller and decoder
45+
- `disagg_example_nixl.sh` - Main script to run the example
46+
47+
#### Configuration
48+
- `configs/lmcache-prefiller-config.yaml` - Configuration for prefiller server
49+
- `configs/lmcache-decoder-config.yaml` - Configuration for decoder server
50+
51+
#### Log Files
52+
The main script generates several log files:
53+
- `prefiller.log` - Logs from the prefill server
54+
- `decoder.log` - Logs from the decode server
55+
- `proxy.log` - Logs from the proxy server

examples/disagg_prefill/disagg_example_nixl.sh renamed to examples/disagg_prefill/1p1d/disagg_example_nixl.sh

Lines changed: 34 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -48,9 +48,31 @@ ensure_python_library_installed() {
4848

4949
cleanup() {
5050
echo "Stopping everything…"
51-
trap - INT TERM # prevent re-entrancy
52-
kill -- -$$ # negative PID == “this whole process-group”
53-
wait # reap children so we don't leave zombies
51+
trap - INT TERM USR1 # prevent re-entrancy
52+
53+
# Kill all tracked PIDs
54+
for pid in "${PIDS[@]}"; do
55+
if kill -0 "$pid" 2>/dev/null; then
56+
echo "Killing process $pid"
57+
kill "$pid" 2>/dev/null
58+
fi
59+
done
60+
61+
# Wait a moment for graceful shutdown
62+
sleep 2
63+
64+
# Force kill any remaining processes
65+
for pid in "${PIDS[@]}"; do
66+
if kill -0 "$pid" 2>/dev/null; then
67+
echo "Force killing process $pid"
68+
kill -9 "$pid" 2>/dev/null
69+
fi
70+
done
71+
72+
# Kill the entire process group as backup
73+
kill -- -$$ 2>/dev/null
74+
75+
echo "All processes stopped."
5476
exit 0
5577
}
5678

@@ -118,8 +140,17 @@ main() {
118140
wait_for_server 8200
119141
wait_for_server 9000
120142

143+
echo "================================================"
121144
echo "All servers are up. You can send request now..."
145+
echo "Press Ctrl-C to terminate all instances."
146+
147+
# Keep the script running until interrupted
148+
echo "Script is running. Waiting for termination signal..."
149+
echo "================================================"
122150

151+
while true; do
152+
sleep 1
153+
done
123154
}
124155

125156
main

examples/disagg_prefill/README.md

Lines changed: 89 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,101 @@
1-
## Example of Disaggregated Prefill in vLLM v1
1+
# Disaggregated Prefill Examples for LMCache with vLLM v1
22

3-
This example demonstrates how to run LMCache with disaggregated prefill using NIXL on a single node.
3+
This directory contains examples demonstrating how to run LMCache with disaggregated prefill using NIXL. Disaggregated prefill allows you to separate the prefill (prompt processing) and decode (token generation) phases of LLM inference across different GPU instances, enabling better resource utilization and scalability.
44

5-
### Prerequisites
5+
## Overview
66

7-
- Install [LMCache](https://github.com/LMCache/LMCache). You can simply run `pip install lmcache`.
8-
- Install [NIXL](https://github.com/ai-dynamo/nixl).
9-
- At least 2 GPUs
10-
- Valid Hugging Face token (HF_TOKEN) for Llama 3.1 8B Instruct.
7+
Disaggregated prefill architecture separates the compute-intensive prefill phase from the memory-intensive decode phase:
118

12-
### Usage
9+
- **Prefill servers**: Handle prompt processing and KV cache generation
10+
- **Decode server**: Handles token generation using cached KV states
11+
- **Proxy server**: Coordinates requests between prefill and decode servers
12+
13+
This architecture provides several benefits:
14+
- Better GPU utilization by matching workload characteristics to hardware
15+
- Improved scalability by independently scaling prefill and decode capacity
16+
- Reduced latency through parallel processing
17+
- Cost optimization by using different instance types for different phases
18+
19+
## Available Examples
20+
21+
### 1p1d - Single Prefill, Single Decode
22+
Directory: [`1p1d/`](./1p1d/)
23+
24+
A basic setup with:
25+
- 1 prefill server (port 8100)
26+
- 1 decode server (port 8200)
27+
- 1 proxy server (port 9000)
28+
29+
**Requirements**: At least 2 GPUs
30+
31+
This is the simplest configuration to get started with disaggregated prefill.
32+
33+
### xp1d - Multiple Prefill, Single Decode
34+
Directory: [`xp1d/`](./xp1d/)
35+
36+
A scaled setup with:
37+
- 2 prefill servers (ports 8100, 8101)
38+
- 1 decode server (port 8200)
39+
- 1 proxy server with round-robin load balancing (port 9000)
40+
41+
**Requirements**: At least 3 GPUs
42+
43+
This configuration demonstrates how to scale prefill capacity while maintaining a single decode instance.
44+
45+
## Prerequisites
46+
47+
Before running any example, ensure you have:
48+
49+
- [LMCache](https://github.com/LMCache/LMCache) installed: `pip install lmcache`
50+
- [NIXL](https://github.com/ai-dynamo/nixl) installed
51+
- Valid Hugging Face token (HF_TOKEN) for Llama 3.1 8B Instruct
52+
- Sufficient GPU resources (see individual example requirements)
53+
54+
## Quick Start
55+
56+
1. Choose the appropriate example based on your GPU resources:
57+
- For 2 GPUs: Use [`1p1d/`](./1p1d/)
58+
- For 3+ GPUs: Use [`xp1d/`](./xp1d/)
59+
60+
2. Navigate to the chosen directory:
61+
```bash
62+
cd 1p1d/ # or cd xp1d/
63+
```
64+
65+
3. Follow the specific README instructions in that directory
66+
67+
## Benchmarking
68+
69+
Both examples can be benchmarked using vLLM's `benchmark_serving.py`:
1370

14-
Run
1571
```bash
16-
bash disagg_example_nixl.sh
72+
python benchmark_serving.py --port 9000 --seed $(date +%s) \
73+
--model meta-llama/Llama-3.1-8B-Instruct \
74+
--dataset-name random --random-input-len 7500 --random-output-len 200 \
75+
--num-prompts 30 --burstiness 100 --request-rate 1 --ignore-eos
1776
```
1877

19-
to start disaggregated prefill and benchmark the performance.
78+
## Architecture Components
79+
80+
Each example includes:
81+
82+
- **Main script**: `disagg_example_*.sh` - Main entry point to run the example
83+
- **Launcher script**: `disagg_vllm_launcher.sh` - Launches vLLM servers and proxy
84+
- **Proxy server**: `disagg_proxy_server.py` - FastAPI server coordinating requests
85+
- **Configuration files**: YAML configs for prefill and decode servers
86+
- **Log files**: Generated during execution for debugging
87+
88+
## Troubleshooting
2089

21-
### Components
90+
- **GPU Memory Issues**: Ensure you have sufficient VRAM for the model on each GPU
91+
- **Port Conflicts**: Check that ports 8100, 8101, 8200, and 9000 are available
92+
- **HF Token**: Verify your Hugging Face token has access to Llama 3.1 models
93+
- **Dependencies**: Ensure both LMCache and NIXL are properly installed
2294

23-
#### Server Scripts
24-
- `disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh` - Launches individual vLLM servers for prefill/decode, and also launches the proxy server.
25-
- `disagg_prefill_lmcache_v1/disagg_proxy_server.py` - FastAPI proxy server that coordinates between prefiller and decoder
26-
- `disagg_prefill_lmcache_v1/disagg_example_nixl.sh` - Main script to run the example
95+
For detailed troubleshooting, check the log files generated in each example directory.
2796

28-
#### Configuration
29-
- `disagg_prefill_lmcache_v1/configs/lmcache-prefiller-config.yaml` - Configuration for prefiller server
30-
- `disagg_prefill_lmcache_v1/configs/lmcache-decoder-config.yaml` - Configuration for decoder server
97+
## Further Reading
3198

32-
#### Log Files
33-
The main script generates several log files:
34-
- `prefiller.log` - Logs from the prefill server
35-
- `decoder.log` - Logs from the decode server
36-
- `proxy.log` - Logs from the proxy server
99+
- [LMCache Documentation](https://github.com/LMCache/LMCache)
100+
- [NIXL Documentation](https://github.com/ai-dynamo/nixl)
101+
- [vLLM Documentation](https://docs.vllm.ai/)
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
## Example of Disaggregated Prefill in vLLM v1
2+
3+
This example demonstrates how to run LMCache with disaggregated prefill using NIXL on a single node.
4+
5+
### Prerequisites
6+
7+
- Install [LMCache](https://github.com/LMCache/LMCache). You can simply run `pip install lmcache`.
8+
- Install [NIXL](https://github.com/ai-dynamo/nixl).
9+
- At least 3 GPUs
10+
- Valid Hugging Face token (HF_TOKEN) for Llama 3.1 8B Instruct.
11+
12+
### Usage
13+
14+
Run
15+
```bash
16+
bash disagg_example_xp1d.sh
17+
```
18+
19+
to start disaggregated prefill and benchmark the performance.
20+
21+
The script will:
22+
23+
1. Launch 1 decoder instance listening on port 8200
24+
2. Launch 2 prefill instances listening on ports 8100 and 8101, respectively
25+
3. Launch a proxy server that uses round-robin to distribute requests between the prefill instances, listening on port 9000
26+
27+
Press `Ctrl+C` to stop the servers.
28+
29+
#### Example benchmark command
30+
31+
If you have vLLM [benchmark_serving.py](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py), you can run the following command to benchmark the serving performance of the disaggregated prefill setup:
32+
33+
```bash
34+
python benchmark_serving.py --port 9000 --seed $(date +%s) \
35+
--model meta-llama/Llama-3.1-8B-Instruct \
36+
--dataset-name random --random-input-len 7500 --random-output-len 200 \
37+
--num-prompts 30 --burstiness 100 --request-rate 1 --ignore-eos
38+
```
39+
40+
Expected output from the benchmark script:
41+
42+
```plaintext
43+
============ Serving Benchmark Result ============
44+
Successful requests: 30
45+
Benchmark duration (s): 31.34
46+
Total input tokens: 224970
47+
Total generated tokens: 6000
48+
Request throughput (req/s): 0.96
49+
Output token throughput (tok/s): 191.44
50+
Total Token throughput (tok/s): 7369.36
51+
---------------Time to First Token----------------
52+
Mean TTFT (ms): 313.41
53+
Median TTFT (ms): 272.83
54+
P99 TTFT (ms): 837.32
55+
-----Time per Output Token (excl. 1st token)------
56+
Mean TPOT (ms): 8.84
57+
Median TPOT (ms): 8.72
58+
P99 TPOT (ms): 11.35
59+
---------------Inter-token Latency----------------
60+
Mean ITL (ms): 8.84
61+
Median ITL (ms): 8.61
62+
P99 ITL (ms): 11.43
63+
==================================================
64+
```
65+
66+
### Components
67+
68+
#### Server Scripts
69+
- `disagg_vllm_launcher.sh` - Launches individual vLLM servers for prefill/decode, and also launches the proxy server.
70+
- `disagg_proxy_server.py` - FastAPI proxy server that coordinates between prefiller and decoder
71+
- `disagg_example_xp1d.sh` - Main script to run the example
72+
73+
#### Configuration
74+
- `configs/lmcache-prefiller-config.yaml` - Configuration for prefiller server
75+
- `configs/lmcache-decoder-config.yaml` - Configuration for decoder server
76+
77+
#### Log Files
78+
The main script generates several log files:
79+
- `prefiller1.log` and `prefiller2.log` - Logs from the prefill servers
80+
- `decoder.log` - Logs from the decode server
81+
- `proxy.log` - Logs from the proxy server
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
local_cpu: False
2+
max_local_cpu_size: 0
3+
max_local_disk_size: 0
4+
remote_serde: NULL
5+
6+
enable_nixl: True
7+
nixl_role: "receiver"
8+
nixl_peer_host: "localhost"
9+
nixl_peer_port: 55555
10+
nixl_buffer_size: 1073741824 # 1GB
11+
nixl_buffer_device: "cuda"
12+
nixl_enable_gc: True
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
local_cpu: False
2+
max_local_cpu_size: 0
3+
max_local_disk_size: 0
4+
remote_serde: NULL
5+
6+
enable_nixl: True
7+
nixl_role: "sender"
8+
nixl_peer_host: "localhost"
9+
nixl_peer_port: 55555
10+
nixl_buffer_size: 1073741824 # 1GB
11+
nixl_buffer_device: "cuda"
12+
nixl_enable_gc: True

0 commit comments

Comments
 (0)