|
1 |
| -## Example of Disaggregated Prefill in vLLM v1 |
| 1 | +# Disaggregated Prefill Examples for LMCache with vLLM v1 |
2 | 2 |
|
3 |
| -This example demonstrates how to run LMCache with disaggregated prefill using NIXL on a single node. |
| 3 | +This directory contains examples demonstrating how to run LMCache with disaggregated prefill using NIXL. Disaggregated prefill allows you to separate the prefill (prompt processing) and decode (token generation) phases of LLM inference across different GPU instances, enabling better resource utilization and scalability. |
4 | 4 |
|
5 |
| -### Prerequisites |
| 5 | +## Overview |
6 | 6 |
|
7 |
| -- Install [LMCache](https://github.com/LMCache/LMCache). You can simply run `pip install lmcache`. |
8 |
| -- Install [NIXL](https://github.com/ai-dynamo/nixl). |
9 |
| -- At least 2 GPUs |
10 |
| -- Valid Hugging Face token (HF_TOKEN) for Llama 3.1 8B Instruct. |
| 7 | +Disaggregated prefill architecture separates the compute-intensive prefill phase from the memory-intensive decode phase: |
11 | 8 |
|
12 |
| -### Usage |
| 9 | +- **Prefill servers**: Handle prompt processing and KV cache generation |
| 10 | +- **Decode server**: Handles token generation using cached KV states |
| 11 | +- **Proxy server**: Coordinates requests between prefill and decode servers |
| 12 | + |
| 13 | +This architecture provides several benefits: |
| 14 | +- Better GPU utilization by matching workload characteristics to hardware |
| 15 | +- Improved scalability by independently scaling prefill and decode capacity |
| 16 | +- Reduced latency through parallel processing |
| 17 | +- Cost optimization by using different instance types for different phases |
| 18 | + |
| 19 | +## Available Examples |
| 20 | + |
| 21 | +### 1p1d - Single Prefill, Single Decode |
| 22 | +Directory: [`1p1d/`](./1p1d/) |
| 23 | + |
| 24 | +A basic setup with: |
| 25 | +- 1 prefill server (port 8100) |
| 26 | +- 1 decode server (port 8200) |
| 27 | +- 1 proxy server (port 9000) |
| 28 | + |
| 29 | +**Requirements**: At least 2 GPUs |
| 30 | + |
| 31 | +This is the simplest configuration to get started with disaggregated prefill. |
| 32 | + |
| 33 | +### xp1d - Multiple Prefill, Single Decode |
| 34 | +Directory: [`xp1d/`](./xp1d/) |
| 35 | + |
| 36 | +A scaled setup with: |
| 37 | +- 2 prefill servers (ports 8100, 8101) |
| 38 | +- 1 decode server (port 8200) |
| 39 | +- 1 proxy server with round-robin load balancing (port 9000) |
| 40 | + |
| 41 | +**Requirements**: At least 3 GPUs |
| 42 | + |
| 43 | +This configuration demonstrates how to scale prefill capacity while maintaining a single decode instance. |
| 44 | + |
| 45 | +## Prerequisites |
| 46 | + |
| 47 | +Before running any example, ensure you have: |
| 48 | + |
| 49 | +- [LMCache](https://github.com/LMCache/LMCache) installed: `pip install lmcache` |
| 50 | +- [NIXL](https://github.com/ai-dynamo/nixl) installed |
| 51 | +- Valid Hugging Face token (HF_TOKEN) for Llama 3.1 8B Instruct |
| 52 | +- Sufficient GPU resources (see individual example requirements) |
| 53 | + |
| 54 | +## Quick Start |
| 55 | + |
| 56 | +1. Choose the appropriate example based on your GPU resources: |
| 57 | + - For 2 GPUs: Use [`1p1d/`](./1p1d/) |
| 58 | + - For 3+ GPUs: Use [`xp1d/`](./xp1d/) |
| 59 | + |
| 60 | +2. Navigate to the chosen directory: |
| 61 | + ```bash |
| 62 | + cd 1p1d/ # or cd xp1d/ |
| 63 | + ``` |
| 64 | + |
| 65 | +3. Follow the specific README instructions in that directory |
| 66 | + |
| 67 | +## Benchmarking |
| 68 | + |
| 69 | +Both examples can be benchmarked using vLLM's `benchmark_serving.py`: |
13 | 70 |
|
14 |
| -Run |
15 | 71 | ```bash
|
16 |
| -bash disagg_example_nixl.sh |
| 72 | +python benchmark_serving.py --port 9000 --seed $(date +%s) \ |
| 73 | + --model meta-llama/Llama-3.1-8B-Instruct \ |
| 74 | + --dataset-name random --random-input-len 7500 --random-output-len 200 \ |
| 75 | + --num-prompts 30 --burstiness 100 --request-rate 1 --ignore-eos |
17 | 76 | ```
|
18 | 77 |
|
19 |
| -to start disaggregated prefill and benchmark the performance. |
| 78 | +## Architecture Components |
| 79 | + |
| 80 | +Each example includes: |
| 81 | + |
| 82 | +- **Main script**: `disagg_example_*.sh` - Main entry point to run the example |
| 83 | +- **Launcher script**: `disagg_vllm_launcher.sh` - Launches vLLM servers and proxy |
| 84 | +- **Proxy server**: `disagg_proxy_server.py` - FastAPI server coordinating requests |
| 85 | +- **Configuration files**: YAML configs for prefill and decode servers |
| 86 | +- **Log files**: Generated during execution for debugging |
| 87 | + |
| 88 | +## Troubleshooting |
20 | 89 |
|
21 |
| -### Components |
| 90 | +- **GPU Memory Issues**: Ensure you have sufficient VRAM for the model on each GPU |
| 91 | +- **Port Conflicts**: Check that ports 8100, 8101, 8200, and 9000 are available |
| 92 | +- **HF Token**: Verify your Hugging Face token has access to Llama 3.1 models |
| 93 | +- **Dependencies**: Ensure both LMCache and NIXL are properly installed |
22 | 94 |
|
23 |
| -#### Server Scripts |
24 |
| -- `disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh` - Launches individual vLLM servers for prefill/decode, and also launches the proxy server. |
25 |
| -- `disagg_prefill_lmcache_v1/disagg_proxy_server.py` - FastAPI proxy server that coordinates between prefiller and decoder |
26 |
| -- `disagg_prefill_lmcache_v1/disagg_example_nixl.sh` - Main script to run the example |
| 95 | +For detailed troubleshooting, check the log files generated in each example directory. |
27 | 96 |
|
28 |
| -#### Configuration |
29 |
| -- `disagg_prefill_lmcache_v1/configs/lmcache-prefiller-config.yaml` - Configuration for prefiller server |
30 |
| -- `disagg_prefill_lmcache_v1/configs/lmcache-decoder-config.yaml` - Configuration for decoder server |
| 97 | +## Further Reading |
31 | 98 |
|
32 |
| -#### Log Files |
33 |
| -The main script generates several log files: |
34 |
| -- `prefiller.log` - Logs from the prefill server |
35 |
| -- `decoder.log` - Logs from the decode server |
36 |
| -- `proxy.log` - Logs from the proxy server |
| 99 | +- [LMCache Documentation](https://github.com/LMCache/LMCache) |
| 100 | +- [NIXL Documentation](https://github.com/ai-dynamo/nixl) |
| 101 | +- [vLLM Documentation](https://docs.vllm.ai/) |
0 commit comments