|
1 | 1 | 1p1d
|
2 | 2 | ====
|
3 | 3 |
|
4 |
| -Coming soon... |
| 4 | +One Prefiller, One Decoder (1p1d) Example |
| 5 | +------------------------------------------ |
| 6 | + |
| 7 | +This example demonstrates how to run LMCache with disaggregated prefill using NIXL on a single node with a 1 prefiller + 1 decoder setup. This configuration separates the compute-intensive prefill operations from the decode operations, allowing for better resource utilization and performance optimization. |
| 8 | + |
| 9 | +Architecture Overview |
| 10 | +~~~~~~~~~~~~~~~~~~~~~ |
| 11 | + |
| 12 | +The 1p1d setup consists of three main components: |
| 13 | + |
| 14 | +1. **Prefiller Server** - Handles the prefill phase of inference (initial prompt processing) |
| 15 | +2. **Decoder Server** - Handles the decode phase of inference (token generation) |
| 16 | +3. **Proxy Server** - Coordinates requests between the prefiller and decoder |
| 17 | + |
| 18 | +.. code-block:: |
| 19 | +
|
| 20 | + ┌─────────────┐ |
| 21 | + │ Client │ |
| 22 | + └─────┬───────┘ |
| 23 | + │ |
| 24 | + ┌───────▼───────┐ |
| 25 | + │ Proxy Server │ |
| 26 | + │ Port 9000 │ |
| 27 | + └───┬───────┬───┘ |
| 28 | + │ │ |
| 29 | + ┌────────▼──┐ ┌─▼────────┐ |
| 30 | + │ Prefiller │ │ Decoder │ |
| 31 | + │Port 8100 │ │Port 8200 │ |
| 32 | + │ GPU 0 │ │ GPU 1 │ |
| 33 | + └───────────┘ └──────────┘ |
| 34 | + │ ▲ |
| 35 | + │ │ |
| 36 | + └───────┘ |
| 37 | + NIXL Transfer |
| 38 | +
|
| 39 | +Prerequisites |
| 40 | +~~~~~~~~~~~~~ |
| 41 | + |
| 42 | +- **LMCache**: Install with ``pip install lmcache`` |
| 43 | +- **NIXL**: Install from `NIXL GitHub repository <https://github.com/ai-dynamo/nixl>`_ |
| 44 | +- **Hardware**: At least 2 GPUs |
| 45 | +- **Model Access**: Valid Hugging Face token (HF_TOKEN) for Llama 3.1 8B Instruct |
| 46 | + |
| 47 | +Quick Start |
| 48 | +~~~~~~~~~~~ |
| 49 | + |
| 50 | +1. **Set your Hugging Face token**: |
| 51 | + |
| 52 | + .. code-block:: bash |
| 53 | +
|
| 54 | + export HF_TOKEN=hf_your_token_here |
| 55 | +
|
| 56 | +2. **Navigate to the example directory**: |
| 57 | + |
| 58 | + .. code-block:: bash |
| 59 | +
|
| 60 | + cd examples/disagg_prefill/1p1d |
| 61 | +
|
| 62 | +3. **Run the example**: |
| 63 | + |
| 64 | + .. code-block:: bash |
| 65 | +
|
| 66 | + bash disagg_example_nixl.sh |
| 67 | +
|
| 68 | +The script will automatically: |
| 69 | + |
| 70 | +- Launch a prefiller instance on port 8100 (GPU 0) |
| 71 | +- Launch a decoder instance on port 8200 (GPU 1) |
| 72 | +- Launch a proxy server on port 9000 |
| 73 | +- Wait for all servers to be ready |
| 74 | + |
| 75 | +Press ``Ctrl+C`` to stop all servers. |
| 76 | + |
| 77 | +Configuration |
| 78 | +~~~~~~~~~~~~~ |
| 79 | + |
| 80 | +Prefiller Configuration |
| 81 | +^^^^^^^^^^^^^^^^^^^^^^^ |
| 82 | + |
| 83 | +The prefiller is configured via ``configs/lmcache-prefiller-config.yaml``: |
| 84 | + |
| 85 | +.. code-block:: yaml |
| 86 | +
|
| 87 | + local_cpu: False |
| 88 | + max_local_cpu_size: 0 |
| 89 | + max_local_disk_size: 0 |
| 90 | + remote_serde: NULL |
| 91 | +
|
| 92 | + enable_nixl: True |
| 93 | + nixl_role: "sender" |
| 94 | + nixl_peer_host: "localhost" |
| 95 | + nixl_peer_port: 55555 |
| 96 | + nixl_buffer_size: 1073741824 # 1GB |
| 97 | + nixl_buffer_device: "cuda" |
| 98 | + nixl_enable_gc: True |
| 99 | +
|
| 100 | +Key settings: |
| 101 | +- ``nixl_role: "sender"`` - Configures this instance to send KV cache data |
| 102 | +- ``nixl_buffer_size: 1GB`` - Buffer size for NIXL transfers |
| 103 | +- ``nixl_buffer_device: "cuda"`` - Uses GPU memory for buffering |
| 104 | + |
| 105 | +Decoder Configuration |
| 106 | +^^^^^^^^^^^^^^^^^^^^^ |
| 107 | + |
| 108 | +The decoder is configured via ``configs/lmcache-decoder-config.yaml``: |
| 109 | + |
| 110 | +.. code-block:: yaml |
| 111 | +
|
| 112 | + local_cpu: False |
| 113 | + max_local_cpu_size: 0 |
| 114 | + max_local_disk_size: 0 |
| 115 | + remote_serde: NULL |
| 116 | +
|
| 117 | + enable_nixl: True |
| 118 | + nixl_role: "receiver" |
| 119 | + nixl_peer_host: "localhost" |
| 120 | + nixl_peer_port: 55555 |
| 121 | + nixl_buffer_size: 1073741824 # 1GB |
| 122 | + nixl_buffer_device: "cuda" |
| 123 | + nixl_enable_gc: True |
| 124 | +
|
| 125 | +Key settings: |
| 126 | +- ``nixl_role: "receiver"`` - Configures this instance to receive KV cache data |
| 127 | +- Same buffer configuration as the prefiller for compatibility |
| 128 | + |
| 129 | +Components Deep Dive |
| 130 | +~~~~~~~~~~~~~~~~~~~~ |
| 131 | + |
| 132 | +Proxy Server (disagg_proxy_server.py) |
| 133 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 134 | + |
| 135 | +The proxy server coordinates the disaggregated prefill workflow: |
| 136 | + |
| 137 | +1. **Request Handling**: Receives client requests on port 9000 |
| 138 | +2. **Prefill Coordination**: Sends requests to the prefiller with ``max_tokens=1`` |
| 139 | +3. **Response Streaming**: Streams the full response from the decoder |
| 140 | +4. **Performance Monitoring**: Tracks Time-To-First-Token (TTFT) statistics |
| 141 | + |
| 142 | +Supported endpoints: |
| 143 | +- ``/v1/completions`` |
| 144 | +- ``/v1/chat/completions`` |
| 145 | + |
| 146 | +vLLM Server Launcher (disagg_vllm_launcher.sh) |
| 147 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 148 | + |
| 149 | +This script launches individual vLLM servers with appropriate configurations: |
| 150 | + |
| 151 | +**Prefiller Launch Command**: |
| 152 | + |
| 153 | +.. code-block:: bash |
| 154 | +
|
| 155 | + UCX_TLS=cuda_ipc,cuda_copy,tcp \ |
| 156 | + LMCACHE_CONFIG_FILE=configs/lmcache-prefiller-config.yaml \ |
| 157 | + VLLM_ENABLE_V1_MULTIPROCESSING=1 \ |
| 158 | + VLLM_WORKER_MULTIPROC_METHOD=spawn \ |
| 159 | + CUDA_VISIBLE_DEVICES=0 \ |
| 160 | + vllm serve meta-llama/Llama-3.1-8B-Instruct \ |
| 161 | + --port 8100 \ |
| 162 | + --disable-log-requests \ |
| 163 | + --enforce-eager \ |
| 164 | + --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_producer",...}' |
| 165 | +
|
| 166 | +**Decoder Launch Command**: |
| 167 | + |
| 168 | +.. code-block:: bash |
| 169 | +
|
| 170 | + UCX_TLS=cuda_ipc,cuda_copy,tcp \ |
| 171 | + LMCACHE_CONFIG_FILE=configs/lmcache-decoder-config.yaml \ |
| 172 | + VLLM_ENABLE_V1_MULTIPROCESSING=1 \ |
| 173 | + VLLM_WORKER_MULTIPROC_METHOD=spawn \ |
| 174 | + CUDA_VISIBLE_DEVICES=1 \ |
| 175 | + vllm serve meta-llama/Llama-3.1-8B-Instruct \ |
| 176 | + --port 8200 \ |
| 177 | + --disable-log-requests \ |
| 178 | + --enforce-eager \ |
| 179 | + --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_consumer",...}' |
| 180 | +
|
| 181 | +Testing and Benchmarking |
| 182 | +~~~~~~~~~~~~~~~~~~~~~~~~ |
| 183 | + |
| 184 | +Basic Test |
| 185 | +^^^^^^^^^^ |
| 186 | + |
| 187 | +Once all servers are running, you can test with a simple curl command: |
| 188 | + |
| 189 | +.. code-block:: bash |
| 190 | +
|
| 191 | + curl -X POST http://localhost:9000/v1/completions \ |
| 192 | + -H "Content-Type: application/json" \ |
| 193 | + -d '{ |
| 194 | + "model": "meta-llama/Llama-3.1-8B-Instruct", |
| 195 | + "prompt": "The future of AI is", |
| 196 | + "max_tokens": 50, |
| 197 | + "temperature": 0.7 |
| 198 | + }' |
| 199 | +
|
| 200 | +Performance Benchmarking |
| 201 | +^^^^^^^^^^^^^^^^^^^^^^^^ |
| 202 | + |
| 203 | +For comprehensive performance testing, use vLLM's benchmark tool: |
| 204 | + |
| 205 | +.. code-block:: bash |
| 206 | +
|
| 207 | + python benchmark_serving.py --port 9000 --seed $(date +%s) \ |
| 208 | + --model meta-llama/Llama-3.1-8B-Instruct \ |
| 209 | + --dataset-name random --random-input-len 7500 --random-output-len 200 \ |
| 210 | + --num-prompts 30 --burstiness 100 --request-rate 1 --ignore-eos |
| 211 | +
|
| 212 | +This benchmark: |
| 213 | +- Sends requests to port 9000 (proxy server) |
| 214 | +- Uses random prompts with 7500 input tokens |
| 215 | +- Generates 200 output tokens per request |
| 216 | +- Tests with 30 total prompts at 1 request/second |
| 217 | + |
| 218 | +Log Files and Monitoring |
| 219 | +~~~~~~~~~~~~~~~~~~~~~~~~ |
| 220 | + |
| 221 | +The example generates three log files for monitoring: |
| 222 | + |
| 223 | +- ``prefiller.log`` - Prefiller server logs and errors |
| 224 | +- ``decoder.log`` - Decoder server logs and errors |
| 225 | +- ``proxy.log`` - Proxy server logs and TTFT statistics |
| 226 | + |
| 227 | +The proxy server automatically calculates and displays TTFT statistics every 5 seconds: |
| 228 | + |
| 229 | +.. code-block:: |
| 230 | +
|
| 231 | + =============================== |
| 232 | + Num requests: 10 |
| 233 | + Prefill node TTFT stats: |
| 234 | + - Average (ms): 45.2 |
| 235 | + - Median (ms): 43.1 |
| 236 | + - 99th Percentile (ms): 52.8 |
| 237 | + =============================== |
| 238 | +
|
| 239 | +Troubleshooting |
| 240 | +~~~~~~~~~~~~~~~ |
| 241 | + |
| 242 | +Common Issues |
| 243 | +^^^^^^^^^^^^^ |
| 244 | + |
| 245 | +1. **GPU Memory**: Ensure each GPU has sufficient memory for the model |
| 246 | +2. **NIXL Installation**: Verify NIXL is properly installed and accessible |
| 247 | +3. **Port Conflicts**: Check that ports 8100, 8200, and 9000 are available |
| 248 | +4. **HF Token**: Ensure your Hugging Face token has access to Llama models |
| 249 | + |
| 250 | +Error Recovery |
| 251 | +^^^^^^^^^^^^^^ |
| 252 | + |
| 253 | +If any server fails to start: |
| 254 | + |
| 255 | +1. Check the corresponding log file for error details |
| 256 | +2. Verify GPU availability with ``nvidia-smi`` |
| 257 | +3. Ensure all dependencies are installed |
| 258 | +4. Try restarting with ``Ctrl+C`` followed by re-running the script |
0 commit comments