diff --git a/Qwen/Qwen2.5-VL.md b/Qwen/Qwen2.5-VL.md
index 924c8f0..fb303b1 100644
--- a/Qwen/Qwen2.5-VL.md
+++ b/Qwen/Qwen2.5-VL.md
@@ -9,20 +9,53 @@ Since BF16 is the commonly used precision type for Qwen2.5-VL training, using BF
## GPU Deployment
-### Installing vLLM
+### Setting Up vLLM
+
+
+CUDA
```bash
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
```
+
+
+
+ROCm
+```bash
+# to get the latest amd official released docker image
+# DOCKER_IMAGE=rocm/vllm
+# to get the latest nightly image if you want to have access to latest vLLM feature
+DOCKER_IMAGE=rocm/vllm-dev:nightly
+
+docker pull $DOCKER_IMAGE
+
+docker run -it --rm \
+--network=host \
+--group-add=video \
+--ipc=host \
+--cap-add=SYS_PTRACE \
+--security-opt seccomp=unconfined \
+--device /dev/kfd \
+--device /dev/dri \
+-v :/app/models \
+-e HF_HOME="/app/models" \
+$DOCKER_IMAGE \
+bash
+```
+
-### Running Qwen2.5-VL with BF16 on 4xA100
+
+### Usage
There are two ways to parallelize the model over multiple GPUs: (1) Tensor-parallel (TP) or (2) Data-parallel (DP). Each one has its own advantages, where tensor-parallel is usually more beneficial for low-latency / low-load scenarios, and data-parallel works better for cases where there is a lot of data with heavy loads.
To launch the online inference server for Qwen2.5-VL-72B:
+
+Running Qwen2.5-VL with BF16 on 4xA100
+
```bash
# Start server with BF16 model on 4 GPUs using TP=4
export CUDA_VISIBLE_DEVICES=0,1,2,3
@@ -31,10 +64,51 @@ vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
--port 8000 \
--tensor-parallel-size 4 \
--mm-encoder-tp-mode data \
- --limit-mm-per-prompt '{"image":2,"video":0}' \
+ --limit-mm-per-prompt '{"image":2,"video":0}'
+```
+
+
+MI300X
+```bash
+export CUDA_VISIBLE_DEVICES=0,1
+VLLM_ROCM_USE_AITER=1 \
+vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
+ --host 0.0.0.0 \
+ --port 8000 \
+ --tensor-parallel-size 2 \
+ --mm-encoder-tp-mode data \
+ --limit-mm-per-prompt '{"image":2,"video":0}'
```
+If you want to improve throughput on a single node,
+launch 4 of the following TP2 instances and connect the instances to a routing gateway.
+
+```bash
+export CUDA_VISIBLE_DEVICES=0,1
+VLLM_ROCM_USE_AITER=1 \
+vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
+ --host 0.0.0.0 \
+ --port 8000 \
+ --tensor-parallel-size 2 \
+ --mm-encoder-tp-mode data \
+ --limit-mm-per-prompt '{"image":2,"video":0}'
+```
+
+If you want to lower the latency on single node, launch the server with TP8
+```bash
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+VLLM_ROCM_USE_AITER=1 \
+vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
+ --host 0.0.0.0 \
+ --port 8000 \
+ --tensor-parallel-size 8 \
+ --mm-encoder-tp-mode data \
+ --limit-mm-per-prompt '{"image":2,"video":0}'
+```
+
+
+
#### Tips
- You can set `--max-model-len` to preserve memory. By default the model's context length is 128K, but `--max-model-len=65536` is usually good for most scenarios.