From b5697de3d7c324bc3033c2e8d0545c8a36ff5e90 Mon Sep 17 00:00:00 2001
From: zhuyuhua-v <yuhzhu@amd.com>
Date: Thu, 27 Nov 2025 15:43:04 +0800
Subject: [PATCH 1/5] Add ROCm llama recipes

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
---
 Llama/Llama3.1-405B-ROCm.md | 335 +++++++++++++++++++++++++++++++++++
 Llama/Llama3.3-70B-ROCm.md  | 337 ++++++++++++++++++++++++++++++++++++
 2 files changed, 672 insertions(+)
 create mode 100644 Llama/Llama3.1-405B-ROCm.md
 create mode 100644 Llama/Llama3.3-70B-ROCm.md
diff --git a/Llama/Llama3.1-405B-ROCm.md b/Llama/Llama3.1-405B-ROCm.md
new file mode 100644
index 0000000..d5eee85
--- /dev/null
+++ b/Llama/Llama3.1-405B-ROCm.md
@@ -0,0 +1,335 @@
+# Quick Start Recipe for Llama 3.3 70B on vLLM - AMD MI355
+
+## Introduction
+
+This quick start recipe provides step-by-step instructions for running the Llama 3.3-70B Instruct model using vLLM with FP8 and FP4 quantization, optimized for AMD GPUs(MI355). It covers the complete setup required; from accessing model weights and preparing the software environment to configuring vLLM parameters, launching the server, and validating inference output.
+
+The recipe is intended for developers and practitioners seeking high-throughput or low-latency inference using ROCm's accelerated stack—building a docker image with vLLM for model serving, FlashInfer for optimized ROCm kernels, and ModelOpt to enable FP8 and MXFP4 quantized execution.
+
+## Access & Licensing
+
+### License
+
+To use Llama 3.3-70B, you must first agree to Meta’s Llama 3 Community License (https://ai.meta.com/resources/models-and-libraries/llama-downloads/). AMD’s quantized versions (FP8 and FP4) are built on top of the base model and are available for research and commercial use under the same license.
+
+### Weights
+
+You only need to download one version of the model weights, depending on the precision in use:
+
+- FP8 model for MI355: [amd/Llama-3.1-405B-Instruct-FP8-KV](https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV)
+- FP4 model for MI355: [amd/Llama-3.1-405B-Instruct-MXFP4-Preview](https://huggingface.co/amd/Llama-3.1-405B-Instruct-MXFP4-Preview)
+
+No Hugging Face authentication token is required to download these weights.
+
+Note on Quantization Choice:
+For MI355, FP8 offers the best performance for most workloads. MXFP4 provides additional memory savings and throughput gains, but may require tuning to maintain accuracy on certain tasks.
+
+## Prerequisites
+
+- OS: Linux
+- GPU: MI355
+- [ROCm docker setup](https://docs.vllm.ai/en/stable/getting_started/installation/gpu.html#set-up-using-docker)
+
+## Deployment Steps
+
+### Pull Docker Image
+
+Pull the rocm/vllm-dev:nightly docker image.
+
+`pull_image`
+```
+docker pull rocm/vllm-dev:nightly # to get the latest image
+```
+
+### Run Docker Container
+
+Run the docker container using the docker image `rocm/vllm-dev:nightly`.
+
+`run_container`
+```
+docker run -it --rm \
+--network=host \
+--group-add=video \
+--ipc=host \
+--cap-add=SYS_PTRACE \
+--security-opt seccomp=unconfined \
+--device /dev/kfd \
+--device /dev/dri \
+-v <path/to/your/models>:/app/models \
+-e HF_HOME="/app/models" \
+-e HF_TOKEN="$HF_TOKEN" \
+rocm/vllm-dev:nightly
+```
+
+Note: You can mount additional directories and paths using the `-v <local_path>:<path>` flag if needed, such as mounting the downloaded weight paths.
+
+The `-e HF_TOKEN="$HF_TOKEN" -e HF_HOME="$HF_HOME"` flags are added so that the models are downloaded using your HuggingFace token and the downloaded models can be cached in $HF_HOME. Refer to [HuggingFace documentation](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) for more information about these environment variables and refer to [HuggingFace Quickstart guide](https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication) about steps to generate your HuggingFace access token.
+
+### Prepare the AITER and vLLM packages
+
+We suggest to install the latest vLLM and AITER to leverage all the optimizations available on ROCm plarforms.
+
+`install AITER`
+
+```
+# uninstall aiter & vllm
+pip uninstall -y aiter vllm
+# clone aiter repo
+git clone https://github.com/ROCm/aiter.git
+cd aiter
+git checkout dev/perf
+git submodule sync && git submodule update --init --recursive
+python3 setup.py install
+```
+
+`install vLLM`
+```
+# clone vllm
+cd .. && git clone https://github.com/vllm-project/vllm.git
+cd vllm
+PYTORCH_ROCM_ARCH="gfx950" python3 setup.py develop
+```
+
+(Optional) Recommended to install when loading weight from local storage.
+   - Install from wheel (only have python 3.10 wheel for now and is compatible with the docker image `rocm/ali-private:ubuntu22.04_rocm7.0.1.42_vllm_5b842c2_aiter_6b586ae_torch2.8.0_20250917`)
+
+     ```shell
+     python3 -m pip install https://github.com/EmbeddedLLM/fastsafetensors-rocm/releases/download/v0.1.15-rocm7-preview/fastsafetensors-0.1.15-cp310-cp310-linux_x86_64.whl
+     ```
+
+   - If you are using other python version follow the following steps (this works for ROCm 6.4.3 and ROCm 7.0):
+
+     ```shell
+     git clone https://github.com/EmbeddedLLM/fastsafetensors-rocm.git
+     cd fastsafetensors-rocm
+     python3 setup.py develop
+     ```
+
+   Add `--load-format fastsafetensors` to the `vllm serve` command to enable this feature.
+   A 5-mins readup about fastsafetensors can be found here <https://github.com/EmbeddedLLM/fastsafetensors-rocm/blob/blog/blog_fastsafetensors_amd.md>
+
+### Launch the vLLM Server
+
+Below is an example command to launch the vLLM server with Llama-3.3-70B-Instruct-FP4/FP8 model.
+
+`launch_server_llama_fp8.sh`
+```
+export SAFETENSORS_FAST_GPU=1
+export VLLM_ROCM_USE_AITER=1
+export VLLM_USE_V1=1
+export VLLM_USE_TRITON_FLASH_ATTN=0
+export NCCL_DEBUG=WARN
+export VLLM_RPC_TIMEOUT=1800000
+
+vllm serve amd/Llama-3.1-405B-Instruct-FP8-KV/ \
+    --tensor-parallel-size 1 \
+    --max-num-batched-tokens 32768 \
+    --port 8000 \
+    --trust-remote-code \
+    --no-enable-prefix-caching \
+    --disable-log-requests \
+    --gpu_memory_utilization 0.9 \
+    --async-scheduling \
+    --load-format safetensors \
+    --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE", "custom_ops": ["-rms_norm", "-quant_fp8", "-silu_and_mul"]}' \
+    --kv-cache-dtype fp8 \
+```
+
+`launch_server_llama_fp4.sh`
+```
+export SAFETENSORS_FAST_GPU=1
+export VLLM_ROCM_USE_AITER=1
+export VLLM_USE_V1=1
+export VLLM_USE_TRITON_FLASH_ATTN=0
+export NCCL_DEBUG=WARN
+export VLLM_RPC_TIMEOUT=1800000
+export VLLM_ROCM_USE_AITER_FP4_ASM_GEMM=1
+
+vllm serve amd/Llama-3.1-405B-Instruct-MXFP4-Preview \
+    --tensor-parallel-size 1 \
+    --max-num-batched-tokens 32768 \
+    --port 8000 \
+    --trust-remote-code \
+    --no-enable-prefix-caching \
+    --disable-log-requests \
+    --gpu_memory_utilization 0.9 \
+    --async-scheduling \
+    --load-format safetensors \
+    --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE"}' \
+    --kv-cache-dtype fp8 \
+```
+
+After the server is set up, the client can now send prompt requests to the server and receive results.
+
+### Configs and Parameters
+
+You can specify the IP address and the port that you would like to run the server with using these flags:
+
+- `host`: IP address of the server. By default, it uses 127.0.0.1.
+- `port`: The port to listen to by the server. By default, it uses port 8000.
+
+Below are the config flags that we do not recommend changing or tuning with:
+
+- `kv-cache-dtype`: Kv-cache data type. We recommend setting it to `fp8` for best performance.
+- `compilation-config`: Configuration for vLLM compilation stage. For amd/Llama-3.1-405B-Instruct-FP8-KV, we recommend setting it to `'{"cudagraph_mode":"FULL_AND_PIECEWISE", "custom_ops": ["-rms_norm", "-quant_fp8", "-silu_and_mul"]}'` to enable all the necessary fusions for the best performance on MI355. For amd/Llama-3.1-405B-Instruct-MXFP4-Preview, we recommend setting it to `'{"cudagraph_mode":"FULL_AND_PIECEWISE"}'` to enable all the necessary fusions for the best performance on MI355.
+- `async-scheduling`: Enable asynchronous scheduling to reduce the host overheads between decoding steps. We recommend always adding this flag for best performance.
+- `no-enable-prefix-caching` Disable prefix caching. We recommend always adding this flag if running with synthetic dataset for consistent performance measurement.
+
+Below are a few tunable parameters you can modify based on your serving requirements:
+
+- `tensor-parallel-size`: Tensor parallelism size. Increasing this will increase the number of GPUs that are used for inference.
+  - Set this to `1` to achieve the best throughput per GPU, and set this to `2`, `4`, or `8` to achieve better per-user latencies.
+- `max-num-batched-tokens`: Maximum number of tokens per batch.
+  - We recommend setting this to `32768`. Increasing this value may have slight performance improvements if the sequences have long input sequence lengths.
+
+
+## Validation & Expected Behavior
+
+### Basic Test
+
+After the vLLM server is set up and shows `Application startup complete`, you can send requests to the server
+
+### Verify Accuracy
+
+When the server is still running, we can run accuracy tests using lm_eval tool.
+
+`run_accuracy.sh`
+```
+# Install lm_eval that is compatible with the latest vLLM
+pip3 install lm-eval[api]
+
+# Run lm_eval
+lm_eval \
+    --model local-completions \
+    --tasks gsm8k \
+    --model_args model="$model",base_url=http://127.0.0.1:8000/v1/completions \
+    --batch_size 100 \
+```
+
+Here is an example accuracy result with the amd/Llama-3.1-405B-Instruct-FP8-KV model on one MI355 GPU:
+
+```
+local-completions (model=/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
+|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
+|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9052|±  |0.0081|
+|     |       |strict-match    |     5|exact_match|↑  |0.8575|±  |0.0096|
+```
+
+Here is an example accuracy result with the amd/Llama-3.1-405B-Instruct-MXFP4-Preview model on one MI355 GPU:
+
+```
+local-completions (model=/data/pretrained-models/amd/Llama-3.1-405B-Instruct-MXFP4-Preview/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
+|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
+|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8954|±  |0.0084|
+|     |       |strict-match    |     5|exact_match|↑  |0.8317|±  |0.0103|
+```
+
+### Benchmarking Performance
+
+To benchmark the performance, you can use the `vllm bench serve` command.
+
+`run_performance.sh`
+```
+input_tokens=8192
+output_tokens=1024
+max_concurrency=64
+num_prompts=128
+
+# model="/data/pretrained-models/amd/Llama-3.1-405B-Instruct-FP8-KV/"
+# model="/data/pretrained-models/amd/Llama-3.1-405B-Instruct-MXFP4-Preview/"
+
+vllm bench serve \
+    --host localhost \
+    --port 8000 \
+    --model ${model} \
+    --dataset-name random \
+    --random-input-len ${input_tokens} \
+    --random-output-len ${output_tokens} \
+    --max-concurrency ${max_concurrency} \
+    --num-prompts ${num_prompts} \
+    --percentile-metrics ttft,tpot,itl,e2el \
+    --ignore-eos \
+    --seed 123 \
+```
+
+Explanations for the flags:
+
+- `--dataset-name`: Which dataset to use for benchmarking. We use a `random` dataset here.
+- `--random-input-len`: Specifies the average input sequence length.
+- `--random-output-len`: Specifies the average output sequence length.
+- `--ignore-eos`: Disables early returning when eos (end-of-sentence) token is generated. This ensures that the output sequence lengths match our expected range.
+- `--max-concurrency`: Maximum number of in-flight requests. We recommend matching this with the `--max-num-seqs` flag used to launch the server.
+- `--num-prompts`: Total number of prompts used for performance benchmarking. We recommend setting it to at least five times of the `--max-concurrency` to measure the steady state performance.
+
+### Interpreting Performance Benchmarking Output 
+
+Sample output by the `vllm bench serve` command:
+
+`amd/Llama-3.1-405B-Instruct-FP8-KV` TP1 8k/1k conc=64 performance on MI355
+```
+============ Serving Benchmark Result ============
+Successful requests:                     xxxxxx
+Benchmark duration (s):                  xxx.xx
+Total input tokens:                      xxxxxx
+Total generated tokens:                  xxxxxx
+Request throughput (req/s):              xxx.xx
+Output token throughput (tok/s):         xxx.xx
+Total Token throughput (tok/s):          xxx.xx
+---------------Time to First Token----------------
+Mean TTFT (ms):                          xxx.xx
+Median TTFT (ms):                        xxx.xx
+P99 TTFT (ms):                           xxx.xx
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          xxx.xx
+Median TPOT (ms):                        xxx.xx
+P99 TPOT (ms):                           xxx.xx
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           xxx.xx
+Median ITL (ms):                         xxx.xx
+P99 ITL (ms):                            xxx.xx
+----------------End-to-end Latency----------------
+Mean E2EL (ms):                          xxx.xx
+Median E2EL (ms):                        xxx.xx
+P99 E2EL (ms):                           xxx.xx
+==================================================
+```
+
+`amd/Llama-3.1-405B-Instruct-MXFP4-Preview` TP1 8k/1k conc=64 performance on MI355
+```
+============ Serving Benchmark Result ============
+Successful requests:                     xxxxxx
+Benchmark duration (s):                  xxx.xx
+Total input tokens:                      xxxxxx
+Total generated tokens:                  xxxxxx
+Request throughput (req/s):              xxx.xx
+Output token throughput (tok/s):         xxx.xx
+Total Token throughput (tok/s):          xxx.xx
+---------------Time to First Token----------------
+Mean TTFT (ms):                          xxx.xx
+Median TTFT (ms):                        xxx.xx
+P99 TTFT (ms):                           xxx.xx
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          xxx.xx
+Median TPOT (ms):                        xxx.xx
+P99 TPOT (ms):                           xxx.xx
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           xxx.xx
+Median ITL (ms):                         xxx.xx
+P99 ITL (ms):                            xxx.xx
+----------------End-to-end Latency----------------
+Mean E2EL (ms):                          xxx.xx
+Median E2EL (ms):                        xxx.xx
+P99 E2EL (ms):                           xxx.xx
+==================================================
+```
+
+Explanations for key metrics:
+
+- `Median Time to First Token (TTFT)`: The typical time elapsed from when a request is sent until the first output token is generated.
+- `Median Time Per Output Token (TPOT)`: The typical time required to generate each token after the first one.
+- `Median Inter-Token Latency (ITL)`: The typical time delay between the completion of one token and the completion of the next.
+- `Median End-to-End Latency (E2EL)`: The typical total time from when a request is submitted until the final token of the response is received.
+- `Output token throughput`: The rate at which the system generates the output (generated) tokens.
+- `Total Token Throughput`: The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
diff --git a/Llama/Llama3.3-70B-ROCm.md b/Llama/Llama3.3-70B-ROCm.md
new file mode 100644
index 0000000..5112ed4
--- /dev/null
+++ b/Llama/Llama3.3-70B-ROCm.md
@@ -0,0 +1,337 @@
+# Quick Start Recipe for Llama 3.3 70B on vLLM - AMD MI355
+
+## Introduction
+
+This quick start recipe provides step-by-step instructions for running the Llama 3.3-70B Instruct model using vLLM with FP8 and FP4 quantization, optimized for AMD GPUs(MI355). It covers the complete setup required; from accessing model weights and preparing the software environment to configuring vLLM parameters, launching the server, and validating inference output.
+
+The recipe is intended for developers and practitioners seeking high-throughput or low-latency inference using ROCm's accelerated stack—building a docker image with vLLM for model serving, FlashInfer for optimized ROCm kernels, and ModelOpt to enable FP8 and MXFP4 quantized execution.
+
+## Access & Licensing
+
+### License
+
+To use Llama 3.3-70B, you must first agree to Meta’s Llama 3 Community License (https://ai.meta.com/resources/models-and-libraries/llama-downloads/). AMD’s quantized versions (FP8 and FP4) are built on top of the base model and are available for research and commercial use under the same license.
+
+### Weights
+
+You only need to download one version of the model weights, depending on the precision in use:
+
+- FP8 model for MI355: [amd/Llama-3.3-70B-Instruct-FP8-KV](https://huggingface.co/amd/Llama-3.3-70B-Instruct-FP8-KV)
+- FP4 model for MI355: [amd/Llama-3.3-70B-Instruct-MXFP4-Preview](https://huggingface.co/amd/Llama-3.3-70B-Instruct-MXFP4-Preview)
+
+No Hugging Face authentication token is required to download these weights.
+
+Note on Quantization Choice:
+For MI355, FP8 offers the best performance for most workloads. MXFP4 provides additional memory savings and throughput gains, but may require tuning to maintain accuracy on certain tasks.
+
+## Prerequisites
+
+- OS: Linux
+- GPU: MI355
+- [ROCm docker setup](https://docs.vllm.ai/en/stable/getting_started/installation/gpu.html#set-up-using-docker)
+
+## Deployment Steps
+
+### Pull Docker Image
+
+Pull the rocm/vllm-dev:nightly docker image.
+
+`pull_image`
+```
+docker pull rocm/vllm-dev:nightly # to get the latest image
+```
+
+### Run Docker Container
+
+Run the docker container using the docker image `rocm/vllm-dev:nightly`.
+
+`run_container`
+```
+docker run -it --rm \
+--network=host \
+--group-add=video \
+--ipc=host \
+--cap-add=SYS_PTRACE \
+--security-opt seccomp=unconfined \
+--device /dev/kfd \
+--device /dev/dri \
+-v <path/to/your/models>:/app/models \
+-e HF_HOME="/app/models" \
+-e HF_TOKEN="$HF_TOKEN" \
+rocm/vllm-dev:nightly
+```
+
+Note: You can mount additional directories and paths using the `-v <local_path>:<path>` flag if needed, such as mounting the downloaded weight paths.
+
+The `-e HF_TOKEN="$HF_TOKEN" -e HF_HOME="$HF_HOME"` flags are added so that the models are downloaded using your HuggingFace token and the downloaded models can be cached in $HF_HOME. Refer to [HuggingFace documentation](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhome) for more information about these environment variables and refer to [HuggingFace Quickstart guide](https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication) about steps to generate your HuggingFace access token.
+
+### Prepare the AITER and vLLM packages
+
+We suggest to install the latest vLLM and AITER to leverage all the optimizations available on ROCm plarforms.
+
+`install AITER`
+
+```
+# uninstall aiter & vllm
+pip uninstall -y aiter vllm
+# clone aiter repo
+git clone https://github.com/ROCm/aiter.git
+cd aiter
+git checkout dev/perf
+git submodule sync && git submodule update --init --recursive
+python3 setup.py install
+```
+
+`install vLLM`
+```
+# clone vllm
+cd .. && git clone https://github.com/vllm-project/vllm.git
+cd vllm
+PYTORCH_ROCM_ARCH="gfx950" python3 setup.py develop
+```
+
+(Optional) Recommended to install when loading weight from local storage.
+   - Install from wheel (only have python 3.10 wheel for now and is compatible with the docker image `rocm/ali-private:ubuntu22.04_rocm7.0.1.42_vllm_5b842c2_aiter_6b586ae_torch2.8.0_20250917`)
+
+     ```shell
+     python3 -m pip install https://github.com/EmbeddedLLM/fastsafetensors-rocm/releases/download/v0.1.15-rocm7-preview/fastsafetensors-0.1.15-cp310-cp310-linux_x86_64.whl
+     ```
+
+   - If you are using other python version follow the following steps (this works for ROCm 6.4.3 and ROCm 7.0):
+
+     ```shell
+     git clone https://github.com/EmbeddedLLM/fastsafetensors-rocm.git
+     cd fastsafetensors-rocm
+     python3 setup.py develop
+     ```
+
+   Add `--load-format fastsafetensors` to the `vllm serve` command to enable this feature.
+   A 5-mins readup about fastsafetensors can be found here <https://github.com/EmbeddedLLM/fastsafetensors-rocm/blob/blog/blog_fastsafetensors_amd.md>
+
+### Launch the vLLM Server
+
+Below is an example command to launch the vLLM server with Llama-3.3-70B-Instruct-FP4/FP8 model.
+
+`launch_server_llama_fp8.sh`
+```
+export SAFETENSORS_FAST_GPU=1
+export VLLM_ROCM_USE_AITER=1
+export VLLM_USE_V1=1
+export VLLM_USE_TRITON_FLASH_ATTN=0
+export NCCL_DEBUG=WARN
+export VLLM_RPC_TIMEOUT=1800000
+
+vllm serve amd/Llama-3.1-70B-Instruct-FP8-KV/ \
+    --tensor-parallel-size 1 \
+    --max-num-batched-tokens 32768 \
+    --port 8000 \
+    --trust-remote-code \
+    --no-enable-prefix-caching \
+    --disable-log-requests \
+    --gpu_memory_utilization 0.9 \
+    --async-scheduling \
+    --load-format safetensors \
+    --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE", "custom_ops": ["-rms_norm", "-quant_fp8", "-silu_and_mul"]}' \
+    --kv-cache-dtype fp8 \
+```
+
+`launch_server_llama_fp4.sh`
+```
+export SAFETENSORS_FAST_GPU=1
+export VLLM_ROCM_USE_AITER=1
+export VLLM_USE_V1=1
+export VLLM_USE_TRITON_FLASH_ATTN=0
+export NCCL_DEBUG=WARN
+export VLLM_RPC_TIMEOUT=1800000
+export VLLM_ROCM_USE_AITER_FP4_ASM_GEMM=1
+
+vllm serve amd/Llama-3.3-70B-Instruct-MXFP4-Preview \
+    --tensor-parallel-size 1 \
+    --max-num-batched-tokens 32768 \
+    --port 8000 \
+    --trust-remote-code \
+    --no-enable-prefix-caching \
+    --disable-log-requests \
+    --gpu_memory_utilization 0.9 \
+    --async-scheduling \
+    --load-format safetensors \
+    --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE"}' \
+    --kv-cache-dtype fp8 \
+```
+
+After the server is set up, the client can now send prompt requests to the server and receive results.
+
+### Configs and Parameters
+
+You can specify the IP address and the port that you would like to run the server with using these flags:
+
+- `host`: IP address of the server. By default, it uses 127.0.0.1.
+- `port`: The port to listen to by the server. By default, it uses port 8000.
+
+Below are the config flags that we do not recommend changing or tuning with:
+
+- `kv-cache-dtype`: Kv-cache data type. We recommend setting it to `fp8` for best performance.
+- `compilation-config`: Configuration for vLLM compilation stage. For amd/Llama-3.3-70B-Instruct-FP8-KV, we recommend setting it to `'{"cudagraph_mode":"FULL_AND_PIECEWISE", "custom_ops": ["-rms_norm", "-quant_fp8", "-silu_and_mul"]}'` to enable all the necessary fusions for the best performance on MI355. For amd/Llama-3.3-70B-Instruct-MXFP4-Preview, we recommend setting it to `'{"cudagraph_mode":"FULL_AND_PIECEWISE"}'` to enable all the necessary fusions for the best performance on MI355.
+- `async-scheduling`: Enable asynchronous scheduling to reduce the host overheads between decoding steps. We recommend always adding this flag for best performance.
+- `no-enable-prefix-caching` Disable prefix caching. We recommend always adding this flag if running with synthetic dataset for consistent performance measurement.
+
+Below are a few tunable parameters you can modify based on your serving requirements:
+
+- `tensor-parallel-size`: Tensor parallelism size. Increasing this will increase the number of GPUs that are used for inference.
+  - Set this to `1` to achieve the best throughput per GPU, and set this to `2`, `4`, or `8` to achieve better per-user latencies.
+- `max-num-batched-tokens`: Maximum number of tokens per batch.
+  - We recommend setting this to `32768`. Increasing this value may have slight performance improvements if the sequences have long input sequence lengths.
+
+
+## Validation & Expected Behavior
+
+### Basic Test
+
+After the vLLM server is set up and shows `Application startup complete`, you can send requests to the server
+
+### Verify Accuracy
+
+When the server is still running, we can run accuracy tests using lm_eval tool.
+
+`run_accuracy.sh`
+```
+# Install lm_eval that is compatible with the latest vLLM
+pip3 install lm-eval[api]
+
+# Run lm_eval
+lm_eval \
+    --model local-completions \
+    --tasks gsm8k \
+    --model_args model="$model",base_url=http://127.0.0.1:8000/v1/completions \
+    --batch_size 100 \
+```
+
+Here is an example accuracy result with the amd/Llama-3.1-70B-Instruct-FP8-KV/ model on one MI355 GPU:
+
+```
+local-completions (model=/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
+|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
+|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9052|±  |0.0081|
+|     |       |strict-match    |     5|exact_match|↑  |0.8575|±  |0.0096|
+```
+
+Here is an example accuracy result with the amd/Llama-3.3-70B-Instruct-MXFP4-Preview model on one MI355 GPU:
+
+```
+local-completions (model=/data/pretrained-models/amd/Llama-3.3-70B-Instruct-MXFP4-Preview/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
+|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
+|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8954|±  |0.0084|
+|     |       |strict-match    |     5|exact_match|↑  |0.8317|±  |0.0103|
+```
+
+### Benchmarking Performance
+
+To benchmark the performance, you can use the `vllm bench serve` command.
+
+`run_performance.sh`
+```
+input_tokens=8192
+output_tokens=1024
+max_concurrency=64
+num_prompts=128
+
+# model="/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/"
+# model="/data/pretrained-models/amd/Llama-3.3-70B-Instruct-MXFP4-Preview/"
+# model="/data/pretrained-models/amd/Llama-3.1-405B-Instruct-FP8-KV/"
+# model="/data/pretrained-models/amd/Llama-3.1-405B-Instruct-MXFP4-Preview/"
+
+vllm bench serve \
+    --host localhost \
+    --port 8000 \
+    --model ${model} \
+    --dataset-name random \
+    --random-input-len ${input_tokens} \
+    --random-output-len ${output_tokens} \
+    --max-concurrency ${max_concurrency} \
+    --num-prompts ${num_prompts} \
+    --percentile-metrics ttft,tpot,itl,e2el \
+    --ignore-eos \
+    --seed 123 \
+```
+
+Explanations for the flags:
+
+- `--dataset-name`: Which dataset to use for benchmarking. We use a `random` dataset here.
+- `--random-input-len`: Specifies the average input sequence length.
+- `--random-output-len`: Specifies the average output sequence length.
+- `--ignore-eos`: Disables early returning when eos (end-of-sentence) token is generated. This ensures that the output sequence lengths match our expected range.
+- `--max-concurrency`: Maximum number of in-flight requests. We recommend matching this with the `--max-num-seqs` flag used to launch the server.
+- `--num-prompts`: Total number of prompts used for performance benchmarking. We recommend setting it to at least five times of the `--max-concurrency` to measure the steady state performance.
+
+### Interpreting Performance Benchmarking Output 
+
+Sample output by the `vllm bench serve` command:
+
+`amd/Llama-3.1-70B-Instruct-FP8-KV` TP1 8k/1k conc=64 performance on MI355
+```
+============ Serving Benchmark Result ============
+Successful requests:                     xxxxxx
+Benchmark duration (s):                  xxx.xx
+Total input tokens:                      xxxxxx
+Total generated tokens:                  xxxxxx
+Request throughput (req/s):              xxx.xx
+Output token throughput (tok/s):         xxx.xx
+Total Token throughput (tok/s):          xxx.xx
+---------------Time to First Token----------------
+Mean TTFT (ms):                          xxx.xx
+Median TTFT (ms):                        xxx.xx
+P99 TTFT (ms):                           xxx.xx
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          xxx.xx
+Median TPOT (ms):                        xxx.xx
+P99 TPOT (ms):                           xxx.xx
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           xxx.xx
+Median ITL (ms):                         xxx.xx
+P99 ITL (ms):                            xxx.xx
+----------------End-to-end Latency----------------
+Mean E2EL (ms):                          xxx.xx
+Median E2EL (ms):                        xxx.xx
+P99 E2EL (ms):                           xxx.xx
+==================================================
+```
+
+`amd/Llama-3.3-70B-Instruct-MXFP4-Preview` TP1 8k/1k conc=64 performance on MI355
+```
+============ Serving Benchmark Result ============
+Successful requests:                     xxxxxx
+Benchmark duration (s):                  xxx.xx
+Total input tokens:                      xxxxxx
+Total generated tokens:                  xxxxxx
+Request throughput (req/s):              xxx.xx
+Output token throughput (tok/s):         xxx.xx
+Total Token throughput (tok/s):          xxx.xx
+---------------Time to First Token----------------
+Mean TTFT (ms):                          xxx.xx
+Median TTFT (ms):                        xxx.xx
+P99 TTFT (ms):                           xxx.xx
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          xxx.xx
+Median TPOT (ms):                        xxx.xx
+P99 TPOT (ms):                           xxx.xx
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           xxx.xx
+Median ITL (ms):                         xxx.xx
+P99 ITL (ms):                            xxx.xx
+----------------End-to-end Latency----------------
+Mean E2EL (ms):                          xxx.xx
+Median E2EL (ms):                        xxx.xx
+P99 E2EL (ms):                           xxx.xx
+==================================================
+```
+
+Explanations for key metrics:
+
+- `Median Time to First Token (TTFT)`: The typical time elapsed from when a request is sent until the first output token is generated.
+- `Median Time Per Output Token (TPOT)`: The typical time required to generate each token after the first one.
+- `Median Inter-Token Latency (ITL)`: The typical time delay between the completion of one token and the completion of the next.
+- `Median End-to-End Latency (E2EL)`: The typical total time from when a request is submitted until the final token of the response is received.
+- `Output token throughput`: The rate at which the system generates the output (generated) tokens.
+- `Total Token Throughput`: The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.

From 584c0c647bd4ac8d635112553130d845be10e68c Mon Sep 17 00:00:00 2001
From: zhuyuhua-v <yuhzhu@amd.com>
Date: Thu, 27 Nov 2025 16:38:21 +0800
Subject: [PATCH 2/5] update llama3.3-70b perf data

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
---
 Llama/Llama3.3-70B-ROCm.md | 98 ++++++++++++++++++++------------------
 1 file changed, 51 insertions(+), 47 deletions(-)

diff --git a/Llama/Llama3.3-70B-ROCm.md b/Llama/Llama3.3-70B-ROCm.md
index 5112ed4..69292c0 100644
--- a/Llama/Llama3.3-70B-ROCm.md
+++ b/Llama/Llama3.3-70B-ROCm.md
@@ -117,11 +117,10 @@ Below is an example command to launch the vLLM server with Llama-3.3-70B-Instruc
 export SAFETENSORS_FAST_GPU=1
 export VLLM_ROCM_USE_AITER=1
 export VLLM_USE_V1=1
-export VLLM_USE_TRITON_FLASH_ATTN=0
 export NCCL_DEBUG=WARN
 export VLLM_RPC_TIMEOUT=1800000
 
-vllm serve amd/Llama-3.1-70B-Instruct-FP8-KV/ \
+vllm serve amd/Llama-3.3-70B-Instruct-FP8-KV/ \
     --tensor-parallel-size 1 \
     --max-num-batched-tokens 32768 \
     --port 8000 \
@@ -140,7 +139,6 @@ vllm serve amd/Llama-3.1-70B-Instruct-FP8-KV/ \
 export SAFETENSORS_FAST_GPU=1
 export VLLM_ROCM_USE_AITER=1
 export VLLM_USE_V1=1
-export VLLM_USE_TRITON_FLASH_ATTN=0
 export NCCL_DEBUG=WARN
 export VLLM_RPC_TIMEOUT=1800000
 export VLLM_ROCM_USE_AITER_FP4_ASM_GEMM=1
@@ -206,10 +204,10 @@ lm_eval \
     --batch_size 100 \
 ```
 
-Here is an example accuracy result with the amd/Llama-3.1-70B-Instruct-FP8-KV/ model on one MI355 GPU:
+Here is an example accuracy result with the amd/Llama-3.3-70B-Instruct-FP8-KV/ model on one MI355 GPU:
 
 ```
-local-completions (model=/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
+local-completions (model=/data/pretrained-models/amd/Llama-3.3-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
 |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
 |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
 |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9052|±  |0.0081|
@@ -237,10 +235,8 @@ output_tokens=1024
 max_concurrency=64
 num_prompts=128
 
-# model="/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/"
+# model="/data/pretrained-models/amd/Llama-3.3-70B-Instruct-FP8-KV/"
 # model="/data/pretrained-models/amd/Llama-3.3-70B-Instruct-MXFP4-Preview/"
-# model="/data/pretrained-models/amd/Llama-3.1-405B-Instruct-FP8-KV/"
-# model="/data/pretrained-models/amd/Llama-3.1-405B-Instruct-MXFP4-Preview/"
 
 vllm bench serve \
     --host localhost \
@@ -269,61 +265,69 @@ Explanations for the flags:
 
 Sample output by the `vllm bench serve` command:
 
-`amd/Llama-3.1-70B-Instruct-FP8-KV` TP1 8k/1k conc=64 performance on MI355
+`amd/Llama-3.3-70B-Instruct-FP8-KV` TP1 8k/1k conc=64 performance on MI355
 ```
 ============ Serving Benchmark Result ============
-Successful requests:                     xxxxxx
-Benchmark duration (s):                  xxx.xx
-Total input tokens:                      xxxxxx
-Total generated tokens:                  xxxxxx
-Request throughput (req/s):              xxx.xx
-Output token throughput (tok/s):         xxx.xx
-Total Token throughput (tok/s):          xxx.xx
+Successful requests:                     128       
+Failed requests:                         0         
+Maximum request concurrency:             64        
+Benchmark duration (s):                  149.97    
+Total input tokens:                      1048448   
+Total generated tokens:                  131072    
+Request throughput (req/s):              0.85      
+Output token throughput (tok/s):         873.98    
+Peak output token throughput (tok/s):    1600.00   
+Peak concurrent requests:                70.00     
+Total Token throughput (tok/s):          7864.95   
 ---------------Time to First Token----------------
-Mean TTFT (ms):                          xxx.xx
-Median TTFT (ms):                        xxx.xx
-P99 TTFT (ms):                           xxx.xx
+Mean TTFT (ms):                          11639.66  
+Median TTFT (ms):                        7543.86   
+P99 TTFT (ms):                           31797.74  
 -----Time per Output Token (excl. 1st token)------
-Mean TPOT (ms):                          xxx.xx
-Median TPOT (ms):                        xxx.xx
-P99 TPOT (ms):                           xxx.xx
+Mean TPOT (ms):                          61.81     
+Median TPOT (ms):                        66.91     
+P99 TPOT (ms):                           72.17     
 ---------------Inter-token Latency----------------
-Mean ITL (ms):                           xxx.xx
-Median ITL (ms):                         xxx.xx
-P99 ITL (ms):                            xxx.xx
+Mean ITL (ms):                           61.81     
+Median ITL (ms):                         42.31     
+P99 ITL (ms):                            1068.18   
 ----------------End-to-end Latency----------------
-Mean E2EL (ms):                          xxx.xx
-Median E2EL (ms):                        xxx.xx
-P99 E2EL (ms):                           xxx.xx
+Mean E2EL (ms):                          74875.13  
+Median E2EL (ms):                        74736.83  
+P99 E2EL (ms):                           101320.40 
 ==================================================
 ```
 
 `amd/Llama-3.3-70B-Instruct-MXFP4-Preview` TP1 8k/1k conc=64 performance on MI355
 ```
 ============ Serving Benchmark Result ============
-Successful requests:                     xxxxxx
-Benchmark duration (s):                  xxx.xx
-Total input tokens:                      xxxxxx
-Total generated tokens:                  xxxxxx
-Request throughput (req/s):              xxx.xx
-Output token throughput (tok/s):         xxx.xx
-Total Token throughput (tok/s):          xxx.xx
+Successful requests:                     128       
+Failed requests:                         0         
+Maximum request concurrency:             64        
+Benchmark duration (s):                  131.16    
+Total input tokens:                      1048448   
+Total generated tokens:                  131072    
+Request throughput (req/s):              0.98      
+Output token throughput (tok/s):         999.30    
+Peak output token throughput (tok/s):    1728.00   
+Peak concurrent requests:                76.00     
+Total Token throughput (tok/s):          8992.71   
 ---------------Time to First Token----------------
-Mean TTFT (ms):                          xxx.xx
-Median TTFT (ms):                        xxx.xx
-P99 TTFT (ms):                           xxx.xx
+Mean TTFT (ms):                          10128.30  
+Median TTFT (ms):                        6526.57   
+P99 TTFT (ms):                           26033.25  
 -----Time per Output Token (excl. 1st token)------
-Mean TPOT (ms):                          xxx.xx
-Median TPOT (ms):                        xxx.xx
-P99 TPOT (ms):                           xxx.xx
+Mean TPOT (ms):                          54.10     
+Median TPOT (ms):                        57.28     
+P99 TPOT (ms):                           62.95     
 ---------------Inter-token Latency----------------
-Mean ITL (ms):                           xxx.xx
-Median ITL (ms):                         xxx.xx
-P99 ITL (ms):                            xxx.xx
+Mean ITL (ms):                           54.10     
+Median ITL (ms):                         39.05     
+P99 ITL (ms):                            411.51    
 ----------------End-to-end Latency----------------
-Mean E2EL (ms):                          xxx.xx
-Median E2EL (ms):                        xxx.xx
-P99 E2EL (ms):                           xxx.xx
+Mean E2EL (ms):                          65475.58  
+Median E2EL (ms):                        65305.68  
+P99 E2EL (ms):                           84989.10  
 ==================================================
 ```
 

From c8708f312650022d9307eebaf81e008c7d1e9fbf Mon Sep 17 00:00:00 2001
From: zhuyuhua-v <yuhzhu@amd.com>
Date: Thu, 27 Nov 2025 16:42:04 +0800
Subject: [PATCH 3/5] update 405b recipe

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
---
 Llama/Llama3.1-405B-ROCm.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/Llama/Llama3.1-405B-ROCm.md b/Llama/Llama3.1-405B-ROCm.md
index d5eee85..4f99dc3 100644
--- a/Llama/Llama3.1-405B-ROCm.md
+++ b/Llama/Llama3.1-405B-ROCm.md
@@ -1,8 +1,8 @@
-# Quick Start Recipe for Llama 3.3 70B on vLLM - AMD MI355
+# Quick Start Recipe for Llama 3.1 405B on vLLM - AMD MI355
 
 ## Introduction
 
-This quick start recipe provides step-by-step instructions for running the Llama 3.3-70B Instruct model using vLLM with FP8 and FP4 quantization, optimized for AMD GPUs(MI355). It covers the complete setup required; from accessing model weights and preparing the software environment to configuring vLLM parameters, launching the server, and validating inference output.
+This quick start recipe provides step-by-step instructions for running the Llama 3.1-405B Instruct model using vLLM with FP8 and FP4 quantization, optimized for AMD GPUs(MI355). It covers the complete setup required; from accessing model weights and preparing the software environment to configuring vLLM parameters, launching the server, and validating inference output.
 
 The recipe is intended for developers and practitioners seeking high-throughput or low-latency inference using ROCm's accelerated stack—building a docker image with vLLM for model serving, FlashInfer for optimized ROCm kernels, and ModelOpt to enable FP8 and MXFP4 quantized execution.
 
@@ -10,7 +10,7 @@ The recipe is intended for developers and practitioners seeking high-throughput
 
 ### License
 
-To use Llama 3.3-70B, you must first agree to Meta’s Llama 3 Community License (https://ai.meta.com/resources/models-and-libraries/llama-downloads/). AMD’s quantized versions (FP8 and FP4) are built on top of the base model and are available for research and commercial use under the same license.
+To use Llama 3.1-405B, you must first agree to Meta’s Llama 3 Community License (https://ai.meta.com/resources/models-and-libraries/llama-downloads/). AMD’s quantized versions (FP8 and FP4) are built on top of the base model and are available for research and commercial use under the same license.
 
 ### Weights
 
@@ -110,7 +110,7 @@ PYTORCH_ROCM_ARCH="gfx950" python3 setup.py develop
 
 ### Launch the vLLM Server
 
-Below is an example command to launch the vLLM server with Llama-3.3-70B-Instruct-FP4/FP8 model.
+Below is an example command to launch the vLLM server with Llama-3.1-405B-Instruct-FP4/FP8 model.
 
 `launch_server_llama_fp8.sh`
 ```
@@ -209,7 +209,7 @@ lm_eval \
 Here is an example accuracy result with the amd/Llama-3.1-405B-Instruct-FP8-KV model on one MI355 GPU:
 
 ```
-local-completions (model=/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
+local-completions (model=/data/pretrained-models/amd/Llama-3.1-405B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
 |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
 |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
 |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9052|±  |0.0081|

From 4c19cf7d747354cb72cee8fcb7e69a463470c911 Mon Sep 17 00:00:00 2001
From: zhuyuhua-v <yuhzhu@amd.com>
Date: Thu, 27 Nov 2025 17:06:35 +0800
Subject: [PATCH 4/5] update 405b perf data

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
---
 Llama/Llama3.1-405B-ROCm.md | 93 ++++++++++++++++++++-----------------
 1 file changed, 50 insertions(+), 43 deletions(-)

diff --git a/Llama/Llama3.1-405B-ROCm.md b/Llama/Llama3.1-405B-ROCm.md
index 4f99dc3..33f9794 100644
--- a/Llama/Llama3.1-405B-ROCm.md
+++ b/Llama/Llama3.1-405B-ROCm.md
@@ -212,8 +212,8 @@ Here is an example accuracy result with the amd/Llama-3.1-405B-Instruct-FP8-KV m
 local-completions (model=/data/pretrained-models/amd/Llama-3.1-405B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
 |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
 |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
-|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9052|±  |0.0081|
-|     |       |strict-match    |     5|exact_match|↑  |0.8575|±  |0.0096|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9333|±  |0.0069|
+|     |       |strict-match    |     5|exact_match|↑  |0.9295|±  |0.0071|
 ```
 
 Here is an example accuracy result with the amd/Llama-3.1-405B-Instruct-MXFP4-Preview model on one MI355 GPU:
@@ -222,8 +222,8 @@ Here is an example accuracy result with the amd/Llama-3.1-405B-Instruct-MXFP4-Pr
 local-completions (model=/data/pretrained-models/amd/Llama-3.1-405B-Instruct-MXFP4-Preview/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
 |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
 |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
-|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8954|±  |0.0084|
-|     |       |strict-match    |     5|exact_match|↑  |0.8317|±  |0.0103|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9371|±  |0.0067|
+|     |       |strict-match    |     5|exact_match|↑  |0.9333|±  |0.0069|
 ```
 
 ### Benchmarking Performance
@@ -270,58 +270,65 @@ Sample output by the `vllm bench serve` command:
 `amd/Llama-3.1-405B-Instruct-FP8-KV` TP1 8k/1k conc=64 performance on MI355
 ```
 ============ Serving Benchmark Result ============
-Successful requests:                     xxxxxx
-Benchmark duration (s):                  xxx.xx
-Total input tokens:                      xxxxxx
-Total generated tokens:                  xxxxxx
-Request throughput (req/s):              xxx.xx
-Output token throughput (tok/s):         xxx.xx
-Total Token throughput (tok/s):          xxx.xx
+Successful requests:                     128       
+Failed requests:                         0         
+Maximum request concurrency:             64        
+Benchmark duration (s):                  158.82    
+Total input tokens:                      1048448   
+Total generated tokens:                  131072    
+Request throughput (req/s):              0.81      
+Output token throughput (tok/s):         825.30    
+Peak output token throughput (tok/s):    2018.00   
+Peak concurrent requests:                80.00     
+Total Token throughput (tok/s):          7426.92   
 ---------------Time to First Token----------------
-Mean TTFT (ms):                          xxx.xx
-Median TTFT (ms):                        xxx.xx
-P99 TTFT (ms):                           xxx.xx
+Mean TTFT (ms):                          18196.10  
+Median TTFT (ms):                        12043.65  
+P99 TTFT (ms):                           46479.77  
 -----Time per Output Token (excl. 1st token)------
-Mean TPOT (ms):                          xxx.xx
-Median TPOT (ms):                        xxx.xx
-P99 TPOT (ms):                           xxx.xx
+Mean TPOT (ms):                          59.72     
+Median TPOT (ms):                        65.29     
+P99 TPOT (ms):                           75.74     
 ---------------Inter-token Latency----------------
-Mean ITL (ms):                           xxx.xx
-Median ITL (ms):                         xxx.xx
-P99 ITL (ms):                            xxx.xx
+Mean ITL (ms):                           59.72     
+Median ITL (ms):                         32.36     
+P99 ITL (ms):                            762.26    
 ----------------End-to-end Latency----------------
-Mean E2EL (ms):                          xxx.xx
-Median E2EL (ms):                        xxx.xx
-P99 E2EL (ms):                           xxx.xx
+Mean E2EL (ms):                          79291.69  
+Median E2EL (ms):                        79202.23  
+P99 E2EL (ms):                           114016.62 
 ==================================================
-```
 
 `amd/Llama-3.1-405B-Instruct-MXFP4-Preview` TP1 8k/1k conc=64 performance on MI355
 ```
 ============ Serving Benchmark Result ============
-Successful requests:                     xxxxxx
-Benchmark duration (s):                  xxx.xx
-Total input tokens:                      xxxxxx
-Total generated tokens:                  xxxxxx
-Request throughput (req/s):              xxx.xx
-Output token throughput (tok/s):         xxx.xx
-Total Token throughput (tok/s):          xxx.xx
+Successful requests:                     128       
+Failed requests:                         0         
+Maximum request concurrency:             64        
+Benchmark duration (s):                  147.60    
+Total input tokens:                      1048448   
+Total generated tokens:                  131072    
+Request throughput (req/s):              0.87      
+Output token throughput (tok/s):         888.04    
+Peak output token throughput (tok/s):    2112.00   
+Peak concurrent requests:                72.00     
+Total Token throughput (tok/s):          7991.51   
 ---------------Time to First Token----------------
-Mean TTFT (ms):                          xxx.xx
-Median TTFT (ms):                        xxx.xx
-P99 TTFT (ms):                           xxx.xx
+Mean TTFT (ms):                          16456.00  
+Median TTFT (ms):                        10925.81  
+P99 TTFT (ms):                           42157.67  
 -----Time per Output Token (excl. 1st token)------
-Mean TPOT (ms):                          xxx.xx
-Median TPOT (ms):                        xxx.xx
-P99 TPOT (ms):                           xxx.xx
+Mean TPOT (ms):                          55.94     
+Median TPOT (ms):                        61.01     
+P99 TPOT (ms):                           70.48     
 ---------------Inter-token Latency----------------
-Mean ITL (ms):                           xxx.xx
-Median ITL (ms):                         xxx.xx
-P99 ITL (ms):                            xxx.xx
+Mean ITL (ms):                           55.94     
+Median ITL (ms):                         31.13     
+P99 ITL (ms):                            672.20    
 ----------------End-to-end Latency----------------
-Mean E2EL (ms):                          xxx.xx
-Median E2EL (ms):                        xxx.xx
-P99 E2EL (ms):                           xxx.xx
+Mean E2EL (ms):                          73682.70  
+Median E2EL (ms):                        73568.78  
+P99 E2EL (ms):                           105183.02 
 ==================================================
 ```
 

From be7e1c5db0cdd126fbde248b04058d1ffc33d309 Mon Sep 17 00:00:00 2001
From: zhuyuhua-v <yuhzhu@amd.com>
Date: Thu, 27 Nov 2025 17:08:53 +0800
Subject: [PATCH 5/5] update safetensors installation guide

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>
---
 Llama/Llama3.1-405B-ROCm.md | 2 +-
 Llama/Llama3.3-70B-ROCm.md  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Llama/Llama3.1-405B-ROCm.md b/Llama/Llama3.1-405B-ROCm.md
index 33f9794..99df763 100644
--- a/Llama/Llama3.1-405B-ROCm.md
+++ b/Llama/Llama3.1-405B-ROCm.md
@@ -91,7 +91,7 @@ PYTORCH_ROCM_ARCH="gfx950" python3 setup.py develop
 ```
 
 (Optional) Recommended to install when loading weight from local storage.
-   - Install from wheel (only have python 3.10 wheel for now and is compatible with the docker image `rocm/ali-private:ubuntu22.04_rocm7.0.1.42_vllm_5b842c2_aiter_6b586ae_torch2.8.0_20250917`)
+   - Install from wheel (only have python 3.10 wheel for now)
 
      ```shell
      python3 -m pip install https://github.com/EmbeddedLLM/fastsafetensors-rocm/releases/download/v0.1.15-rocm7-preview/fastsafetensors-0.1.15-cp310-cp310-linux_x86_64.whl
diff --git a/Llama/Llama3.3-70B-ROCm.md b/Llama/Llama3.3-70B-ROCm.md
index 69292c0..52d6c4b 100644
--- a/Llama/Llama3.3-70B-ROCm.md
+++ b/Llama/Llama3.3-70B-ROCm.md
@@ -91,7 +91,7 @@ PYTORCH_ROCM_ARCH="gfx950" python3 setup.py develop
 ```
 
 (Optional) Recommended to install when loading weight from local storage.
-   - Install from wheel (only have python 3.10 wheel for now and is compatible with the docker image `rocm/ali-private:ubuntu22.04_rocm7.0.1.42_vllm_5b842c2_aiter_6b586ae_torch2.8.0_20250917`)
+   - Install from wheel (only have python 3.10 wheel for now)
 
      ```shell
      python3 -m pip install https://github.com/EmbeddedLLM/fastsafetensors-rocm/releases/download/v0.1.15-rocm7-preview/fastsafetensors-0.1.15-cp310-cp310-linux_x86_64.whl