[ROCm] add Llama-70b & Llama-405b recipes #127

zhuyuhua-v · 2025-11-27T07:53:45Z

This pr adds a recipe for Llama-70b & Llama-405b running on ROCm platforms.
The recipe is subject to changes as some optimizations are still on the way:

add act/rmsnorm+ mxfp4 quant fusion [ROCm][torch.compile] Add aiter act/rms+mxfp4 per tensor quant fusion pass for Llama vllm#25860

Signed-off-by: zhuyuhua-v <[email protected]>

gemini-code-assist · 2025-11-27T07:53:59Z

Summary of Changes

Hello @zhuyuhua-v, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces two new quick-start recipes designed to facilitate the deployment and evaluation of Llama 3.1-405B and Llama 3.3-70B large language models on AMD MI355 GPUs. These comprehensive guides walk users through the entire process, from environment setup and model weight acquisition to launching vLLM servers with FP8 and MXFP4 quantization, and finally, validating model accuracy and benchmarking performance on ROCm platforms.

Highlights

New Llama Recipes for ROCm: Added quick start recipes for Llama 3.1-405B and Llama 3.3-70B models, specifically optimized for AMD MI355 GPUs on ROCm platforms.
Quantization Support: The recipes include detailed instructions for running models with both FP8 and MXFP4 quantization to balance performance and memory usage.
Comprehensive Deployment Guide: Each recipe provides end-to-end guidance covering Docker setup, AITER and vLLM installation, server launch commands, and validation steps for accuracy and performance benchmarking.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds two new recipe files for running Llama models on ROCm platforms. The recipes are for Llama-3.1-405B and Llama-3.3-70B. My review found several inconsistencies in both files, likely due to copy-pasting. Many references to model names and versions are incorrect, which could confuse users. I've left specific comments to correct these inconsistencies. Please review them carefully to ensure the recipes are accurate and easy to follow.

gemini-code-assist · 2025-11-27T07:55:28Z

Llama/Llama3.1-405B-ROCm.md

@@ -0,0 +1,335 @@
+# Quick Start Recipe for Llama 3.3 70B on vLLM - AMD MI355


The title of the recipe is "Quick Start Recipe for Llama 3.3 70B on vLLM - AMD MI355", which is inconsistent with the filename Llama3.1-405B-ROCm.md. It seems this file should be about the 405B model.

Suggested change

# Quick Start Recipe for Llama 3.3 70B on vLLM - AMD MI355

# Quick Start Recipe for Llama 3.1 405B on vLLM - AMD MI355

gemini-code-assist · 2025-11-27T07:55:28Z

Llama/Llama3.1-405B-ROCm.md

+
+## Introduction
+
+This quick start recipe provides step-by-step instructions for running the Llama 3.3-70B Instruct model using vLLM with FP8 and FP4 quantization, optimized for AMD GPUs(MI355). It covers the complete setup required; from accessing model weights and preparing the software environment to configuring vLLM parameters, launching the server, and validating inference output.


The introduction mentions running the "Llama 3.3-70B Instruct model", which contradicts the filename and the model weights specified later in the document (which are for Llama 3.1 405B).

Suggested change

This quick start recipe provides step-by-step instructions for running the Llama 3.3-70B Instruct model using vLLM with FP8 and FP4 quantization, optimized for AMD GPUs(MI355). It covers the complete setup required; from accessing model weights and preparing the software environment to configuring vLLM parameters, launching the server, and validating inference output.

This quick start recipe provides step-by-step instructions for running the Llama 3.1-405B Instruct model using vLLM with FP8 and FP4 quantization, optimized for AMD GPUs(MI355). It covers the complete setup required; from accessing model weights and preparing the software environment to configuring vLLM parameters, launching the server, and validating inference output.

gemini-code-assist · 2025-11-27T07:55:28Z

Llama/Llama3.1-405B-ROCm.md

+
+### License
+
+To use Llama 3.3-70B, you must first agree to Meta’s Llama 3 Community License (https://ai.meta.com/resources/models-and-libraries/llama-downloads/). AMD’s quantized versions (FP8 and FP4) are built on top of the base model and are available for research and commercial use under the same license.


The license section refers to "Llama 3.3-70B", but this document is for "Llama 3.1-405B".

Suggested change

To use Llama 3.3-70B, you must first agree to Meta’s Llama 3 Community License (https://ai.meta.com/resources/models-and-libraries/llama-downloads/). AMD’s quantized versions (FP8 and FP4) are built on top of the base model and are available for research and commercial use under the same license.

To use Llama 3.1-405B, you must first agree to Meta’s Llama 3 Community License (https://ai.meta.com/resources/models-and-libraries/llama-downloads/). AMD’s quantized versions (FP8 and FP4) are built on top of the base model and are available for research and commercial use under the same license.

gemini-code-assist · 2025-11-27T07:55:28Z

Llama/Llama3.1-405B-ROCm.md

+
+### Launch the vLLM Server
+
+Below is an example command to launch the vLLM server with Llama-3.3-70B-Instruct-FP4/FP8 model.


The text says the example command is for "Llama-3.3-70B-Instruct-FP4/FP8 model", but the commands that follow are for "Llama-3.1-405B".

Suggested change

Below is an example command to launch the vLLM server with Llama-3.3-70B-Instruct-FP4/FP8 model.

Below is an example command to launch the vLLM server with Llama-3.1-405B-Instruct-FP4/FP8 model.

gemini-code-assist · 2025-11-27T07:55:28Z

Llama/Llama3.1-405B-ROCm.md

+Here is an example accuracy result with the amd/Llama-3.1-405B-Instruct-FP8-KV model on one MI355 GPU:
+
+```
+local-completions (model=/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100


The example accuracy result log shows a model path for Llama-3.1-70B-Instruct-FP8-KV, but the section header above it says it's for amd/Llama-3.1-405B-Instruct-FP8-KV. The log output is inconsistent with the context.

Suggested change

local-completions (model=/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100

local-completions (model=/data/pretrained-models/amd/Llama-3.1-405B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100

gemini-code-assist · 2025-11-27T07:55:29Z

Llama/Llama3.3-70B-ROCm.md

+export NCCL_DEBUG=WARN
+export VLLM_RPC_TIMEOUT=1800000
+
+vllm serve amd/Llama-3.1-70B-Instruct-FP8-KV/ \


The launch command for the FP8 model uses amd/Llama-3.1-70B-Instruct-FP8-KV/, but this recipe is for Llama 3.3. The model name should be amd/Llama-3.3-70B-Instruct-FP8-KV/.

Suggested change

vllm serve amd/Llama-3.1-70B-Instruct-FP8-KV/ \

vllm serve amd/Llama-3.3-70B-Instruct-FP8-KV/ \

gemini-code-assist · 2025-11-27T07:55:29Z

Llama/Llama3.3-70B-ROCm.md

+    --batch_size 100 \
+```
+
+Here is an example accuracy result with the amd/Llama-3.1-70B-Instruct-FP8-KV/ model on one MI355 GPU:


The accuracy result section for the FP8 model refers to amd/Llama-3.1-70B-Instruct-FP8-KV/, which is inconsistent with the Llama 3.3 version this recipe is for.

Suggested change

Here is an example accuracy result with the amd/Llama-3.1-70B-Instruct-FP8-KV/ model on one MI355 GPU:

Here is an example accuracy result with the amd/Llama-3.3-70B-Instruct-FP8-KV/ model on one MI355 GPU:

gemini-code-assist · 2025-11-27T07:55:29Z

Llama/Llama3.3-70B-ROCm.md

+Here is an example accuracy result with the amd/Llama-3.1-70B-Instruct-FP8-KV/ model on one MI355 GPU:
+
+```
+local-completions (model=/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100


The example accuracy log for the FP8 model shows a path for Llama-3.1-70B-Instruct-FP8-KV. This should be updated to Llama-3.3-70B-Instruct-FP8-KV to be consistent with the recipe.

Suggested change

local-completions (model=/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100

local-completions (model=/data/pretrained-models/amd/Llama-3.3-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100

gemini-code-assist · 2025-11-27T07:55:29Z

Llama/Llama3.3-70B-ROCm.md

+# model="/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/"
+# model="/data/pretrained-models/amd/Llama-3.3-70B-Instruct-MXFP4-Preview/"
+# model="/data/pretrained-models/amd/Llama-3.1-405B-Instruct-FP8-KV/"
+# model="/data/pretrained-models/amd/Llama-3.1-405B-Instruct-MXFP4-Preview/"


The commented-out model paths in the performance benchmark script include irrelevant models (405B) and an incorrect version for the 70B model. This can be confusing. It would be clearer to only include the models relevant to this recipe.

Suggested change

# model="/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/"

# model="/data/pretrained-models/amd/Llama-3.3-70B-Instruct-MXFP4-Preview/"

# model="/data/pretrained-models/amd/Llama-3.1-405B-Instruct-FP8-KV/"

# model="/data/pretrained-models/amd/Llama-3.1-405B-Instruct-MXFP4-Preview/"

# model="/data/pretrained-models/amd/Llama-3.3-70B-Instruct-FP8-KV/"

# model="/data/pretrained-models/amd/Llama-3.3-70B-Instruct-MXFP4-Preview/"

gemini-code-assist · 2025-11-27T07:55:29Z

Llama/Llama3.3-70B-ROCm.md

+
+Sample output by the `vllm bench serve` command:
+
+`amd/Llama-3.1-70B-Instruct-FP8-KV` TP1 8k/1k conc=64 performance on MI355


The sample output header for the FP8 model performance refers to amd/Llama-3.1-70B-Instruct-FP8-KV. This should be updated to Llama-3.3 to be consistent.

Suggested change

`amd/Llama-3.1-70B-Instruct-FP8-KV` TP1 8k/1k conc=64 performance on MI355

`amd/Llama-3.3-70B-Instruct-FP8-KV` TP1 8k/1k conc=64 performance on MI355

Signed-off-by: zhuyuhua-v <[email protected]>

Add ROCm llama recipes

b5697de

Signed-off-by: zhuyuhua-v <[email protected]>

gemini-code-assist bot reviewed Nov 27, 2025

View reviewed changes

zhuyuhua-v changed the title ~~[ROCm] add Llama-70b & Llama-405b recipe~~ [ROCm] add Llama-70b & Llama-405b recipes Nov 27, 2025

zhuyuhua-v added 4 commits November 27, 2025 16:38

update llama3.3-70b perf data

584c0c6

Signed-off-by: zhuyuhua-v <[email protected]>

update 405b recipe

c8708f3

Signed-off-by: zhuyuhua-v <[email protected]>

update 405b perf data

4c19cf7

Signed-off-by: zhuyuhua-v <[email protected]>

update safetensors installation guide

be7e1c5

Signed-off-by: zhuyuhua-v <[email protected]>

		@@ -0,0 +1,335 @@
		# Quick Start Recipe for Llama 3.3 70B on vLLM - AMD MI355

	# Quick Start Recipe for Llama 3.3 70B on vLLM - AMD MI355
	# Quick Start Recipe for Llama 3.1 405B on vLLM - AMD MI355


		## Introduction

		This quick start recipe provides step-by-step instructions for running the Llama 3.3-70B Instruct model using vLLM with FP8 and FP4 quantization, optimized for AMD GPUs(MI355). It covers the complete setup required; from accessing model weights and preparing the software environment to configuring vLLM parameters, launching the server, and validating inference output.


		### License

		To use Llama 3.3-70B, you must first agree to Meta’s Llama 3 Community License (https://ai.meta.com/resources/models-and-libraries/llama-downloads/). AMD’s quantized versions (FP8 and FP4) are built on top of the base model and are available for research and commercial use under the same license.

	To use Llama 3.3-70B, you must first agree to Meta’s Llama 3 Community License (https://ai.meta.com/resources/models-and-libraries/llama-downloads/). AMD’s quantized versions (FP8 and FP4) are built on top of the base model and are available for research and commercial use under the same license.
	To use Llama 3.1-405B, you must first agree to Meta’s Llama 3 Community License (https://ai.meta.com/resources/models-and-libraries/llama-downloads/). AMD’s quantized versions (FP8 and FP4) are built on top of the base model and are available for research and commercial use under the same license.


		### Launch the vLLM Server

		Below is an example command to launch the vLLM server with Llama-3.3-70B-Instruct-FP4/FP8 model.

	local-completions (model=/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
	local-completions (model=/data/pretrained-models/amd/Llama-3.1-405B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100

	vllm serve amd/Llama-3.1-70B-Instruct-FP8-KV/ \
	vllm serve amd/Llama-3.3-70B-Instruct-FP8-KV/ \

	Here is an example accuracy result with the amd/Llama-3.1-70B-Instruct-FP8-KV/ model on one MI355 GPU:
	Here is an example accuracy result with the amd/Llama-3.3-70B-Instruct-FP8-KV/ model on one MI355 GPU:


		Sample output by the `vllm bench serve` command:

		`amd/Llama-3.1-70B-Instruct-FP8-KV` TP1 8k/1k conc=64 performance on MI355

	`amd/Llama-3.1-70B-Instruct-FP8-KV` TP1 8k/1k conc=64 performance on MI355
	`amd/Llama-3.3-70B-Instruct-FP8-KV` TP1 8k/1k conc=64 performance on MI355

[ROCm] add Llama-70b & Llama-405b recipes #127

Are you sure you want to change the base?

[ROCm] add Llama-70b & Llama-405b recipes #127

Uh oh!

Conversation

zhuyuhua-v commented Nov 27, 2025

Uh oh!

gemini-code-assist bot commented Nov 27, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant