Skip to content

Conversation

@zhuyuhua-v
Copy link

This pr adds a recipe for Llama-70b & Llama-405b running on ROCm platforms.
The recipe is subject to changes as some optimizations are still on the way:

Signed-off-by: zhuyuhua-v <[email protected]>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @zhuyuhua-v, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces two new quick-start recipes designed to facilitate the deployment and evaluation of Llama 3.1-405B and Llama 3.3-70B large language models on AMD MI355 GPUs. These comprehensive guides walk users through the entire process, from environment setup and model weight acquisition to launching vLLM servers with FP8 and MXFP4 quantization, and finally, validating model accuracy and benchmarking performance on ROCm platforms.

Highlights

  • New Llama Recipes for ROCm: Added quick start recipes for Llama 3.1-405B and Llama 3.3-70B models, specifically optimized for AMD MI355 GPUs on ROCm platforms.
  • Quantization Support: The recipes include detailed instructions for running models with both FP8 and MXFP4 quantization to balance performance and memory usage.
  • Comprehensive Deployment Guide: Each recipe provides end-to-end guidance covering Docker setup, AITER and vLLM installation, server launch commands, and validation steps for accuracy and performance benchmarking.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds two new recipe files for running Llama models on ROCm platforms. The recipes are for Llama-3.1-405B and Llama-3.3-70B. My review found several inconsistencies in both files, likely due to copy-pasting. Many references to model names and versions are incorrect, which could confuse users. I've left specific comments to correct these inconsistencies. Please review them carefully to ensure the recipes are accurate and easy to follow.

@@ -0,0 +1,335 @@
# Quick Start Recipe for Llama 3.3 70B on vLLM - AMD MI355
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The title of the recipe is "Quick Start Recipe for Llama 3.3 70B on vLLM - AMD MI355", which is inconsistent with the filename Llama3.1-405B-ROCm.md. It seems this file should be about the 405B model.

Suggested change
# Quick Start Recipe for Llama 3.3 70B on vLLM - AMD MI355
# Quick Start Recipe for Llama 3.1 405B on vLLM - AMD MI355


## Introduction

This quick start recipe provides step-by-step instructions for running the Llama 3.3-70B Instruct model using vLLM with FP8 and FP4 quantization, optimized for AMD GPUs(MI355). It covers the complete setup required; from accessing model weights and preparing the software environment to configuring vLLM parameters, launching the server, and validating inference output.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The introduction mentions running the "Llama 3.3-70B Instruct model", which contradicts the filename and the model weights specified later in the document (which are for Llama 3.1 405B).

Suggested change
This quick start recipe provides step-by-step instructions for running the Llama 3.3-70B Instruct model using vLLM with FP8 and FP4 quantization, optimized for AMD GPUs(MI355). It covers the complete setup required; from accessing model weights and preparing the software environment to configuring vLLM parameters, launching the server, and validating inference output.
This quick start recipe provides step-by-step instructions for running the Llama 3.1-405B Instruct model using vLLM with FP8 and FP4 quantization, optimized for AMD GPUs(MI355). It covers the complete setup required; from accessing model weights and preparing the software environment to configuring vLLM parameters, launching the server, and validating inference output.


### License

To use Llama 3.3-70B, you must first agree to Meta’s Llama 3 Community License (https://ai.meta.com/resources/models-and-libraries/llama-downloads/). AMD’s quantized versions (FP8 and FP4) are built on top of the base model and are available for research and commercial use under the same license.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The license section refers to "Llama 3.3-70B", but this document is for "Llama 3.1-405B".

Suggested change
To use Llama 3.3-70B, you must first agree to Meta’s Llama 3 Community License (https://ai.meta.com/resources/models-and-libraries/llama-downloads/). AMD’s quantized versions (FP8 and FP4) are built on top of the base model and are available for research and commercial use under the same license.
To use Llama 3.1-405B, you must first agree to Meta’s Llama 3 Community License (https://ai.meta.com/resources/models-and-libraries/llama-downloads/). AMD’s quantized versions (FP8 and FP4) are built on top of the base model and are available for research and commercial use under the same license.


### Launch the vLLM Server

Below is an example command to launch the vLLM server with Llama-3.3-70B-Instruct-FP4/FP8 model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The text says the example command is for "Llama-3.3-70B-Instruct-FP4/FP8 model", but the commands that follow are for "Llama-3.1-405B".

Suggested change
Below is an example command to launch the vLLM server with Llama-3.3-70B-Instruct-FP4/FP8 model.
Below is an example command to launch the vLLM server with Llama-3.1-405B-Instruct-FP4/FP8 model.

Here is an example accuracy result with the amd/Llama-3.1-405B-Instruct-FP8-KV model on one MI355 GPU:

```
local-completions (model=/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example accuracy result log shows a model path for Llama-3.1-70B-Instruct-FP8-KV, but the section header above it says it's for amd/Llama-3.1-405B-Instruct-FP8-KV. The log output is inconsistent with the context.

Suggested change
local-completions (model=/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
local-completions (model=/data/pretrained-models/amd/Llama-3.1-405B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100

export NCCL_DEBUG=WARN
export VLLM_RPC_TIMEOUT=1800000

vllm serve amd/Llama-3.1-70B-Instruct-FP8-KV/ \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The launch command for the FP8 model uses amd/Llama-3.1-70B-Instruct-FP8-KV/, but this recipe is for Llama 3.3. The model name should be amd/Llama-3.3-70B-Instruct-FP8-KV/.

Suggested change
vllm serve amd/Llama-3.1-70B-Instruct-FP8-KV/ \
vllm serve amd/Llama-3.3-70B-Instruct-FP8-KV/ \

--batch_size 100 \
```

Here is an example accuracy result with the amd/Llama-3.1-70B-Instruct-FP8-KV/ model on one MI355 GPU:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The accuracy result section for the FP8 model refers to amd/Llama-3.1-70B-Instruct-FP8-KV/, which is inconsistent with the Llama 3.3 version this recipe is for.

Suggested change
Here is an example accuracy result with the amd/Llama-3.1-70B-Instruct-FP8-KV/ model on one MI355 GPU:
Here is an example accuracy result with the amd/Llama-3.3-70B-Instruct-FP8-KV/ model on one MI355 GPU:

Here is an example accuracy result with the amd/Llama-3.1-70B-Instruct-FP8-KV/ model on one MI355 GPU:

```
local-completions (model=/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example accuracy log for the FP8 model shows a path for Llama-3.1-70B-Instruct-FP8-KV. This should be updated to Llama-3.3-70B-Instruct-FP8-KV to be consistent with the recipe.

Suggested change
local-completions (model=/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
local-completions (model=/data/pretrained-models/amd/Llama-3.3-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100

Comment on lines 240 to 243
# model="/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/"
# model="/data/pretrained-models/amd/Llama-3.3-70B-Instruct-MXFP4-Preview/"
# model="/data/pretrained-models/amd/Llama-3.1-405B-Instruct-FP8-KV/"
# model="/data/pretrained-models/amd/Llama-3.1-405B-Instruct-MXFP4-Preview/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The commented-out model paths in the performance benchmark script include irrelevant models (405B) and an incorrect version for the 70B model. This can be confusing. It would be clearer to only include the models relevant to this recipe.

Suggested change
# model="/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/"
# model="/data/pretrained-models/amd/Llama-3.3-70B-Instruct-MXFP4-Preview/"
# model="/data/pretrained-models/amd/Llama-3.1-405B-Instruct-FP8-KV/"
# model="/data/pretrained-models/amd/Llama-3.1-405B-Instruct-MXFP4-Preview/"
# model="/data/pretrained-models/amd/Llama-3.3-70B-Instruct-FP8-KV/"
# model="/data/pretrained-models/amd/Llama-3.3-70B-Instruct-MXFP4-Preview/"


Sample output by the `vllm bench serve` command:

`amd/Llama-3.1-70B-Instruct-FP8-KV` TP1 8k/1k conc=64 performance on MI355
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The sample output header for the FP8 model performance refers to amd/Llama-3.1-70B-Instruct-FP8-KV. This should be updated to Llama-3.3 to be consistent.

Suggested change
`amd/Llama-3.1-70B-Instruct-FP8-KV` TP1 8k/1k conc=64 performance on MI355
`amd/Llama-3.3-70B-Instruct-FP8-KV` TP1 8k/1k conc=64 performance on MI355

@zhuyuhua-v zhuyuhua-v changed the title [ROCm] add Llama-70b & Llama-405b recipe [ROCm] add Llama-70b & Llama-405b recipes Nov 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant