-
Notifications
You must be signed in to change notification settings - Fork 90
[ROCm] add Llama-70b & Llama-405b recipes #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: zhuyuhua-v <[email protected]>
Summary of ChangesHello @zhuyuhua-v, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces two new quick-start recipes designed to facilitate the deployment and evaluation of Llama 3.1-405B and Llama 3.3-70B large language models on AMD MI355 GPUs. These comprehensive guides walk users through the entire process, from environment setup and model weight acquisition to launching vLLM servers with FP8 and MXFP4 quantization, and finally, validating model accuracy and benchmarking performance on ROCm platforms. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds two new recipe files for running Llama models on ROCm platforms. The recipes are for Llama-3.1-405B and Llama-3.3-70B. My review found several inconsistencies in both files, likely due to copy-pasting. Many references to model names and versions are incorrect, which could confuse users. I've left specific comments to correct these inconsistencies. Please review them carefully to ensure the recipes are accurate and easy to follow.
Llama/Llama3.1-405B-ROCm.md
Outdated
| @@ -0,0 +1,335 @@ | |||
| # Quick Start Recipe for Llama 3.3 70B on vLLM - AMD MI355 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The title of the recipe is "Quick Start Recipe for Llama 3.3 70B on vLLM - AMD MI355", which is inconsistent with the filename Llama3.1-405B-ROCm.md. It seems this file should be about the 405B model.
| # Quick Start Recipe for Llama 3.3 70B on vLLM - AMD MI355 | |
| # Quick Start Recipe for Llama 3.1 405B on vLLM - AMD MI355 |
Llama/Llama3.1-405B-ROCm.md
Outdated
|
|
||
| ## Introduction | ||
|
|
||
| This quick start recipe provides step-by-step instructions for running the Llama 3.3-70B Instruct model using vLLM with FP8 and FP4 quantization, optimized for AMD GPUs(MI355). It covers the complete setup required; from accessing model weights and preparing the software environment to configuring vLLM parameters, launching the server, and validating inference output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The introduction mentions running the "Llama 3.3-70B Instruct model", which contradicts the filename and the model weights specified later in the document (which are for Llama 3.1 405B).
| This quick start recipe provides step-by-step instructions for running the Llama 3.3-70B Instruct model using vLLM with FP8 and FP4 quantization, optimized for AMD GPUs(MI355). It covers the complete setup required; from accessing model weights and preparing the software environment to configuring vLLM parameters, launching the server, and validating inference output. | |
| This quick start recipe provides step-by-step instructions for running the Llama 3.1-405B Instruct model using vLLM with FP8 and FP4 quantization, optimized for AMD GPUs(MI355). It covers the complete setup required; from accessing model weights and preparing the software environment to configuring vLLM parameters, launching the server, and validating inference output. |
Llama/Llama3.1-405B-ROCm.md
Outdated
|
|
||
| ### License | ||
|
|
||
| To use Llama 3.3-70B, you must first agree to Meta’s Llama 3 Community License (https://ai.meta.com/resources/models-and-libraries/llama-downloads/). AMD’s quantized versions (FP8 and FP4) are built on top of the base model and are available for research and commercial use under the same license. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The license section refers to "Llama 3.3-70B", but this document is for "Llama 3.1-405B".
| To use Llama 3.3-70B, you must first agree to Meta’s Llama 3 Community License (https://ai.meta.com/resources/models-and-libraries/llama-downloads/). AMD’s quantized versions (FP8 and FP4) are built on top of the base model and are available for research and commercial use under the same license. | |
| To use Llama 3.1-405B, you must first agree to Meta’s Llama 3 Community License (https://ai.meta.com/resources/models-and-libraries/llama-downloads/). AMD’s quantized versions (FP8 and FP4) are built on top of the base model and are available for research and commercial use under the same license. |
Llama/Llama3.1-405B-ROCm.md
Outdated
|
|
||
| ### Launch the vLLM Server | ||
|
|
||
| Below is an example command to launch the vLLM server with Llama-3.3-70B-Instruct-FP4/FP8 model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text says the example command is for "Llama-3.3-70B-Instruct-FP4/FP8 model", but the commands that follow are for "Llama-3.1-405B".
| Below is an example command to launch the vLLM server with Llama-3.3-70B-Instruct-FP4/FP8 model. | |
| Below is an example command to launch the vLLM server with Llama-3.1-405B-Instruct-FP4/FP8 model. |
Llama/Llama3.1-405B-ROCm.md
Outdated
| Here is an example accuracy result with the amd/Llama-3.1-405B-Instruct-FP8-KV model on one MI355 GPU: | ||
|
|
||
| ``` | ||
| local-completions (model=/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example accuracy result log shows a model path for Llama-3.1-70B-Instruct-FP8-KV, but the section header above it says it's for amd/Llama-3.1-405B-Instruct-FP8-KV. The log output is inconsistent with the context.
| local-completions (model=/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100 | |
| local-completions (model=/data/pretrained-models/amd/Llama-3.1-405B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100 |
Llama/Llama3.3-70B-ROCm.md
Outdated
| export NCCL_DEBUG=WARN | ||
| export VLLM_RPC_TIMEOUT=1800000 | ||
|
|
||
| vllm serve amd/Llama-3.1-70B-Instruct-FP8-KV/ \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Llama/Llama3.3-70B-ROCm.md
Outdated
| --batch_size 100 \ | ||
| ``` | ||
|
|
||
| Here is an example accuracy result with the amd/Llama-3.1-70B-Instruct-FP8-KV/ model on one MI355 GPU: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The accuracy result section for the FP8 model refers to amd/Llama-3.1-70B-Instruct-FP8-KV/, which is inconsistent with the Llama 3.3 version this recipe is for.
| Here is an example accuracy result with the amd/Llama-3.1-70B-Instruct-FP8-KV/ model on one MI355 GPU: | |
| Here is an example accuracy result with the amd/Llama-3.3-70B-Instruct-FP8-KV/ model on one MI355 GPU: |
Llama/Llama3.3-70B-ROCm.md
Outdated
| Here is an example accuracy result with the amd/Llama-3.1-70B-Instruct-FP8-KV/ model on one MI355 GPU: | ||
|
|
||
| ``` | ||
| local-completions (model=/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example accuracy log for the FP8 model shows a path for Llama-3.1-70B-Instruct-FP8-KV. This should be updated to Llama-3.3-70B-Instruct-FP8-KV to be consistent with the recipe.
| local-completions (model=/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100 | |
| local-completions (model=/data/pretrained-models/amd/Llama-3.3-70B-Instruct-FP8-KV/,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100 |
Llama/Llama3.3-70B-ROCm.md
Outdated
| # model="/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/" | ||
| # model="/data/pretrained-models/amd/Llama-3.3-70B-Instruct-MXFP4-Preview/" | ||
| # model="/data/pretrained-models/amd/Llama-3.1-405B-Instruct-FP8-KV/" | ||
| # model="/data/pretrained-models/amd/Llama-3.1-405B-Instruct-MXFP4-Preview/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commented-out model paths in the performance benchmark script include irrelevant models (405B) and an incorrect version for the 70B model. This can be confusing. It would be clearer to only include the models relevant to this recipe.
| # model="/data/pretrained-models/amd/Llama-3.1-70B-Instruct-FP8-KV/" | |
| # model="/data/pretrained-models/amd/Llama-3.3-70B-Instruct-MXFP4-Preview/" | |
| # model="/data/pretrained-models/amd/Llama-3.1-405B-Instruct-FP8-KV/" | |
| # model="/data/pretrained-models/amd/Llama-3.1-405B-Instruct-MXFP4-Preview/" | |
| # model="/data/pretrained-models/amd/Llama-3.3-70B-Instruct-FP8-KV/" | |
| # model="/data/pretrained-models/amd/Llama-3.3-70B-Instruct-MXFP4-Preview/" |
Llama/Llama3.3-70B-ROCm.md
Outdated
|
|
||
| Sample output by the `vllm bench serve` command: | ||
|
|
||
| `amd/Llama-3.1-70B-Instruct-FP8-KV` TP1 8k/1k conc=64 performance on MI355 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sample output header for the FP8 model performance refers to amd/Llama-3.1-70B-Instruct-FP8-KV. This should be updated to Llama-3.3 to be consistent.
| `amd/Llama-3.1-70B-Instruct-FP8-KV` TP1 8k/1k conc=64 performance on MI355 | |
| `amd/Llama-3.3-70B-Instruct-FP8-KV` TP1 8k/1k conc=64 performance on MI355 |
Signed-off-by: zhuyuhua-v <[email protected]>
Signed-off-by: zhuyuhua-v <[email protected]>
Signed-off-by: zhuyuhua-v <[email protected]>
Signed-off-by: zhuyuhua-v <[email protected]>
This pr adds a recipe for Llama-70b & Llama-405b running on ROCm platforms.
The recipe is subject to changes as some optimizations are still on the way: