Skip to content

Commit 460542b

Browse files
authored
Merge branch 'master' into fix/preprocessing-documentation
2 parents f24e859 + 2035f3f commit 460542b

File tree

11 files changed

+164
-36
lines changed

11 files changed

+164
-36
lines changed

docs/benchmarks/image_classification/resnet50.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ hide:
1717

1818
{{ mlperf_inference_implementation_readme (4, "resnet50", "nvidia") }}
1919

20+
<!-->
21+
2022
=== "Intel"
2123
## Intel MLPerf Implementation
2224

@@ -31,3 +33,5 @@ hide:
3133
## MLPerf Modular Implementation in C++
3234

3335
{{ mlperf_inference_implementation_readme (4, "resnet50", "cpp") }}
36+
37+
-->

docs/benchmarks/language/bert.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ hide:
1919

2020
{{ mlperf_inference_implementation_readme (4, "bert-99.9", "nvidia") }}
2121

22+
<!--
2223
=== "Intel"
2324
## Intel MLPerf Implementation
2425
@@ -32,3 +33,4 @@ hide:
3233
{{ mlperf_inference_implementation_readme (4, "bert-99", "qualcomm") }}
3334
3435
{{ mlperf_inference_implementation_readme (4, "bert-99.9", "qualcomm") }}
36+
-->s

docs/benchmarks/language/gpt-j.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ hide:
2323

2424
{{ mlperf_inference_implementation_readme (4, "gptj-99.9", "nvidia") }}
2525

26+
<!--
2627
=== "Intel"
2728
## Intel MLPerf Implementation
2829
@@ -35,3 +36,4 @@ hide:
3536
3637
{{ mlperf_inference_implementation_readme (4, "gptj-99", "qualcomm") }}
3738
39+
-->

docs/benchmarks/language/llama2-70b.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ hide:
1919

2020
{{ mlperf_inference_implementation_readme (4, "llama2-70b-99.9", "nvidia") }}
2121

22+
<!--
2223
=== "Neural Magic"
2324
## Neural Magic MLPerf Implementation
2425
@@ -32,3 +33,4 @@ hide:
3233
{{ mlperf_inference_implementation_readme (4, "llama2-70b-99", "amd") }}
3334
3435
{{ mlperf_inference_implementation_readme (4, "llama2-70b-99.9", "amd") }}
36+
-->
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
---
2+
hide:
3+
- toc
4+
---
5+
6+
# Text Summarization with Llama2-70b for Student Cluster Competition 2025
7+
8+
## Introduction
9+
10+
This guide is designed for the [Student Cluster Competition 2025](https://sc25.supercomputing.org/students/student-cluster-competition/) to walk participants through running and optimizing the [MLPerf Inference Benchmark](https://arxiv.org/abs/1911.02549) using [Llama2 70b](https://github.com/mlcommons/inference/tree/master/language/llama2-70b) across various software and hardware configurations. The goal is to maximize system throughput (measured in Tokens per second) without compromising accuracy. Since the model performs poorly on CPUs, it is essential to run it on GPUs.
11+
12+
For a valid MLPerf Inference submission in this competition, you must run both a performance test and an accuracy test—**no compliance runs are required**. We use the **Offline** scenario, where throughput is the key metric (higher is better). For Llama 2-70B with the OpenOrca dataset (24,576 samples), the **performance run** must process an integer multiple of the full dataset (24,576 × *N* samples), while the **accuracy run** must process **exactly** the full dataset (24,576 samples). These requirements are taken care of by the MLPerf inference implementations. Setup for NVIDIA GPUs typically takes 2–3 hours and can be done offline. The final output is a tarball (`mlperf_submission.tar.gz`) containing MLPerf-compatible results which can be submitted to the organizers via a CLI command.
13+
14+
## Scoring
15+
16+
In the SCC, your first objective will be to get a valid MLPerf benchmark run. Traditionally running the reference MLPerf inference implementation (in Python) is easier compared to running Nvidia MLPerf inference implementation. Since for SCC25 we are having the Llama2-70b model, running the reference implementation needs around 600GB of VRAM and is tested only on 8xH100 Nvidia GPUs. If you have lower VRAM, trying the vendor implementation like of Nvidia or AMD is the best option.
17+
18+
MLCommons provides [automation](https://github.com/mlcommons/mlperf-automations/) to run the MLPerf inference benchmarks which you can make use of. Currently the automation supports the reference implementation as well as Nvidia implementation and this is useful for you to get a quick valid result as the automation produces the required final output. You can also use the manual steps by following the [reference](https://github.com/mlcommons/inference/tree/master/language/llama2-70b), [Nvidia](https://github.com/mlcommons/inference_results_v5.0/tree/main/closed/NVIDIA) or [AMD](https://github.com/mlcommons/inference_results_v5.0/tree/main/closed/AMD) implementation readmes.
19+
20+
Once the initial run is successful, you'll have the opportunity to optimize the benchmark further by maximizing system utilization, applying quantization techniques, adjusting ML frameworks, experimenting with batch sizes, and more, all of which can earn you additional points.
21+
22+
Since vendor implementations of the MLPerf inference benchmark vary, teams will compete within their respective hardware categories (e.g., Nvidia GPUs, AMD GPUs). Points will be awarded based on the throughput achieved on your system.
23+
24+
Additionally, significant bonus points will be awarded if your team enhances an existing implementation, enables multi-node execution, or adds/extends scripts to [mlperf-automations repository](https://github.com/mlcommons/mlperf-automations/tree/dev/script) supporting new devices, frameworks, implementations etc. All improvements must be made publicly available under the Apache 2.0 license and submitted as pull requests by November 10, 2025 and only the code which is *merge ready* will be considered for evaluation. As a guideline, below are some examples which can fetch you bonus points.
25+
26+
* Adds multi-node execution support for Nvidia, AMD or reference implementations
27+
* Support automation for AMD implementation
28+
* Supports fp8/fp4 quantization for Reference implementation
29+
* Automate the [network reference implementation](https://github.com/mlcommons/inference/blob/master/language/llama2-70b/SUT_API.py) (this uses OpenAI compatible endpoints)
30+
* The MLPerf automation supports docker run of Nvidia implementation. Supporting apptainer is a valuable contribution
31+
32+
PS: For any query regarding the contribution, feel free to raise an issue in the [Inference](https://github.com/mlcommons/inference) or [MLPerf automations](https://github.com/mlcommons/mlperf-automations) repositories.
33+
34+
!!! info
35+
Both MLPerf and MLC automation are evolving projects.
36+
If you encounter issues related to SCC, please submit them [here](https://github.com/mlcommons/inference/issues) with **scc-25** label
37+
with proper information about the command used, error logs and any additional usefull information to debug the issue.
38+
39+
## Artifacts to submit to the SCC committee
40+
41+
You will need to submit the following files:
42+
43+
* `mlperf_submission.run` - MLC commands to run MLPerf inference benchmark saved to this file.
44+
* `mlperf_submission.md` - description of your platform and some highlights of the MLPerf benchmark execution.
45+
* `<Team Name>` under which results are pushed to the github repository.
46+
47+
48+
## SCC interview
49+
50+
You are encouraged to highlight and explain the obtained MLPerf inference throughput on your system
51+
and describe any improvements and extensions to this benchmark (such as adding new hardware backend
52+
or supporting multi-node execution) useful for the community and [MLCommons](https://mlcommons.org).
53+
54+
## Run Commands
55+
56+
=== "MLCommons-Python"
57+
## MLPerf Reference Implementation in Python
58+
59+
{{ mlperf_inference_implementation_readme (4, "llama2-70b-99", "reference", fixed_scenarios=["Offline"], categories=["Datacenter"], setup_tips=False, implementation_tips=False, skip_test_query_count=True) }}
60+
61+
=== "Nvidia"
62+
## Nvidia MLPerf Implementation
63+
64+
{{ mlperf_inference_implementation_readme (4, "llama2-70b-99", "nvidia", fixed_scenarios=["Offline"], categories=["Datacenter"], setup_tips=False, implementation_tips=False, skip_test_query_count=True) }}
65+
66+
## Submission Commands
67+
68+
### Generate actual submission tree
69+
70+
71+
```bash
72+
mlcr generate,inference,submission,_wg-inference \
73+
--clean \
74+
--run-checker \
75+
--tar=yes \
76+
--env.MLC_TAR_OUTFILE=submission.tar.gz \
77+
--division=open \
78+
--category=datacenter \
79+
--env.CM_DETERMINE_MEMORY_CONFIGURATION=yes \
80+
--quiet \
81+
--submitter=<Team Name>
82+
```
83+
84+
* Use `--hw_name="My system name"` to give a meaningful system name.
85+
* At the end, a **.tar** file would be generated inside the current working directory.
86+
87+
### Submit Results
88+
89+
> **Note:**
90+
Further instructions on the final submission will be published as the deadline approaches.
91+
92+
<!-- Fork the `mlperf-inference-results-scc25` branch of the repository URL at [mlperf-automations](https://github.com/mlcommons/mlperf-automations).
93+
94+
Run the following command after **replacing `--repo_url` with your GitHub fork URL**.
95+
96+
```bash
97+
mlcr push,github,mlperf,inference,submission \
98+
--repo_url=https://github.com/<myfork>/mlperf-automations \
99+
--repo_branch=mlperf-inference-results-scc25 \
100+
--commit_message="Results on system <HW Name>" \
101+
--quiet
102+
```
103+
104+
Once uploaded give a Pull Request to the origin repository. Github action will be running there and once
105+
finished you can see your submitted results at [https://docs.mlcommons.org/mlperf-automations](https://docs.mlcommons.org/mlperf-automations). -->

docs/benchmarks/medical_imaging/3d-unet.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,12 @@ hide:
2222

2323
{{ mlperf_inference_implementation_readme (4, "3d-unet-99.9", "nvidia") }}
2424

25+
<!--
2526
=== "Intel"
2627
## Intel MLPerf Implementation
2728
2829
{{ mlperf_inference_implementation_readme (4, "3d-unet-99", "intel") }}
2930
3031
3132
{{ mlperf_inference_implementation_readme (4, "3d-unet-99.9", "intel") }}
33+
-->

docs/benchmarks/object_detection/retinanet.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ hide:
1515

1616
{{ mlperf_inference_implementation_readme (4, "retinanet", "nvidia") }}
1717

18+
<!--
1819
=== "Intel"
1920
## Intel MLPerf Implementation
2021
@@ -29,3 +30,4 @@ hide:
2930
## MLPerf Modular Implementation in C++
3031
3132
{{ mlperf_inference_implementation_readme (4, "retinanet", "cpp") }}
33+
-->

docs/benchmarks/recommendation/dlrm-v2.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,11 @@ hide:
1919

2020
{{ mlperf_inference_implementation_readme (4, "dlrm-v2-99.9", "nvidia") }}
2121

22+
<!--
2223
=== "Intel"
2324
## Intel MLPerf Implementation
2425
2526
{{ mlperf_inference_implementation_readme (4, "dlrm-v2-99", "intel") }}
2627
2728
{{ mlperf_inference_implementation_readme (4, "dlrm-v2-99.9", "intel") }}
29+
-->

docs/benchmarks/text_to_image/sdxl.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,8 @@ hide:
1616

1717
{{ mlperf_inference_implementation_readme (4, "sdxl", "nvidia") }}
1818

19+
<!--
1920
=== "Intel"
2021
## Intel MLPerf Implementation
2122
{{ mlperf_inference_implementation_readme (4, "sdxl", "intel") }}
22-
23+
-->

main.py

Lines changed: 32 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ def mlperf_inference_implementation_readme(
2828
content = ""
2929

3030
execution_envs = ["Docker", "Native"]
31+
run_modes = ["performance-only", "accuracy-only"]
3132
code_version = "r5.0-dev"
3233
implementation_run_options = []
3334

@@ -67,7 +68,7 @@ def mlperf_inference_implementation_readme(
6768

6869
elif implementation == "nvidia":
6970
if model in ["retinanet", "resnet50",
70-
"3d-unet-99", "3d-unet-99.9"]:
71+
"3d-unet-99", "3d-unet-99.9", "llama2-70b-99", "llama2-70b-99.9"]:
7172
code_version = "r5.1-dev"
7273
if model in ["mixtral-8x7b"]:
7374
return pre_space + " WIP"
@@ -186,6 +187,7 @@ def mlperf_inference_implementation_readme(
186187
cur_space2 = cur_space1 + " "
187188
cur_space3 = cur_space2 + " "
188189
cur_space4 = cur_space3 + " "
190+
cur_space5 = cur_space4 + " "
189191

190192
content += f"{cur_space1}=== \"{device}\"\n"
191193
content += f"{cur_space2}##### {device} device\n\n"
@@ -305,6 +307,8 @@ def mlperf_inference_implementation_readme(
305307

306308
if implementation.lower() == "nvidia":
307309
content += f"{cur_space3}* `--gpu_name=<Name of the GPU>` : The GPUs with supported configs in MLC are `orin`, `rtx_4090`, `rtx_a6000`, `rtx_6000_ada`, `l4`, `t4`and `a100`. For other GPUs, default configuration as per the GPU memory will be used.\n"
310+
if "llama2-70b" in model.lower():
311+
content += f"{cur_space3}* Add `--adr.llama2-model.tags=_pre-quantized` to use the Nvidia quantized models with the available in the MLC Storage. These models were quantized with three different configurations of tensor parallelism and pipeline parallelism: TP1–PP2, TP2–PP1, and TP1–PP1. The appropriate model will be automatically selected based on the values provided for `--tp_size` and `--pp_size` in run command. By default tp size of 2 and pp size of 1 would be used.\n"
308312

309313
if device.lower() not in ["cuda"]:
310314
content += f"{cur_space3}* `--docker_os=ubuntu`: ubuntu and rhel are supported. \n"
@@ -373,25 +377,27 @@ def mlperf_inference_implementation_readme(
373377

374378
for scenario in scenarios:
375379
content += f"{cur_space3}=== \"{scenario}\"\n{cur_space4}###### {scenario}\n\n"
376-
run_cmd = mlperf_inference_run_command(
377-
spaces + 21,
378-
model,
379-
implementation,
380-
framework.lower(),
381-
category.lower(),
382-
scenario,
383-
device.lower(),
384-
final_run_mode,
385-
test_query_count,
386-
False,
387-
skip_test_query_count,
388-
scenarios,
389-
code_version,
390-
extra_variation_tags,
391-
extra_input_string,
392-
)
393-
content += run_cmd
394-
# content += run_suffix
380+
for run_mode in run_modes:
381+
content += f"{cur_space4}=== \"{run_mode}\"\n{cur_space5}###### {run_mode}\n\n"
382+
run_cmd = mlperf_inference_run_command(
383+
spaces + 25,
384+
model,
385+
implementation,
386+
framework.lower(),
387+
category.lower(),
388+
scenario,
389+
device.lower(),
390+
final_run_mode,
391+
test_query_count,
392+
False,
393+
skip_test_query_count,
394+
scenarios,
395+
code_version,
396+
extra_variation_tags + f",_{run_mode}",
397+
extra_input_string,
398+
)
399+
content += run_cmd
400+
# content += run_suffix
395401

396402
if len(scenarios) > 1:
397403
content += f"{cur_space3}=== \"All Scenarios\"\n{cur_space4}###### All Scenarios\n\n"
@@ -481,7 +487,7 @@ def get_min_system_requirements(spaces, model, implementation, device):
481487
ds = {
482488
"dlrm": "500GB",
483489
"pointpainting": "500GB",
484-
"llama2-70b": "600GB",
490+
"llama2-70b": "900GB",
485491
"llama3_1-405b": "2.3TB",
486492
"mixtral": "100GB",
487493
"retinanet": "200GB",
@@ -498,7 +504,12 @@ def get_min_system_requirements(spaces, model, implementation, device):
498504
disk_space = ds[key]
499505
break
500506

507+
if "llama2" in model.lower():
508+
disk_space = f" 900GB for manual execution of {'reference' if implementation.lower() == 'reference' else 'vendor'} implementation and 1.5TB for automated run through MLC-Scripts"
509+
510+
if implementation.lower() == "reference" or "llama2" in model.lower():
501511
min_sys_req_content += f"{spaces}* **Disk Space**: {disk_space}\n\n"
512+
502513
# System memory
503514
if "dlrm" in model:
504515
system_memory = "512GB"
@@ -583,9 +594,6 @@ def get_docker_info(spaces, model, implementation,
583594
if implementation.lower() == "nvidia":
584595
info += f"{pre_space} - Default batch size is assigned based on [GPU memory](https://github.com/mlcommons/cm4mlops/blob/dd0c35856969c68945524d5c80414c615f5fe42c/script/app-mlperf-inference-nvidia/_cm.yaml#L1129) or the [specified GPU](https://github.com/mlcommons/cm4mlops/blob/dd0c35856969c68945524d5c80414c615f5fe42c/script/app-mlperf-inference-nvidia/_cm.yaml#L1370). Please click more option for *docker launch* or *run command* to see how to specify the GPU name.\n\n"
585596
info += f"{pre_space} - When run with `--all_models=yes`, all the benchmark models of NVIDIA implementation can be executed within the same container.\n\n"
586-
if "llama2" in model.lower():
587-
info += f"{pre_space} - The dataset for NVIDIA's implementation of Llama2 is not publicly available. The user must fill [this](https://docs.google.com/forms/d/e/1FAIpQLSc_8VIvRmXM3I8KQaYnKf7gy27Z63BBoI_I1u02f4lw6rBp3g/viewform?pli=1&fbzx=-8842630989397184967) form and be verified as a MLCommons member to access the dataset.\n\n"
588-
info += f"{pre_space} - `PATH_TO_PICKE_FILE` should be replaced with path to the downloaded pickle file.\n\n"
589597
else:
590598
if model == "sdxl":
591599
info += f"\n{pre_space}!!! tip\n\n"
@@ -731,7 +739,6 @@ def mlperf_inference_run_command(
731739
if "llama2-70b" in model.lower():
732740
if implementation == "nvidia":
733741
docker_cmd_suffix += f" \\\n{pre_space} --tp_size=2"
734-
docker_cmd_suffix += f" \\\n{pre_space} --nvidia_llama2_dataset_file_path=<PATH_TO_PICKLE_FILE>"
735742
elif implementation == "neuralmagic":
736743
docker_cmd_suffix += (
737744
f" \\\n{pre_space} --api_server=http://localhost:8000"
@@ -779,7 +786,6 @@ def mlperf_inference_run_command(
779786
if "llama2-70b" in model.lower():
780787
if implementation == "nvidia":
781788
cmd_suffix += f" \\\n{pre_space} --tp_size=<TP_SIZE>"
782-
cmd_suffix += f" \\\n{pre_space} --nvidia_llama2_dataset_file_path=<PATH_TO_PICKE_FILE>"
783789
elif implementation == "neuralmagic":
784790
cmd_suffix += f" \\\n{pre_space} --api_server=http://localhost:8000"
785791
cmd_suffix += f" \\\n{pre_space} --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8"

0 commit comments

Comments
 (0)