Skip to content

Commit c8e2546

Browse files
committed
Merge branch 'dev' into python_upgrades
2 parents 1ce3e62 + cabcc59 commit c8e2546

File tree

210 files changed

+529
-552
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

210 files changed

+529
-552
lines changed

.github/workflows/linting.yml

+2-2
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ jobs:
1717
pip install pylint==2.16.1
1818
- name: Run pylint
1919
run: |
20-
pylint algorithmic_efficiency
20+
pylint algoperf
2121
pylint reference_algorithms
2222
pylint prize_qualification_baselines
2323
pylint submission_runner.py
@@ -34,7 +34,7 @@ jobs:
3434
- name: Install isort
3535
run: |
3636
python -m pip install --upgrade pip
37-
pip install isort
37+
pip install isort==5.12.0
3838
- name: Run isort
3939
run: |
4040
isort . --check --diff

.github/workflows/regression_tests_variants.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ jobs:
7272
run: |
7373
docker pull us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_${{ github.head_ref || github.ref_name }}
7474
docker run -v $HOME/data/:/data/ -v $HOME/experiment_runs/:/experiment_runs -v $HOME/experiment_runs/logs:/logs --gpus all --ipc=host us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_${{ github.head_ref || github.ref_name }} -d criteo1tb -f pytorch -s reference_algorithms/paper_baselines/adamw/pytorch/submission.py -w criteo1tb_resnet -t reference_algorithms/paper_baselines/adamw/tuning_search_space.json -e tests/regression_tests/adamw -m 10 -c False -o True -r false
75-
criteo_resnet_pytorch:
75+
criteo_embed_init_pytorch:
7676
runs-on: self-hosted
7777
needs: build_and_push_pytorch_docker_image
7878
steps:

.gitignore

+3-3
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@ makefile
1212
*.swp
1313
*/data/
1414
*events.out.tfevents*
15-
algorithmic_efficiency/workloads/librispeech_conformer/data_dir
16-
algorithmic_efficiency/workloads/librispeech_conformer/work_dir
15+
algoperf/workloads/librispeech_conformer/data_dir
16+
algoperf/workloads/librispeech_conformer/work_dir
1717
*.flac
1818
*.npy
1919
*.csv
@@ -25,4 +25,4 @@ scoring/plots/
2525
!scoring/test_data/experiment_dir/study_0/mnist_jax/trial_0/eval_measurements.csv
2626
!scoring/test_data/experiment_dir/study_0/mnist_jax/trial_1/eval_measurements.csv
2727

28-
algorithmic_efficiency/_version.py
28+
algoperf/_version.py

CHANGELOG.md

+13-7
Original file line numberDiff line numberDiff line change
@@ -4,34 +4,39 @@
44

55
- Finalized variant workload targets.
66
- Fix in random_utils helper function.
7-
- For conformer PyTorch Dropout layers set `inplace=True`.
7+
- For conformer PyTorch Dropout layers set `inplace=True`.
88
- Clear CUDA cache at begining of each trial for PyTorch.
99

1010
## algoperf-benchmark-0.1.4 (2024-03-26)
1111

1212
Upgrade CUDA version to CUDA 12.1:
13+
1314
- Upgrade CUDA version in Dockerfiles that will be used for scoring.
1415
- Update Jax and PyTorch package version tags to use local CUDA installation.
1516

16-
Add flag for completely disabling checkpointing.
17+
Add flag for completely disabling checkpointing.
18+
1719
- Note that we will run with checkpointing off at scoring time.
1820

19-
Update Deepspeech and Conformer variant target setting configurations.
20-
- Note that variant targets are not final.
21+
Update Deepspeech and Conformer variant target setting configurations.
22+
23+
- Note that variant targets are not final.
2124

2225
Fixed bug in scoring code to take best trial in a study for external-tuning ruleset.
2326

24-
Added instructions for submission.
27+
Added instructions for submission.
2528

26-
Changed default number of workers for PyTorch data loaders to 0. Running with >0 may lead to incorrect eval results see https://github.com/mlcommons/algorithmic-efficiency/issues/732.
29+
Changed default number of workers for PyTorch data loaders to 0. Running with >0 may lead to incorrect eval results see <https://github.com/mlcommons/algorithmic-efficiency/issues/732>.
2730

2831
## algoperf-benchmark-0.1.2 (2024-03-04)
32+
2933
Workload variant additions and fixes:
34+
3035
- Add Deepspeech workload variant
3136
- Fix bugs in Imagenet ResNet, WMT and Criteo1tb variants
3237

3338
Add prize qualification logs for external tuning ruleset.
34-
Note: FastMRI trials with dropout are not yet added due to https://github.com/mlcommons/algorithmic-efficiency/issues/664.
39+
Note: FastMRI trials with dropout are not yet added due to <https://github.com/mlcommons/algorithmic-efficiency/issues/664>.
3540

3641
Add missing funcitonality to Docker startup script for self_tuning ruleset.
3742
Add self_tuning ruleset option to script that runs all workloads for scoring.
@@ -41,6 +46,7 @@ Datasetup fixes.
4146
Fix tests that check training differences in PyTorch and JAX on GPU.
4247

4348
## algoperf-benchmark-0.1.1 (2024-01-19)
49+
4450
Bug fixes to FastMRI metric calculation and targets.
4551

4652
Added workload variants and targets for ogbg, fastmri, librispeech_conformer, imagenet_resnet, imagenet_vit, criteo1tb to be used as held-out workloads.

CONTRIBUTING.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -205,7 +205,7 @@ docker run -t -d \
205205
-v $HOME/data/:/data/ \
206206
-v $HOME/experiment_runs/:/experiment_runs \
207207
-v $HOME/experiment_runs/logs:/logs \
208-
-v $HOME/algorithmic-efficiency:/algorithmic-efficiency \
208+
-v $HOME/algorithmic-efficiency:/algoperf \
209209
--gpus all \
210210
--ipc=host \
211211
<image_path> \
@@ -229,7 +229,7 @@ To run the below commands, use the versions installed via `pip install -e '.[dev
229229
To automatically fix formatting errors, run the following (*WARNING:* this will edit your code, so it is suggested to make a git commit first!):
230230

231231
```bash
232-
yapf -i -r -vv -p algorithmic_efficiency datasets prize_qualification_baselines reference_algorithms tests *.py
232+
yapf -i -r -vv -p algoperf datasets prize_qualification_baselines reference_algorithms tests *.py
233233
```
234234

235235
To sort all import orderings, run the following:
@@ -247,7 +247,7 @@ isort . --check --diff
247247
To print out all offending pylint issues, run the following:
248248

249249
```bash
250-
pylint algorithmic_efficiency
250+
pylint algoperf
251251
pylint datasets
252252
pylint prize_qualification_baselines
253253
pylint reference_algorithms
@@ -288,4 +288,4 @@ You can check what version `setuptools_scm` is creating by running `python -m se
288288
289289
To create a new version, create a new release (and tag) in the GitHub UI.
290290
The package version is automatically updated to the new version.
291-
Once the package is installed, the version can be accessed as the package attribute `algorithmic_efficiency.__version__`, i.e. via `python -c "import algorithmic_efficiency; print(algorithmic_efficiency.__version__)"`.
291+
Once the package is installed, the version can be accessed as the package attribute `algoperf.__version__`, i.e. via `python -c "import algoperf; print(algoperf.__version__)"`.

DOCUMENTATION.md

+10-11
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@ With the exception of `_build_input_queue`, submitters can call any of these fun
9191
def step_hint(self): -> int
9292
```
9393

94-
- The `step_hint` function gives the number of global steps the baseline algorithm was allowed to use to reach the targets for a workload. Note that the baseline algorithms may have reached the target in fewer steps than this, but these were the max number of steps the baseline algorithms used for their learning rate schedules. Submitters can use this to help specify learning rate (or other) schedules.
94+
- The `step_hint` function gives the number of global steps the baseline algorithm can perform with the `max_runtime` to reach the targets for a workload. The `step_hint` is therefore dependent on the `max_runtime` and the workload. Note that the baseline algorithms may have reached the target in fewer steps than this, but these were the max number of steps the baseline algorithms used for their learning rate schedules. Submitters can use this to help specify learning rate (or other) schedules.
9595

9696
###### Data augmentation and preprocessing
9797

@@ -222,7 +222,6 @@ def update_params(
222222
- Cannot replace the model parameters with pre-trained ones.
223223
- Batch norm should work here because the `model_fn` will return updated batch norm moving averages when it is told to with `update_batch_norm`.
224224

225-
226225
###### Prepare for evaluation function
227226

228227
```python
@@ -278,7 +277,7 @@ def data_selection(
278277

279278
In general, with noisy, non-deterministic training, evaluation frequency can affect training time measurements as more "bites of the apple" potentially allows the training code to exploit instability. We also want to discourage submissions from complicated and unrealistic logic that attempts to guess when training is close to complete and increases the evaluation rate, while not producing a well-sampled training curve at the start of training. Simply allowing submissions complete freedom over evaluation frequency encourages competitors to work to minimize the number of evaluations, which distracts from the primary goal of finding better training algorithms.
280279

281-
Submissions are eligible for an untimed eval every `eval_period` seconds. Before proceeding to evaluation, the submission can prepare the model through a call to `prepare_for_eval`, effectively modifying the model parameters and state as well as the the optimizer state. Any additional evaluations performed by the submission code count against the runtime for scoring.
280+
Submissions are eligible for an untimed eval every `eval_period` seconds. Before proceeding to evaluation, the submission can prepare the model through a call to `prepare_for_eval`, effectively modifying the model parameters and state as well as the the optimizer state. Any additional evaluations performed by the submission code count against the runtime for scoring.
282281
The harness that runs the submission code will attempt to eval every `eval_period` seconds by checking between each submission step (call of `update_params`) whether it has been at least `eval_period` seconds since that last eval, if so, the submission is given the possibility to prepare for evaluation (through a timed call to `prepare_for_eval`). If the accumulated runtime does not exceed the maximum allowed runtime after the preparation step, the clock is paused, and the submission is evaluated. This means that if calls to `update_params` typically take a lot more than `eval_period` seconds, such submissions will not receive as many untimed evals as a submission that had an `update_params` function that took less time. However, for appropriate settings of `eval_period`, we expect this to be quite rare. Submissions are always free to restructure their `update_params` code to split work into two subsequent steps to regain the potential benefits of these untimed model evaluations. For each workload, the `eval_period` will be set such that the total evaluation time is roughly between 10% and 20% of the total training time for the target-setting runs.
283282

284283
#### Valid submissions
@@ -419,7 +418,7 @@ In each trial, the tuning trial with the fastest training time to achieve the *v
419418

420419
Submissions to this ruleset are not allowed to have user-defined hyperparameters. This ruleset allows both submissions that use the same hyperparameters for all workloads, including the randomized ones (e.g. Adam with default parameters), as well as submissions that perform inner-loop tuning during their training run (e.g. SGD with line searches).
421420

422-
Submissions will run on one instance of the [benchmarking hardware](#benchmarking-hardware). As always, submissions are allowed to perform inner-loop tuning (e.g. for their learning rate) but the tuning efforts will be part of their score. A submission will run *S=5* times and its score will be the median time to reach the target evaluation metric value on the validation set. To account for the lack of external tuning, submissions have a longer time budget to reach the target performance. Compared to the [external tuning ruleset](#external-tuning-ruleset), the `max_runtime` is tripled. Runs that do not reach the target performance of the evaluation metric within this allotted time budget have an infinite time.
421+
Submissions will run on one instance of the [benchmarking hardware](#benchmarking-hardware). As always, submissions are allowed to perform inner-loop tuning (e.g. for their learning rate) but the tuning efforts will be part of their score. A submission will run *S=5* times and its score will be the median time to reach the target evaluation metric value on the validation set. To account for the lack of external tuning, submissions have a longer time budget to reach the target performance. Compared to the [external tuning ruleset](#external-tuning-ruleset), the `max_runtime` is $1.5$ times longer. Runs that do not reach the target performance of the evaluation metric within this allotted time budget have an infinite time.
423422

424423
### Workloads
425424

@@ -440,11 +439,11 @@ The currently eight fixed workloads are:
440439
| | **Task** | **Dataset** | **Model** | **Loss** | **Metric** | Validation<br>**Target** | Test<br>**Target** | Maximum<br>**Runtime** <br>(in secs) |
441440
|------------|-------------------------------|-------------|-------------------------|----------|------------|--------------------------|----------------------|------------------------|
442441
| **1** | Clickthrough rate prediction | Criteo 1TB | DLRMsmall | CE | CE | 0.123735 | 0.126041 | 7,703 |
443-
| **2** | MRI reconstruction | fastMRI | U-Net | L1 | SSIM | 0.723653 | 0.740633 | 8,859 |
444-
| **3<br>4** | Image classification | ImageNet | ResNet-50<br>ViT | CE | ER | 0.22569<br>0.22691 | 0.3440<br>0.3481 | 63,008 <br> 77,520 |
445-
| **5<br>6** | Speech recognition | LibriSpeech | Conformer<br>DeepSpeech | CTC | WER | 0.085884<br>0.119936 | 0.052981<br>0.074143 | 61,068<br>55,506 |
446-
| **7** | Molecular property prediction | OGBG | GNN | CE | mAP | 0.28098 | 0.268729 | 18,477 |
447-
| **8** | Translation | WMT | Transformer | CE | BLEU | 30.8491 | 30.7219 | 48,151 |
442+
| **2** | MRI reconstruction | fastMRI | U-Net | L1 | SSIM | 0.723653 | 0.740633 | 4,430 |
443+
| **3<br>4** | Image classification | ImageNet | ResNet-50<br>ViT | CE | ER | 0.22569<br>0.22691 | 0.3440<br>0.3481 | 66,159 <br> 69,768 |
444+
| **5<br>6** | Speech recognition | LibriSpeech | Conformer<br>DeepSpeech | CTC | WER | 0.085884<br>0.119936 | 0.052981<br>0.074143 | 58,015<br>44,405 |
445+
| **7** | Molecular property prediction | OGBG | GNN | CE | mAP | 0.28098 | 0.268729 | 12,011 |
446+
| **8** | Translation | WMT | Transformer | CE | BLEU | 30.8491 | 30.7219 | 43,336 |
448447

449448
Default Dropout Values for Different Workloads:
450449

@@ -504,7 +503,7 @@ For self-reported results, it is acceptable to perform the tuning trials on hard
504503
Target performances on the validation and test sets will be defined for each [workload](#workloads) separately. For the [fixed workloads](#fixed-workloads), we take the best performance achievable by one of four standard algorithms (AdamW, NadamW, Nesterov Momentum, and Heavy Ball Momentum). These target-setting algorithms will follow the general process of the external tuning ruleset, with a significantly larger tuning budget of $200$ trials to guarantee competitive performance. Once the best algorithm and its hyperparameters are determined, training is repeated $20$ times. The median of the best achieved validation errors across seeds is used as the *validation* target. Out of the $10$ repeated runs that achieved this validation target, we took the worst achieved test error across seeds as our *test* target. Taking the median validation performance after rerunning the best hyperparameter point prevents our procedure from selecting a lucky outlier.
505504
To save computational resources, we only tuned two training algorithms instead of four, for the [randomized workloads](#randomized-workloads). For each workload variant, we used NadamW and the other best-performing training algorithm on the corresponding fixed workload the randomized workload is based on.
506505

507-
Both [tuning rulesets](#tuning) will use the same target performances. The runtime of the target-setting algorithms on each workload will be chosen to match published results and is constrained by the overall time budget of roughly a single week for all fixed workloads. The `max_runtime` for submissions on each workload is $\frac{1}{3}$ longer than the runtime of the target-setting algorithms (this `max_runtime` will be three times as much for the self-tuning ruleset, see the [Self-tuning ruleset](#self-tuning-ruleset) section).
506+
Both [tuning rulesets](#tuning) will use the same target performances. The runtime of the target-setting algorithms on each workload will be chosen to match published results and is constrained by the overall time budget of roughly a single week for all fixed workloads. The initial `max_runtime` for submissions on each workload was $\frac{1}{3}$ longer than the runtime of the target-setting algorithms (this `max_runtime` will be $1.5$ times as much for the self-tuning ruleset, see the [Self-tuning ruleset](#self-tuning-ruleset) section). After the initial round of submissions, we have adapated the `max_runtime` based on the performance of the submissions (see [this issue](https://github.com/mlcommons/algorithmic-efficiency/issues/836)).
508507

509508
#### Benchmark score using performance profiles
510509

@@ -642,4 +641,4 @@ That said, while submitting Adam with some novel heuristic to set various hyperp
642641
The JAX and PyTorch versions of the Criteo, FastMRI, Librispeech, OGBG, and WMT workloads use the same TensorFlow input pipelines. Due to differences in how JAX and PyTorch distribute computations across devices, the PyTorch workloads have an additional overhead for these workloads.
643642

644643
Since we use PyTorch's [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel) implementation, there is one Python process for each device. Depending on the hardware and the settings of the cluster, running a TensorFlow input pipeline in each Python process can lead to errors, since too many threads are created in each process. See [this PR thread](https://github.com/mlcommons/algorithmic-efficiency/pull/85) for more details.
645-
While this issue might not affect all setups, we currently implement a different strategy: we only run the TensorFlow input pipeline in one Python process (with `rank == 0`), and [broadcast](https://pytorch.org/docs/stable/distributed.html#torch.distributed.broadcast) the batches to all other devices. This introduces an additional communication overhead for each batch. See the [implementation for the WMT workload](https://github.com/mlcommons/algorithmic-efficiency/blob/main/algorithmic_efficiency/workloads/wmt/wmt_pytorch/workload.py#L215-L288) as an example.
644+
While this issue might not affect all setups, we currently implement a different strategy: we only run the TensorFlow input pipeline in one Python process (with `rank == 0`), and [broadcast](https://pytorch.org/docs/stable/distributed.html#torch.distributed.broadcast) the batches to all other devices. This introduces an additional communication overhead for each batch. See the [implementation for the WMT workload](https://github.com/mlcommons/algorithmic-efficiency/blob/main/algoperf/workloads/wmt/wmt_pytorch/workload.py#L215-L288) as an example.

0 commit comments

Comments
 (0)