You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+13-7
Original file line number
Diff line number
Diff line change
@@ -4,34 +4,39 @@
4
4
5
5
- Finalized variant workload targets.
6
6
- Fix in random_utils helper function.
7
-
- For conformer PyTorch Dropout layers set `inplace=True`.
7
+
- For conformer PyTorch Dropout layers set `inplace=True`.
8
8
- Clear CUDA cache at begining of each trial for PyTorch.
9
9
10
10
## algoperf-benchmark-0.1.4 (2024-03-26)
11
11
12
12
Upgrade CUDA version to CUDA 12.1:
13
+
13
14
- Upgrade CUDA version in Dockerfiles that will be used for scoring.
14
15
- Update Jax and PyTorch package version tags to use local CUDA installation.
15
16
16
-
Add flag for completely disabling checkpointing.
17
+
Add flag for completely disabling checkpointing.
18
+
17
19
- Note that we will run with checkpointing off at scoring time.
18
20
19
-
Update Deepspeech and Conformer variant target setting configurations.
20
-
- Note that variant targets are not final.
21
+
Update Deepspeech and Conformer variant target setting configurations.
22
+
23
+
- Note that variant targets are not final.
21
24
22
25
Fixed bug in scoring code to take best trial in a study for external-tuning ruleset.
23
26
24
-
Added instructions for submission.
27
+
Added instructions for submission.
25
28
26
-
Changed default number of workers for PyTorch data loaders to 0. Running with >0 may lead to incorrect eval results see https://github.com/mlcommons/algorithmic-efficiency/issues/732.
29
+
Changed default number of workers for PyTorch data loaders to 0. Running with >0 may lead to incorrect eval results see <https://github.com/mlcommons/algorithmic-efficiency/issues/732>.
27
30
28
31
## algoperf-benchmark-0.1.2 (2024-03-04)
32
+
29
33
Workload variant additions and fixes:
34
+
30
35
- Add Deepspeech workload variant
31
36
- Fix bugs in Imagenet ResNet, WMT and Criteo1tb variants
32
37
33
38
Add prize qualification logs for external tuning ruleset.
34
-
Note: FastMRI trials with dropout are not yet added due to https://github.com/mlcommons/algorithmic-efficiency/issues/664.
39
+
Note: FastMRI trials with dropout are not yet added due to <https://github.com/mlcommons/algorithmic-efficiency/issues/664>.
35
40
36
41
Add missing funcitonality to Docker startup script for self_tuning ruleset.
37
42
Add self_tuning ruleset option to script that runs all workloads for scoring.
@@ -41,6 +46,7 @@ Datasetup fixes.
41
46
Fix tests that check training differences in PyTorch and JAX on GPU.
42
47
43
48
## algoperf-benchmark-0.1.1 (2024-01-19)
49
+
44
50
Bug fixes to FastMRI metric calculation and targets.
45
51
46
52
Added workload variants and targets for ogbg, fastmri, librispeech_conformer, imagenet_resnet, imagenet_vit, criteo1tb to be used as held-out workloads.
To print out all offending pylint issues, run the following:
248
248
249
249
```bash
250
-
pylint algorithmic_efficiency
250
+
pylint algoperf
251
251
pylint datasets
252
252
pylint prize_qualification_baselines
253
253
pylint reference_algorithms
@@ -288,4 +288,4 @@ You can check what version `setuptools_scm` is creating by running `python -m se
288
288
289
289
To create a new version, create a new release (and tag) in the GitHub UI.
290
290
The package version is automatically updated to the new version.
291
-
Once the package is installed, the version can be accessed as the package attribute `algorithmic_efficiency.__version__`, i.e. via `python -c "import algorithmic_efficiency; print(algorithmic_efficiency.__version__)"`.
291
+
Once the package is installed, the version can be accessed as the package attribute `algoperf.__version__`, i.e. via `python -c "import algoperf; print(algoperf.__version__)"`.
Copy file name to clipboardExpand all lines: DOCUMENTATION.md
+10-11
Original file line number
Diff line number
Diff line change
@@ -91,7 +91,7 @@ With the exception of `_build_input_queue`, submitters can call any of these fun
91
91
defstep_hint(self): ->int
92
92
```
93
93
94
-
- The `step_hint` function gives the number of global steps the baseline algorithm was allowed to use to reach the targets for a workload. Note that the baseline algorithms may have reached the target in fewer steps than this, but these were the max number of steps the baseline algorithms used for their learning rate schedules. Submitters can use this to help specify learning rate (or other) schedules.
94
+
- The `step_hint` function gives the number of global steps the baseline algorithm can perform with the `max_runtime`to reach the targets for a workload. The `step_hint` is therefore dependent on the `max_runtime` and the workload. Note that the baseline algorithms may have reached the target in fewer steps than this, but these were the max number of steps the baseline algorithms used for their learning rate schedules. Submitters can use this to help specify learning rate (or other) schedules.
95
95
96
96
###### Data augmentation and preprocessing
97
97
@@ -222,7 +222,6 @@ def update_params(
222
222
- Cannot replace the model parameters with pre-trained ones.
223
223
- Batch norm should work here because the `model_fn` will return updated batch norm moving averages when it is told to with`update_batch_norm`.
224
224
225
-
226
225
###### Prepare for evaluation function
227
226
228
227
```python
@@ -278,7 +277,7 @@ def data_selection(
278
277
279
278
In general, with noisy, non-deterministic training, evaluation frequency can affect training time measurements as more "bites of the apple" potentially allows the training code to exploit instability. We also want to discourage submissions from complicated and unrealistic logic that attempts to guess when training is close to complete and increases the evaluation rate, whilenot producing a well-sampled training curve at the start of training. Simply allowing submissions complete freedom over evaluation frequency encourages competitors to work to minimize the number of evaluations, which distracts from the primary goal of finding better training algorithms.
280
279
281
-
Submissions are eligible for an untimed eval every `eval_period` seconds. Before proceeding to evaluation, the submission can prepare the model through a call to `prepare_for_eval`, effectively modifying the model parameters and state as well as the the optimizer state. Any additional evaluations performed by the submission code count against the runtime for scoring.
280
+
Submissions are eligible for an untimed eval every `eval_period` seconds. Before proceeding to evaluation, the submission can prepare the model through a call to `prepare_for_eval`, effectively modifying the model parameters and state as well as the the optimizer state. Any additional evaluations performed by the submission code count against the runtime for scoring.
282
281
The harness that runs the submission code will attempt to eval every `eval_period` seconds by checking between each submission step (call of `update_params`) whether it has been at least `eval_period` seconds since that last eval, if so, the submission is given the possibility to prepare for evaluation (through a timed call to `prepare_for_eval`). If the accumulated runtime does not exceed the maximum allowed runtime after the preparation step, the clock is paused, and the submission is evaluated. This means that if calls to `update_params` typically take a lot more than `eval_period` seconds, such submissions will not receive as many untimed evals as a submission that had an `update_params` function that took less time. However, for appropriate settings of `eval_period`, we expect this to be quite rare. Submissions are always free to restructure their `update_params` code to split work into two subsequent steps to regain the potential benefits of these untimed model evaluations. For each workload, the `eval_period` will be set such that the total evaluation time is roughly between 10% and 20% of the total training time for the target-setting runs.
283
282
284
283
#### Valid submissions
@@ -419,7 +418,7 @@ In each trial, the tuning trial with the fastest training time to achieve the *v
419
418
420
419
Submissions to this ruleset are not allowed to have user-defined hyperparameters. This ruleset allows both submissions that use the same hyperparameters forall workloads, including the randomized ones (e.g. Adam with default parameters), as well as submissions that perform inner-loop tuning during their training run (e.g. SGDwith line searches).
421
420
422
-
Submissions will run on one instance of the [benchmarking hardware](#benchmarking-hardware). As always, submissions are allowed to perform inner-loop tuning (e.g. for their learning rate) but the tuning efforts will be part of their score. A submission will run *S=5* times and its score will be the median time to reach the target evaluation metric value on the validation set. To account for the lack of external tuning, submissions have a longer time budget to reach the target performance. Compared to the [external tuning ruleset](#external-tuning-ruleset), the `max_runtime` is tripled. Runs that do not reach the target performance of the evaluation metric within this allotted time budget have an infinite time.
421
+
Submissions will run on one instance of the [benchmarking hardware](#benchmarking-hardware). As always, submissions are allowed to perform inner-loop tuning (e.g. for their learning rate) but the tuning efforts will be part of their score. A submission will run *S=5* times and its score will be the median time to reach the target evaluation metric value on the validation set. To account for the lack of external tuning, submissions have a longer time budget to reach the target performance. Compared to the [external tuning ruleset](#external-tuning-ruleset), the `max_runtime` is $1.5$ times longer. Runs that do not reach the target performance of the evaluation metric within this allotted time budget have an infinite time.
423
422
424
423
### Workloads
425
424
@@ -440,11 +439,11 @@ The currently eight fixed workloads are:
@@ -504,7 +503,7 @@ For self-reported results, it is acceptable to perform the tuning trials on hard
504
503
Target performances on the validation and test sets will be defined for each [workload](#workloads) separately. For the [fixed workloads](#fixed-workloads), we take the best performance achievable by one of four standard algorithms (AdamW, NadamW, Nesterov Momentum, and Heavy Ball Momentum). These target-setting algorithms will follow the general process of the external tuning ruleset, with a significantly larger tuning budget of $200$ trials to guarantee competitive performance. Once the best algorithm and its hyperparameters are determined, training is repeated $20$ times. The median of the best achieved validation errors across seeds is used as the *validation* target. Out of the $10$ repeated runs that achieved this validation target, we took the worst achieved test error across seeds as our *test* target. Taking the median validation performance after rerunning the best hyperparameter point prevents our procedure from selecting a lucky outlier.
505
504
To save computational resources, we only tuned two training algorithms instead of four, for the [randomized workloads](#randomized-workloads). For each workload variant, we used NadamW and the other best-performing training algorithm on the corresponding fixed workload the randomized workload is based on.
506
505
507
-
Both [tuning rulesets](#tuning) will use the same target performances. The runtime of the target-setting algorithms on each workload will be chosen to match published results and is constrained by the overall time budget of roughly a single week for all fixed workloads. The `max_runtime` for submissions on each workload is $\frac{1}{3}$ longer than the runtime of the target-setting algorithms (this `max_runtime` will be three times as much for the self-tuning ruleset, see the [Self-tuning ruleset](#self-tuning-ruleset) section).
506
+
Both [tuning rulesets](#tuning) will use the same target performances. The runtime of the target-setting algorithms on each workload will be chosen to match published results and is constrained by the overall time budget of roughly a single week for all fixed workloads. The initial `max_runtime` for submissions on each workload was $\frac{1}{3}$ longer than the runtime of the target-setting algorithms (this `max_runtime` will be $1.5$ times as much for the self-tuning ruleset, see the [Self-tuning ruleset](#self-tuning-ruleset) section). After the initial round of submissions, we have adapated the `max_runtime` based on the performance of the submissions (see [this issue](https://github.com/mlcommons/algorithmic-efficiency/issues/836)).
508
507
509
508
#### Benchmark score using performance profiles
510
509
@@ -642,4 +641,4 @@ That said, while submitting Adam with some novel heuristic to set various hyperp
642
641
The JAXand PyTorch versions of the Criteo, FastMRI, Librispeech, OGBG, andWMT workloads use the same TensorFlow input pipelines. Due to differences in how JAXand PyTorch distribute computations across devices, the PyTorch workloads have an additional overhead for these workloads.
643
642
644
643
Since we use PyTorch's [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel) implementation, there is one Python process for each device. Depending on the hardware and the settings of the cluster, running a TensorFlow input pipeline in each Python process can lead to errors, since too many threads are created in each process. See [this PR thread](https://github.com/mlcommons/algorithmic-efficiency/pull/85) for more details.
645
-
While this issue might not affect all setups, we currently implement a different strategy: we only run the TensorFlow input pipeline in one Python process (with`rank ==0`), and [broadcast](https://pytorch.org/docs/stable/distributed.html#torch.distributed.broadcast) the batches to all other devices. This introduces an additional communication overhead for each batch. See the [implementation for the WMT workload](https://github.com/mlcommons/algorithmic-efficiency/blob/main/algorithmic_efficiency/workloads/wmt/wmt_pytorch/workload.py#L215-L288) as an example.
644
+
While this issue might not affect all setups, we currently implement a different strategy: we only run the TensorFlow input pipeline in one Python process (with`rank ==0`), and [broadcast](https://pytorch.org/docs/stable/distributed.html#torch.distributed.broadcast) the batches to all other devices. This introduces an additional communication overhead for each batch. See the [implementation for the WMT workload](https://github.com/mlcommons/algorithmic-efficiency/blob/main/algoperf/workloads/wmt/wmt_pytorch/workload.py#L215-L288) as an example.
0 commit comments