From 95776ed5608c92236236da8dc71ef878602cf184 Mon Sep 17 00:00:00 2001
From: "Ahmed Khaled (rka97)"
Date: Thu, 5 Feb 2026 23:20:48 +0000
Subject: [PATCH 1/3] Change README for v1.0
---
README.md | 301 ++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 212 insertions(+), 89 deletions(-)
diff --git a/README.md b/README.md
index 71595e11b..63e2931f4 100644
--- a/README.md
+++ b/README.md
@@ -24,59 +24,105 @@
----
-
-The MLCommons™ **AlgoPerf: Training Algorithms benchmark** is designed to find **training algorithms that can train neural networks faster** by rigorously measuring how quickly they reach a specific performance target across a diverse set of deep learning workloads.
-
-When training neural nets, practitioners face many critical yet often opaque decisions: What optimizer to choose? How should its learning rate be tuned? What learning rate schedule should be used? These choices can make or break training, yet the community has lacked a clear, standardized way to identify the state of the art.
-Unlike benchmarks focused on hardware or model architecture, AlgoPerf isolates the **training algorithm** itself, which includes the optimizer, regularization, data selection, and hyperparameters like the learning rate schedule. By standardizing the benchmark process, AlgoPerf offers a meaningful apples-to-apples comparison of training algorithms and follows the following **key principles**:
-
-- 🎯 **Fixed Target, Model & Hardware:** Submitted training algorithms must train a set of [**fixed models**](/docs/DOCUMENTATION.md#workloads) to a pre-defined validation performance target as fast as possible. All submissions use the same model architecture and are run on the same [**standardized hardware**](/docs/DOCUMENTATION.md#benchmarking-hardware) (4x A100 (40GB) GPUs). This isolates the training algorithm's performance and allows a fair apples-to-apples comparison.
-- ⏱️ **Time-To-Result:** Submissions are evaluated based on the total wall-clock time required to reach the target, rewarding practical and efficient algorithms.
-- 🧠 **Diverse Workloads:** The benchmark includes [**8 diverse deep learning workloads**](/docs/DOCUMENTATION.md#workloads) across domains like image classification, speech recognition, and machine translation. A submission's score is computed by aggregating its performance, using [**performance profiles**](/docs/DOCUMENTATION.md#benchmark-score-using-performance-profiles), across all workloads to ensure general-purpose algorithms.
-- 📦 **Fully-Specified Algorithms:** Submissions must be complete procedures and thus hyperparameter tuning is treated as part of the algorithm. Submissions can either provide a search space for automated tuning ([**External tuning ruleset**](/docs/DOCUMENTATION.md#external-tuning-ruleset)) or be hyperparameter-free ([**Self-tuning ruleset**](/docs/DOCUMENTATION.md#self-tuning-ruleset)) with any tuning done automatically and "on the clock". This measures an algorithm's _total_ practical cost and provides practitioners with a complete method, eliminating the guesswork of how to apply it.
+--------------------------------------------------------------------------------
+
+The MLCommons™ **AlgoPerf: Training Algorithms benchmark** is designed to find
+**training algorithms that can train neural networks faster** by rigorously
+measuring how quickly they reach a specific performance target across a diverse
+set of deep learning workloads.
+
+When training neural nets, practitioners face many critical yet often opaque
+decisions: What optimizer to choose? How should its learning rate be tuned? What
+learning rate schedule should be used? These choices can make or break training,
+yet the community has lacked a clear, standardized way to identify the state of
+the art. Unlike benchmarks focused on hardware or model architecture, AlgoPerf
+isolates the **training algorithm** itself, which includes the optimizer,
+regularization, data selection, and hyperparameters like the learning rate
+schedule. By standardizing the benchmark process, AlgoPerf offers a meaningful
+apples-to-apples comparison of training algorithms and follows the following
+**key principles**:
+
+- 🎯 **Fixed Target, Model & Hardware:** Submitted training algorithms must
+ train a set of [**fixed models**](/docs/DOCUMENTATION.md#workloads) to a
+ pre-defined validation performance target as fast as possible. All
+ submissions use the same model architecture and are run on the same
+ [**standardized hardware**](/docs/DOCUMENTATION.md#benchmarking-hardware)
+ (4x A100 (40GB) GPUs). This isolates the training algorithm's performance
+ and allows a fair apples-to-apples comparison.
+- ⏱️ **Time-To-Result:** Submissions are evaluated based on the total
+ wall-clock time required to reach the target, rewarding practical and
+ efficient algorithms.
+- 🧠 **Diverse Workloads:** The benchmark includes
+ [**9 diverse deep learning workloads**](/docs/DOCUMENTATION.md#workloads)
+ across domains like image classification, speech recognition, and machine
+ translation. A submission's score is computed by aggregating its
+ performance, using
+ [**performance profiles**](/docs/DOCUMENTATION.md#benchmark-score-using-performance-profiles),
+ across all workloads to ensure general-purpose algorithms.
+- 📦 **Fully-Specified Algorithms:** Submissions must be complete procedures
+ and thus hyperparameter tuning is treated as part of the algorithm.
+ Submissions can either provide a search space for automated tuning
+ ([**External tuning ruleset**](/docs/DOCUMENTATION.md#external-tuning-ruleset))
+ or be hyperparameter-free
+ ([**Self-tuning ruleset**](/docs/DOCUMENTATION.md#self-tuning-ruleset)) with
+ any tuning done automatically and "on the clock". This measures an
+ algorithm's *total* practical cost and provides practitioners with a
+ complete method, eliminating the guesswork of how to apply it.
> [!IMPORTANT]
>
-> **We have moved to a rolling leaderboard!**
-> We invite you to submit your algorithm for evaluation, see our [**How to Submit**](#how-to-submit) section and the [**submission repository**](https://github.com/mlcommons/submissions_algorithms). The working group will review your submission and, if selected, run it on our hardware and add your results to the official [**AlgoPerf Leaderboard**](https://github.com/mlcommons/submissions_algorithms). **Note: we are currently focusing our efforts on the self-tuning leaderboard to strengthen its competitiveness.**
-
----
+> **We have moved to a rolling leaderboard!** We invite you to submit your
+> algorithm for evaluation, see our [**How to Submit**](#how-to-submit) section
+> and the
+> [**submission repository**](https://github.com/mlcommons/submissions_algorithms).
+> The working group will review your submission and, if selected, run it on our
+> hardware and add your results to the official
+> [**AlgoPerf Leaderboard**](https://github.com/mlcommons/submissions_algorithms).
+> **Note: we are currently focusing our efforts on the self-tuning leaderboard
+> to strengthen its competitiveness.**
+
+--------------------------------------------------------------------------------
## Table of Contents
-- [Getting Started](#getting-started)
- - [Installation](#installation)
- - [Run a Workload](#run-a-workload)
- - [Develop Your Algorithm](#develop-your-algorithm)
-- [How to Submit](#how-to-submit)
-- [Rules, Documentation \& FAQ](#rules-documentation--faq)
-- [Contributing \& Resources](#contributing--resources)
-- [Releases \& Roadmap](#releases--roadmap)
-- [Training Algorithm Collection](#training-algorithm-collection)
-- [Citing Our Work](#citing-our-work)
-- [License](#license)
+- [Getting Started](#getting-started)
+ - [Installation](#installation)
+ - [Run a Workload](#run-a-workload)
+ - [Develop Your Algorithm](#develop-your-algorithm)
+- [How to Submit](#how-to-submit)
+- [Rules, Documentation \& FAQ](#rules-documentation--faq)
+- [Contributing \& Resources](#contributing--resources)
+- [Releases \& Roadmap](#releases--roadmap)
+- [Training Algorithm Collection](#training-algorithm-collection)
+- [Citing Our Work](#citing-our-work)
+- [License](#license)
## Getting Started
-Follow these steps to run a baseline algorithm and start developing your own submission.
-A more detailed guide can be found in the [**Getting Started**](/docs/GETTING_STARTED.md) document.
-If you run into any issues, please feel free to contact us.
-Either [**file an issue**](https://github.com/mlcommons/algorithmic-efficiency/issues), ask a question on [**our Discord**](https://discord.gg/5FPXK7SMt6) or [**join our weekly meetings**](https://mlcommons.org/en/groups/research-algorithms/).
+Follow these steps to run a baseline algorithm and start developing your own
+submission. A more detailed guide can be found in the
+[**Getting Started**](/docs/GETTING_STARTED.md) document. If you run into any
+issues, please feel free to contact us. Either
+[**file an issue**](https://github.com/mlcommons/algorithmic-efficiency/issues),
+ask a question on [**our Discord**](https://discord.gg/5FPXK7SMt6) or
+[**join our weekly meetings**](https://mlcommons.org/en/groups/research-algorithms/).
### Installation
-We recommend using the provided [**Docker container**](/docs/GETTING_STARTED.md#docker) to ensure a reproducible environment similar to our scoring environment.
-Alternatively, you can install the package and its dependencies in a Python virtual environment.
-Both options are described in more detail in the [**Getting Started**](/docs/GETTING_STARTED.md) document.
+We recommend using the provided
+[**Docker container**](/docs/GETTING_STARTED.md#docker) to ensure a reproducible
+environment similar to our scoring environment. Alternatively, you can install
+the package and its dependencies in a Python virtual environment. Both options
+are described in more detail in the
+[**Getting Started**](/docs/GETTING_STARTED.md) document.
-_TL;DR Install JAX version for GPU (with workload dependencies):_
+*TL;DR Install JAX version for GPU (with workload dependencies):*
```bash
pip3 install -e '.[pytorch_cpu,jax_gpu,full]' --extra-index-url https://download.pytorch.org/whl/cpu
```
-_TL;DR Install PyTorch version for GPU (with workload dependencies):_
+*TL;DR Install PyTorch version for GPU (with workload dependencies):*
```bash
pip3 install -e '.[jax_cpu,pytorch_gpu,full]'
@@ -84,10 +130,11 @@ pip3 install -e '.[jax_cpu,pytorch_gpu,full]'
### Run a Workload
-Use the `submission_runner.py` to run an experiment, i.e., train a workload using a specific training algorithm.
-Here's how to run the AdamW baseline on the `mnist` workload.
+Use the `submission_runner.py` to run an experiment, i.e., train a workload
+using a specific training algorithm. Here's how to run the AdamW baseline on the
+`mnist` workload.
-_TL;DR: Running a JAX workload:_
+*TL;DR: Running a JAX workload:*
```bash
python3 submission_runner.py \
@@ -99,7 +146,7 @@ python3 submission_runner.py \
--tuning_search_space=algorithms/archived_paper_baselines/adamw/tuning_search_space.json
```
-_TL;DR: Running a PyTorch workload:_
+*TL;DR: Running a PyTorch workload:*
```bash
python3 submission_runner.py \
@@ -113,77 +160,148 @@ python3 submission_runner.py \
### Develop Your Algorithm
-Now you're ready to create your own `submission.py`! For detailed instructions, FAQs, and technical details, please refer to our documentation:
+Now you're ready to create your own `submission.py`! For detailed instructions,
+FAQs, and technical details, please refer to our documentation:
-- [**Getting Started Guide**](/docs/GETTING_STARTED.md): A detailed walkthrough for developing your algorithm.
-- [**Benchmark Documentation**](/docs/DOCUMENTATION.md): The complete technical reference including the "benchmark rules" such as allowed and disallowed submissions, FAQs, and technical details such as the API.
+- [**Getting Started Guide**](/docs/GETTING_STARTED.md): A detailed
+ walkthrough for developing your algorithm.
+- [**Benchmark Documentation**](/docs/DOCUMENTATION.md): The complete
+ technical reference including the "benchmark rules" such as allowed and
+ disallowed submissions, FAQs, and technical details such as the API.
## How to Submit
-Ready to see how your algorithm stacks up? Submit it to the official AlgoPerf leaderboard!
-
-1. **Develop Your Algorithm:** Create your training algorithm following the API and "rules" described in our [**documentation**](/docs/DOCUMENTATION.md).
-2. **Create a Pull Request:** Fork the [**submissions repository**](https://github.com/mlcommons/submissions_algorithms) and create a pull request with your algorithm.
-3. **Review and Evaluation:** The MLCommons Algorithms Working Group will review your PR. Based on its potential and our available resources, it may be selected for a **free, official evaluation** on our hardware.
-4. **See Your Results:** If selected, we will run your algorithm and add the results to the [**public leaderboard**](https://github.com/mlcommons/submissions_algorithms).
+Ready to see how your algorithm stacks up? Submit it to the official AlgoPerf
+leaderboard!
+
+1. **Develop Your Algorithm:** Create your training algorithm following the API
+ and "rules" described in our [**documentation**](/docs/DOCUMENTATION.md).
+2. **Create a Pull Request:** Fork the
+ [**submissions repository**](https://github.com/mlcommons/submissions_algorithms)
+ and create a pull request with your algorithm.
+3. **Review and Evaluation:** The MLCommons Algorithms Working Group will
+ review your PR. Based on its potential and our available resources, it may
+ be selected for a **free, official evaluation** on our hardware.
+4. **See Your Results:** If selected, we will run your algorithm and add the
+ results to the
+ [**public leaderboard**](https://github.com/mlcommons/submissions_algorithms).
## Rules, Documentation & FAQ
-We provide a technical documentation of the benchmark and answer frequently asked questions regarding the benchmarking protocol in a dedicated [**Documentation**](/docs/DOCUMENTATION.md) page. This includes which types of submissions are allowed, a description of the benchmark API, and the entire benchmarking protocol. Please ensure that your submission is compliant with these rules before submitting. Suggestions, clarifications, and questions can be raised via pull requests, by creating an issue, or by reaching out to the [**working group**](mailto:algorithms@mlcommons.org).
-
-For a detailed description and motivation of the initial benchmark design, please refer to our [**Benchmark Paper**](/docs/DOCUMENTATION.md#benchmark-paper).
-For the results of the first AlgoPerf competition, please refer to our [**Competition Results Paper**](/docs/DOCUMENTATION.md#competition-results-paper).
-See our [**AlgoPerf Leaderboard**](https://github.com/mlcommons/submissions_algorithms) for the latest results of the benchmark and the option to submit your algorithm.
+We provide a technical documentation of the benchmark and answer frequently
+asked questions regarding the benchmarking protocol in a dedicated
+[**Documentation**](/docs/DOCUMENTATION.md) page. This includes which types of
+submissions are allowed, a description of the benchmark API, and the entire
+benchmarking protocol. Please ensure that your submission is compliant with
+these rules before submitting. Suggestions, clarifications, and questions can be
+raised via pull requests, by creating an issue, or by reaching out to the
+[**working group**](mailto:algorithms@mlcommons.org).
+
+For a detailed description and motivation of the initial benchmark design,
+please refer to our
+[**Benchmark Paper**](/docs/DOCUMENTATION.md#benchmark-paper). For the results
+of the first AlgoPerf competition, please refer to our
+[**Competition Results Paper**](/docs/DOCUMENTATION.md#competition-results-paper).
+See our
+[**AlgoPerf Leaderboard**](https://github.com/mlcommons/submissions_algorithms)
+for the latest results of the benchmark and the option to submit your algorithm.
## Contributing & Resources
-AlgoPerf is an open, community-driven project organized by the [MLCommons Algorithms Working Group](https://mlcommons.org/en/groups/research-algorithms/). Whether you want to submit an algorithm, report a bug, or help shape the future of the benchmark, we welcome your contributions.
-
-- 🏆 **Submit Your Algorithm:** Ready to compete? Create a pull request in the [**Submissions Repository**](https://github.com/mlcommons/submissions_algorithms).
-- 🐞 **Report a Bug:** Found an issue with the codebase? Please [**file an issue**](https://github.com/mlcommons/algorithmic-efficiency/issues) so we can take a look. This also includes any rules changes or clarifications you would like to see.
-- 🛠️ **Contribute to the Codebase:** We actively welcome pull requests! If you're interested in implementing new workloads, adding baselines, or fixing bugs please reach out to us. Our [**Contributing Guide**](/docs/CONTRIBUTING.md) offers further contributing guidelines and additional setup and workflow instructions.
-- 👥 **Influence the Benchmark:** To contribute to the benchmark's design and direction, please join the [**weekly working group meetings**](https://mlcommons.org/en/groups/research-algorithms/).
-- 💬 **Ask a Question:** Have a question or want to discuss ideas? Join the conversation on our [**Discord Server**](https://discord.gg/5FPXK7SMt6) or [**join our weekly meetings**](https://mlcommons.org/en/groups/research-algorithms/).
+AlgoPerf is an open, community-driven project organized by the
+[MLCommons Algorithms Working Group](https://mlcommons.org/en/groups/research-algorithms/).
+Whether you want to submit an algorithm, report a bug, or help shape the future
+of the benchmark, we welcome your contributions.
+
+- 🏆 **Submit Your Algorithm:** Ready to compete? Create a pull request in the
+ [**Submissions Repository**](https://github.com/mlcommons/submissions_algorithms).
+- 🐞 **Report a Bug:** Found an issue with the codebase? Please
+ [**file an issue**](https://github.com/mlcommons/algorithmic-efficiency/issues)
+ so we can take a look. This also includes any rules changes or
+ clarifications you would like to see.
+- 🛠️ **Contribute to the Codebase:** We actively welcome pull requests! If
+ you're interested in implementing new workloads, adding baselines, or fixing
+ bugs please reach out to us. Our
+ [**Contributing Guide**](/docs/CONTRIBUTING.md) offers further contributing
+ guidelines and additional setup and workflow instructions.
+- 👥 **Influence the Benchmark:** To contribute to the benchmark's design and
+ direction, please join the
+ [**weekly working group meetings**](https://mlcommons.org/en/groups/research-algorithms/).
+- 💬 **Ask a Question:** Have a question or want to discuss ideas? Join the
+ conversation on our [**Discord Server**](https://discord.gg/5FPXK7SMt6) or
+ [**join our weekly meetings**](https://mlcommons.org/en/groups/research-algorithms/).
## Releases & Roadmap
-The AlgoPerf benchmark is an actively evolving project designed to keep pace with the rapidly changing field of machine learning. To ensure clarity and reproducibility, we have adopted a unified versioning system: codebase, rules, and leaderboard all share the same `Major.Minor` version. `Patch` versions may differ for minor updates.
-All results produced under the same `Major.Minor` version are comparable, making it easy to cite "`AlgoPerf v0.X`" and know exactly which set of rules, code, and submissions are being referenced.
-
-Here is an overview of our key releases and the future roadmap. For a detailed list of changes in each release, see our [**Changelog**](docs/CHANGELOG.md).
-
-- `v0.5` - Inaugural Competition
The benchmark as it was run for the first AlgoPerf competition in 2024. The key findings and analysis from this competition are detailed in our [**ICLR 2025 Results Paper**](https://openreview.net/forum?id=CtM5xjRSfm). It serves as a historical reference.
- - **Leaderboard:** Archived at [**AlgoPerf v0.5 Leaderboard**](https://github.com/mlcommons/submissions_algorithms/tree/main/previous_leaderboards/algoperf_v05).
- - **Rules:** The rules are archived at the [**AlgoPerf v0.5 Documentation**](https://github.com/mlcommons/algorithmic-efficiency/blob/v0.5.0/DOCUMENTATION.md).
-- `v0.6` - **Current Version**
The active and recommended version of the benchmark. It is an improved and streamlined version that fixes important bugs and modifying the benchmarking protocol based on the lessons learned from the competition. **This is the recommended version for all new submissions.**
-
- - **Key Changes:** (see the [Changelog](/docs/CHANGELOG.md) for details, including links to discussions on rule changes.)
- - A rolling leaderboard now allows for continuous submissions and updates.
- - Reduced computational cost via removing held-out workloads, 3 repetition studies (down from 5), and adjusted runtime budgets.
- - Includes important bug fixes (e.g., batch norm) and API improvements (e.g., `prepare_for_eval` function).
- - Migrating from `pmap` to `jit` in JAX for better performance and scalability.
- - **Leaderboard:** The active (but currently limited) leaderboard can be found at [**AlgoPerf v0.6 Leaderboard**](https://github.com/mlcommons/submissions_algorithms).
- - **Rules:** For the current set of rules see [**AlgoPerf v0.6 Documentation**](/docs/DOCUMENTATION.md).
-
-> 🏗️ `v1.0` (Future) - Planned Long-Term Support Release
This will be the next major release of the benchmark and a "long-term support" version, with the following **anticipated features:**
+The AlgoPerf benchmark is an actively evolving project designed to keep pace
+with the rapidly changing field of machine learning. To ensure clarity and
+reproducibility, we have adopted a unified versioning system: codebase, rules,
+and leaderboard all share the same `Major.Minor` version. `Patch` versions may
+differ for minor updates. All results produced under the same `Major.Minor`
+version are comparable, making it easy to cite "`AlgoPerf vX.Y`" and know
+exactly which set of rules, code, and submissions are being referenced.
+
+Here is an overview of our key releases and the future roadmap. For a detailed
+list of changes in each release, see our [**Changelog**](docs/CHANGELOG.md).
+
+- `v0.5` - Inaugural Competition
The benchmark as it was run for the first AlgoPerf competition in 2024. The key findings and analysis from this competition are detailed in our [**ICLR 2025 Results Paper**](https://openreview.net/forum?id=CtM5xjRSfm). It serves as a historical reference.
+ - **Leaderboard:** Archived at
+ [**AlgoPerf v0.5 Leaderboard**](https://github.com/mlcommons/submissions_algorithms/tree/main/previous_leaderboards/algoperf_v05).
+ - **Rules:** The rules are archived at the
+ [**AlgoPerf v0.5 Documentation**](https://github.com/mlcommons/algorithmic-efficiency/blob/v0.5.0/DOCUMENTATION.md).
+- `v0.6` - This was an improved and streamlined version that fixed important
+ bugs and modified the benchmark protocol based on the lessons learned from
+ the competition.
+ - **Key Changes:**
+ - Reduced computational cost via removing held-out workloads, 3 repetition
+ studies (down from 5), and adjusted runtime budgets.
+ - Includes important bug fixes (e.g., batch norm) and API improvements
+ (e.g., `prepare_for_eval` function).
+ - Migrating from `pmap` to `jit` in JAX for better performance and
+ scalability.
+
+> 🏗️ `v1.0` - **Long-Term Support Release**
This is the active and recommended version of the benchmark and the recommended starting version for all new submissions.
>
-> - Adding a new language model (LM) workload.
-> - Stronger baselines, especially for the self-tuning leaderboard.
+> - **Key Changes:** (see the [Changelog](/docs/CHANGELOG.md) for details)
+> - Introduced a new language model (LM) workload trained on the
+> finewebedu dataset.
+> - Switched the benchmark hardware to 4xA100 GPUs.
+> - Made improvements to the submission API.
+> - **Planned:** We are currently planning on releasing baselines for the
+> self-tuning leaderboard trained with this version. Stay tuned for the
+> updated leaderboard
+> [**AlgoPerf leaderboard**](https://github.com/mlcommons/submissions_algorithms).
+> - **Rules:** For the current set of rules see
+> [**AlgoPerf v1.0 Documentation**](/docs/DOCUMENTATION.md).
## Training Algorithm Collection
-This repository also provides a collection of implemented training algorithms with different purposes. These include [**submission templates**](./algorithms/template), [**development examples**](./algorithms/development_algorithms), [**target-setting algorithms**](./algorithms/target_setting_algorithms), [**historical baselines**](./algorithms/archived_paper_baselines), and [**current baselines**](./algorithms/baselines). For a detailed overview of these algorithms and their organization, please refer to the [`algorithms/README.md`](./algorithms/README.md) file. You can also find all benchmark submissions and their results on the official [**Leaderboard**](https://github.com/mlcommons/submissions_algorithms).
-These algorithms provide a starting point for developing your own training algorithm and are a great resource for understanding the AlgoPerf benchmark and its API.
+This repository also provides a collection of implemented training algorithms
+with different purposes. These include
+[**submission templates**](./algorithms/template),
+[**development examples**](./algorithms/development_algorithms),
+[**target-setting algorithms**](./algorithms/target_setting_algorithms),
+[**historical baselines**](./algorithms/archived_paper_baselines), and
+[**current baselines**](./algorithms/baselines). For a detailed overview of
+these algorithms and their organization, please refer to the
+[`algorithms/README.md`](./algorithms/README.md) file. You can also find all
+benchmark submissions and their results on the official
+[**Leaderboard**](https://github.com/mlcommons/submissions_algorithms). These
+algorithms provide a starting point for developing your own training algorithm
+and are a great resource for understanding the AlgoPerf benchmark and its API.
## Citing Our Work
-If you use the AlgoPerf benchmark, its codebase, or results in your research, please cite our papers.
+If you use the AlgoPerf benchmark, its codebase, or results in your research,
+please cite our papers.
**Benchmark Paper:**
-In this paper, we motivate, describe, and justify the _AlgoPerf: Training Algorithms_ benchmark.
+In this paper, we motivate, describe, and justify the *AlgoPerf: Training
+Algorithms* benchmark.
-> [Dahl, Schneider, Nado, et al.
> **Benchmarking Neural Network Training Algorithms**
> _arXiv 2306.07179_](http://arxiv.org/abs/2306.07179)
+> [Dahl, Schneider, Nado, et al.
> **Benchmarking Neural Network Training
+> Algorithms**
> *arXiv 2306.07179*](http://arxiv.org/abs/2306.07179)
```bibtex
@Misc{Dahl2023AlgoPerf,
@@ -199,7 +317,9 @@ In this paper, we motivate, describe, and justify the _AlgoPerf: Training Algori
In this paper, we analyze the results of the first AlgoPerf competition.
-> [Kasimbeg, Schneider, Eschenhagen, et al.
> **Accelerating neural network training: An analysis of the AlgoPerf competition**
> _ICLR 2025_](https://openreview.net/forum?id=CtM5xjRSfm)
+> [Kasimbeg, Schneider, Eschenhagen, et al.
> **Accelerating neural network
+> training: An analysis of the AlgoPerf competition**
> *ICLR
+> 2025*](https://openreview.net/forum?id=CtM5xjRSfm)
```bibtex
@inproceedings{Kasimbeg2025AlgoPerfResults,
@@ -213,9 +333,12 @@ url = {https://openreview.net/forum?id=CtM5xjRSfm}
## License
-The _AlgoPerf_ codebase is licensed under the [**Apache License 2.0**](/LICENSE.md). All AlgoPerf benchmark submissions must likewise be open-source under the same [**Apache License 2.0**](https://www.apache.org/licenses/LICENSE-2.0).
+The *AlgoPerf* codebase is licensed under the
+[**Apache License 2.0**](/LICENSE.md). All AlgoPerf benchmark submissions must
+likewise be open-source under the same
+[**Apache License 2.0**](https://www.apache.org/licenses/LICENSE-2.0).
----
+--------------------------------------------------------------------------------
MLCommons™ Algorithms Working Group • Join us!
From e45fdd31ff2cdad4f57a6172a9099cbc3bda0d63 Mon Sep 17 00:00:00 2001
From: Priya Kasimbeg
Date: Thu, 5 Feb 2026 23:54:52 +0000
Subject: [PATCH 2/3] add changelog
---
docs/CHANGELOG.md | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md
index 419553cd6..066f3512d 100644
--- a/docs/CHANGELOG.md
+++ b/docs/CHANGELOG.md
@@ -12,6 +12,15 @@ AlgoPerf uses a unified versioning scheme: codebase, rules, and leaderboard all
- _Codebase_: API improvements, bug fixes, or small non-breaking changes in the benchmark code could increment its `Patch` version as reflected in the `algoperf` package version.
- _Documentation/Rules_: Clarifications, typo fixes, or minor updates to the rules/documentation could increment its `Patch` version as shown in the documentation file.
+## [1.0.0] - 2026-02-05
+
+
+### Added
+- [Code, Rules] Migrated from 8xV100 to 4xA100 (40GB) and calibrated the workload runtime budgets for this new hardware ([PR](https://github.com/mlcommons/algorithmic-efficiency/pull/892)).
+- [Code, Rules] Added LM `finewebedu_lm` workload with decoder-only language model with Fineweb-Edu 10B dataset ([PR](https://github.com/mlcommons/algorithmic-efficiency/pull/902)).
+- [Code] Support for changing dropout with the `model_fn` ([PR](https://github.com/mlcommons/algorithmic-efficiency/pull/884)).
+
+
## [0.6.0] - 2025-06-24
Improved and streamlined version of the benchmark which includes important bug fixes, API improvements and benchmark protocol changes following the lessons learned from the first competition.
From 1cb13c263381ba9bc5668a17c37fbe8cfcefd10c Mon Sep 17 00:00:00 2001
From: "Ahmed Khaled (rka97)"
Date: Fri, 6 Feb 2026 00:03:12 +0000
Subject: [PATCH 3/3] update more docs
---
docs/CONTRIBUTING.md | 306 +++++++----
docs/DOCUMENTATION.md | 1085 +++++++++++++++++++++++++++++----------
docs/GETTING_STARTED.md | 423 +++++++++------
3 files changed, 1258 insertions(+), 556 deletions(-)
diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md
index f635b977f..ae11154a9 100644
--- a/docs/CONTRIBUTING.md
+++ b/docs/CONTRIBUTING.md
@@ -2,61 +2,87 @@
## Table of Contents
-- [Contributing to MLCommons](#contributing-to-mlcommons)
-- [Setup for Contributing](#setup-for-contributing)
- - [Setting up a Linux VM on GCP](#setting-up-a-linux-vm-on-gcp)
- - [Installing GPU Drivers](#installing-gpu-drivers)
- - [Authentication for Google Cloud Container Registry](#authentication-for-google-cloud-container-registry)
-- [Installation](#installation)
-- [Docker Workflows](#docker-workflows)
- - [Pre-built Images on Google Cloud Container Registry](#pre-built-images-on-google-cloud-container-registry)
- - [Trigger Rebuild and Push of Maintained Images](#trigger-rebuild-and-push-of-maintained-images)
- - [Trigger Build and Push of Images on Other Branch](#trigger-build-and-push-of-images-on-other-branch)
- - [GCP Data and Experiment Integration](#gcp-data-and-experiment-integration)
- - [Downloading Data from GCP](#downloading-data-from-gcp)
- - [Saving Experiments to GCP](#saving-experiments-to-gcp)
- - [Getting Information from a Container](#getting-information-from-a-container)
- - [Mounting Local Repository](#mounting-local-repository)
-- [Submitting PRs](#submitting-prs)
-- [Testing](#testing)
- - [Style Testing](#style-testing)
- - [Unit and Integration Tests](#unit-and-integration-tests)
- - [Regression Tests](#regression-tests)
- - [Versioning](#versioning)
- - [Release Process](#release-process)
+- [Contributing to MLCommons](#contributing-to-mlcommons)
+- [Setup for Contributing](#setup-for-contributing)
+ - [Setting up a Linux VM on GCP](#setting-up-a-linux-vm-on-gcp)
+ - [Installing GPU Drivers](#installing-gpu-drivers)
+ - [Authentication for Google Cloud Container Registry](#authentication-for-google-cloud-container-registry)
+- [Installation](#installation)
+- [Docker Workflows](#docker-workflows)
+ - [Pre-built Images on Google Cloud Container Registry](#pre-built-images-on-google-cloud-container-registry)
+ - [Trigger Rebuild and Push of Maintained Images](#trigger-rebuild-and-push-of-maintained-images)
+ - [Trigger Build and Push of Images on Other Branch](#trigger-build-and-push-of-images-on-other-branch)
+ - [GCP Data and Experiment Integration](#gcp-data-and-experiment-integration)
+ - [Downloading Data from GCP](#downloading-data-from-gcp)
+ - [Saving Experiments to GCP](#saving-experiments-to-gcp)
+ - [Getting Information from a Container](#getting-information-from-a-container)
+ - [Mounting Local Repository](#mounting-local-repository)
+- [Submitting PRs](#submitting-prs)
+- [Testing](#testing)
+ - [Style Testing](#style-testing)
+ - [Unit and Integration Tests](#unit-and-integration-tests)
+ - [Regression Tests](#regression-tests)
+ - [Versioning](#versioning)
+ - [Release Process](#release-process)
## Contributing to MLCommons
-We invite everyone to look through our technical documentation and codebase and submit issues and pull requests, e.g. for changes, clarifications, or any bugs you might encounter. If you are interested in contributing to the work of the working group and influence the benchmark's design decisions, please [join the weekly meetings](https://mlcommons.org/en/groups/research-algorithms/) and consider becoming a member of the working group.
-
-The best way to contribute to the MLCommons is to get involved with one of our many project communities. You can find more information about getting involved with MLCommons on our [getting started page](https://mlcommons.org/en/get-involved/#getting-started).
-
-Generally we encourage people to become a MLCommons member if they wish to contribute to MLCommons projects, but outside pull requests are very welcome too.
-
-To get started contributing code, you or your organization needs to sign the MLCommons CLA found at the [MLC policies page](https://mlcommons.org/en/policies/). Once you or your organization has signed the corporate CLA, please fill out this [CLA sign up form](https://forms.gle/Ew1KkBVpyeJDuRw67) form to get your specific GitHub handle authorized so that you can start contributing code under the proper license.
-
-MLCommons project work is tracked with issue trackers and pull requests. Modify the project in your own fork and issue a pull request once you want other developers to take a look at what you have done and discuss the proposed changes. Ensure that cla-bot and other checks pass for your Pull requests.
+We invite everyone to look through our technical documentation and codebase and
+submit issues and pull requests, e.g. for changes, clarifications, or any bugs
+you might encounter. If you are interested in contributing to the work of the
+working group and influence the benchmark's design decisions, please
+[join the weekly meetings](https://mlcommons.org/en/groups/research-algorithms/)
+and consider becoming a member of the working group.
+
+The best way to contribute to the MLCommons is to get involved with one of our
+many project communities. You can find more information about getting involved
+with MLCommons on our
+[getting started page](https://mlcommons.org/en/get-involved/#getting-started).
+
+Generally we encourage people to become a MLCommons member if they wish to
+contribute to MLCommons projects, but outside pull requests are very welcome
+too.
+
+To get started contributing code, you or your organization needs to sign the
+MLCommons CLA found at the
+[MLC policies page](https://mlcommons.org/en/policies/). Once you or your
+organization has signed the corporate CLA, please fill out this
+[CLA sign up form](https://forms.gle/Ew1KkBVpyeJDuRw67) form to get your
+specific GitHub handle authorized so that you can start contributing code under
+the proper license.
+
+MLCommons project work is tracked with issue trackers and pull requests. Modify
+the project in your own fork and issue a pull request once you want other
+developers to take a look at what you have done and discuss the proposed
+changes. Ensure that cla-bot and other checks pass for your Pull requests.
## Setup for Contributing
### Setting up a Linux VM on GCP
-If you want to run containers on GCP VMs or store and retrieve Docker images from the Google Cloud Container Registry, please read ahead.
-If you'd like to use a Linux VM, you will have to install the correct GPU drivers and the NVIDIA Docker toolkit.
-We recommmend to use the Deep Learning on Linux image. Further instructions are based on that.
+If you want to run containers on GCP VMs or store and retrieve Docker images
+from the Google Cloud Container Registry, please read ahead. If you'd like to
+use a Linux VM, you will have to install the correct GPU drivers and the NVIDIA
+Docker toolkit. We recommmend to use the Deep Learning on Linux image. Further
+instructions are based on that.
### Installing GPU Drivers
-You can use the `docker/scripts/cloud-startup.sh` as a startup script for the VM. This will automate the installation of the NVIDIA GPU Drivers and NVIDIA Docker toolkit.
+You can use the `docker/scripts/cloud-startup.sh` as a startup script for the
+VM. This will automate the installation of the NVIDIA GPU Drivers and NVIDIA
+Docker toolkit.
### Authentication for Google Cloud Container Registry
-To access the Google Cloud Container Registry, you will have to authenticate to the repository whenever you use Docker.
-Use the gcloud credential helper as documented in the [Google Cloud documentation](https://cloud.google.com/artifact-registry/docs/docker/pushing-and-pulling#cred-helper).
+To access the Google Cloud Container Registry, you will have to authenticate to
+the repository whenever you use Docker. Use the gcloud credential helper as
+documented in the
+[Google Cloud documentation](https://cloud.google.com/artifact-registry/docs/docker/pushing-and-pulling#cred-helper).
## Installation
-If you have not installed the package and dependencies yet see [Installation](/README.md#installation).
+If you have not installed the package and dependencies yet see
+[Installation](/README.md#installation).
To use the development tools such as `pytest` or `pylint` use the `dev` option:
@@ -65,21 +91,24 @@ pip3 install -e '.[dev]'
pre-commit install
```
-To get an installation with the requirements for all workloads and development, use the argument `[full_dev]`.
+To get an installation with the requirements for all workloads and development,
+use the argument `[full_dev]`.
## Docker Workflows
-We recommend developing in our Docker image to ensure a consistent environment between developing, testing and scoring submissions.
+We recommend developing in our Docker image to ensure a consistent environment
+between developing, testing and scoring submissions.
To get started see also:
-- [Installation with Docker](/GETTING_STARTED.md#docker)
-- [Running a submission inside a Docker Container](/GETTING_STARTED.md#run-your-submission-in-a-docker-container)
+- [Installation with Docker](/GETTING_STARTED.md#docker)
+- [Running a submission inside a Docker Container](/GETTING_STARTED.md#run-your-submission-in-a-docker-container)
### Pre-built Images on Google Cloud Container Registry
-If you want to maintain or use images stored on our Google Cloud Container Registry read this section.
-You will have to use an authentication helper to set up permissions to access the repository:
+If you want to maintain or use images stored on our Google Cloud Container
+Registry read this section. You will have to use an authentication helper to set
+up permissions to access the repository:
```bash
ARTIFACT_REGISTRY_URL=us-central1-docker.pkg.dev
@@ -95,19 +124,20 @@ docker pull europe-west4-docker.pkg.dev/mlcommons-algoperf/algoperf-docker-repo/
The naming convention for `image_name` is `algoperf__`.
Currently maintained images on the repository are:
-- `algoperf_jax_main`
-- `algoperf_pytorch_main`
-- `algoperf_both_main`
-- `algoperf_jax_dev`
-- `algoperf_pytorch_dev`
-- `algoperf_both_dev`
+- `algoperf_jax_main`
+- `algoperf_pytorch_main`
+- `algoperf_both_main`
+- `algoperf_jax_dev`
+- `algoperf_pytorch_dev`
+- `algoperf_both_dev`
To reference the pulled image you will have to use the full `image_path`, e.g.
`europe-west4-docker.pkg.dev/mlcommons-algoperf/algoperf-docker-repo/algoperf_jax_main`.
### Trigger Rebuild and Push of Maintained Images
-To build and push all images (`pytorch`, `jax`, `both`) on maintained branches (`dev`, `main`).
+To build and push all images (`pytorch`, `jax`, `both`) on maintained branches
+(`dev`, `main`).
```bash
bash docker/build_docker_images.sh -b
@@ -117,17 +147,20 @@ bash docker/build_docker_images.sh -b
You can also use the above script to build images from a different branch.
-1. Push the branch to `mlcommons/algorithmic-efficiency` repository.
-2. Run
+1. Push the branch to `mlcommons/algorithmic-efficiency` repository.
+2. Run
- ```bash
- bash docker/build_docker_images.sh -b
- ```
+ ```bash
+ bash docker/build_docker_images.sh -b
+ ```
### GCP Data and Experiment Integration
-The Docker entrypoint script can transfer data to and from our GCP buckets on our internal GCP project. If you are an approved contributor you can get access to these resources to automatically download the datasets and upload experiment results.
-You can use these features by setting the `--internal_contributor` flag to 'true' for the Docker entrypoint script.
+The Docker entrypoint script can transfer data to and from our GCP buckets on
+our internal GCP project. If you are an approved contributor you can get access
+to these resources to automatically download the datasets and upload experiment
+results. You can use these features by setting the `--internal_contributor` flag
+to 'true' for the Docker entrypoint script.
### Downloading Data from GCP
@@ -147,13 +180,15 @@ docker run -t -d \
--internal_contributor true
```
-If `keep_container_alive` is `true` the main process on the container will persist after finishing the data download.
-This run command is useful if you are developing or debugging.
+If `keep_container_alive` is `true` the main process on the container will
+persist after finishing the data download. This run command is useful if you are
+developing or debugging.
### Saving Experiments to GCP
-If you set the internal collaborator mode to true
-experiments will also be automatically uploaded to our GCP bucket under `gs://mlcommons-runs/ /bin/bash
### Mounting Local Repository
-Rebuilding the docker image can become tedious if
-you are making frequent changes to the code.
-To have changes in your local copy of the algorithmic-efficiency repo be reflected inside the container you can mount the local repository with the `-v` flag.
+Rebuilding the docker image can become tedious if you are making frequent
+changes to the code. To have changes in your local copy of the
+algorithmic-efficiency repo be reflected inside the container you can mount the
+local repository with the `-v` flag.
```bash
docker run -t -d \
@@ -215,16 +251,19 @@ docker run -t -d \
## Submitting PRs
-New PRs will be merged on the dev branch by default, given that they pass the presubmits.
+New PRs will be merged on the dev branch by default, given that they pass the
+presubmits.
## Testing
-We run tests with GitHub Actions, configured in the [.github/workflows](.github/workflows/) folder.
+We run tests with GitHub Actions, configured in the
+[.github/workflows](.github/workflows/) folder.
### Style Testing
-We run formatting and linting tests via ruff on PRs. You can view and fix offending errors with these instructions.
-To run the below commands, use the versions installed via `pip install -e '.[dev]'`.
+We run formatting and linting tests via ruff on PRs. You can view and fix
+offending errors with these instructions. To run the below commands, use the
+versions installed via `pip install -e '.[dev]'`.
To check whether your code is **formatted** correctly, run the following:
@@ -232,8 +271,9 @@ To check whether your code is **formatted** correctly, run the following:
ruff format --check
```
-To automatically fix formatting errors you can run `ruff format`, without the `--check` flag.
-(**WARNING**: this will edit your code, so it is suggested to make a git commit first!)
+To automatically fix formatting errors you can run `ruff format`, without the
+`--check` flag. (**WARNING**: this will edit your code, so it is suggested to
+make a git commit first!)
To check whether your code is **linted** correctly, run the following:
@@ -241,63 +281,101 @@ To check whether your code is **linted** correctly, run the following:
ruff check
```
-To automatically fix linting errors you can run `ruff check --fix`, with the additional `--fix` flag.
-(**WARNING**: this will edit your code, so it is suggested to make a git commit first!)
+To automatically fix linting errors you can run `ruff check --fix`, with the
+additional `--fix` flag. (**WARNING**: this will edit your code, so it is
+suggested to make a git commit first!)
### Unit and Integration Tests
-We run unit tests and integration tests as part of the of github actions as well.
-You can also use `python tests/reference_algorithm_tests.py` to run a single model update and two model evals for each workload using the reference algorithm in `algorithms/target_setting_algorithms/`.
+We run unit tests and integration tests as part of the of github actions as
+well. You can also use `python tests/reference_algorithm_tests.py` to run a
+single model update and two model evals for each workload using the reference
+algorithm in `algorithms/target_setting_algorithms/`.
### Regression Tests
-We also have regression tests available in [.github/workflows/regression_tests.yml](.github/workflows/regression_tests.yml) that can be run semi-automatically.
-The regression tests are shorter end-to-end submissions run in a containerized environment across all 8 workloads, in both the JAX and PyTorch frameworks.
-The regression tests run on self-hosted runners and are triggered for pull requests that target the main branch. Typically these PRs will be from the `dev` branch
-so the tests will run containers based on images build from the `dev` branch.
-To run a regression test:
+We also have regression tests available in
+[.github/workflows/regression_tests.yml](.github/workflows/regression_tests.yml)
+that can be run semi-automatically. The regression tests are shorter end-to-end
+submissions run in a containerized environment across all 8 workloads, in both
+the JAX and PyTorch frameworks. The regression tests run on self-hosted runners
+and are triggered for pull requests that target the main branch. Typically these
+PRs will be from the `dev` branch so the tests will run containers based on
+images build from the `dev` branch. To run a regression test:
-1. Build and upload latest Docker images from dev branch.
+1. Build and upload latest Docker images from dev branch.
- ```bash
- bash ~/algorithmic-efficiency/docker/build_docker_images.sh -b dev
- ```
+ ```bash
+ bash ~/algorithmic-efficiency/docker/build_docker_images.sh -b dev
+ ```
-2. Turn on the self-hosted runner.
-3. Run the self-hosted runner application for the runner to accept jobs.
-4. Open a pull request into mian to trigger the workflow.
+2. Turn on the self-hosted runner.
-### Versioning
+3. Run the self-hosted runner application for the runner to accept jobs.
-AlgoPerf uses a unified versioning scheme: codebase, rules, and leaderboard all share the same `Major.Minor` version. `Patch` versions may differ for minor updates to each component. All results produced under the same `Major.Minor` version should be comparable. See the [README](../README.md#releases--roadmap) and [Changelog](CHANGELOG.md) for details.
+4. Open a pull request into mian to trigger the workflow.
-The package version is automatically determined by the `setuptools_scm` package based on the last git tag.
-It follows the structure `major.minor.patch` + `devN` where `N` is the number of commits since the last tag.
-It automatically increments the patch version (i.e. it guesses the next version) if there are commits after the last tag.
-Additionally, if there are uncommitted changes, the version will include a suffix separated by a `+` character and includes the last commit hash plus the date on dirt workdir (see [setuptools_scm's documentation](https://setuptools-scm.readthedocs.io/en/latest/extending/#setuptools_scmlocal_scheme) with the default version and local scheme).
-You can check what version `setuptools_scm` is creating by running `python -m setuptools_scm`.
+### Versioning
+
+AlgoPerf uses a unified versioning scheme: codebase, rules, and leaderboard all
+share the same `Major.Minor` version. `Patch` versions may differ for minor
+updates to each component. All results produced under the same `Major.Minor`
+version should be comparable. See the [README](../README.md#releases--roadmap)
+and [Changelog](CHANGELOG.md) for details.
+
+The package version is automatically determined by the `setuptools_scm` package
+based on the last git tag. It follows the structure `major.minor.patch` + `devN`
+where `N` is the number of commits since the last tag. It automatically
+increments the patch version (i.e. it guesses the next version) if there are
+commits after the last tag. Additionally, if there are uncommitted changes, the
+version will include a suffix separated by a `+` character and includes the last
+commit hash plus the date on dirt workdir (see
+[setuptools_scm's documentation](https://setuptools-scm.readthedocs.io/en/latest/extending/#setuptools_scmlocal_scheme)
+with the default version and local scheme). You can check what version
+`setuptools_scm` is creating by running `python -m setuptools_scm`.
#### Release Process
The suggested workflow:
-- **Development:**
-
- - All changes will be on the `dev` (or `dev-0.X` or similar) branch. Only merge to `main` once we release.
- - For internal milestones, we could use pre-release labels like `-alpha.N`, `-beta.N` or `-rc.N`.
- - Iterative changes here, do not increment the version, since on this branch we are working _towards_ the next release.
- - All changes should be documented in the `CHANGELOG.md` for the upcoming version release. This includes changes in the code and the rules.
- - Do **not** manually edit version numbers in the codebase or `pyproject.toml`.
-
-- **Changes:** All changes that affect the results of the benchmark should result in increases to either the `Major` or `Minor` version. We could reserve increases to the `Major` version for larger changes like adding new workloads. Changes that do not affect the results of the benchmark should result in increases to the `Patch` version and could include the following:
-
- - _Codebase:_ Implement bug fixes, improvements, or new features. The git tag version automatically updates the `algoperf.__version__` of the package.
- - _Documentation/Rules:_ Updates like clarifications, typo fixes, or new content. Update the version in `docs/DOCUMENTATION.md` with the new version.
- - _Leaderboard:_ For example, adding a new submission, correcting typos, or adding details could result in updating the version as documented in the `submissions_algorithms` repo.
-
-- **Release new version:**
- - Check that `CHANGELOG.md` is up-to-date and complete.
- - Update the version in `docs/DOCUMENTATION.md` with the new version.
- - Update the release plan in the [README](../README.md#releases--roadmap) with the new version.
- - Merge `dev` or `dev-0.X` into `main`.
- - Tag release with new version in the GitHub UI. The package version is automatically updated to the new version. Once the package is installed, the version can be accessed as the package attribute `algoperf.__version__`, i.e. via `python -c "import algoperf; print(algoperf.__version__)"`.
+- **Development:**
+
+ - All changes will be on the `dev` (or `dev-0.X` or similar) branch. Only
+ merge to `main` once we release.
+ - For internal milestones, we could use pre-release labels like
+ `-alpha.N`, `-beta.N` or `-rc.N`.
+ - Iterative changes here, do not increment the version, since on this
+ branch we are working *towards* the next release.
+ - All changes should be documented in the `CHANGELOG.md` for the upcoming
+ version release. This includes changes in the code and the rules.
+ - Do **not** manually edit version numbers in the codebase or
+ `pyproject.toml`.
+
+- **Changes:** All changes that affect the results of the benchmark should
+ result in increases to either the `Major` or `Minor` version. We could
+ reserve increases to the `Major` version for larger changes like adding new
+ workloads. Changes that do not affect the results of the benchmark should
+ result in increases to the `Patch` version and could include the following:
+
+ - *Codebase:* Implement bug fixes, improvements, or new features. The git
+ tag version automatically updates the `algoperf.__version__` of the
+ package.
+ - *Documentation/Rules:* Updates like clarifications, typo fixes, or new
+ content. Update the version in `docs/DOCUMENTATION.md` with the new
+ version.
+ - *Leaderboard:* For example, adding a new submission, correcting typos,
+ or adding details could result in updating the version as documented in
+ the `submissions_algorithms` repo.
+
+- **Release new version:**
+
+ - Check that `CHANGELOG.md` is up-to-date and complete.
+ - Update the version in `docs/DOCUMENTATION.md` with the new version.
+ - Update the release plan in the [README](../README.md#releases--roadmap)
+ with the new version.
+ - Merge `dev` or `dev-0.X` into `main`.
+ - Tag release with new version in the GitHub UI. The package version is
+ automatically updated to the new version. Once the package is installed,
+ the version can be accessed as the package attribute
+ `algoperf.__version__`, i.e. via `python -c "import algoperf;
+ print(algoperf.__version__)"`.
diff --git a/docs/DOCUMENTATION.md b/docs/DOCUMENTATION.md
index 49e738408..15ffc7e65 100644
--- a/docs/DOCUMENTATION.md
+++ b/docs/DOCUMENTATION.md
@@ -1,93 +1,178 @@
# MLCommons™ AlgoPerf: Documentation, Benchmarking Process & FAQs
-**Version:** 0.6.0 _(Last updated August 27, 2025)_
+**Version:** 1.0.0 *(Last updated Feb 05, 2026)*
> [!IMPORTANT]
>
-> **TL;DR:** The MLCommons™ **AlgoPerf: Training Algorithms benchmark is designed to find training algorithms that can train neural networks faster** by rigorously measuring how quickly they reach a specific performance target across a diverse set of deep learning workloads.
-> This document provides the technical documentation, benchmarking process, and FAQs for the AlgoPerf benchmark.
+> **TL;DR:** The MLCommons™ **AlgoPerf: Training Algorithms benchmark is
+> designed to find training algorithms that can train neural networks faster**
+> by rigorously measuring how quickly they reach a specific performance target
+> across a diverse set of deep learning workloads. This document provides the
+> technical documentation, benchmarking process, and FAQs for the AlgoPerf
+> benchmark.
## Table of Contents
-- [Introduction](#introduction)
- - [Motivation](#motivation)
- - [Overview](#overview)
-- [Benchmarking Process](#benchmarking-process)
- - [Submissions](#submissions)
- - [Submission API](#submission-api)
- - [Valid Submissions](#valid-submissions)
- - [Runtime Environment and Evaluation](#runtime-environment-and-evaluation)
- - [Tuning Rulesets](#tuning-rulesets)
- - [External Tuning Ruleset](#external-tuning-ruleset)
- - [Self-Tuning Ruleset](#self-tuning-ruleset)
- - [Workloads](#workloads)
- - [Recommended Qualification Set](#recommended-qualification-set)
- - [Scoring](#scoring)
- - [AlgoPerf Benchmark Score via Integrated Performance Profiles](#algoperf-benchmark-score-via-integrated-performance-profiles)
- - [Benchmarking Hardware](#benchmarking-hardware)
- - [Defining Target Performance and `max_runtime`](#defining-target-performance-and-max_runtime)
-- [Versioning Policy](#versioning-policy)
- - [Version Freeze](#version-freeze)
-- [License and Legal Requirements](#license-and-legal-requirements)
-- [FAQs](#faqs)
- - [Setup \& Platform](#setup--platform)
- - [Submitting](#submitting)
- - [Scoring \& Hardware](#scoring--hardware)
-- [Disclaimers](#disclaimers)
- - [Shared Data Pipelines between `JAX` and `PyTorch`](#shared-data-pipelines-between-jax-and-pytorch)
+- [Introduction](#introduction)
+ - [Motivation](#motivation)
+ - [Overview](#overview)
+- [Benchmarking Process](#benchmarking-process)
+ - [Submissions](#submissions)
+ - [Submission API](#submission-api)
+ - [Valid Submissions](#valid-submissions)
+ - [Runtime Environment and Evaluation](#runtime-environment-and-evaluation)
+ - [Tuning Rulesets](#tuning-rulesets)
+ - [External Tuning Ruleset](#external-tuning-ruleset)
+ - [Self-Tuning Ruleset](#self-tuning-ruleset)
+ - [Workloads](#workloads)
+ - [Recommended Qualification Set](#recommended-qualification-set)
+ - [Scoring](#scoring)
+ - [AlgoPerf Benchmark Score via Integrated Performance Profiles](#algoperf-benchmark-score-via-integrated-performance-profiles)
+ - [Benchmarking Hardware](#benchmarking-hardware)
+ - [Defining Target Performance and `max_runtime`](#defining-target-performance-and-max_runtime)
+- [Versioning Policy](#versioning-policy)
+ - [Version Freeze](#version-freeze)
+- [License and Legal Requirements](#license-and-legal-requirements)
+- [FAQs](#faqs)
+ - [Setup \& Platform](#setup--platform)
+ - [Submitting](#submitting)
+ - [Scoring \& Hardware](#scoring--hardware)
+- [Disclaimers](#disclaimers)
+ - [Shared Data Pipelines between `JAX` and `PyTorch`](#shared-data-pipelines-between-jax-and-pytorch)
## Introduction
### Motivation
-Neural networks are powerful models, but they need to be trained to be useful. Training cutting-edge machine learning (ML) models exceeds the compute budgets of many researchers and is a growing cost in industry.
-Additionally, when training neural nets, practitioners face many critical yet often opaque decisions: What optimizer to choose? How should its learning rate be tuned? What learning rate schedule should be used? These choices can make or break training, yet the community has lacked a clear, standardized way to identify the state of the art.
-
-To reduce the compute and potentially environmental cost of machine learning models, as well as provide guidance for practitioners, we need more scientifically sound methods for evaluating training speedups due to new algorithms.
-
-Unlike benchmarks focused on hardware or model architecture, AlgoPerf isolates the training algorithm itself, which includes the optimizer, regularization, data selection, and hyperparameters like the learning rate schedule. By standardizing the benchmarking process, AlgoPerf offers a meaningful apples-to-apples comparison of training algorithms.
-
-This document focuses on the **Training Algorithm Track** of the _AlgoPerf benchmark_.
+Neural networks are powerful models, but they need to be trained to be useful.
+Training cutting-edge machine learning (ML) models exceeds the compute budgets
+of many researchers and is a growing cost in industry. Additionally, when
+training neural nets, practitioners face many critical yet often opaque
+decisions: What optimizer to choose? How should its learning rate be tuned? What
+learning rate schedule should be used? These choices can make or break training,
+yet the community has lacked a clear, standardized way to identify the state of
+the art.
+
+To reduce the compute and potentially environmental cost of machine learning
+models, as well as provide guidance for practitioners, we need more
+scientifically sound methods for evaluating training speedups due to new
+algorithms.
+
+Unlike benchmarks focused on hardware or model architecture, AlgoPerf isolates
+the training algorithm itself, which includes the optimizer, regularization,
+data selection, and hyperparameters like the learning rate schedule. By
+standardizing the benchmarking process, AlgoPerf offers a meaningful
+apples-to-apples comparison of training algorithms.
+
+This document focuses on the **Training Algorithm Track** of the *AlgoPerf
+benchmark*.
### Overview
-The **AlgoPerf: Training Algorithms benchmark** challenges participants to submit training algorithms that accelerate the training of neural networks. The goal is to reach a pre-defined performance target in the shortest possible time ("time-to-result") across a diverse set of workloads. The benchmark is designed to identify general-purpose training algorithms, such as new optimizers, data selection methods, regularization techniques, etc., that provide practical speedups for the broader ML community.
+The **AlgoPerf: Training Algorithms benchmark** challenges participants to
+submit training algorithms that accelerate the training of neural networks. The
+goal is to reach a pre-defined performance target in the shortest possible time
+("time-to-result") across a diverse set of workloads. The benchmark is designed
+to identify general-purpose training algorithms, such as new optimizers, data
+selection methods, regularization techniques, etc., that provide practical
+speedups for the broader ML community.
The benchmarking process follows these **key principles**:
-- 🎯 **Fixed Target, Model & Hardware:** Submitted training algorithms must train a set of [**fixed models**](#workloads) to a pre-defined validation performance target as fast as possible. All submissions use the same model architecture and are run on the same [**standardized hardware**](#benchmarking-hardware) (currently `4x NVIDIA A100 GPUs`). This isolates the training algorithm's performance and allows a fair apples-to-apples comparison.
-- ⏱️ **Time-To-Result:** Submissions are evaluated based on the total wall-clock time required to reach the target, rewarding practical and efficient algorithms.
-- 🧠 **Diverse Workloads:** The benchmark includes [**8 diverse deep learning workloads**](#workloads) across domains like image classification, speech recognition, and machine translation. A submission's score is computed by aggregating its performance across all workloads, using [**performance profiles**](#algoperf-benchmark-score-via-integrated-performance-profiles), to ensure general-purpose algorithms.
-- 📦 **Fully-Specified Algorithms:** Submissions must be [**complete procedures**](#submission-api) and thus hyperparameter tuning is treated as part of the algorithm. Depending on the [**ruleset**](#tuning-rulesets), submissions may use parallel tuning resources. This ensures that the benchmark measures the _total_ practical cost of a training algorithm and provides practitioners with a complete method, eliminating the guesswork of how to apply it.
-
-To participate, you [**submit a training algorithm**](/README.md#how-to-submit) by implementing a specific set of functions within our API, i.e. the [**submission functions**](#submission-api). All other components, including the model architecture, loss function, and evaluation logic, are fixed. This ensures that any performance gains are directly attributable to your algorithmic innovations.
+- 🎯 **Fixed Target, Model & Hardware:** Submitted training algorithms must
+ train a set of [**fixed models**](#workloads) to a pre-defined validation
+ performance target as fast as possible. All submissions use the same model
+ architecture and are run on the same
+ [**standardized hardware**](#benchmarking-hardware) (currently `4x NVIDIA
+ A100 GPUs`). This isolates the training algorithm's performance and allows a
+ fair apples-to-apples comparison.
+- ⏱️ **Time-To-Result:** Submissions are evaluated based on the total
+ wall-clock time required to reach the target, rewarding practical and
+ efficient algorithms.
+- 🧠 **Diverse Workloads:** The benchmark includes
+ [**9 diverse deep learning workloads**](#workloads) across domains like
+ image classification, speech recognition, and machine translation. A
+ submission's score is computed by aggregating its performance across all
+ workloads, using
+ [**performance profiles**](#algoperf-benchmark-score-via-integrated-performance-profiles),
+ to ensure general-purpose algorithms.
+- 📦 **Fully-Specified Algorithms:** Submissions must be
+ [**complete procedures**](#submission-api) and thus hyperparameter tuning is
+ treated as part of the algorithm. Depending on the
+ [**ruleset**](#tuning-rulesets), submissions may use parallel tuning
+ resources. This ensures that the benchmark measures the *total* practical
+ cost of a training algorithm and provides practitioners with a complete
+ method, eliminating the guesswork of how to apply it.
+
+To participate, you [**submit a training algorithm**](/README.md#how-to-submit)
+by implementing a specific set of functions within our API, i.e. the
+[**submission functions**](#submission-api). All other components, including the
+model architecture, loss function, and evaluation logic, are fixed. This ensures
+that any performance gains are directly attributable to your algorithmic
+innovations.
Submissions can be entered under two distinct rulesets:
-1. **External Tuning Ruleset:** This ruleset permits a limited, automated, parallel hyperparameter search for each workload, where the search space is defined by the submitter but must be the same for all workloads. A submission's workload score uses only the fastest tuning trial to reach the target.
-2. **Self-Tuning Ruleset:** This ruleset is for hyperparameter-free or fully autonomous algorithms. All workload adaptations or hyperparameter tuning must be performed by the algorithm "on the clock" during a single training run.
+1. **External Tuning Ruleset:** This ruleset permits a limited, automated,
+ parallel hyperparameter search for each workload, where the search space is
+ defined by the submitter but must be the same for all workloads. A
+ submission's workload score uses only the fastest tuning trial to reach the
+ target.
+2. **Self-Tuning Ruleset:** This ruleset is for hyperparameter-free or fully
+ autonomous algorithms. All workload adaptations or hyperparameter tuning
+ must be performed by the algorithm "on the clock" during a single training
+ run.
-A core tenet of the benchmark is to foster the development of broadly applicable methods. Submissions must be able to generalize and are prohibited from using logic or pre-computed solutions specific to any single workload.
+A core tenet of the benchmark is to foster the development of broadly applicable
+methods. Submissions must be able to generalize and are prohibited from using
+logic or pre-computed solutions specific to any single workload.
## Benchmarking Process
-The following sections provide the complete technical specifications of the benchmarking process, starting with what constitutes a [**Submission**](#submissions), followed by the two rulesets handling [**Hyperparameter Tuning**](#tuning-rulesets). The [**Workloads**](#workloads) section outlines the deep learning workloads (i.e. models, datasets, loss functions, etc.) used in the benchmark. Finally, the [**Scoring**](#scoring) section describes the process of computing a submission's final, scalar AlgoPerf score (as well as alternative scoring metrics).
+The following sections provide the complete technical specifications of the
+benchmarking process, starting with what constitutes a
+[**Submission**](#submissions), followed by the two rulesets handling
+[**Hyperparameter Tuning**](#tuning-rulesets). The [**Workloads**](#workloads)
+section outlines the deep learning workloads (i.e. models, datasets, loss
+functions, etc.) used in the benchmark. Finally, the [**Scoring**](#scoring)
+section describes the process of computing a submission's final, scalar AlgoPerf
+score (as well as alternative scoring metrics).
### Submissions
-A submission to the _AlgoPerf_ benchmark consists of a `submission.py` file that implements a set of Python functions that define your custom training algorithm. This code will be called by the benchmark harness that manages the overall training and evaluation loop.
-The core idea is that a submission replaces specific parts of a standard training pipeline with its own logic to train the _AlgoPerf_ workloads to the target performance as quickly as possible, while adhering to the benchmark's rules.
+A submission to the *AlgoPerf* benchmark consists of a `submission.py` file that
+implements a set of Python functions that define your custom training algorithm.
+This code will be called by the benchmark harness that manages the overall
+training and evaluation loop. The core idea is that a submission replaces
+specific parts of a standard training pipeline with its own logic to train the
+*AlgoPerf* workloads to the target performance as quickly as possible, while
+adhering to the benchmark's rules.
-This section details the functions you must implement (the [**Submission API**](#submission-api)), the most important functions and data provided by the benchmark environment ([**fixed functions**](#fixed-functions)), the [**rules to create a valid submission**](#valid-submissions), as well as the [**runtime environment and evaluation procedure**](#runtime-environment-and-evaluation).
+This section details the functions you must implement (the
+[**Submission API**](#submission-api)), the most important functions and data
+provided by the benchmark environment ([**fixed functions**](#fixed-functions)),
+the [**rules to create a valid submission**](#valid-submissions), as well as the
+[**runtime environment and evaluation procedure**](#runtime-environment-and-evaluation).
#### Submission API
-The submission functions are the [`get_batch_size`](#get_batch_size), [`init_optimizer_state`](#init_optimizer_state), [`update_params`](#update_params), [`prepare_for_eval`](#prepare_for_eval), and [`data_selection`](#data_selection) functions. These functions are the only ones that submitters are allowed to modify.
-All other functions are [**fixed functions**](#fixed-functions) and contain among other things the `step_hint`, `_build_input_queue`, `init_model_fn`, `model_fn`, or `loss_fn` functions.
-Although a submission might access these fixed functions, e.g., to re-initialize the model after a failed training effort, it is not allowed to modify them.
-The trained model will be evaluated in a separate step that does not call any of the submitted code.
-
-> 💡 In principle, submissions are allowed to use the available hardware systems in any data- or model-parallel manner they desire, within the constraints of the submission function APIs. However, in practice, model-parallelism may not be possible with the API. Submitters are allowed to access any framework-specific device information necessary to exploit the hardware.
+The submission functions are the [`get_batch_size`](#get_batch_size),
+[`init_optimizer_state`](#init_optimizer_state),
+[`update_params`](#update_params), [`prepare_for_eval`](#prepare_for_eval), and
+[`data_selection`](#data_selection) functions. These functions are the only ones
+that submitters are allowed to modify. All other functions are
+[**fixed functions**](#fixed-functions) and contain among other things the
+`step_hint`, `_build_input_queue`, `init_model_fn`, `model_fn`, or `loss_fn`
+functions. Although a submission might access these fixed functions, e.g., to
+re-initialize the model after a failed training effort, it is not allowed to
+modify them. The trained model will be evaluated in a separate step that does
+not call any of the submitted code.
+
+> 💡 In principle, submissions are allowed to use the available hardware systems
+> in any data- or model-parallel manner they desire, within the constraints of
+> the submission function APIs. However, in practice, model-parallelism may not
+> be possible with the API. Submitters are allowed to access any
+> framework-specific device information necessary to exploit the hardware.
##### `get_batch_size`
@@ -97,13 +182,22 @@ def get_batch_size(workload_name: str) -> int
**Purpose:** To specify the training batch size for a given workload.
-- This function allows submitters to define a different batch size for each workload to ensure that the training does not run out of memory.
-- For example, in advance, submitters can determine, for each workload, the largest batch size that fits into memory of the [benchmarking hardware](#benchmarking-hardware).
-- Called once per workload before training begins.
+- This function allows submitters to define a different batch size for each
+ workload to ensure that the training does not run out of memory.
+- For example, in advance, submitters can determine, for each workload, the
+ largest batch size that fits into memory of the
+ [benchmarking hardware](#benchmarking-hardware).
+- Called once per workload before training begins.
> [!NOTE]
>
-> This does not change the _evaluation batch size_ (i.e., the batch size used during the evaluation phase). By design, submitters are not allowed to modify the evaluation batch size, which is set by the benchmarking codebase. However, you can file an issue if you believe that the evaluation batch size of a particular workload is set inappropriately. The working group will review this request and consider adjusting the evaluation batch size in the benchmarking codebase, thus affecting all submitters equally.
+> This does not change the *evaluation batch size* (i.e., the batch size used
+> during the evaluation phase). By design, submitters are not allowed to modify
+> the evaluation batch size, which is set by the benchmarking codebase. However,
+> you can file an issue if you believe that the evaluation batch size of a
+> particular workload is set inappropriately. The working group will review this
+> request and consider adjusting the evaluation batch size in the benchmarking
+> codebase, thus affecting all submitters equally.
##### `init_optimizer_state`
@@ -117,10 +211,16 @@ def init_optimizer_state(
) -> initial_optimizer_state
```
-**Purpose:** To initialize the optimizer state, i.e., momentum buffers or defining learning rate schedules.
+**Purpose:** To initialize the optimizer state, i.e., momentum buffers or
+defining learning rate schedules.
-- It does not involve the [initialization for the model parameters](#fixed-functions), which in this benchmark is considered a fixed function.
-- The optimizer state is a dictionary (`Dict[str, Any]`). For a PyTorch submission, any value in this dictionary which is a class instance with internal state has to have a `state_dict()` method implemented to be stored correctly at the training checkpoints.
+- It does not involve the
+ [initialization for the model parameters](#fixed-functions), which in this
+ benchmark is considered a fixed function.
+- The optimizer state is a dictionary (`Dict[str, Any]`). For a PyTorch
+ submission, any value in this dictionary which is a class instance with
+ internal state has to have a `state_dict()` method implemented to be stored
+ correctly at the training checkpoints.
##### `update_params`
@@ -141,19 +241,40 @@ def update_params(
) -> (updated_optimizer_state, updated_params, updated_model_state)
```
-**Purpose:** To perform a single training step, i.e., update the model parameters and optimizer state.
-
-- Inside this function, you will typically call the workload's `loss_fn` and `model_fn` to perform a forward and backward pass to get gradients.
- - Uses the `model_fn` of the `workload` in order to decouple the loss from the model so that model outputs (forward passes) can be reused (by storing them in the optimizer state).
-- The fixed `init_model_fn` can optionally be called during training, for example, to reinitialize the model after a failed training effort.
-- **A call to this function will be considered a step**. The time between a call to this function and the next call to this function will be considered the per-step time.
-- A submission can access the elapsed training time and get further information about the evaluation through `train_state`. It may also access the target evaluation metric via the `workload` variable.
-- `current_param_container` is the same kind of nested structure as used by `model_fn` which constitutes a nested collection of `float32` arrays, each endowed with information about what kind of parameter that array represents stored in a parallel structure of `current_params_types`.
- - Parameter kind is one of a known list of types, e.g. `{"weights", "biases", "embeddings", "conv_weight", "batch_norm_scale", "batch_norm_bias", ...}`.
-- `model_state` holds auxiliary state necessary for some models, such as the current batch norm statistics.
-- The loss function will be one of a small set of known possibilities and the update function is allowed to branch on the `loss_type` enum/name.
-- The `loss_fn` produces a loss per example and a summed loss (both only for one device), which both can be used.
-- Cannot modify the given hyperparameters in a workload-conditional way (please see the [Valid Submissions](#valid-submissions) section). This rule is intended to prohibit circumventing the tuning rules by looking up a pre-tuned optimal set of hyperparameters for each workload. It is not intended to prohibit line searches and other similar techniques.
+**Purpose:** To perform a single training step, i.e., update the model
+parameters and optimizer state.
+
+- Inside this function, you will typically call the workload's `loss_fn` and
+ `model_fn` to perform a forward and backward pass to get gradients.
+ - Uses the `model_fn` of the `workload` in order to decouple the loss from
+ the model so that model outputs (forward passes) can be reused (by
+ storing them in the optimizer state).
+- The fixed `init_model_fn` can optionally be called during training, for
+ example, to reinitialize the model after a failed training effort.
+- **A call to this function will be considered a step**. The time between a
+ call to this function and the next call to this function will be considered
+ the per-step time.
+- A submission can access the elapsed training time and get further
+ information about the evaluation through `train_state`. It may also access
+ the target evaluation metric via the `workload` variable.
+- `current_param_container` is the same kind of nested structure as used by
+ `model_fn` which constitutes a nested collection of `float32` arrays, each
+ endowed with information about what kind of parameter that array represents
+ stored in a parallel structure of `current_params_types`.
+ - Parameter kind is one of a known list of types, e.g. `{"weights",
+ "biases", "embeddings", "conv_weight", "batch_norm_scale",
+ "batch_norm_bias", ...}`.
+- `model_state` holds auxiliary state necessary for some models, such as the
+ current batch norm statistics.
+- The loss function will be one of a small set of known possibilities and the
+ update function is allowed to branch on the `loss_type` enum/name.
+- The `loss_fn` produces a loss per example and a summed loss (both only for
+ one device), which both can be used.
+- Cannot modify the given hyperparameters in a workload-conditional way
+ (please see the [Valid Submissions](#valid-submissions) section). This rule
+ is intended to prohibit circumventing the tuning rules by looking up a
+ pre-tuned optimal set of hyperparameters for each workload. It is not
+ intended to prohibit line searches and other similar techniques.
##### `prepare_for_eval`
@@ -172,15 +293,24 @@ def prepare_for_eval(
) -> (updated_optimizer_state, updated_params, updated_model_state)
```
-**Purpose:** To prepare the model for evaluation, e.g., for swapping model parameters.
-
-- Arguments are the same as `update_params`, with the only exception of `batch`.
-- This function is called when a submission is deemed eligible for an evaluation (see [Evaluation during training](#evaluation-during-training) section).
- - The call to `prepare_for_eval` is timed and its runtime accumulates to the overall submission time.
- - The returned model parameters are evaluated on the validation and test sets, provided that the accumulated submission time does not exceed the maximum runtime after this function call (else the evaluation is skipped and the run is terminated).
-- This API supports Polyak averaging and similar methods that implement moving averages of model parameters.
-- Allowed to update model state and model parameters.
-- Allowed to update state for the optimizer.
+**Purpose:** To prepare the model for evaluation, e.g., for swapping model
+parameters.
+
+- Arguments are the same as `update_params`, with the only exception of
+ `batch`.
+- This function is called when a submission is deemed eligible for an
+ evaluation (see [Evaluation during training](#evaluation-during-training)
+ section).
+ - The call to `prepare_for_eval` is timed and its runtime accumulates to
+ the overall submission time.
+ - The returned model parameters are evaluated on the validation and test
+ sets, provided that the accumulated submission time does not exceed the
+ maximum runtime after this function call (else the evaluation is skipped
+ and the run is terminated).
+- This API supports Polyak averaging and similar methods that implement moving
+ averages of model parameters.
+- Allowed to update model state and model parameters.
+- Allowed to update state for the optimizer.
##### `data_selection`
@@ -199,16 +329,27 @@ def data_selection(
**Purpose:** To select a subset of the training data for the next training step.
-- `input_queue` can yield up to the number of elements in the training dataset
-- Want to allow for submitters to construct their own data batches from the dataset
-- Submissions are allowed to arbitrarily modify the input examples, as long as the modifications are sufficiently generic to be applicable to any workload
-- This is only called on the training inputs. **No submitted code will be called during evaluation.**
-- Examples of data selection methods include _data echoing_, _curriculum learning_, _bootstrapping_, or _biased sampling_ (based on loss values, so need to store the forward pass in the `optimizer_state`, potentially forward pass of a cheaper proxy model).
+- `input_queue` can yield up to the number of elements in the training dataset
+- Want to allow for submitters to construct their own data batches from the
+ dataset
+- Submissions are allowed to arbitrarily modify the input examples, as long as
+ the modifications are sufficiently generic to be applicable to any workload
+- This is only called on the training inputs. **No submitted code will be
+ called during evaluation.**
+- Examples of data selection methods include *data echoing*, *curriculum
+ learning*, *bootstrapping*, or *biased sampling* (based on loss values, so
+ need to store the forward pass in the `optimizer_state`, potentially forward
+ pass of a cheaper proxy model).
##### Fixed Functions
-Any function that is not part of the [**Submission API**](#submission-api) (and thus a submission function) is considered a fixed function, which submitters are **not** allowed to modify.
-Below, we describe some of the fixed functions to provide a better understanding of the _AlgoPerf_ benchmark API. With the exception of `_build_input_queue`, submitters can call any of these functions (along with any public function in the provided `workload` instance) at any time in their submission functions.
+Any function that is not part of the [**Submission API**](#submission-api) (and
+thus a submission function) is considered a fixed function, which submitters are
+**not** allowed to modify. Below, we describe some of the fixed functions to
+provide a better understanding of the *AlgoPerf* benchmark API. With the
+exception of `_build_input_queue`, submitters can call any of these functions
+(along with any public function in the provided `workload` instance) at any time
+in their submission functions.
###### Step hint
@@ -217,8 +358,16 @@ Below, we describe some of the fixed functions to provide a better understanding
def step_hint(self) -> int
```
-- The `step_hint` function gives the number of global steps the baseline algorithm can perform within the `max_runtime` on a workload. The `step_hint` is therefore dependent on the workload and its (current) `max_runtime`. Note that the baseline algorithms may have reached the target in fewer steps than this, but these were the max number of steps the baseline algorithms used for their learning rate schedules. Submitters can use this to help specify learning rate (or other) schedules.
-- The `step_hint` is only a hint. There is no need to use it at all. However, it is often more convenient, e.g. to define the learning rate schedule in terms of the number of steps (instead of the runtime).
+- The `step_hint` function gives the number of global steps the baseline
+ algorithm can perform within the `max_runtime` on a workload. The
+ `step_hint` is therefore dependent on the workload and its (current)
+ `max_runtime`. Note that the baseline algorithms may have reached the target
+ in fewer steps than this, but these were the max number of steps the
+ baseline algorithms used for their learning rate schedules. Submitters can
+ use this to help specify learning rate (or other) schedules.
+- The `step_hint` is only a hint. There is no need to use it at all. However,
+ it is often more convenient, e.g. to define the learning rate schedule in
+ terms of the number of steps (instead of the runtime).
###### Data augmentation and preprocessing
@@ -232,7 +381,12 @@ def _build_input_queue(
) -> Iterator[Dict[str, Tensor]]
```
-- The `_build_input_queue` function will be called to produce the iterator over batches that the submitted data selection function consumes. It is responsible for all data reading, shuffling, repeating, preprocessing, and batching. Note that for JAX this should return an iterator over tensors of shape `(num_devices, per_device_batch_size, ...)`, and for PyTorch this should return tensors of shape `(per_device_batch_size, ...)` (assuming PyTorch's [DDP](https://pytorch.org/docs/stable/notes/ddp.html) is used).
+- The `_build_input_queue` function will be called to produce the iterator
+ over batches that the submitted data selection function consumes. It is
+ responsible for all data reading, shuffling, repeating, preprocessing, and
+ batching. This should return tensors of shape `(per_device_batch_size, ...)`
+ (assuming PyTorch's [DDP](https://pytorch.org/docs/stable/notes/ddp.html) or
+ JAX.jit with sharding is used).
###### Model initialization
@@ -240,7 +394,9 @@ def _build_input_queue(
def init_model_fn(self, rng: RandomState) -> ModelInitState
```
-- This function initializes the parameters of the model. While it can be called by the submission (e.g. to restart the model after a failed training effort) it cannot be changed.
+- This function initializes the parameters of the model. While it can be
+ called by the submission (e.g. to restart the model after a failed training
+ effort) it cannot be changed.
###### Forward pass
@@ -257,11 +413,27 @@ def model_fn(
) -> (logits_output_batch, new_model_state): Tuple[Tensor, ModelAuxiliaryState]
```
-- `params` is whatever the structure is that contains the (`float32`) model parameters. The naming is overloaded due to having to handle the more object-oriented `PyTorch` style and the functional `JAX` style of development. In the `Flax` library (written in `JAX`), this is typically a nested dictionary of `JAX`/`numpy` arrays, but in `PyTorch` this is the `torch.nn.Model`.
-- It is possible that `model_parameters` will be endowed with additional information about the kind of each parameter, e.g. "weights" or "bias" or "batch norm", although `model_fn` does not really need that information we might use the same nested structure elsewhere.
-- `logits_output_batch` is before the output activation.
-- `new_model_state` is for batch norm or similar side effects and will only be updated if `update_batch_norm` is set.
-- `dropout_rate` is used in the model forward pass for models that support it. These can be tuned or will default to documented model-specific values (see the [workloads table](#workloads) for the list of defaults). Note that adding additional dropout would be considered changing the model, which is not allowed, but the tuning of dropout in existing dropout layers can be considered a regularizer, so we allow it. There should be at most two dropout rates in a model (if there are more than two we will reuse the same values). The `dropout_rate` can be changed during the training process.
+- `params` is whatever the structure is that contains the (`float32`) model
+ parameters. The naming is overloaded due to having to handle the more
+ object-oriented `PyTorch` style and the functional `JAX` style of
+ development. In the `Flax` library (written in `JAX`), this is typically a
+ nested dictionary of `JAX`/`numpy` arrays, but in `PyTorch` this is the
+ `torch.nn.Model`.
+- It is possible that `model_parameters` will be endowed with additional
+ information about the kind of each parameter, e.g. "weights" or "bias" or
+ "batch norm", although `model_fn` does not really need that information we
+ might use the same nested structure elsewhere.
+- `logits_output_batch` is before the output activation.
+- `new_model_state` is for batch norm or similar side effects and will only be
+ updated if `update_batch_norm` is set.
+- `dropout_rate` is used in the model forward pass for models that support it.
+ These can be tuned or will default to documented model-specific values (see
+ the [workloads table](#workloads) for the list of defaults). Note that
+ adding additional dropout would be considered changing the model, which is
+ not allowed, but the tuning of dropout in existing dropout layers can be
+ considered a regularizer, so we allow it. There should be at most two
+ dropout rates in a model (if there are more than two we will reuse the same
+ values). The `dropout_rate` can be changed during the training process.
###### Loss function
@@ -276,47 +448,87 @@ def loss_fn(
) -> Dict[str, Tensor] # differentiable
```
-- The loss function does **not** include regularization. Instead, regularization can be added by the submissions in the `update_params` function.
-- The loss function returns a dict `{'summed': scalar summed loss, 'n_valid_examples': scalar number of valid examples in batch, 'per_example': 1-d array of per-example losses}`.
- - Note that the returned quantities are not synced across devices; this can be done by the user in the `update_params` function.
-- Each workload uses one of the following loss functions: {_mean squared error_, _cross-entropy_, _CTC_, or _L1 reconstruction error_}.
- - Your submission must work with all these loss types. We provide the loss type via a workload property in order to let training algorithms depend on the loss function.
+- The loss function does **not** include regularization. Instead,
+ regularization can be added by the submissions in the `update_params`
+ function.
+- The loss function returns a dict `{'summed': scalar summed loss,
+ 'n_valid_examples': scalar number of valid examples in batch, 'per_example':
+ 1-d array of per-example losses}`.
+ - Note that the returned quantities are not synced across devices; this
+ can be done by the user in the `update_params` function.
+- Each workload uses one of the following loss functions: {_mean squared
+ error_, *cross-entropy*, *CTC*, or *L1 reconstruction error*}.
+ - Your submission must work with all these loss types. We provide the loss
+ type via a workload property in order to let training algorithms depend
+ on the loss function.
#### Valid Submissions
-The intention of this benchmark is to identify training algorithm submissions that will be broadly applicable and effective in practical scenarios without customization to the specific [workload](#workloads) (model, dataset, and loss function). Generally useful training algorithms can train models faster and thus require less compute resources, decreasing the cost of machine learning. We want to discourage all submissions that sidestep the purpose of this benchmark. We welcome creative ideas and novel research. Therefore, the API aims to allow a wide variety of submissions. However, in some cases, routines that would be allowed in principle might not be practically feasible to express in the provided framework.
-
-A valid submission must implement **general-purpose training logic** that is expected to work on unseen workloads **without** workload-specific modifications or precomputed lookups.
-In order to help clarify which submissions are [allowed](#allowed-submissions) and [disallowed](#disallowed-submissions), we described a few examples below. Two essential questions can help provide a general guideline for whether a submission is allowed or not:
-
-1. What **information** is being used by the submission?
-2. What **action** is the submission code taking based on this information?
-
-In general, both parts are needed to decide if a particular piece of code is within the spirit of the rules. For example, it is fine to use the shape information of the model parameters to switch between a low-memory and a high-memory approximation, but it isn't allowed to use this shape as a "fingerprint" to uniquely identify a workload and then use pre-computed hyperparameters for this specific workload. As a rule of thumb, submissions are allowed if it is reasonable to assume that the method will work comparably well on unseen workloads automatically without requiring human engineering labor.
+The intention of this benchmark is to identify training algorithm submissions
+that will be broadly applicable and effective in practical scenarios without
+customization to the specific [workload](#workloads) (model, dataset, and loss
+function). Generally useful training algorithms can train models faster and thus
+require less compute resources, decreasing the cost of machine learning. We want
+to discourage all submissions that sidestep the purpose of this benchmark. We
+welcome creative ideas and novel research. Therefore, the API aims to allow a
+wide variety of submissions. However, in some cases, routines that would be
+allowed in principle might not be practically feasible to express in the
+provided framework.
+
+A valid submission must implement **general-purpose training logic** that is
+expected to work on unseen workloads **without** workload-specific modifications
+or precomputed lookups. In order to help clarify which submissions are
+[allowed](#allowed-submissions) and [disallowed](#disallowed-submissions), we
+described a few examples below. Two essential questions can help provide a
+general guideline for whether a submission is allowed or not:
+
+1. What **information** is being used by the submission?
+2. What **action** is the submission code taking based on this information?
+
+In general, both parts are needed to decide if a particular piece of code is
+within the spirit of the rules. For example, it is fine to use the shape
+information of the model parameters to switch between a low-memory and a
+high-memory approximation, but it isn't allowed to use this shape as a
+"fingerprint" to uniquely identify a workload and then use pre-computed
+hyperparameters for this specific workload. As a rule of thumb, submissions are
+allowed if it is reasonable to assume that the method will work comparably well
+on unseen workloads automatically without requiring human engineering labor.
##### Allowed submissions
-Submissions are allowed to use the provided model parameter information, e.g. the shapes and types of the layers, if the resulting action works on generic workloads.
+Submissions are allowed to use the provided model parameter information, e.g.
+the shapes and types of the layers, if the resulting action works on generic
+workloads.
Examples:
-- Using shape information of the parameters to switch between low-memory and high-memory routines is allowed.
-- Using shape information of the parameters to conditionally construct variables to avoid running out of memory, e.g. by approximating larger matrices, is allowed.
-- Using the ordering of the parameters to train deeper layers differently, e.g. training them sequentially, is allowed.
-- Submissions are allowed to use the layer type to change the update rules, e.g. use a different update rule for all batch normalization layers, or use different sub-routines for each layer type, e.g. compute variances for convolutional layers but not for batch normalization layers.
+- Using shape information of the parameters to switch between low-memory and
+ high-memory routines is allowed.
+- Using shape information of the parameters to conditionally construct
+ variables to avoid running out of memory, e.g. by approximating larger
+ matrices, is allowed.
+- Using the ordering of the parameters to train deeper layers differently,
+ e.g. training them sequentially, is allowed.
+- Submissions are allowed to use the layer type to change the update rules,
+ e.g. use a different update rule for all batch normalization layers, or use
+ different sub-routines for each layer type, e.g. compute variances for
+ convolutional layers but not for batch normalization layers.
-Automatic methods for determining or dynamically setting hyperparameters are allowed if they function on generic workloads.
+Automatic methods for determining or dynamically setting hyperparameters are
+allowed if they function on generic workloads.
Examples:
-- Submissions are allowed to use automatic procedures for setting hyperparameters, e.g. automated learning rate range tests.
-- Inner-loop tuning methods for setting hyperparameters, e.g. line searches, are allowed.
-- Changing the batch size dynamically during training.
+- Submissions are allowed to use automatic procedures for setting
+ hyperparameters, e.g. automated learning rate range tests.
+- Inner-loop tuning methods for setting hyperparameters, e.g. line searches,
+ are allowed.
+- Changing the batch size dynamically during training.
@@ -326,59 +538,118 @@ Submissions can also be based on learned training algorithms.
Examples:
-- Submissions are allowed to learn the update rule of the training method.
-- In the [self-tuning ruleset](#self-tuning-ruleset), submissions could try out a learned list of hyperparameters.
+- Submissions are allowed to learn the update rule of the training method.
+- In the [self-tuning ruleset](#self-tuning-ruleset), submissions could try
+ out a learned list of hyperparameters.
-Submissions can use additional software dependencies provided they have the intention of supporting new algorithmic and mathematical ideas. The procedure for adding dependencies is described in more detail in the [Software dependencies](#software-dependencies) section.
+Submissions can use additional software dependencies provided they have the
+intention of supporting new algorithmic and mathematical ideas. The procedure
+for adding dependencies is described in more detail in the
+[Software dependencies](#software-dependencies) section.
Examples:
-- [`BackPACK`](https://docs.backpack.pt/en/master/index.html) is a `pip` package that hooks into `PyTorch` to extract additional information from the backward pass. An allowed use of `BackPACK` would be to compute batch statistics (e.g. within-batch gradient variances, etc.) to calibrate or auto-tune training algorithms.
+- [`BackPACK`](https://docs.backpack.pt/en/master/index.html) is a `pip`
+ package that hooks into `PyTorch` to extract additional information from the
+ backward pass. An allowed use of `BackPACK` would be to compute batch
+ statistics (e.g. within-batch gradient variances, etc.) to calibrate or
+ auto-tune training algorithms.
##### Disallowed submissions
-Submissions must rely on new algorithmic or mathematical ideas and concepts, and must not use software engineering approaches in order to increase primitive operations in PyTorch, JAX, their dependencies, the operating systems, or the hardware. Submissions may use public APIs in JAX and PyTorch from within the submission function APIs, but may not use APIs to optimize the internals of primitive operations and/or standard dependencies to benefit any submission.
+Submissions must rely on new algorithmic or mathematical ideas and concepts, and
+must not use software engineering approaches in order to increase primitive
+operations in PyTorch, JAX, their dependencies, the operating systems, or the
+hardware. Submissions may use public APIs in JAX and PyTorch from within the
+submission function APIs, but may not use APIs to optimize the internals of
+primitive operations and/or standard dependencies to benefit any submission.
-Submissions are not allowed to circumvent the tuning rules by looking up the result of an offline computation that was performed ahead of time.
+Submissions are not allowed to circumvent the tuning rules by looking up the
+result of an offline computation that was performed ahead of time.
Examples:
-- Submissions are not allowed to look up (pre-trained) model parameters.
-- Computing the optimal hyperparameters for every workload offline and having the submission look up those pre-computed values is not allowed. In contrast, finding and hard-coding a single good setting of the hyperparameters that works well across all the workloads simultaneously would be allowed.
-- Submissions are not allowed to adjust the hyperparameter search spaces for the external tuning ruleset, such that it differs between the workloads.
+- Submissions are not allowed to look up (pre-trained) model parameters.
+- Computing the optimal hyperparameters for every workload offline and having
+ the submission look up those pre-computed values is not allowed. In
+ contrast, finding and hard-coding a single good setting of the
+ hyperparameters that works well across all the workloads simultaneously
+ would be allowed.
+- Submissions are not allowed to adjust the hyperparameter search spaces for
+ the external tuning ruleset, such that it differs between the workloads.
-Submissions may not identify (directly or indirectly) the specific benchmark workload to select special-cased logic or hyperparameters; learned detectors that end up selecting workload-specific behavior are equally disallowed. This would result in highly specific behavior that isn't generally useful. In general, all else being equal, if some submission was written that was extremely effective on a small set of the workloads (and far worse on the rest) and another submission with the opposite performance pattern, we would prefer both submissions to be submitted and tested on **all** workloads.
+Submissions may not identify (directly or indirectly) the specific benchmark
+workload to select special-cased logic or hyperparameters; learned detectors
+that end up selecting workload-specific behavior are equally disallowed. This
+would result in highly specific behavior that isn't generally useful. In
+general, all else being equal, if some submission was written that was extremely
+effective on a small set of the workloads (and far worse on the rest) and
+another submission with the opposite performance pattern, we would prefer both
+submissions to be submitted and tested on **all** workloads.
Examples:
-- A hard-coded switching of the update rule based on the workload is not allowed, e.g. using Adam for RNNs and SGD with momentum on CNNs. Although submissions can specialize for certain layer types in generic ways, they should not uniquely identify a model or dataset. In other words, if there are two workloads A and B that both have convolutional layers and fully connected layers the submission shouldn't detect whether it is dealing with A or B specifically and choose Adam for one and SGD with momentum for the other. However, if the updates for all parameters of convolutional layers always used SGD with momentum and the updates for all other layers always used Adam and a workload with both types of layers had mixed updates, that would be fine.
- It is also allowed to make the update rule part of the (external) hyperparameter tuning or determine the optimal update rule during the run, i.e. while "on-the-clock".
-- Submissions are not allowed to look up learning rate schedules that are only utilized for specific subsets of the workloads. It is allowed to use one general learning rate schedule, to dynamically adapt the learning rate based on general information such as curvature, or to select the learning rate schedule as part of the (external) hyperparameter tuning.
+- A hard-coded switching of the update rule based on the workload is not
+ allowed, e.g. using Adam for RNNs and SGD with momentum on CNNs. Although
+ submissions can specialize for certain layer types in generic ways, they
+ should not uniquely identify a model or dataset. In other words, if there
+ are two workloads A and B that both have convolutional layers and fully
+ connected layers the submission shouldn't detect whether it is dealing with
+ A or B specifically and choose Adam for one and SGD with momentum for the
+ other. However, if the updates for all parameters of convolutional layers
+ always used SGD with momentum and the updates for all other layers always
+ used Adam and a workload with both types of layers had mixed updates, that
+ would be fine. It is also allowed to make the update rule part of the
+ (external) hyperparameter tuning or determine the optimal update rule during
+ the run, i.e. while "on-the-clock".
+- Submissions are not allowed to look up learning rate schedules that are only
+ utilized for specific subsets of the workloads. It is allowed to use one
+ general learning rate schedule, to dynamically adapt the learning rate based
+ on general information such as curvature, or to select the learning rate
+ schedule as part of the (external) hyperparameter tuning.
-Valid submissions must rely on new algorithmic or mathematical ideas and should not use software engineering approaches to speed up primitive operations in `PyTorch`, `JAX`, their dependencies, the operating system, or the hardware. We recognize that the way a method is implemented will impact its performance in the benchmark. It is generally acceptable to make clever, judicious, and efficient use of public APIs in `JAX` and/or `PyTorch` from within the submission function APIs. It is not acceptable to use these APIs to optimize the internals of primitive operations and standard dependencies in ways that could generally benefit any submission.
+Valid submissions must rely on new algorithmic or mathematical ideas and should
+not use software engineering approaches to speed up primitive operations in
+`PyTorch`, `JAX`, their dependencies, the operating system, or the hardware. We
+recognize that the way a method is implemented will impact its performance in
+the benchmark. It is generally acceptable to make clever, judicious, and
+efficient use of public APIs in `JAX` and/or `PyTorch` from within the
+submission function APIs. It is not acceptable to use these APIs to optimize the
+internals of primitive operations and standard dependencies in ways that could
+generally benefit any submission.
Examples:
-- Submissions **are allowed** to use `CUDA` streams to schedule operations, e.g., transferring data between CPU and GPU, or among GPUs, while performing other computations.
-- Submissions **are not allowed** to use `CUDA` streams or asynchronous operations (e.g., spawning additional threads) to perform additional computations that run during the [untimed evaluations](#evaluation-during-training).
-- Submissions **are not allowed** to use faster GPU kernels than other submitters by writing their own, using `TVM`, or using a different version of `cuDNN`/`cuBLAS`.
-- Submissions **are not allowed** to skip or reduce system or framework overhead, such as modifying `JAX` to skip internal steps like pytree flattening/unflattening.
-- Submissions **are not allowed** to introduce new compiler optimizations, such as modifying `XLA` to perform more or less kernel fusion.
+- Submissions **are allowed** to use `CUDA` streams to schedule operations,
+ e.g., transferring data between CPU and GPU, or among GPUs, while performing
+ other computations.
+- Submissions **are not allowed** to use `CUDA` streams or asynchronous
+ operations (e.g., spawning additional threads) to perform additional
+ computations that run during the
+ [untimed evaluations](#evaluation-during-training).
+- Submissions **are not allowed** to use faster GPU kernels than other
+ submitters by writing their own, using `TVM`, or using a different version
+ of `cuDNN`/`cuBLAS`.
+- Submissions **are not allowed** to skip or reduce system or framework
+ overhead, such as modifying `JAX` to skip internal steps like pytree
+ flattening/unflattening.
+- Submissions **are not allowed** to introduce new compiler optimizations,
+ such as modifying `XLA` to perform more or less kernel fusion.
@@ -386,234 +657,459 @@ Valid submissions must rely on new algorithmic or mathematical ideas and should
##### Evaluation during training
-In general, with noisy, non-deterministic training, evaluation frequency can affect training time measurements as more "bites of the apple" potentially allows the training code to exploit instability. We also want to discourage submissions from complicated and unrealistic logic that attempts to guess when training is close to complete and increases the evaluation rate, while not producing a well-sampled training curve at the start of training. Simply allowing submissions complete freedom over evaluation frequency encourages competitors to work to minimize the number of evaluations, which distracts from the primary goal of finding better training algorithms.
-
-Submissions are eligible for an untimed evaluation every `eval_period` seconds. Before proceeding to evaluation, the submission can prepare the model through a call to `prepare_for_eval`, effectively modifying the model parameters/state as well as the optimizer state. Any (optional) additional evaluations performed by the submission code count against the runtime for scoring.
-The harness that runs the submission code will attempt to evaluate every `eval_period` seconds by checking between each submission step (call of `update_params`) whether it has been at least `eval_period` seconds since that last evaluation. If so, the submission is given the possibility to prepare for evaluation (through a timed call to `prepare_for_eval`). If the accumulated runtime does not exceed the maximum allowed runtime after the preparation step, the clock is paused, and the submission is evaluated. This means that if calls to `update_params` typically take a lot more than `eval_period` seconds, such submissions will not receive as many untimed evaluations as a submission that had an `update_params` function that took less time. However, for appropriate settings of `eval_period`, we expect this to be quite rare. Submissions are always free to restructure their `update_params` code to split work into two subsequent steps to regain the potential benefits of these untimed model evaluations. For each workload, the `eval_period` will be set such that the total evaluation time is roughly between 10% and 20% of the total training time for the target-setting runs.
+In general, with noisy, non-deterministic training, evaluation frequency can
+affect training time measurements as more "bites of the apple" potentially
+allows the training code to exploit instability. We also want to discourage
+submissions from complicated and unrealistic logic that attempts to guess when
+training is close to complete and increases the evaluation rate, while not
+producing a well-sampled training curve at the start of training. Simply
+allowing submissions complete freedom over evaluation frequency encourages
+competitors to work to minimize the number of evaluations, which distracts from
+the primary goal of finding better training algorithms.
+
+Submissions are eligible for an untimed evaluation every `eval_period` seconds.
+Before proceeding to evaluation, the submission can prepare the model through a
+call to `prepare_for_eval`, effectively modifying the model parameters/state as
+well as the optimizer state. Any (optional) additional evaluations performed by
+the submission code count against the runtime for scoring. The harness that runs
+the submission code will attempt to evaluate every `eval_period` seconds by
+checking between each submission step (call of `update_params`) whether it has
+been at least `eval_period` seconds since that last evaluation. If so, the
+submission is given the possibility to prepare for evaluation (through a timed
+call to `prepare_for_eval`). If the accumulated runtime does not exceed the
+maximum allowed runtime after the preparation step, the clock is paused, and the
+submission is evaluated. This means that if calls to `update_params` typically
+take a lot more than `eval_period` seconds, such submissions will not receive as
+many untimed evaluations as a submission that had an `update_params` function
+that took less time. However, for appropriate settings of `eval_period`, we
+expect this to be quite rare. Submissions are always free to restructure their
+`update_params` code to split work into two subsequent steps to regain the
+potential benefits of these untimed model evaluations. For each workload, the
+`eval_period` will be set such that the total evaluation time is roughly between
+10% and 20% of the total training time for the target-setting runs.
##### Software Dependencies
-If your submission will have any software dependencies, you must create a `requirements.txt` file in the `/submission` directory. This file must clearly list all software dependencies your submission requires in order to be a valid submission. The file must be "pip-readable" (the dependencies listed can be installed via the `pip install -r requirements.txt` command). You may not modify the package versions of the software dependencies used by the benchmarking codebase, including using a different version of libraries such as PyTorch or JAX from those specified in the benchmark.
-
-We require submissions to use specific versions of `PyTorch`/`JAX` as well as additional dependencies in order to facilitate fair comparisons. Submitters must build on top of these pinned software packages. Additional dependencies can be added as long as they include a comment describing what was added and why. Submitters are free to add dependencies that support new algorithmic and mathematical ideas but they should not circumvent the intention of the benchmark to measure training speedups due to new training methods. For example, software engineering techniques that lead to faster implementations of existing software, e.g. using newer versions of `PyTorch` or `JAX`, are not allowed and these are described in more detail in the [Disallowed submissions](#disallowed-submissions) section.
+If your submission will have any software dependencies, you must create a
+`requirements.txt` file in the `/submission` directory. This file must clearly
+list all software dependencies your submission requires in order to be a valid
+submission. The file must be "pip-readable" (the dependencies listed can be
+installed via the `pip install -r requirements.txt` command). You may not modify
+the package versions of the software dependencies used by the benchmarking
+codebase, including using a different version of libraries such as PyTorch or
+JAX from those specified in the benchmark.
+
+We require submissions to use specific versions of `PyTorch`/`JAX` as well as
+additional dependencies in order to facilitate fair comparisons. Submitters must
+build on top of these pinned software packages. Additional dependencies can be
+added as long as they include a comment describing what was added and why.
+Submitters are free to add dependencies that support new algorithmic and
+mathematical ideas but they should not circumvent the intention of the benchmark
+to measure training speedups due to new training methods. For example, software
+engineering techniques that lead to faster implementations of existing software,
+e.g. using newer versions of `PyTorch` or `JAX`, are not allowed and these are
+described in more detail in the
+[Disallowed submissions](#disallowed-submissions) section.
##### Environment Variables
-The benchmark codebase sets environment variables, and submitters are not permitted to modify (or add) environment variables for the software dependencies. However, if you believe a setting is sub-optimal, open an issue with justification; the working group may adjust it. This ensures that all submissions are equally affected by the environment variables and maintains the competition's primary focus on algorithmic improvements.
+The benchmark codebase sets environment variables, and submitters are not
+permitted to modify (or add) environment variables for the software
+dependencies. However, if you believe a setting is sub-optimal, open an issue
+with justification; the working group may adjust it. This ensures that all
+submissions are equally affected by the environment variables and maintains the
+competition's primary focus on algorithmic improvements.
### Tuning Rulesets
-Tuning will be substantially different for the [**external**](#external-tuning-ruleset) and the [**self-tuning ruleset**](#self-tuning-ruleset) and the individual specifications for each will be described in the following.
+Tuning will be substantially different for the
+[**external**](#external-tuning-ruleset) and the
+[**self-tuning ruleset**](#self-tuning-ruleset) and the individual
+specifications for each will be described in the following.
#### External Tuning Ruleset
-For every workload, **$5$ tuning _trials_** are run, and this tuning process is **repeated in $3$ independent _studies_** to capture variance, resulting in $15$ runs overall.
-Submitters have to provide a _workload-agnostic search space_, via a `tuning_search_space.json` file.
-During scoring, we draw $15$ hyperparameter configurations from this search space using [(quasi)random search](https://arxiv.org/abs/1706.03200) and randomly assign them to the $3$ studies with each $5$ trials.
-Instead of independent samples from a search space, submitters can also provide a fixed list of $5$ hyperparameter points, which will be sampled without replacement for each study.
-
-Within each study, we select the fastest trial that reaches the validation target. The median of the three per-study best times is the submission's official _per-workload score_. These $8$ _per-workload runtimes_ are used in the scoring procedure (see the [**Scoring submissions**](#scoring) section). Trials that do not reach the target within `max_runtime` receive $\infty$, (which participates in the median).
-Submissions may also perform on-the-clock self-tuning during timed training.
+For every workload, **$5$ tuning *trials*** are run, and this tuning process is
+**repeated in $3$ independent *studies*** to capture variance, resulting in $15$
+runs overall. Submitters have to provide a *workload-agnostic search space*, via
+a `tuning_search_space.json` file. During scoring, we draw $15$ hyperparameter
+configurations from this search space using
+[(quasi)random search](https://arxiv.org/abs/1706.03200) and randomly assign
+them to the $3$ studies with each $5$ trials. Instead of independent samples
+from a search space, submitters can also provide a fixed list of $5$
+hyperparameter points, which will be sampled without replacement for each study.
+
+Within each study, we select the fastest trial that reaches the validation
+target. The median of the three per-study best times is the submission's
+official *per-workload score*. These $8$ *per-workload runtimes* are used in the
+scoring procedure (see the [**Scoring submissions**](#scoring) section). Trials
+that do not reach the target within `max_runtime` receive $\infty$, (which
+participates in the median). Submissions may also perform on-the-clock
+self-tuning during timed training.
> [!IMPORTANT]
>
-> - **Trial**: One training run, with a fixed hyperparameter configuration until the target or `max_runtime` was reached. The first time the validation target is reached in a trial is denoted $t_{i,j}$ (a miss scores $\infty$).
-> - **Study**: A set of $5$ trials, each run with distinct hyperparameter points. The studies are independent and capture variance. The study's score is the **fastest** (minimum) time among its trials.
-> - **Per-Workload Runtime**: The per-workload runtime of a submission is given by the median across the per-study scores, i.e., $t_{s,w} = median_{j=1..3} \left( \min_{i=1..5} (t_{i,j}) \right)$, with $t_{i,j}$ the score of trial $i$ in study $j$, i.e.
-
+> - **Trial**: One training run, with a fixed hyperparameter configuration
+> until the target or `max_runtime` was reached. The first time the
+> validation target is reached in a trial is denoted $t_{i,j}$ (a miss
+> scores $\infty$).
+> - **Study**: A set of $5$ trials, each run with distinct hyperparameter
+> points. The studies are independent and capture variance. The study's
+> score is the **fastest** (minimum) time among its trials.
+> - **Per-Workload Runtime**: The per-workload runtime of a submission is
+> given by the median across the per-study scores, i.e., $t_{s,w} =
+> median_{j=1..3} \left( \min_{i=1..5} (t_{i,j}) \right)$, with $t_{i,j}$
+> the score of trial $i$ in study $j$, i.e.
#### Self-Tuning Ruleset
-Submissions under this ruleset are not allowed to expose user-defined hyperparameters.
-Instead, submissions can either apply one "default" hyperparameter configuration for all workloads (e.g., Adam with default settings), or perform inner-loop tuning during their training run (e.g., SGD with line searches).
-All workload adaptations, e.g. inner-loop tuning, will be part of the submission's score.
-
-For each workload, a submission will run for **$3$ independent studies**, and the _per-workload score_ is the median time to reach the validation target, i.e., $t_{s,w} = median_{j=1..3} \left(t_{j}\right)$.
-To account for the lack of external tuning, submissions have a longer time budget to reach the target performance.
-Compared to the [**external tuning ruleset**](#external-tuning-ruleset), the `max_runtime` is $1.5\times$ longer (i.e. multiply the `max_runtimes` from the [**workload overview table**](#workloads) by $1.5$).
-As in the [**external tuning ruleset**](#external-tuning-ruleset), any run that fails to achieve the target within this window is assigned an infinite runtime.
+Submissions under this ruleset are not allowed to expose user-defined
+hyperparameters. Instead, submissions can either apply one "default"
+hyperparameter configuration for all workloads (e.g., Adam with default
+settings), or perform inner-loop tuning during their training run (e.g., SGD
+with line searches). All workload adaptations, e.g. inner-loop tuning, will be
+part of the submission's score.
+
+For each workload, a submission will run for **$3$ independent studies**, and
+the *per-workload score* is the median time to reach the validation target,
+i.e., $t_{s,w} = median_{j=1..3} \left(t_{j}\right)$. To account for the lack of
+external tuning, submissions have a longer time budget to reach the target
+performance. Compared to the
+[**external tuning ruleset**](#external-tuning-ruleset), the `max_runtime` is
+$1.5\times$ longer (i.e. multiply the `max_runtimes` from the
+[**workload overview table**](#workloads) by $1.5$). As in the
+[**external tuning ruleset**](#external-tuning-ruleset), any run that fails to
+achieve the target within this window is assigned an infinite runtime.
### Workloads
-For the purposes of the _AlgoPerf: Training Algorithms_ benchmark, we consider a workload the combination of a `dataset`, `model`, `loss_fn`, along with a `target` that is defined over some evaluation `metric`. E.g., `ResNet-50` on `ImageNet` using the `cross-entropy` loss until a target `error` of `22.6%` on the validation set has been reached, would constitute a workload.
-
-The _AlgoPerf: Training Algorithms_ benchmark contains a diverse set of $8$ workloads spanning tasks such as image classification, machine translation, speech recognition, or other typical machine learning tasks. For a single task and dataset there might be multiple models and therefore multiple workloads. As a rough guideline, the entire set of workloads was designed to have a combined runtime of very roughly $100$ hours on the [**benchmarking hardware**](#benchmarking-hardware).
-
-The eight _AlgoPerf Workloads_ are:
-
-| | **Task** | **Dataset** | **Model** | **Loss** | **Metric** | Validation
**Target** | Test
**Target** | Max
**Runtime**
_(in seconds)_ | Default
**Dropout**
Value |
-| ----- | ----------------------------- | ----------- | ----------- | -------- | ---------- | ------------------------ | ------------------ | ------------------------------------- | ------------------------------------------- |
-| **1** | Clickthrough rate prediction | Criteo 1TB | DLRMsmall | CE | CE (↓) | 0.123735 | 0.126041 | 7,703 | 0 |
-| **2** | MRI reconstruction | fastMRI | U-Net | L1 | SSIM (↑) | 0.723653 | 0.740633 | 4,430 | 0 |
-| **3** | Image classification | ImageNet | ResNet-50 | CE | ER (↓) | 0.22569 | 0.3440 | 66,159 | None |
-| **4** | | | ViT | CE | ER (↓) | 0.22691 | 0.3481 | 69,768 | 0 |
-| **5** | Speech recognition | LibriSpeech | Conformer | CTC | WER (↓) | 0.085884 | 0.052981 | 58,015 | 0.1 (`input`, `attn_res`, `ff_res`); else 0 |
-| **6** | | | DeepSpeech | CTC | WER (↓) | 0.119936 | 0.074143 | 44,405 | 0.1 (`input`, `ff`); `JAX CudnnLSTM`: 0 |
-| **7** | Molecular property prediction | OGBG | GNN | CE | mAP (↑) | 0.28098 | 0.268729 | 12,011 | 0.1 |
-| **8** | Translation | WMT | Transformer | CE | BLEU (↑) | 30.8491 | 30.7219 | 43,336 | 0.1 (`main`, `attn`) |
-
-> [!NOTE]
-> Notes on the default dropout column:
+For the purposes of the *AlgoPerf: Training Algorithms* benchmark, we consider a
+workload the combination of a `dataset`, `model`, `loss_fn`, along with a
+`target` that is defined over some evaluation `metric`. E.g., `ResNet-50` on
+`ImageNet` using the `cross-entropy` loss until a target `error` of `22.6%` on
+the validation set has been reached, would constitute a workload.
+
+The *AlgoPerf: Training Algorithms* benchmark contains a diverse set of $8$
+workloads spanning tasks such as image classification, machine translation,
+speech recognition, or other typical machine learning tasks. For a single task
+and dataset there might be multiple models and therefore multiple workloads. As
+a rough guideline, the entire set of workloads was designed to have a combined
+runtime of very roughly $100$ hours on the
+[**benchmarking hardware**](#benchmarking-hardware).
+
+The eight *AlgoPerf Workloads* are:
+
+ | **Task** | **Dataset** | **Model** | **Loss** | **Metric** | Validation
**Target** | Test
**Target** | Max
**Runtime**
*(in seconds)* | Default
**Dropout**
Value
+----- | ----------------------------- | ----------- | ----------- | -------- | ---------- | ------------------------ | ------------------ | ------------------------------------- | -------------------------------
+**1** | Clickthrough rate prediction | Criteo 1TB | DLRMsmall | CE | CE (↓) | 0.123735 | 0.126041 | 8,915 | 0
+**2** | MRI reconstruction | fastMRI | U-Net | L1 | SSIM (↑) | 0.723653 | 0.740633 | 2,745 | 0
+**3** | Image classification | ImageNet | ResNet-50 | CE | ER (↓) | 0.22569 | 0.3440 | 49,918 | None
+**4** | | | ViT | CE | ER (↓) | 0.22691 | 0.3481 | 64,292 | 0
+**5** | Speech recognition | LibriSpeech | Conformer | CTC | WER (↓) | 0.085884 | 0.052981 | 43,680 | 0.1 (`input`, `attn_res`, `ff_res`); else 0
+**6** | | | DeepSpeech | CTC | WER (↓) | 0.119936 | 0.074143 | 36,949 | 0.1 (`input`, `ff`); `JAX CudnnLSTM`: 0
+**7** | Molecular property prediction | OGBG | GNN | CE | mAP (↑) | 0.28098 | 0.268729 | 11,303 | 0.1
+**8** | Translation | WMT | Encoder-Decoder Transformer | CE | BLEU (↑) | 30.8491 | 30.7219 | 16,114 | 0.1 (`main`, `attn`)
+**9** | Language modelling | Fineweb-edu | Decoder-Only Transformer | CE | Perplexity (↓) | 22.2995 | None | 31,967 | None
+> [!NOTE] Notes on the default dropout column:
>
-> - `None` indicates that the model does not use dropout.
-> - `0` or `0.1` indicates that the model uses dropout with a default value of 0.0 or 0.1, respectively.
-> - `0.1 (main, attn)` indicates that the model uses dropout with a default value of 0.1 for the main `dropout_rate` and the `attention_dropout_rate`.
-> - `0.1 (input, attn_res, ff_res) else 0` indicates that the model uses dropout with a default value of 0.1 for `input_dropout_rate`, `attention_residual_dropout_rate`, and `feed_forward_residual_dropout_rate` and use a default value of 0 for all other dropout rates.
-> - `0.1 (input, ff) else 0; JAX CudnnLSTM: 0` indicates that the model uses dropout with a default value of 0.1 for `input_dropout_rate` and `feed_forward_dropout_rate`. For JAX models, the `dropout_rate` is set to 0.0 for the `CudnnLSTM` class.
+> - `None` indicates that the model does not use dropout.
+> - `0` or `0.1` indicates that the model uses dropout with a default value of
+> 0.0 or 0.1, respectively.
+> - `0.1 (main, attn)` indicates that the model uses dropout with a default
+> value of 0.1 for the main `dropout_rate` and the `attention_dropout_rate`.
+> - `0.1 (input, attn_res, ff_res) else 0` indicates that the model uses
+> dropout with a default value of 0.1 for `input_dropout_rate`,
+> `attention_residual_dropout_rate`, and
+> `feed_forward_residual_dropout_rate` and use a default value of 0 for all
+> other dropout rates.
+> - `0.1 (input, ff) else 0; JAX CudnnLSTM: 0` indicates that the model uses
+> dropout with a default value of 0.1 for `input_dropout_rate` and
+> `feed_forward_dropout_rate`. For JAX models, the `dropout_rate` is set to
+> 0.0 for the `CudnnLSTM` class.
>
> Dropout defaults are used if not overridden by the submission.
#### Recommended Qualification Set
-Because the full _AlgoPerf: Training Algorithms_ benchmark is computationally quite expensive, we also provide a recommendation for a cheaper variant, the _qualification set_.
-This _qualification set_ excludes both _ImageNet_ workloads, both _LibriSpeech_ workloads, and the _fastMRI_ workload, leaving **_Criteo 1TB_, _OGBG_, and _WMT_**.
-Together, they run in roughly $24$ hours on the [**benchmarking hardware**](#benchmarking-hardware).
-To further reduce computational costs, the [**external tuning ruleset**](#external-tuning-ruleset) uses **only one study** (instead of the proposed $3$) on the qualification set. The [**self-tuning ruleset**](#self-tuning-ruleset) will keep the $3$ studies because it is less costly.
+Because the full *AlgoPerf: Training Algorithms* benchmark is computationally
+quite expensive, we also provide a recommendation for a cheaper variant, the
+*qualification set*. This *qualification set* excludes both *ImageNet*
+workloads, both *LibriSpeech* workloads, the *LM* workload, and the *fastMRI* workload, leaving
+***Criteo 1TB*, *OGBG*, and *WMT***. Together, they run in less than $24$ hours on
+the [**benchmarking hardware**](#benchmarking-hardware). To further reduce
+computational costs, the [**external tuning ruleset**](#external-tuning-ruleset)
+uses **only one study** (instead of the proposed $3$) on the qualification set.
+The [**self-tuning ruleset**](#self-tuning-ruleset) will keep the $3$ studies
+because it is less costly.
> [!NOTE]
>
-> The "qualification set" was originally designed as a cheaper benchmark that allowed resource-constrained teams to prove themselves and "qualify" for sponsored compute for the full benchmark. Self-reporting is now optional, but the subset still serves as a cheaper performance estimate, so we're keeping it as a recommendation, including the (historical) name.
+> The "qualification set" was originally designed as a cheaper benchmark that
+> allowed resource-constrained teams to prove themselves and "qualify" for
+> sponsored compute for the full benchmark. Self-reporting is now optional, but
+> the subset still serves as a cheaper performance estimate, so we're keeping it
+> as a recommendation, including the (historical) name.
### Scoring
-Submissions are scored based on the training time needed to reach the target performance on each workload's validation set.
-The target metric may match the loss function or use another workload-specific metric, such as error rate or BLEU score.
-See the [**workload overview table**](#workloads) for the targets and metrics of each workload and the [**Defining target performance**](#defining-target-performance-and-max_runtime) section for how they were determined.
-The overall ranking is then determined by the scalar _AlgoPerf Benchmark Score_, which summarizes the _per-workload_ runtimes across all [**workloads**](#workloads), using integrated [**performance profiles**](#algoperf-benchmark-score-via-integrated-performance-profiles), as explained below.
+Submissions are scored based on the training time needed to reach the target
+performance on each workload's validation set. The target metric may match the
+loss function or use another workload-specific metric, such as error rate or
+BLEU score. See the [**workload overview table**](#workloads) for the targets
+and metrics of each workload and the
+[**Defining target performance**](#defining-target-performance-and-max_runtime)
+section for how they were determined. The overall ranking is then determined by
+the scalar *AlgoPerf Benchmark Score*, which summarizes the *per-workload*
+runtimes across all [**workloads**](#workloads), using integrated
+[**performance profiles**](#algoperf-benchmark-score-via-integrated-performance-profiles),
+as explained below.
> [!NOTE]
>
-> The training time of a submission includes the compilation times for computation graphs and ops that could happen just-in-time during training; all our benchmarks should be fast enough to compile so as not to dramatically impact overall performance.
+> The training time of a submission includes the compilation times for
+> computation graphs and ops that could happen just-in-time during training; all
+> our benchmarks should be fast enough to compile so as not to dramatically
+> impact overall performance.
> [!NOTE]
>
-> The training time until the _test set target_ was reached is not used in the scoring procedure but might be used for additional analysis of the competition results.
+> The training time until the *test set target* was reached is not used in the
+> scoring procedure but might be used for additional analysis of the competition
+> results.
#### AlgoPerf Benchmark Score via Integrated Performance Profiles
-We will aggregate the _per-workload training times_ of a submission across all workloads using [Performance Profiles](http://www.argmin.net/2018/03/26/performance-profiles/) (originally from [Dolan and Moré](https://arxiv.org/abs/cs/0102001)). Below we surface several relevant definitions from their work for easier readability, before explaining how we integrate the performance profiles to reach a scalar benchmark score that will be used for ranking submissions.
-
-_Notation:_ We have a set $\mathcal{S} = \{s_1, s_2, \dots, s_k\}$ of in total $k$ submissions that we evaluate on a set of $n$ workloads: $\mathcal{W} = \{w_1, w_2, \dots, w_n\}$. For each submission $s$ and each workload $w$ we have a _per-workload runtime_ $t_{s,w} \in [0,\infty)$. This is the median time it took the submission to reach the validation target performance on this particular workload.
+We will aggregate the *per-workload training times* of a submission across all
+workloads using
+[Performance Profiles](http://www.argmin.net/2018/03/26/performance-profiles/)
+(originally from [Dolan and Moré](https://arxiv.org/abs/cs/0102001)). Below we
+surface several relevant definitions from their work for easier readability,
+before explaining how we integrate the performance profiles to reach a scalar
+benchmark score that will be used for ranking submissions.
+
+*Notation:* We have a set $\mathcal{S} = \{s_1, s_2, \dots, s_k\}$ of in total
+$k$ submissions that we evaluate on a set of $n$ workloads: $\mathcal{W} =
+\{w_1, w_2, \dots, w_n\}$. For each submission $s$ and each workload $w$ we have
+a *per-workload runtime* $t_{s,w} \in [0,\infty)$. This is the median time it
+took the submission to reach the validation target performance on this
+particular workload.
##### Computing performance ratios
-For all workloads and submissions, we first compute their performance ratio $r$, which is defined for a particular submission $\bar{s}$ and a particular workload $\bar{w}$ to be:
+For all workloads and submissions, we first compute their performance ratio $r$,
+which is defined for a particular submission $\bar{s}$ and a particular workload
+$\bar{w}$ to be:
$$r_{\bar{s},\bar{w}} = \frac{t_{\bar{s},\bar{w}}}{\min_{s \in \mathcal{S}} t_{s,\bar{w}}} \in [1,\infty)$$
-This performance ratio $r_{s,w}$ expresses the "time spent by submission $s$ on workload $w$" relative to the "time spent by the best submission on this workload". E.g. If a submission takes twice as long on a particular workload compared to the best submission on this workload it will have a performance ratio of $2$. Lower performance ratios are therefore better, with an optimal ratio of $1$ if the given submission is the fastest on this workload.
+This performance ratio $r_{s,w}$ expresses the "time spent by submission $s$ on
+workload $w$" relative to the "time spent by the best submission on this
+workload". E.g. If a submission takes twice as long on a particular workload
+compared to the best submission on this workload it will have a performance
+ratio of $2$. Lower performance ratios are therefore better, with an optimal
+ratio of $1$ if the given submission is the fastest on this workload.
##### Building performance profiles
-Next, we compute how often a submission is within a factor $\tau \in [1,\infty)$ of the optimal submission. For this, we determine the following function for every submission $\bar{s}$:
+Next, we compute how often a submission is within a factor $\tau \in [1,\infty)$
+of the optimal submission. For this, we determine the following function for
+every submission $\bar{s}$:
$$
\rho_{\bar{s}}(\tau) = \frac{1}{n} \cdot \left| \\{ w \text{ such that } r_{\bar{s},w}\leq \tau \\}\right| = \frac{1}{n} \cdot \left[\text{number of workloads where}\, r_{\bar{s},w}\leq \tau\right]
$$
-In other words, we compute the fraction of workloads where a submission $\bar{s}$ is less than $\tau$ away from the optimal submission. The function $\rho_{\bar{s}}(\tau)$ is monotonically increasing with $\tau$ and bounded between $0$ and $1$.
+In other words, we compute the fraction of workloads where a submission
+$\bar{s}$ is less than $\tau$ away from the optimal submission. The function
+$\rho_{\bar{s}}(\tau)$ is monotonically increasing with $\tau$ and bounded
+between $0$ and $1$.
-An example of a performance profiles plot is shown below, where we plot $\rho_{\bar{s}}(\tau)$ for six submissions:
+An example of a performance profiles plot is shown below, where we plot
+$\rho_{\bar{s}}(\tau)$ for six submissions:

##### Integrating performance profiles for the benchmark score
-To get the scalar _AlgoPerf Benchmark Score_ that is usable for ranking submissions, we will integrate the performance profiles $\rho_{\bar{s}}(\tau)$ of all submissions to get their _AlgoPerf Benchmark Score_ $B_{\bar{s}}$, with
+To get the scalar *AlgoPerf Benchmark Score* that is usable for ranking
+submissions, we will integrate the performance profiles $\rho_{\bar{s}}(\tau)$
+of all submissions to get their *AlgoPerf Benchmark Score* $B_{\bar{s}}$, with
$$B_{\bar{s}} = \frac{1}{r_{\text{max}}-1} \int_{1}^{r_{\text{max}}} \rho_{\bar{s}}(\tau) \,d\tau \in [0, 1].$$
-The upper integration limit will be set to $r_{\text{max}} = 4$ which also serves as the upper limit of the performance profile plot.
-This means that any submission that requires more than four times the runtime of the fastest submission will not get any credit on this workload compared to a training algorithm that is unable to successfully train within the maximum allowed runtime budget.
-The integral is normalized by the total integration area, such that all _AlgoPerf Benchmark scores_ are between $0$ and $1$, with higher scores being better. A perfect score of $1$ is achieved if a submission is the fastest on all workloads.
+The upper integration limit will be set to $r_{\text{max}} = 4$ which also
+serves as the upper limit of the performance profile plot. This means that any
+submission that requires more than four times the runtime of the fastest
+submission will not get any credit on this workload compared to a training
+algorithm that is unable to successfully train within the maximum allowed
+runtime budget. The integral is normalized by the total integration area, such
+that all *AlgoPerf Benchmark scores* are between $0$ and $1$, with higher scores
+being better. A perfect score of $1$ is achieved if a submission is the fastest
+on all workloads.
##### Alternative scores
-Performance profiles and the _AlgoPerf Benchmark Score_ derived from them, take a bit of effort to explain.
-However, we believe that they are fairer and well-supported by research in machine learning and the optimization community. To have some simpler to interpret numbers, e.g. for press releases, we will also release a series of alternative scores.
+Performance profiles and the *AlgoPerf Benchmark Score* derived from them, take
+a bit of effort to explain. However, we believe that they are fairer and
+well-supported by research in machine learning and the optimization community.
+To have some simpler to interpret numbers, e.g. for press releases, we will also
+release a series of alternative scores.
-For a given workload $\bar{w}$, we define the "speedup of a submission $\bar{s}$ over submission $\text{ref}$" as $\frac{t_{\text{ref}, \bar{w}}}{t_{\bar{s}, \bar{w}}}$. For example, if a submission was $2\times$ faster than the reference submission, this would be equal to $2$. In addition to the raw $t_{s,w}$ values, we will release the geometric mean of the speedups across all workloads, i.e. $\left(\prod_{w \in \mathcal{W}} \frac{t_{\text{ref}, w}}{t_{\bar{s}, w}}\right)^{\frac{1}{n}}$.
+For a given workload $\bar{w}$, we define the "speedup of a submission $\bar{s}$
+over submission $\text{ref}$" as $\frac{t_{\text{ref}, \bar{w}}}{t_{\bar{s},
+\bar{w}}}$. For example, if a submission was $2\times$ faster than the reference
+submission, this would be equal to $2$. In addition to the raw $t_{s,w}$ values,
+we will release the geometric mean of the speedups across all workloads, i.e.
+$\left(\prod_{w \in \mathcal{W}} \frac{t_{\text{ref}, w}}{t_{\bar{s},
+w}}\right)^{\frac{1}{n}}$.
#### Benchmarking Hardware
-All officially scored runs will be performed on the same benchmarking hardware to allow for a fair comparison of wall-clock training times.
-This benchmarking hardware is chosen to be easily accessible via common cloud computing providers and will likely change with each iteration of the benchmark.
-The specs of the benchmarking hardware for this iteration of the benchmark are:
-
-- 4× NVIDIA A100 (40 GB) GPUs
-- 240 GB in RAM
-- 2 TB in storage (for datasets).
-
-> [!TIP]
-> Submitters are no longer required to self-report results to enter the AlgoPerf benchmark.
-> Instead, they can open a PR and the working group will score the most promising submissions, see our [**How to Submit**](/README.md#how-to-submit) section for more details.
-> If you'd like to self-report results, e.g., for paper experiments or to provide evidence of your submission's performance, it is possible to use a different hardware. However, we strongly recommend to use the same hardware for all algorithms, at least for the scored runs. It is possible to _perform tuning trials on different hardware_, as long as the hardware is consistent for all tuning trials.
-> However, in order to compare to the published results, you will have to repeat at least those fastest trials on the benchmarking hardware.
-> This allows for a fair comparison to the reported results of other submitters while allowing some flexibility in the hardware.
+All officially scored runs will be performed on the same benchmarking hardware
+to allow for a fair comparison of wall-clock training times. This benchmarking
+hardware is chosen to be easily accessible via common cloud computing providers
+and will likely change with each iteration of the benchmark. The specs of the
+benchmarking hardware for this iteration of the benchmark are:
+
+- 4× NVIDIA A100 (40 GB) GPUs
+- 240 GB in RAM
+- 2 TB in storage (for datasets).
+
+> [!TIP] Submitters are no longer required to self-report results to enter the
+> AlgoPerf benchmark. Instead, they can open a PR and the working group will
+> score the most promising submissions, see our
+> [**How to Submit**](/README.md#how-to-submit) section for more details. If
+> you'd like to self-report results, e.g., for paper experiments or to provide
+> evidence of your submission's performance, it is possible to use a different
+> hardware. However, we strongly recommend to use the same hardware for all
+> algorithms, at least for the scored runs. It is possible to *perform tuning
+> trials on different hardware*, as long as the hardware is consistent for all
+> tuning trials. However, in order to compare to the published results, you will
+> have to repeat at least those fastest trials on the benchmarking hardware.
+> This allows for a fair comparison to the reported results of other submitters
+> while allowing some flexibility in the hardware.
#### Defining Target Performance and `max_runtime`
-This section briefly explains the process to define the target performance for each [**workload**](#workloads), which will be used by both [**tuning rulesets**](#tuning-rulesets) equally. For more details, see [**our benchmark paper**](https://arxiv.org/abs/2306.07179).
-
-For each workload, we take the best performance achievable by one of four standard algorithms (`AdamW`, `NadamW`, `Nesterov Momentum`, and `Heavy Ball Momentum`). These target-setting algorithms will follow the general process of the external tuning ruleset, with a significantly larger tuning budget of $200$ trials to guarantee competitive performance. Once the best algorithm and its hyperparameters are determined, training is repeated $20$ times with this configuration. The median of the best achieved validation errors across seeds is used as the _validation_ target. Out of the $10$ repeated runs that achieved this validation target, we took the worst achieved test error across seeds as our _test_ target. Taking the median validation performance after rerunning the best hyperparameter point prevents our procedure from selecting a lucky outlier.
-
-> [!NOTE]
-> The runtime of the target-setting algorithms was chosen to roughly match published results without extending the overall benchmark budget too much.
-> The initial `max_runtime` (used in version 0.5 of the benchmark) available to submissions on each workload was $\frac{1}{3}$ longer than the runtime of the target-setting algorithms to allow submissions a bit more time to reach the target on some workloads, if they can make up for it on others. After the initial competition, we have adapted the `max_runtimes` based on the performance of the submissions (see [this issue](https://github.com/mlcommons/algorithmic-efficiency/issues/836)).
+This section briefly explains the process to define the target performance for
+each [**workload**](#workloads), which will be used by both
+[**tuning rulesets**](#tuning-rulesets) equally. For more details, see
+[**our benchmark paper**](https://arxiv.org/abs/2306.07179).
+
+For each workload, we take the best performance achievable by one of four
+standard algorithms (`AdamW`, `NadamW`, `Nesterov Momentum`, and `Heavy Ball
+Momentum`). These target-setting algorithms will follow the general process of
+the external tuning ruleset, with a significantly larger tuning budget of $200$
+trials to guarantee competitive performance. Once the best algorithm and its
+hyperparameters are determined, training is repeated $20$ times with this
+configuration. The median of the best achieved validation errors across seeds is
+used as the *validation* target. Out of the $10$ repeated runs that achieved
+this validation target, we took the worst achieved test error across seeds as
+our *test* target. Taking the median validation performance after rerunning the
+best hyperparameter point prevents our procedure from selecting a lucky outlier.
+
+> [!NOTE] The runtime of the target-setting algorithms was chosen to roughly
+> match published results without extending the overall benchmark budget too
+> much. The initial `max_runtime` (used in version 0.5 of the benchmark)
+> available to submissions on each workload was $\frac{1}{3}$ longer than the
+> runtime of the target-setting algorithms to allow submissions a bit more time
+> to reach the target on some workloads, if they can make up for it on others.
+> After the initial competition, we have adapted the `max_runtimes` based on the
+> performance of the submissions (see
+> [this issue](https://github.com/mlcommons/algorithmic-efficiency/issues/836)).
## Versioning Policy
-_AlgoPerf_ uses a unified versioning scheme: codebase, rules, and leaderboard all share the same `Major.Minor` version. `Patch` versions may differ for minor, non-breaking updates to each component. All results produced under the same `Major.Minor` version are comparable, making it easy to cite "`AlgoPerf v0.X`" and know exactly which set of rules, code, and submissions are being referenced.
+*AlgoPerf* uses a unified versioning scheme: codebase, rules, and leaderboard
+all share the same `Major.Minor` version. `Patch` versions may differ for minor,
+non-breaking updates to each component. All results produced under the same
+`Major.Minor` version are comparable, making it easy to cite "`AlgoPerf v0.X`"
+and know exactly which set of rules, code, and submissions are being referenced.
-- **Codebase:** The version is automatically set from the latest GitHub tag and accessible via `algoperf.__version__`.
-- **Rules/Documentation:** This document reflects the unified version shown above.
-- **Leaderboard:** The leaderboard in the [**submissions repository**](https://github.com/mlcommons/submissions_algorithms) displays which version of the benchmark was used for scoring.
+- **Codebase:** The version is automatically set from the latest GitHub tag
+ and accessible via `algoperf.__version__`.
+- **Rules/Documentation:** This document reflects the unified version shown
+ above.
+- **Leaderboard:** The leaderboard in the
+ [**submissions repository**](https://github.com/mlcommons/submissions_algorithms)
+ displays which version of the benchmark was used for scoring.
-For detailed information about releases and version history, see our [**README**](../README.md#releases--roadmap) and our [**Changelog**](CHANGELOG.md).
+For detailed information about releases and version history, see our
+[**README**](../README.md#releases--roadmap) and our
+[**Changelog**](CHANGELOG.md).
### Version Freeze
-To ensure that all submitters can develop their submissions based on the same code that will be utilized for scoring, we freeze the package versions of the codebase dependencies in between benchmark versions. By doing so, we level the playing field for everyone involved, ensuring fairness and consistency in the assessment of submissions. We will try to minimize changes to the benchmark codebase as best as possible.
+To ensure that all submitters can develop their submissions based on the same
+code that will be utilized for scoring, we freeze the package versions of the
+codebase dependencies in between benchmark versions. By doing so, we level the
+playing field for everyone involved, ensuring fairness and consistency in the
+assessment of submissions. We will try to minimize changes to the benchmark
+codebase as best as possible.
## License and Legal Requirements
-All submissions must be licensed under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0).
-Furthermore, all submitters must sign the following agreements:
+All submissions must be licensed under the
+[Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). Furthermore,
+all submitters must sign the following agreements:
-- A signed [Contributor License Agreement (CLA) "Corporate CLA"](https://mlcommons.org/en/policies/) of MLCommons.
-- _Either_ a membership in MLCommons _or_ a signed [non-member test agreement](https://mlcommons.org/en/policies/).
-- A signed trademark license agreement, either the member or the non-member version, as appropriate. These license agreements are available upon request to [support@mlcommons.org](mailto:support@mlcommons.org).
+- A signed
+ [Contributor License Agreement (CLA) "Corporate CLA"](https://mlcommons.org/en/policies/)
+ of MLCommons.
+- *Either* a membership in MLCommons *or* a signed
+ [non-member test agreement](https://mlcommons.org/en/policies/).
+- A signed trademark license agreement, either the member or the non-member
+ version, as appropriate. These license agreements are available upon request
+ to [support@mlcommons.org](mailto:support@mlcommons.org).
## FAQs
-> If your question isn't answered here, please [**contact us**](mailto:algorithms-chairs@mlcommons.org). These FAQs serve to supplement and clarify the rules and documentation described above.
+> If your question isn't answered here, please
+> [**contact us**](mailto:algorithms-chairs@mlcommons.org). These FAQs serve to
+> supplement and clarify the rules and documentation described above.
### Setup & Platform
My machine only has one GPU. How can I use this repo?
-> You can run this repo on a machine with an arbitrary number of GPUs. However, the default batch sizes of our algorithms collection (e.g. `algorithms/`) are tuned for a machine with 4× NVIDIA A100 (40 GB) GPUs. You may run into OOMs if you run these algorithms with fewer than 8 GPUs. If you run into these issues because you are using a machine with less total GPU memory, please reduce the batch sizes for the submission. Note that your final submission must 'fit' on the [**benchmarking hardware**](#benchmarking-hardware), so if you are using fewer GPUs with higher per-GPU memory, please monitor your memory usage to make sure it will fit on 4× NVIDIA A100 GPUs with 40 GB of VRAM per card.
+> You can run this repo on a machine with an arbitrary number of GPUs. However,
+> the default batch sizes of our algorithms collection (e.g. `algorithms/`) are
+> tuned for a machine with 4× NVIDIA A100 (40 GB) GPUs. You may run into OOMs if
+> you run these algorithms with fewer than 4 GPUs. If you run into these issues
+> because you are using a machine with less total GPU memory, please reduce the
+> batch sizes for the submission. Note that your final submission must 'fit' on
+> the [**benchmarking hardware**](#benchmarking-hardware), so if you are using
+> fewer GPUs with higher per-GPU memory, please monitor your memory usage to
+> make sure it will fit on 4× NVIDIA A100 GPUs with 40 GB of VRAM per card.
How do I run this on my SLURM cluster?
-> You may run into issues with `sudo` and `docker` on a SLURM cluster. To run the workloads in a SLURM cluster you can use Apptainer (_formerly Singularity_), see this [**section**](/docs/GETTING_STARTED.md#using-singularityapptainer-instead-of-docker).
+> You may run into issues with `sudo` and `docker` on a SLURM cluster. To run
+> the workloads in a SLURM cluster you can use Apptainer (*formerly
+> Singularity*), see this
+> [**section**](/docs/GETTING_STARTED.md#using-singularityapptainer-instead-of-docker).
How can I run this on my AWS/GCP/Azure cloud project?
-> Depending on your virtual machine, you may have to install the correct GPU drivers and the NVIDIA Docker toolkit. For example, in GCP you will have to do the following.
+> Depending on your virtual machine, you may have to install the correct GPU
+> drivers and the NVIDIA Docker toolkit. For example, in GCP you will have to do
+> the following.
>
-> 1. If you don't have a VM instance yet, we recommend creating a
-> new Compute Instance with the "Deep Learning on Linux" Image in Boot disk options.
-> 2. To install the NVIDIA Docker toolkit, you can use [`docker/scripts/cloud-startup.sh`](/docker/scripts/cloud-startup.sh) as a startup script for the VM. This will automate the installation of the NVIDIA GPU Drivers and NVIDIA Docker toolkit.
+> 1. If you don't have a VM instance yet, we recommend creating a new Compute
+> Instance with the "Deep Learning on Linux" Image in Boot disk options.
+> 2. To install the NVIDIA Docker toolkit, you can use
+> [`docker/scripts/cloud-startup.sh`](/docker/scripts/cloud-startup.sh) as a
+> startup script for the VM. This will automate the installation of the
+> NVIDIA GPU Drivers and NVIDIA Docker toolkit.
@@ -622,14 +1118,18 @@ Furthermore, all submitters must sign the following agreements:
How do I submit my algorithm to the benchmark?
-> Please see our [**How to Submit**](/README.md#how-to-submit) section. You can submit your algorithm to the benchmark by opening a PR on the [**submission repository**](https://github.com/mlcommons/submissions_algorithms).
+> Please see our [**How to Submit**](/README.md#how-to-submit) section. You can
+> submit your algorithm to the benchmark by opening a PR on the
+> [**submission repository**](https://github.com/mlcommons/submissions_algorithms).
Can I submit multiple times?
-> Our benchmark allows multiple submissions as long as they are substantially different. We discourage submitters from creating bulk submissions as this is not in the spirit of the benchmark.
+> Our benchmark allows multiple submissions as long as they are substantially
+> different. We discourage submitters from creating bulk submissions as this is
+> not in the spirit of the benchmark.
@@ -643,9 +1143,12 @@ Furthermore, all submitters must sign the following agreements:
Can I install custom dependencies?
-> You may use custom dependencies as long as they do not conflict with any of the pinned packages in [`pyproject.toml`](/pyproject.toml).
+> You may use custom dependencies as long as they do not conflict with any of
+> the pinned packages in [`pyproject.toml`](/pyproject.toml).
>
-> To include your custom dependencies in your submission, please include them in a `requirements.txt` file. Please refer to the [**Software dependencies**](#software-dependencies) section of our rules.
+> To include your custom dependencies in your submission, please include them in
+> a `requirements.txt` file. Please refer to the
+> [**Software dependencies**](#software-dependencies) section of our rules.
@@ -654,14 +1157,25 @@ Furthermore, all submitters must sign the following agreements:
How can I know if my code can be run on benchmarking hardware?
-> The benchmarking hardware specifications are documented in the [**Benchmarking Hardware Section**](#benchmarking-hardware). We recommend monitoring your submission's memory usage so that it does not exceed the available memory on the benchmarking hardware. We also recommend doing a dry run using a cloud instance.
+> The benchmarking hardware specifications are documented in the
+> [**Benchmarking Hardware Section**](#benchmarking-hardware). We recommend
+> monitoring your submission's memory usage so that it does not exceed the
+> available memory on the benchmarking hardware. We also recommend doing a dry
+> run using a cloud instance.
This benchmark seems computationally expensive. Do I have to run it myself?
-> Submitters are no longer required to self-report results to get on the _AlgoPerf_ leaderboard. Instead, they can open a PR in the [**submission repository**](https://github.com/mlcommons/submissions_algorithms) and the working group will score the most promising submissions, see our [**How to Submit**](/README.md#how-to-submit) section for more details. You can use self-reported results to provide evidence of performance on the benchmark. Even if you fully self-report, we will still verify the scores by rerunning the submission on our setup.
+> Submitters are no longer required to self-report results to get on the
+> *AlgoPerf* leaderboard. Instead, they can open a PR in the
+> [**submission repository**](https://github.com/mlcommons/submissions_algorithms)
+> and the working group will score the most promising submissions, see our
+> [**How to Submit**](/README.md#how-to-submit) section for more details. You
+> can use self-reported results to provide evidence of performance on the
+> benchmark. Even if you fully self-report, we will still verify the scores by
+> rerunning the submission on our setup.
@@ -670,9 +1184,18 @@ Furthermore, all submitters must sign the following agreements:
> Yes, you may, as long as it isn't an exact copy of an existing submission.
>
-> For example, you may submit the Adam optimizer with your particularly effective hyperparameter search space and hyperparameter configuration, as different choices for hyperparameter values and/or search spaces constitute different training algorithms and are potential sources of innovation.
+> For example, you may submit the Adam optimizer with your particularly
+> effective hyperparameter search space and hyperparameter configuration, as
+> different choices for hyperparameter values and/or search spaces constitute
+> different training algorithms and are potential sources of innovation.
>
-> That said, while submitting Adam with some novel heuristic to set various hyperparameters, some especially effective hyperparameter search space, or your single best hyperparameter configuration is fine, avoid making multiple submissions that only differ by their hyperparameter configuration without a convincing justification they are substantially different (see the [**"Can I submit multiple times to the benchmark competition?"**](#submitting) question, above).
+> That said, while submitting Adam with some novel heuristic to set various
+> hyperparameters, some especially effective hyperparameter search space, or
+> your single best hyperparameter configuration is fine, avoid making multiple
+> submissions that only differ by their hyperparameter configuration without a
+> convincing justification they are substantially different (see the
+> [**"Can I submit multiple times to the benchmark competition?"**](#submitting)
+> question, above).
@@ -680,7 +1203,23 @@ Furthermore, all submitters must sign the following agreements:
### Shared Data Pipelines between `JAX` and `PyTorch`
-The `JAX` and `PyTorch` versions of the `Criteo`, `fastMRI`, `LibriSpeech`, `OGBG`, and `WMT` workloads use the same `TensorFlow` input pipelines. Due to differences in how `JAX` and `PyTorch` distribute computations across devices, the `PyTorch` workloads have an additional overhead for these workloads.
-
-Since we use `PyTorch`'s [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel) implementation, there is one Python process for each device. Depending on the hardware and the settings of the cluster, running a `TensorFlow` input pipeline in each Python process can lead to errors, since too many threads are created in each process. See [this PR thread](https://github.com/mlcommons/algorithmic-efficiency/pull/85) for more details.
-While this issue might not affect all setups, we currently implement a different strategy: we only run the `TensorFlow` input pipeline in one Python process (with `rank == 0`), and [broadcast](https://pytorch.org/docs/stable/distributed.html#torch.distributed.broadcast) the batches to all other devices. This introduces additional communication overhead for each batch. See the [implementation for the `WMT` workload](https://github.com/mlcommons/algorithmic-efficiency/blob/main/algoperf/workloads/wmt/wmt_pytorch/workload.py#L215-L288) as an example.
+The `JAX` and `PyTorch` versions of the `Criteo`, `fastMRI`, `LibriSpeech`,
+`OGBG`, and `WMT` workloads use the same `TensorFlow` input pipelines. Due to
+differences in how `JAX` and `PyTorch` distribute computations across devices,
+the `PyTorch` workloads have an additional overhead for these workloads.
+
+Since we use `PyTorch`'s
+[`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel)
+implementation, there is one Python process for each device. Depending on the
+hardware and the settings of the cluster, running a `TensorFlow` input pipeline
+in each Python process can lead to errors, since too many threads are created in
+each process. See
+[this PR thread](https://github.com/mlcommons/algorithmic-efficiency/pull/85)
+for more details. While this issue might not affect all setups, we currently
+implement a different strategy: we only run the `TensorFlow` input pipeline in
+one Python process (with `rank == 0`), and
+[broadcast](https://pytorch.org/docs/stable/distributed.html#torch.distributed.broadcast)
+the batches to all other devices. This introduces additional communication
+overhead for each batch. See the
+[implementation for the `WMT` workload](https://github.com/mlcommons/algorithmic-efficiency/blob/main/algoperf/workloads/wmt/wmt_pytorch/workload.py#L215-L288)
+as an example.
diff --git a/docs/GETTING_STARTED.md b/docs/GETTING_STARTED.md
index 0cc286099..e8ea1734e 100644
--- a/docs/GETTING_STARTED.md
+++ b/docs/GETTING_STARTED.md
@@ -2,84 +2,91 @@
## Table of Contents
-- [Set Up and Installation](#set-up-and-installation)
- - [Python Virtual Environment](#python-virtual-environment)
- - [Docker](#docker)
- - [Building Docker Image](#building-docker-image)
- - [Running Docker Container (Interactive)](#running-docker-container-interactive)
- - [Using Singularity/Apptainer instead of Docker](#using-singularityapptainer-instead-of-docker)
-- [Download the Data](#download-the-data)
-- [Develop your Submission](#develop-your-submission)
- - [Set Up Your Directory Structure (Optional)](#set-up-your-directory-structure-optional)
- - [Coding your Submission](#coding-your-submission)
-- [Run your Submission](#run-your-submission)
- - [Pytorch DDP](#pytorch-ddp)
- - [Run your Submission in a Docker Container](#run-your-submission-in-a-docker-container)
- - [Docker Tips](#docker-tips)
-- [Score your Submission](#score-your-submission)
- - [Running workloads](#running-workloads)
-- [Submit your Submission](#submit-your-submission)
+- [Set Up and Installation](#set-up-and-installation)
+ - [Python Virtual Environment](#python-virtual-environment)
+ - [Docker](#docker)
+ - [Building Docker Image](#building-docker-image)
+ - [Running Docker Container (Interactive)](#running-docker-container-interactive)
+ - [Using Singularity/Apptainer instead of Docker](#using-singularityapptainer-instead-of-docker)
+- [Download the Data](#download-the-data)
+- [Develop your Submission](#develop-your-submission)
+ - [Set Up Your Directory Structure (Optional)](#set-up-your-directory-structure-optional)
+ - [Coding your Submission](#coding-your-submission)
+- [Run your Submission](#run-your-submission)
+ - [Pytorch DDP](#pytorch-ddp)
+ - [Run your Submission in a Docker Container](#run-your-submission-in-a-docker-container)
+ - [Docker Tips](#docker-tips)
+- [Score your Submission](#score-your-submission)
+ - [Running workloads](#running-workloads)
+- [Submit your Submission](#submit-your-submission)
## Set Up and Installation
-To get started you will have to make a few decisions and install the repository along with its dependencies. Specifically:
-
-1. Decide if you would like to develop your submission in either PyTorch or JAX.
-2. Set up your workstation or VM. We recommend to use a setup similar to the [benchmarking hardware](/DOCUMENTATION.md#benchmarking-hardware).
- The specs on the benchmarking machines are:
- - 8xV100 16GB GPUs
- - 240 GB in RAM
- - 2 TB in storage (for datasets).
-3. Install the `algoperf` package and dependencies either in a [Python virtual environment](#python-virtual-environment) or use a [Docker](#docker) (recommended) or [Singularity/Apptainer container](#using-singularityapptainer-instead-of-docker).
+To get started you will have to make a few decisions and install the repository
+along with its dependencies. Specifically:
+
+1. Decide if you would like to develop your submission in either PyTorch or
+ JAX.
+2. Set up your workstation or VM. We recommend to use a setup similar to the
+ [benchmarking hardware](/DOCUMENTATION.md#benchmarking-hardware). The specs
+ on the benchmarking machines are:
+ - 4xA100 40GB GPUs
+ - 240 GB in RAM
+ - 2 TB in storage (for datasets).
+3. Install the `algoperf` package and dependencies either in a
+ [Python virtual environment](#python-virtual-environment) or use a
+ [Docker](#docker) (recommended) or
+ [Singularity/Apptainer container](#using-singularityapptainer-instead-of-docker).
### Python Virtual Environment
> **Prerequisites:**
>
-> - Python minimum requirement >= 3.11
-> - CUDA 12.1
-> - NVIDIA Driver version 535.104.05
+> - Python minimum requirement >= 3.11
+> - CUDA 12.1
+> - NVIDIA Driver version 535.104.05
To set up a virtual enviornment and install this repository
-1. Create new environment, e.g. via `conda` or `virtualenv`
+1. Create new environment, e.g. via `conda` or `virtualenv`
- ```bash
- sudo apt-get install python3-venv
- python3 -m venv env
- source env/bin/activate
- ```
+ ```bash
+ sudo apt-get install python3-venv
+ python3 -m venv env
+ source env/bin/activate
+ ```
-2. Clone this repository
+2. Clone this repository
- ```bash
- git clone https://github.com/mlcommons/algorithmic-efficiency.git
- cd algorithmic-efficiency
- ```
+ ```bash
+ git clone https://github.com/mlcommons/algorithmic-efficiency.git
+ cd algorithmic-efficiency
+ ```
-3. Run the following pip3 install commands based on your chosen framework to install `algoperf` and its dependencies.
+3. Run the following pip3 install commands based on your chosen framework to
+ install `algoperf` and its dependencies.
- For **JAX**:
+ For **JAX**:
- ```bash
- pip3 install -e '.[pytorch_cpu]'
- pip3 install -e '.[jax_gpu]' -f 'https://storage.googleapis.com/jax-releases/jax_cuda_releases.html'
- pip3 install -e '.[full]'
- ```
+ ```bash
+ pip3 install -e '.[pytorch_cpu]'
+ pip3 install -e '.[jax_gpu]' -f 'https://storage.googleapis.com/jax-releases/jax_cuda_releases.html'
+ pip3 install -e '.[full]'
+ ```
- For **PyTorch**
+ For **PyTorch**
- Note: the below command assumes you have CUDA 12.1 installed locally.
- This is the default in the provided Docker image.
- We recommend you match this CUDA version but if you decide to run
- with a different local CUDA version, please find the appropriate wheel
- url to pass to the `pip install` command for `pytorch`.
+ Note: the below command assumes you have CUDA 12.1 installed locally. This
+ is the default in the provided Docker image. We recommend you match this
+ CUDA version but if you decide to run with a different local CUDA version,
+ please find the appropriate wheel url to pass to the `pip install` command
+ for `pytorch`.
- ```bash
- pip3 install -e '.[jax_cpu]'
- pip3 install -e '.[pytorch_gpu]' -f 'https://download.pytorch.org/whl/cu121'
- pip3 install -e '.[full]'
- ```
+ ```bash
+ pip3 install -e '.[jax_cpu]'
+ pip3 install -e '.[pytorch_gpu]' -f 'https://download.pytorch.org/whl/cu121'
+ pip3 install -e '.[full]'
+ ```
@@ -101,85 +108,105 @@ pip3 install -e '.[full]'
### Docker
-We recommend using a Docker container to ensure a similar environment to our scoring and testing environments. Alternatively, a Singularity/Apptainer container can also be used (see instructions below).
+We recommend using a Docker container to ensure a similar environment to our
+scoring and testing environments. Alternatively, a Singularity/Apptainer
+container can also be used (see instructions below).
> **Prerequisites:**
>
-> - NVIDIA Driver version 535.104.05
-> - NVIDIA Container Toolkit so that the containers can locate the NVIDIA drivers and GPUs. See instructions in the [NVIDIA Docker documentation](https://github.com/NVIDIA/nvidia-docker).
+> - NVIDIA Driver version >= 535.104.05
+> - NVIDIA Container Toolkit so that the containers can locate the NVIDIA
+> drivers and GPUs. See instructions in the
+> [NVIDIA Docker documentation](https://github.com/NVIDIA/nvidia-docker).
#### Building Docker Image
-1. Clone this repository
+1. Clone this repository
- ```bash
- cd ~ && git clone https://github.com/mlcommons/algorithmic-efficiency.git
- ```
+ ```bash
+ cd ~ && git clone https://github.com/mlcommons/algorithmic-efficiency.git
+ ```
-2. Build Docker image
+2. Build Docker image
- ```bash
- cd algorithmic-efficiency/docker
- docker build -t . --build-arg framework=
- ```
+ ```bash
+ cd algorithmic-efficiency/docker
+ docker build -t . --build-arg framework=
+ ```
- The `framework` flag can be either `pytorch`, `jax` or `both`. Specifying the framework will install the framework specific dependencies.
- The `docker_image_name` is arbitrary.
+ The `framework` flag can be either `pytorch`, `jax` or `both`. Specifying
+ the framework will install the framework specific dependencies. The
+ `docker_image_name` is arbitrary.
#### Running Docker Container (Interactive)
-To use the Docker container as an interactive virtual environment, you can run a container mounted to your local data and code directories and execute the `bash` program. This may be useful if you are in the process of developing a submission.
-
-1. Run detached Docker container. The `container_id` will be printed if the container is run successfully.
-
- ```bash
- docker run -t -d \
- -v $HOME/data/:/data/ \
- -v $HOME/experiment_runs/:/experiment_runs \
- -v $HOME/experiment_runs/logs:/logs \
- -v $HOME/algorithmic-efficiency:/algorithmic-efficiency \
- --gpus all \
- --ipc=host \
- \
- --keep_container_alive true
- ```
-
- > Note: You may have to use double quotes around `algorithmic-efficiency` [path] in the mounting `-v` flag. If the above command fails try replacing the following line:
- >
- > ```bash
- > -v $HOME/algorithmic-efficiency:/algorithmic-efficiency2 \
- > ```
- >
- > with
- >
- > ```bash
- > -v $HOME"/algorithmic-efficiency:/algorithmic-efficiency" \
- > ```
-
-2. Open a bash terminal
-
- ```bash
- docker exec -it /bin/bash
- ```
+To use the Docker container as an interactive virtual environment, you can run a
+container mounted to your local data and code directories and execute the `bash`
+program. This may be useful if you are in the process of developing a
+submission.
+
+1. Run detached Docker container. The `container_id` will be printed if the
+ container is run successfully.
+
+ ```bash
+ docker run -t -d \
+ -v $HOME/data/:/data/ \
+ -v $HOME/experiment_runs/:/experiment_runs \
+ -v $HOME/experiment_runs/logs:/logs \
+ -v $HOME/algorithmic-efficiency:/algorithmic-efficiency \
+ --gpus all \
+ --ipc=host \
+ \
+ --keep_container_alive true
+ ```
+
+ > Note: You may have to use double quotes around `algorithmic-efficiency`
+ > [path] in the mounting `-v` flag. If the above command fails try replacing
+ > the following line:
+ >
+ > ```bash
+ > -v $HOME/algorithmic-efficiency:/algorithmic-efficiency2 \
+ > ```
+ >
+ > with
+ >
+ > ```bash
+ > -v $HOME"/algorithmic-efficiency:/algorithmic-efficiency" \
+ > ```
+
+2. Open a bash terminal
+
+ ```bash
+ docker exec -it /bin/bash
+ ```
### Using Singularity/Apptainer instead of Docker
-Since many compute clusters don't allow the usage of Docker due to securtiy concerns and instead encourage the use of [Singularity/Apptainer](https://github.com/apptainer/apptainer) (formerly Singularity, now called Apptainer), we also provide an Apptainer recipe (located at `docker/Singularity.def`) that can be used to build an image by running
+Since many compute clusters don't allow the usage of Docker due to securtiy
+concerns and instead encourage the use of
+[Singularity/Apptainer](https://github.com/apptainer/apptainer) (formerly
+Singularity, now called Apptainer), we also provide an Apptainer recipe (located
+at `docker/Singularity.def`) that can be used to build an image by running
```bash
singularity build --fakeroot .sif Singularity.def
```
-Note that this can take several minutes. Then, to start a shell session with GPU support (by using the `--nv` flag), we can run
+Note that this can take several minutes. Then, to start a shell session with GPU
+support (by using the `--nv` flag), we can run
```bash
singularity shell --bind $HOME/data:/data,$HOME/experiment_runs:/experiment_runs \
--nv .sif
```
-Note the `--bind` flag which, similarly to Docker, allows to bind specific paths on the host system and the container, as explained in the [Singularity User Guide](https://docs.sylabs.io/guides/3.7/user-guide/bind_paths_and_mounts.html).
+Note the `--bind` flag which, similarly to Docker, allows to bind specific paths
+on the host system and the container, as explained in the
+[Singularity User Guide](https://docs.sylabs.io/guides/3.7/user-guide/bind_paths_and_mounts.html).
-Also note that we generated `Singularity.def` automatically from the `Dockerfile` using [spython](https://github.com/singularityhub/singularity-cli), as follows:
+Also note that we generated `Singularity.def` automatically from the
+`Dockerfile` using [spython](https://github.com/singularityhub/singularity-cli),
+as follows:
```bash
pip3 install spython
@@ -187,60 +214,84 @@ cd algorithmic-efficiency/docker
python scripts/singularity_converter.py -i Dockerfile -o Singularity.def
```
-Users that wish to customize their images are invited to check and modify the `Singularity.def` recipe and the `singularity_converter.py` script.
+Users that wish to customize their images are invited to check and modify the
+`Singularity.def` recipe and the `singularity_converter.py` script.
## Download the Data
-The workloads in this benchmark use 6 different datasets across 8 workloads. You may choose to download some or all of the datasets as you are developing your submission, but your submission will be scored across all 8 workloads. For instructions on obtaining and setting up the datasets see [datasets/README](/datasets/README.md#dataset-setup).
+The workloads in this benchmark use 6 different datasets across 8 workloads. You
+may choose to download some or all of the datasets as you are developing your
+submission, but your submission will be scored across all 8 workloads. For
+instructions on obtaining and setting up the datasets see
+[datasets/README](/datasets/README.md#dataset-setup).
## Develop your Submission
-To develop a submission you will write a Python module containing your training algorithm. Your training algorithm must implement a set of predefined API methods for the initialization and update steps.
+To develop a submission you will write a Python module containing your training
+algorithm. Your training algorithm must implement a set of predefined API
+methods for the initialization and update steps.
### Set Up Your Directory Structure (Optional)
-Make a submissions subdirectory to store your submission modules e.g. `algorithmic-effiency/submissions/my_submissions`.
+Make a submissions subdirectory to store your submission modules e.g.
+`algorithmic-effiency/submissions/my_submissions`.
### Coding your Submission
-You can find examples of submission modules under `algorithmic-efficiency/algorithms`. \
-A submission for the external ruleset will consist of a submission module and a tuning search space definition.
-
-1. Copy the template submission module `algorithms/template/submission.py` into your submissions directory e.g. in `algorithmic-efficiency/my_submissions`.
-2. Implement at least the methods in the template submission module. Feel free to use helper functions and/or modules as you see fit. Make sure you adhere to to the competition rules. Check out the guidelines for [allowed submissions](/DOCUMENTATION.md#allowed-submissions), [disallowed submissions](/DOCUMENTATION.md#allowed-submissions) and pay special attention to the [software dependencies rule](/DOCUMENTATION.md#software-dependencies).
-3. Add a tuning configuration e.g. `tuning_search_space.json` file to your submission directory. For the tuning search space you can either:
-
- 1. Define the set of feasible points by defining a value for "feasible_points" for the hyperparameters:
-
- ```JSON
- {
- "learning_rate": {
- "feasible_points": 0.999
- },
- }
- ```
-
- For a complete example see [tuning_search_space.json](/algorithms/target_setting_algorithms/imagenet_resnet/tuning_search_space.json).
-
- 2. Define a range of values for quasirandom sampling by specifing a `min`, `max` and `scaling` keys for the hyperparameter:
-
- ```JSON
- {
- "weight_decay": {
- "min": 5e-3,
- "max": 1.0,
- "scaling": "log",
- }
- }
- ```
-
- For a complete example see [tuning_search_space.json](/algorithms/archived_paper_baselines/nadamw/tuning_search_space.json).
+You can find examples of submission modules under
+`algorithmic-efficiency/algorithms`. \
+A submission for the external ruleset will consist of a submission module and a
+tuning search space definition.
+
+1. Copy the template submission module `algorithms/template/submission.py` into
+ your submissions directory e.g. in `algorithmic-efficiency/my_submissions`.
+2. Implement at least the methods in the template submission module. Feel free
+ to use helper functions and/or modules as you see fit. Make sure you adhere
+ to to the competition rules. Check out the guidelines for
+ [allowed submissions](/DOCUMENTATION.md#allowed-submissions),
+ [disallowed submissions](/DOCUMENTATION.md#allowed-submissions) and pay
+ special attention to the
+ [software dependencies rule](/DOCUMENTATION.md#software-dependencies).
+3. Add a tuning configuration e.g. `tuning_search_space.json` file to your
+ submission directory. For the tuning search space you can either:
+
+ 1. Define the set of feasible points by defining a value for
+ "feasible_points" for the hyperparameters:
+
+ ```json
+ {
+ "learning_rate": {
+ "feasible_points": 0.999
+ },
+ }
+ ```
+
+ For a complete example see
+ [tuning_search_space.json](/algorithms/target_setting_algorithms/imagenet_resnet/tuning_search_space.json).
+
+ 2. Define a range of values for quasirandom sampling by specifing a `min`,
+ `max` and `scaling` keys for the hyperparameter:
+
+ ```json
+ {
+ "weight_decay": {
+ "min": 5e-3,
+ "max": 1.0,
+ "scaling": "log",
+ }
+ }
+ ```
+
+ For a complete example see
+ [tuning_search_space.json](/algorithms/archived_paper_baselines/nadamw/tuning_search_space.json).
## Run your Submission
-From your virtual environment or interactively running Docker container run your submission with `submission_runner.py`:
+From your virtual environment or interactively running Docker container run your
+submission with `submission_runner.py`:
-**JAX**: to score your submission on a workload, from the algorithmic-efficency directory run:
+**JAX**: to score your submission on a workload, from the algorithmic-efficency
+directory run:
```bash
python3 submission_runner.py \
@@ -252,7 +303,8 @@ python3 submission_runner.py \
--tuning_search_space=
```
-**PyTorch**: to score your submission on a workload, from the algorithmic-efficency directory run:
+**PyTorch**: to score your submission on a workload, from the
+algorithmic-efficency directory run:
```bash
python3 submission_runner.py \
@@ -266,10 +318,14 @@ python3 submission_runner.py \
### Pytorch DDP
-We recommend using PyTorch's [Distributed Data Parallel (DDP)](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) when using multiple GPUs on a single node. You can initialize ddp with torchrun. For example, on single host with 8 GPUs simply replace `python3` in the above command by:
+We recommend using PyTorch's
+[Distributed Data Parallel (DDP)](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)
+when using multiple GPUs on a single node. You can initialize ddp with torchrun.
+For example, on single host with 4 GPUs simply replace `python3` in the above
+command by:
```bash
-torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=N_GPUS
+torchrun --redirects 1:0,2:0,3:0 --standalone --nnodes=1 --nproc_per_node=N_GPUS
```
where `N_GPUS` is the number of available GPUs on the node.
@@ -277,7 +333,7 @@ where `N_GPUS` is the number of available GPUs on the node.
So the complete command is:
```bash
-torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 \
+torchrun --redirects 1:0,2:0,3:0 \
--standalone \
--nnodes=1 \
--nproc_per_node=N_GPUS \
@@ -294,14 +350,28 @@ torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 \
The container entrypoint script provides the following flags:
-- `--dataset` dataset: can be 'imagenet', 'fastmri', 'librispeech', 'criteo1tb', 'wmt', or 'ogbg'. Setting this flag will download data if `~/data/` does not exist on the host machine. Required for running a submission.
-- `--framework` framework: can be either 'pytorch' or 'jax'. If you just want to download data, this flag is required for `-d imagenet` since we have two versions of data for imagenet. This flag is also required for running a submission.
-- `--submission_path` submission_path: path to submission file on container filesystem. If this flag is set, the container will run a submission, so it is required for running a submission.
-- `--tuning_search_space` tuning_search_space: path to file containing tuning search space on container filesystem. Required for running a submission.
-- `--experiment_name` experiment_name: name of experiment. Required for running a submission.
-- `--workload` workload: can be 'imagenet_resnet', 'imagenet_jax', 'librispeech_deepspeech', 'librispeech_conformer', 'ogbg', 'wmt', 'fastmri' or 'criteo1tb'. Required for running a submission.
-- `--max_global_steps` max_global_steps: maximum number of steps to run the workload for. Optional.
-- `--keep_container_alive` : can be true or false. If`true` the container will not be killed automatically. This is useful for developing or debugging.
+- `--dataset` dataset: can be 'imagenet', 'fastmri', 'librispeech',
+ 'criteo1tb', 'wmt', 'finewebedu', or 'ogbg'. Setting this flag will download
+ data if `~/data/` does not exist on the host machine. Required for
+ running a submission.
+- `--framework` framework: can be either 'pytorch' or 'jax'. If you just want
+ to download data, this flag is required for `-d imagenet` since we have two
+ versions of data for imagenet. This flag is also required for running a
+ submission.
+- `--submission_path` submission_path: path to submission file on container
+ filesystem. If this flag is set, the container will run a submission, so it
+ is required for running a submission.
+- `--tuning_search_space` tuning_search_space: path to file containing tuning
+ search space on container filesystem. Required for running a submission.
+- `--experiment_name` experiment_name: name of experiment. Required for
+ running a submission.
+- `--workload` workload: can be 'imagenet_resnet', 'imagenet_jax',
+ 'librispeech_deepspeech', 'librispeech_conformer', 'ogbg', 'wmt', 'fastmri',
+ 'finewebedu_lm', or 'criteo1tb'. Required for running a submission.
+- `--max_global_steps` max_global_steps: maximum number of steps to run the
+ workload for. Optional.
+- `--keep_container_alive` : can be true or false. If`true` the container will
+ not be killed automatically. This is useful for developing or debugging.
To run the docker container that will run the submission runner run:
@@ -346,15 +416,18 @@ docker exec -it /bin/bash
## Score your Submission
-To score your submission we will score over all workloads, studies, and trials as described in the rules.
-In other words, the total number of runs expected for official scoring is:
+To score your submission we will score over all workloads, studies, and trials
+as described in the rules. In other words, the total number of runs expected for
+official scoring is:
-- for external tuning ruleset: **120** = 8 (workloads) x 3 (studies) x 5 (trials)
-- for self-tuning ruleset: **24** = 8 (workloads) x 3 (studies)
+- for external tuning ruleset: **135** = 9 (workloads) x 3 (studies) x 5
+ (trials)
+- for self-tuning ruleset: **27** = 9 (workloads) x 3 (studies)
### Running workloads
-To run a number of studies and trials over all workload using Docker containers for each run:
+To run a number of studies and trials over all workload using Docker containers
+for each run:
```bash
python scoring/run_workloads.py \
@@ -368,19 +441,31 @@ python scoring/run_workloads.py \
--seed
```
-Note that to run the above script you will need at least the `jax_cpu` and `pytorch_cpu` installations of the `algorithmic-efficiency` package.
+Note that to run the above script you will need at least the `jax_cpu` and
+`pytorch_cpu` installations of the `algorithmic-efficiency` package.
-During submission development, it might be useful to do faster, approximate scoring (e.g. without `3` different studies or when some trials are missing) so the scoring scripts allow some flexibility.
-To simulate official scoring, pass the `--strict=True` flag in `score_submission.py`. To get the raw scores and performance profiles of group of submissions or single submission:
+During submission development, it might be useful to do faster, approximate
+scoring (e.g. without `3` different studies or when some trials are missing) so
+the scoring scripts allow some flexibility. To simulate official scoring, pass
+the `--strict=True` flag in `score_submission.py`. To get the raw scores and
+performance profiles of group of submissions or single submission:
```bash
python score_submissions.py --submission_directory --output_dir --compute_performance_profiles
```
-We provide the scores and performance profiles for the [paper baseline algorithms](/algorithms/archived_paper_baselines/) in the "Baseline Results" section in [Benchmarking Neural Network Training Algorithms](https://arxiv.org/abs/2306.07179).
+We provide the scores and performance profiles for the
+[paper baseline algorithms](/algorithms/archived_paper_baselines/) in the
+"Baseline Results" section in
+[Benchmarking Neural Network Training Algorithms](https://arxiv.org/abs/2306.07179).
## Submit your Submission
-To submit your submission, please create a PR on the [submission repository](https://github.com/mlcommons/submissions_algorithms). You can find more details in the submission repositories [How to Submit](https://github.com/mlcommons/submissions_algorithms?tab=readme-ov-file#how-to-submit) section. The working group will review your PR and select the most promising submissions for scoring.
+To submit your submission, please create a PR on the
+[submission repository](https://github.com/mlcommons/submissions_algorithms).
+You can find more details in the submission repositories
+[How to Submit](https://github.com/mlcommons/submissions_algorithms?tab=readme-ov-file#how-to-submit)
+section. The working group will review your PR and select the most promising
+submissions for scoring.
**Good Luck!**