mlcommons
diff --git a/‎.github/workflows/scoring_tests.yml‎
Lines changed: 52 additions & 0 deletions b/‎.github/workflows/scoring_tests.yml‎
Lines changed: 52 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 4 additions & 0 deletions b/‎.gitignore‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 4 additions & 0 deletions b/‎CONTRIBUTING.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 60 additions & 3 deletions b/‎README.md‎
Lines changed: 60 additions & 3 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 53 additions & 0 deletions b/‎pyproject.toml‎
Lines changed: 53 additions & 0 deletions
diff --git a/‎scoring/__init__.py‎ b/‎scoring/__init__.py‎
diff --git a/‎scoring/algoperf_v05/__init__.py‎ b/‎scoring/algoperf_v05/__init__.py‎
diff --git a/‎scoring/algoperf_v05/generate_held_out_workloads.py‎
Lines changed: 73 additions & 0 deletions b/‎scoring/algoperf_v05/generate_held_out_workloads.py‎
Lines changed: 73 additions & 0 deletions
diff --git a/‎scoring/algoperf_v05/held_out_workloads_algoperf_v05.json‎
Lines changed: 1 addition & 0 deletions b/‎scoring/algoperf_v05/held_out_workloads_algoperf_v05.json‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎scoring/algoperf_v05/held_out_workloads_example.json‎
Lines changed: 1 addition & 0 deletions b/‎scoring/algoperf_v05/held_out_workloads_example.json‎
Lines changed: 1 addition & 0 deletions
@@ -0,0 +1,52 @@
+name: Scoring Tests
+
+# Runs the scoring code's linting and unit tests. The scoring code lives in
+# `scoring/` and computes the AlgoPerf leaderboard from submission logs.
+on:
+  push:
+    paths:
+      - 'scoring/**'
+      - 'pyproject.toml'
+      - '.github/workflows/scoring_tests.yml'
+  pull_request:
+    paths:
+      - 'scoring/**'
+      - 'pyproject.toml'
+      - '.github/workflows/scoring_tests.yml'
+
+jobs:
+  ruff:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python 3.11
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.11'
+      - name: Install ruff
+        run: |
+          python -m pip install --upgrade pip
+          pip install ruff==0.12.0
+      - name: Lint scoring/
+        run: ruff check scoring/
+      - name: Format check scoring/
+        run: ruff format --check scoring/
+
+  pytest:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python 3.11
+        uses: actions/setup-python@v5
+        with:
+          python-version: '3.11'
+      - name: Install scoring package
+        run: |
+          python -m pip install --upgrade pip
+          pip install -e .[dev]
+      - name: Run scoring unit tests
+        # Runs the whole scoring/ suite: the log-parsing unit tests
+        # (test_scoring_utils.py) and the end-to-end scoring smoke test
+        # (test_score_submissions.py), which reproduces the published v0.5
+        # leaderboard and guards the score-aggregation path.
+        run: pytest scoring/
@@ -1,3 +1,7 @@
 .DS_Store
 __pycache__/
 *.pyc
+
+# Scoring output artifacts (see README "Scoring")
+scoring_results*/
+*.egg-info/
@@ -7,3 +7,7 @@ Generally we encourage people to become MLCommons members if they wish to contri
 Regardless of whether you are a member, your organization (or you as an individual contributor) needs to sign the MLCommons Contributor License Agreement (CLA). Please submit your GitHub username to the [MLCommons Subscription form](https://mlcommons.org/community/subscribe/) to start that process.
 
 MLCommons project work is tracked with issue trackers and pull requests. Modify the project in your own fork and issue a pull request once you want other developers to take a look at what you have done and discuss the proposed changes. Ensure that cla-bot and other checks pass for your pull requests.
+
+## Scoring code
+
+The leaderboard scoring code lives in [`scoring/`](./scoring/). See the [Scoring section of the README](./README.md#scoring) for how to install it and regenerate the leaderboard.
@@ -9,13 +9,13 @@ This repository hosts the official rolling leaderboard for the [**AlgoPerf: Trai
 The benchmark measures neural network training speedups due to algorithmic improvements in training algorithms.
 The leaderboard tracks the aggregate performance of different algorithms on a variety of [workloads](https://github.com/mlcommons/algorithmic-efficiency/blob/main/DOCUMENTATION.md#workloads) and under two different [tuning rulesets](https://github.com/mlcommons/algorithmic-efficiency/blob/main/DOCUMENTATION.md#tuning).
 
-> [!NOTE]  
+> [!NOTE]
 > **If you want to submit to the AlgoPerf benchmark, please open a PR with your submission. The AlgoPerf working group will review your submission and potentially evaluate your submission on all workloads. For more details, see the [How to Submit](#how-to-submit) section.**
 
 ## Live Leaderboards
 
-> **Leaderboard Version:** 0.6  
-> **Last Updated:** 2025-03-24 15:07 UTC  
+> **Leaderboard Version:** 0.6
+> **Last Updated:** 2025-03-24 15:07 UTC
 > **Using Benchmark Version:** [latest](https://github.com/mlcommons/algorithmic-efficiency)
 
 > [!TIP]
@@ -58,6 +58,63 @@ To submit your algorithm for evaluation on the AlgoPerf leaderboard, please foll
 2. **Create a Pull Request:** Fork this repository, create a new branch and add your submission code to a new folder within either `submissions/external_tuning/` or `submissions/self_tuning`. Open a pull request (PR) to the `evaluation` branch of this repository. Make sure to fill out the PR template asking for information such as submission name, authors, affiliations, etc.
 3. **PR Review and Evaluation:** The AlgoPerf working group will review your PR. Based on our available resources and the perceived potential of the method, it will be selected for a free evaluation and merged into the `evaluation` branch. The working group will run your submission on all workloads and push the results, as well as the updated leaderboard, to the `main`branch.
 
+## Scoring
+
+The code that computes this leaderboard lives in [`scoring/`](./scoring/). Given a
+directory of submission logs (such as those under [`previous_leaderboards/`](./previous_leaderboards/)),
+it computes the performance profiles, time-to-target, AlgoPerf benchmark scores, and
+speedups used in the tables above. This code was moved here from the
+[`scoring/` directory of the algorithmic-efficiency repository](https://github.com/mlcommons/algorithmic-efficiency)
+so that the repository that hosts the leaderboard also owns the code that produces it.
+
+### Installation
+
+The scoring code is self-contained. To run it, set up a fresh
+Python (>=3.11) environment, e.g. via `conda` or `virtualenv`:
+
+```bash
+python3 -m venv env && source env/bin/activate
+pip3 install -e .          # installs the scoring tooling (numpy, pandas, scipy, ...)
+```
+
+> [!NOTE]
+> The `scoring/workload_targets*.json` files are generated from the benchmark
+> definitions in [algorithmic-efficiency](https://github.com/mlcommons/algorithmic-efficiency)
+> (`scoring/generate_workload_targets.py`) and commited here. Each file is
+> frozen for one benchmark version, carrying that version's base/held-out
+> workload sets and per-workload targets. Regenerate and re-copy when a
+> benchmark version changes the workloads or targets.
+
+### Regenerating the leaderboard
+
+The current targets are the default. To score an older leaderboard, pass
+`--workload_targets` for that version's file (e.g. `scoring/workload_targets_v05.json`).
+
+```bash
+# External tuning ruleset
+python -m scoring.score_submissions \
+  --submission_directory previous_leaderboards/algoperf_v06/logs/external_tuning \
+  --compute_performance_profiles \
+  --output_dir scoring_results_external_tuning
+
+# Self-tuning ruleset (add --self_tuning_ruleset)
+python -m scoring.score_submissions \
+  --submission_directory previous_leaderboards/algoperf_v06/logs/self_tuning \
+  --compute_performance_profiles \
+  --self_tuning_ruleset \
+  --output_dir scoring_results_self_tuning
+
+# Reproduce the v0.5 leaderboard (8 base + 6 held-out workloads)
+python -m scoring.score_submissions \
+  --workload_targets scoring/workload_targets_v05.json \
+  --submission_directory previous_leaderboards/algoperf_v05/logs/external_tuning \
+  --compute_performance_profiles \
+  --output_dir scoring_results_v05_external
+```
+
+See the [scoring methodology](https://github.com/mlcommons/algorithmic-efficiency/blob/main/docs/DOCUMENTATION.md#scoring)
+in the benchmark documentation for details on how scores are computed.
+
 ## Citation
 
 If you use the _AlgoPerf benchmark_ in your research, please consider citing our paper.
 
@@ -0,0 +1,53 @@
+###############################################################################
+#              MLCommons AlgoPerf: Leaderboard Scoring Tooling                 #
+###############################################################################
+# This package contains the scoring code used to compute the AlgoPerf
+# leaderboard (performance profiles, time-to-target, benchmark scores and
+# speedups) from submission logs stored in this repository.
+#
+# It was moved here from the `scoring/` directory of
+# https://github.com/mlcommons/algorithmic-efficiency so that the repository
+# that owns the leaderboard also owns the code that produces it.
+
+[project]
+name = "algoperf-submissions"
+version = "0.6.0"
+description = "Scoring tooling for the MLCommons AlgoPerf: Training Algorithms leaderboard"
+authors = [
+  { name = "MLCommons Algorithms Working Group", email = "algorithms@mlcommons.org" },
+]
+license = { text = "Apache 2.0" }
+readme = "README.md"
+requires-python = ">=3.11"
+
+
+dependencies = [
+  "absl-py==2.1.0",
+  "numpy==2.1.3",
+  "pandas==2.2.3",
+  "matplotlib==3.9.2",
+  "scipy==1.14.1",
+  "tabulate==0.9.0",
+]
+
+[project.optional-dependencies]
+dev = ["pytest==8.3.3", "ruff==0.12.0"]
+
+[build-system]
+requires = ["setuptools>=45"]
+build-backend = "setuptools.build_meta"
+
+[tool.setuptools.packages.find]
+# Only package the scoring code; ignore submissions/ and previous_leaderboards/.
+include = ["scoring*"]
+
+###############################################################################
+#                     Linting & Formatting Configurations                     #
+###############################################################################
+[tool.ruff]
+line-length = 80
+indent-width = 2
+target-version = "py311"
+
+[tool.ruff.format]
+quote-style = "single"
@@ -0,0 +1,73 @@
+import json
+import os
+import struct
+
+import numpy as np
+from absl import app, flags, logging
+
+flags.DEFINE_integer(
+  'held_out_workloads_seed',
+  None,
+  'Random seed for scoring.AlgoPerf v0.5 seed: 3438810845',
+)
+flags.DEFINE_string(
+  'output_filename',
+  'held_out_workloads.json',
+  'Path to file to record sampled held_out workloads.',
+)
+FLAGS = flags.FLAGS
+
+HELD_OUT_WORKLOADS = {
+  'librispeech': [
+    'librispeech_conformer_attention_temperature',
+    'librispeech_conformer_layernorm',
+    # 'librispeech_conformer_gelu', # Removed due to bug in target setting procedure
+    'librispeech_deepspeech_no_resnet',
+    'librispeech_deepspeech_norm_and_spec_aug',
+    'librispeech_deepspeech_tanh',
+  ],
+  'imagenet': [
+    'imagenet_resnet_silu',
+    'imagenet_resnet_gelu',
+    'imagenet_resnet_large_bn_init',
+    'imagenet_vit_glu',
+    'imagenet_vit_post_ln',
+    'imagenet_vit_map',
+  ],
+  'ogbg': ['ogbg_gelu', 'ogbg_silu', 'ogbg_model_size'],
+  'wmt': ['wmt_post_ln', 'wmt_attention_temp', 'wmt_glu_tanh'],
+  'fastmri': ['fastmri_model_size', 'fastmri_tanh', 'fastmri_layernorm'],
+  'criteo1tb': [
+    'criteo1tb_layernorm',
+    'criteo1tb_embed_init',
+    'criteo1tb_resnet',
+  ],
+}
+
+
+def save_held_out_workloads(held_out_workloads, filename):
+  with open(filename, 'w') as f:
+    json.dump(held_out_workloads, f)
+
+
+def main(_):
+  rng_seed = FLAGS.held_out_workloads_seed
+  output_filename = FLAGS.output_filename
+
+  if not rng_seed:
+    rng_seed = struct.unpack('I', os.urandom(4))[0]
+
+  logging.info('Using RNG seed %d', rng_seed)
+  rng = np.random.default_rng(rng_seed)
+
+  sampled_held_out_workloads = []
+  for _, v in HELD_OUT_WORKLOADS.items():
+    sampled_index = rng.integers(len(v))
+    sampled_held_out_workloads.append(v[sampled_index])
+
+  logging.info(f'Sampled held-out workloads: {sampled_held_out_workloads}')
+  save_held_out_workloads(sampled_held_out_workloads, output_filename)
+
+
+if __name__ == '__main__':
+  app.run(main)
@@ -0,0 +1 @@
+["librispeech_conformer_layernorm", "imagenet_resnet_large_bn_init", "ogbg_model_size", "wmt_glu_tanh", "fastmri_tanh", "criteo1tb_embed_init"]
@@ -0,0 +1 @@
+["librispeech_conformer_gelu", "imagenet_resnet_silu", "ogbg_gelu", "wmt_post_ln", "fastmri_model_size", "criteo1tb_layernorm"]
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+["librispeech_conformer_layernorm", "imagenet_resnet_large_bn_init", "ogbg_model_size", "wmt_glu_tanh", "fastmri_tanh", "criteo1tb_embed_init"]`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+["librispeech_conformer_gelu", "imagenet_resnet_silu", "ogbg_gelu", "wmt_post_ln", "fastmri_model_size", "criteo1tb_layernorm"]`