Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions .github/workflows/scoring_tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
name: Scoring Tests

# Runs the scoring code's linting and unit tests. The scoring code lives in
# `scoring/` and computes the AlgoPerf leaderboard from submission logs.
on:
push:
paths:
- 'scoring/**'
- 'pyproject.toml'
- '.github/workflows/scoring_tests.yml'
pull_request:
paths:
- 'scoring/**'
- 'pyproject.toml'
- '.github/workflows/scoring_tests.yml'

jobs:
ruff:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python 3.11
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install ruff
run: |
python -m pip install --upgrade pip
pip install ruff==0.12.0
- name: Lint scoring/
run: ruff check scoring/
- name: Format check scoring/
run: ruff format --check scoring/

pytest:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python 3.11
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install scoring package
run: |
python -m pip install --upgrade pip
pip install -e .[dev]
- name: Run scoring unit tests
# Runs the whole scoring/ suite: the log-parsing unit tests
# (test_scoring_utils.py) and the end-to-end scoring smoke test
# (test_score_submissions.py), which reproduces the published v0.5
# leaderboard and guards the score-aggregation path.
run: pytest scoring/
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
.DS_Store
__pycache__/
*.pyc

# Scoring output artifacts (see README "Scoring")
scoring_results*/
*.egg-info/
4 changes: 4 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,7 @@ Generally we encourage people to become MLCommons members if they wish to contri
Regardless of whether you are a member, your organization (or you as an individual contributor) needs to sign the MLCommons Contributor License Agreement (CLA). Please submit your GitHub username to the [MLCommons Subscription form](https://mlcommons.org/community/subscribe/) to start that process.

MLCommons project work is tracked with issue trackers and pull requests. Modify the project in your own fork and issue a pull request once you want other developers to take a look at what you have done and discuss the proposed changes. Ensure that cla-bot and other checks pass for your pull requests.

## Scoring code

The leaderboard scoring code lives in [`scoring/`](./scoring/). See the [Scoring section of the README](./README.md#scoring) for how to install it and regenerate the leaderboard.
63 changes: 60 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,13 @@ This repository hosts the official rolling leaderboard for the [**AlgoPerf: Trai
The benchmark measures neural network training speedups due to algorithmic improvements in training algorithms.
The leaderboard tracks the aggregate performance of different algorithms on a variety of [workloads](https://github.com/mlcommons/algorithmic-efficiency/blob/main/DOCUMENTATION.md#workloads) and under two different [tuning rulesets](https://github.com/mlcommons/algorithmic-efficiency/blob/main/DOCUMENTATION.md#tuning).

> [!NOTE]
> [!NOTE]
> **If you want to submit to the AlgoPerf benchmark, please open a PR with your submission. The AlgoPerf working group will review your submission and potentially evaluate your submission on all workloads. For more details, see the [How to Submit](#how-to-submit) section.**

## Live Leaderboards

> **Leaderboard Version:** 0.6
> **Last Updated:** 2025-03-24 15:07 UTC
> **Leaderboard Version:** 0.6
> **Last Updated:** 2025-03-24 15:07 UTC
> **Using Benchmark Version:** [latest](https://github.com/mlcommons/algorithmic-efficiency)

> [!TIP]
Expand Down Expand Up @@ -58,6 +58,63 @@ To submit your algorithm for evaluation on the AlgoPerf leaderboard, please foll
2. **Create a Pull Request:** Fork this repository, create a new branch and add your submission code to a new folder within either `submissions/external_tuning/` or `submissions/self_tuning`. Open a pull request (PR) to the `evaluation` branch of this repository. Make sure to fill out the PR template asking for information such as submission name, authors, affiliations, etc.
3. **PR Review and Evaluation:** The AlgoPerf working group will review your PR. Based on our available resources and the perceived potential of the method, it will be selected for a free evaluation and merged into the `evaluation` branch. The working group will run your submission on all workloads and push the results, as well as the updated leaderboard, to the `main`branch.

## Scoring

The code that computes this leaderboard lives in [`scoring/`](./scoring/). Given a
directory of submission logs (such as those under [`previous_leaderboards/`](./previous_leaderboards/)),
it computes the performance profiles, time-to-target, AlgoPerf benchmark scores, and
speedups used in the tables above. This code was moved here from the
[`scoring/` directory of the algorithmic-efficiency repository](https://github.com/mlcommons/algorithmic-efficiency)
so that the repository that hosts the leaderboard also owns the code that produces it.

### Installation

The scoring code is self-contained. To run it, set up a fresh
Python (>=3.11) environment, e.g. via `conda` or `virtualenv`:

```bash
python3 -m venv env && source env/bin/activate
pip3 install -e . # installs the scoring tooling (numpy, pandas, scipy, ...)
```

> [!NOTE]
> The `scoring/workload_targets*.json` files are generated from the benchmark
> definitions in [algorithmic-efficiency](https://github.com/mlcommons/algorithmic-efficiency)
> (`scoring/generate_workload_targets.py`) and commited here. Each file is
> frozen for one benchmark version, carrying that version's base/held-out
> workload sets and per-workload targets. Regenerate and re-copy when a
> benchmark version changes the workloads or targets.

### Regenerating the leaderboard

The current targets are the default. To score an older leaderboard, pass
`--workload_targets` for that version's file (e.g. `scoring/workload_targets_v05.json`).

```bash
# External tuning ruleset
python -m scoring.score_submissions \
--submission_directory previous_leaderboards/algoperf_v06/logs/external_tuning \
--compute_performance_profiles \
--output_dir scoring_results_external_tuning

# Self-tuning ruleset (add --self_tuning_ruleset)
python -m scoring.score_submissions \
--submission_directory previous_leaderboards/algoperf_v06/logs/self_tuning \
--compute_performance_profiles \
--self_tuning_ruleset \
--output_dir scoring_results_self_tuning

# Reproduce the v0.5 leaderboard (8 base + 6 held-out workloads)
python -m scoring.score_submissions \
--workload_targets scoring/workload_targets_v05.json \
--submission_directory previous_leaderboards/algoperf_v05/logs/external_tuning \
--compute_performance_profiles \
--output_dir scoring_results_v05_external
```

See the [scoring methodology](https://github.com/mlcommons/algorithmic-efficiency/blob/main/docs/DOCUMENTATION.md#scoring)
in the benchmark documentation for details on how scores are computed.

## Citation

If you use the _AlgoPerf benchmark_ in your research, please consider citing our paper.
Expand Down
53 changes: 53 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
###############################################################################
# MLCommons AlgoPerf: Leaderboard Scoring Tooling #
###############################################################################
# This package contains the scoring code used to compute the AlgoPerf
# leaderboard (performance profiles, time-to-target, benchmark scores and
# speedups) from submission logs stored in this repository.
#
# It was moved here from the `scoring/` directory of
# https://github.com/mlcommons/algorithmic-efficiency so that the repository
# that owns the leaderboard also owns the code that produces it.

[project]
name = "algoperf-submissions"
version = "0.6.0"
description = "Scoring tooling for the MLCommons AlgoPerf: Training Algorithms leaderboard"
authors = [
{ name = "MLCommons Algorithms Working Group", email = "algorithms@mlcommons.org" },
]
license = { text = "Apache 2.0" }
readme = "README.md"
requires-python = ">=3.11"


dependencies = [
"absl-py==2.1.0",
"numpy==2.1.3",
"pandas==2.2.3",
"matplotlib==3.9.2",
"scipy==1.14.1",
"tabulate==0.9.0",
]

[project.optional-dependencies]
dev = ["pytest==8.3.3", "ruff==0.12.0"]

[build-system]
requires = ["setuptools>=45"]
build-backend = "setuptools.build_meta"

[tool.setuptools.packages.find]
# Only package the scoring code; ignore submissions/ and previous_leaderboards/.
include = ["scoring*"]

###############################################################################
# Linting & Formatting Configurations #
###############################################################################
[tool.ruff]
line-length = 80
indent-width = 2
target-version = "py311"

[tool.ruff.format]
quote-style = "single"
Empty file added scoring/__init__.py
Empty file.
Empty file.
73 changes: 73 additions & 0 deletions scoring/algoperf_v05/generate_held_out_workloads.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
import json
import os
import struct

import numpy as np
from absl import app, flags, logging

flags.DEFINE_integer(
'held_out_workloads_seed',
None,
'Random seed for scoring.AlgoPerf v0.5 seed: 3438810845',
)
flags.DEFINE_string(
'output_filename',
'held_out_workloads.json',
'Path to file to record sampled held_out workloads.',
)
FLAGS = flags.FLAGS

HELD_OUT_WORKLOADS = {
'librispeech': [
'librispeech_conformer_attention_temperature',
'librispeech_conformer_layernorm',
# 'librispeech_conformer_gelu', # Removed due to bug in target setting procedure
'librispeech_deepspeech_no_resnet',
'librispeech_deepspeech_norm_and_spec_aug',
'librispeech_deepspeech_tanh',
],
'imagenet': [
'imagenet_resnet_silu',
'imagenet_resnet_gelu',
'imagenet_resnet_large_bn_init',
'imagenet_vit_glu',
'imagenet_vit_post_ln',
'imagenet_vit_map',
],
'ogbg': ['ogbg_gelu', 'ogbg_silu', 'ogbg_model_size'],
'wmt': ['wmt_post_ln', 'wmt_attention_temp', 'wmt_glu_tanh'],
'fastmri': ['fastmri_model_size', 'fastmri_tanh', 'fastmri_layernorm'],
'criteo1tb': [
'criteo1tb_layernorm',
'criteo1tb_embed_init',
'criteo1tb_resnet',
],
}


def save_held_out_workloads(held_out_workloads, filename):
with open(filename, 'w') as f:
json.dump(held_out_workloads, f)


def main(_):
rng_seed = FLAGS.held_out_workloads_seed
output_filename = FLAGS.output_filename

if not rng_seed:
rng_seed = struct.unpack('I', os.urandom(4))[0]

logging.info('Using RNG seed %d', rng_seed)
rng = np.random.default_rng(rng_seed)

sampled_held_out_workloads = []
for _, v in HELD_OUT_WORKLOADS.items():
sampled_index = rng.integers(len(v))
sampled_held_out_workloads.append(v[sampled_index])

logging.info(f'Sampled held-out workloads: {sampled_held_out_workloads}')
save_held_out_workloads(sampled_held_out_workloads, output_filename)


if __name__ == '__main__':
app.run(main)
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
["librispeech_conformer_layernorm", "imagenet_resnet_large_bn_init", "ogbg_model_size", "wmt_glu_tanh", "fastmri_tanh", "criteo1tb_embed_init"]
1 change: 1 addition & 0 deletions scoring/algoperf_v05/held_out_workloads_example.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
["librispeech_conformer_gelu", "imagenet_resnet_silu", "ogbg_gelu", "wmt_post_ln", "fastmri_model_size", "criteo1tb_layernorm"]
Loading
Loading