Skip to content

Commit 6bd3bf1

Browse files
Oss readiness (#51)
* chore: OSS readiness audit — license, schema, validators, refactor, docs Phase 1 — Legal & attribution - Align license: pyproject.toml + README badge → Apache-2.0 (matches LICENSE). - Add NOTICE summarising bundled third-party data and upstream terms. - Add License & attribution sections to datasets/README.md and each datasets/sharegpt_*_v1/README.md (CC BY 4.0, upstream link). - Add schema/accuracy_subset.README.md documenting the MMLU subset (MIT). Phase 2 — Contributor experience & validation - Fix doc drift in DEVELOPMENT.md, README.md, runners/README.md, suites/README.md, runners/template/runner.py (rename SUPPORTED_QUANTIZATIONS → SUPPORTED_QUANTIZATION_BACKENDS in *editable* files only; existing runner.py hashes untouched). - Add schema/suite.schema.json + runners/validate_suites.py and wire both into validate_pr.yml / generate_leaderboard.yml. - Add .github/ISSUE_TEMPLATE/new_suite.md for community suite proposals. - CONTRIBUTING.md: add local leaderboard preview instructions. - .gitignore: ignore node_modules/, .cursor/, .aider*, .envrc, .direnv/. Phase 3 — Code quality & CI - runners/benchmark_runner.py: * Remove dead code (stub format_prompt, dead spec-decoding branch, redundant acc_result init, duplicated _build_result_json block). * Extract helpers (_prepare_load_context, _score_accuracy_questions, _write_accuracy_artifacts) shared between accuracy scenarios. * Replace inference dispatch if/elif ladder with _SCENARIO_REGISTRY (ScenarioSpec dataclass: inference_kind, use_async, merge_key…). * _MERGE_SCENARIO_KEYS now derived from the registry. Net −111 lines. - leaderboard: split SUITE_META into leaderboard/site/assets/data/suite-meta.js, data.js re-exports it (data.js 1010 → 800 lines). - validate_pr.yml: add python-tests job (serve + openclaw_skill pytest). - pyproject.toml: setuptools.packages.find now lists loadgen/runners/ serve/openclaw_skill explicitly and excludes tests*. README hero & citation - Embed docs/assets/framework-overview.png under nav links and docs/assets/chip-cloud.png in a new "Currently on the leaderboard" section. - Expand BibTeX author list in the Citation section. Co-authored-by: Cursor <cursoragent@cursor.com> * docs(readme): update citation title Co-authored-by: Cursor <cursoragent@cursor.com> * ci: fix python-tests collection — lazy uvicorn import + add numpy - serve/server.py: import uvicorn lazily inside start_server() so that importing the module (e.g. from tests, or to expose the ASGI app) does not require uvicorn to be installed. - validate_pr.yml: add numpy to the python-tests install list — pulled in transitively by loadgen, needed once serve.server imports runners.benchmark_runner during test collection. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(serve/tests): add TokenStreamingMockRunner; MockRunner now raises NotImplementedError Pre-existing breakage in serve/tests/test_server.py — never caught because python-tests was not wired into CI until this branch. - test_server.py imports TokenStreamingMockRunner from mock_runner, but the class did not exist (4 ImportError collection errors). - test_fallback_when_no_token_stream expects MockRunner to *not* implement true token streaming so the server's single-chunk fallback path runs. MockRunner used to yield word-by-word, so the test asserted len(content_chunks) == 1 but got more (1 AssertionError). Fix to match the RunnerProtocol contract (runners/protocol.py:67) — true token streaming is optional, runners signal "not supported" by raising NotImplementedError: - MockRunner.inference_fn_token_stream now raises NotImplementedError (with a trailing unreachable yield so the function shape stays an async generator, matching the protocol). - Add TokenStreamingMockRunner(MockRunner) that overrides the method to yield word-by-word with a small async delay — used by the four tests that exercise the multi-chunk SSE path. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(serve/tests): TokenStreamingMockRunner — emit leading separator, not trailing test_token_stream_reassembles_correctly concatenates every content delta and expects exact equality with the response_text. Yielding "word + ' '" tacks an extra trailing space onto the reassembled string, so the assertion failed: got: 'Hello from token stream. ' expected: 'Hello from token stream.' Switch to a leading-space separator (space before every word *after* the first). Concatenation now round-trips exactly, and the shape matches how real BPE / SentencePiece tokenizers stream pieces (the first token has no preceding space; subsequent ones do). Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent ad1a270 commit 6bd3bf1

27 files changed

Lines changed: 1451 additions & 618 deletions

File tree

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
---
2+
name: Propose a new suite
3+
about: Propose a new benchmark suite (new model, scenario mix, or scaling axis)
4+
title: "[Suite] <short description, e.g. 'Suite H — Llama-3.1-405B'>"
5+
labels: suite-proposal
6+
assignees: ''
7+
---
8+
9+
<!--
10+
This template starts the discussion for a new AccelMark suite. The final
11+
contract goes into suites/<suite_id>/suite.json (see
12+
schema/suite.schema.json) — please fill in as many of the fields below as
13+
you can. Anything you leave blank we'll work out in the thread before
14+
merging.
15+
16+
Full walk-through: DEVELOPMENT.md "Adding a new suite"
17+
https://github.com/JuhaoLiang1997/AccelMark/blob/main/DEVELOPMENT.md
18+
-->
19+
20+
## Why this suite?
21+
22+
<!-- One sentence: the question this suite answers that no existing suite
23+
(A–G) covers. Example: "How fast is this chip on 405B-parameter
24+
dense models?" -->
25+
26+
## Suite contract (draft)
27+
28+
| Field | Proposed value |
29+
|---|---|
30+
| **Suite ID** | `suite_<X>` |
31+
| **Model** | `<huggingface/repo-id>` |
32+
| **Model revision** | `<commit sha or tag>` |
33+
| **Chip count** | `1` / `auto` / specific number |
34+
| **Precision** | `BF16` / `FP16` / list of allowed precisions |
35+
| **Dataset** | existing (`sharegpt_standard_v1`, `sharegpt_edge_v1`, `sharegpt_longctx_v1`) or new |
36+
| **Max model length** | tokens |
37+
| **Output tokens (max)** | tokens |
38+
| **Concurrency levels** | e.g. `[8, 32, 128]` |
39+
| **Default scenarios** | subset of `accuracy / offline / online / interactive / sustained` |
40+
| **Extra scenarios** | optional: `sustained / speculative / burst / …` |
41+
| **Primary metric** | `offline_throughput`, `max_valid_qps`, … |
42+
| **Expected run time on A100** | minutes |
43+
44+
## Accuracy baseline
45+
46+
<!-- Required before the suite can land on the main leaderboard. -->
47+
48+
- [ ] I will provide an A100 (or equivalent reference) BF16 baseline score
49+
to add to `schema/accuracy_baselines.json`.
50+
- [ ] If a new dataset is required, I will submit it under
51+
`datasets/<name>_v1/` with a `README.md` that documents the source
52+
and upstream license (see [`datasets/README.md`](../../datasets/README.md)).
53+
54+
## Custom orchestration?
55+
56+
<!-- Most suites only need `suite.json`. Mark these only if you genuinely
57+
need a `suite.py` plugin (multiple subprocesses, custom merge logic,
58+
similar to Suite C/E). -->
59+
60+
- [ ] Standard scenario dispatch is enough — no `suite.py` needed.
61+
- [ ] A `suite.py` plugin is required. Reason:
62+
63+
## Reference result plan
64+
65+
<!-- New suites do not appear on the main leaderboard until at least one
66+
verified reference result is submitted. -->
67+
68+
- Reference hardware: <e.g. NVIDIA A100-SXM4-80GB ×1>
69+
- Runner: `<runner_id>`
70+
- Who will run it: <@your-handle / vendor / community member>
71+
72+
## Open questions
73+
74+
<!-- Anything you'd like community / maintainer feedback on before opening
75+
the PR. -->

.github/workflows/generate_leaderboard.yml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,9 @@ on:
1111
paths:
1212
- 'results/**'
1313
- 'leaderboard/**'
14+
- 'suites/**'
15+
- 'schema/**'
1416
- 'tools/generate_platforms_matrix.py'
15-
- 'schema/platforms.json'
1617
- 'runners/*/meta.json'
1718

1819
# Allow manual trigger from Actions tab (useful for first deploy or to
@@ -37,6 +38,9 @@ jobs:
3738
- name: Validate all runner meta.json files and hashes
3839
run: python runners/validate_runners.py
3940

41+
- name: Validate all suite definitions
42+
run: python runners/validate_suites.py
43+
4044
generate:
4145
name: Generate and deploy leaderboard
4246
runs-on: ubuntu-latest

.github/workflows/validate_pr.yml

Lines changed: 69 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@ on:
88
paths:
99
- 'results/**'
1010
- 'runners/**'
11-
- 'schema/platforms.json'
11+
- 'suites/**'
12+
- 'schema/**'
1213
- 'tools/generate_platforms_matrix.py'
1314
- 'README.md'
1415
- 'leaderboard/site/**'
@@ -89,6 +90,29 @@ jobs:
8990
python tools/generate_platforms_matrix.py --check
9091
echo "::endgroup::"
9192
93+
validate-suites:
94+
name: Validate suite definitions
95+
runs-on: ubuntu-latest
96+
steps:
97+
- uses: actions/checkout@v4
98+
99+
- uses: actions/setup-python@v5
100+
with:
101+
python-version: "3.11"
102+
cache: pip
103+
104+
- name: Install dependencies
105+
run: pip install jsonschema
106+
107+
# Always validate every suite (and re-validate on schema changes too).
108+
# This catches drift introduced by shared changes — e.g. a
109+
# suite.schema.json edit that breaks an unrelated existing suite.
110+
- name: Validate all suite folders (drift check)
111+
run: |
112+
echo "::group::Validating every suite folder in the repo"
113+
python runners/validate_suites.py
114+
echo "::endgroup::"
115+
92116
validate:
93117
name: Validate result submissions
94118
runs-on: ubuntu-latest
@@ -225,4 +249,47 @@ jobs:
225249
# extra files to leaderboard/site/test/ to widen coverage; the
226250
# glob below picks them up automatically.
227251
- name: Run leaderboard frontend tests
228-
run: node --test leaderboard/site/test/*.test.mjs
252+
run: node --test leaderboard/site/test/*.test.mjs
253+
254+
python-tests:
255+
name: Python unit tests (serve + skill)
256+
runs-on: ubuntu-latest
257+
# Lightweight checks for the FastAPI serve layer and the OpenClaw skill
258+
# entry point. No GPU, no real model — everything is mocked. Tests are
259+
# opt-in per package so missing deps in one folder don't take the rest
260+
# of the suite down with them.
261+
steps:
262+
- uses: actions/checkout@v4
263+
264+
- uses: actions/setup-python@v5
265+
with:
266+
python-version: "3.11"
267+
cache: pip
268+
269+
- name: Install test dependencies
270+
# numpy is pulled in transitively by loadgen (imported when serve.server
271+
# touches runners.benchmark_runner). Keep this list lean — these are the
272+
# only packages required to *collect and run* the unit tests; no torch,
273+
# no vendor SDKs, no real runner.
274+
run: |
275+
pip install --quiet pytest pydantic fastapi httpx pyyaml jsonschema numpy
276+
277+
- name: Run serve unit tests
278+
run: |
279+
if [ -d serve/tests ]; then
280+
echo "::group::pytest serve/tests"
281+
python -m pytest serve/tests -q --no-header --color=no
282+
echo "::endgroup::"
283+
else
284+
echo "serve/tests/ not present — skipping."
285+
fi
286+
287+
- name: Run OpenClaw skill unit tests
288+
run: |
289+
if [ -d openclaw_skill/tests ]; then
290+
echo "::group::pytest openclaw_skill/tests"
291+
python -m pytest openclaw_skill/tests -q --no-header --color=no
292+
echo "::endgroup::"
293+
else
294+
echo "openclaw_skill/tests/ not present — skipping."
295+
fi

.gitignore

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,20 @@ env/
1212
# ── Editor / IDE ────────────────────────────────────────────────────────────
1313
.idea/
1414
.vscode/
15+
.cursor/
1516
*.swp
1617
*.swo
1718
*~
1819
*.tmp
1920
.DS_Store
21+
.aider*
22+
.envrc
23+
.direnv/
24+
25+
# ── Node / frontend tooling ─────────────────────────────────────────────────
26+
node_modules/
27+
.eslintcache
28+
npm-debug.log*
2029

2130
# ── Test / lint caches ──────────────────────────────────────────────────────
2231
.pytest_cache/

CONTRIBUTING.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -320,6 +320,21 @@ CI then re-runs the schema validator and the runner-folder integrity check.
320320
When both pass and a contributor reviews the diff, the PR is merged and your
321321
result shows up on the leaderboard on the next site build.
322322

323+
### Optional: preview the leaderboard locally
324+
325+
The static site is generated from `results/` by `leaderboard/generate.py`.
326+
After dropping your result into `results/community/<run_name>/`, you can
327+
preview the final UI before opening the PR:
328+
329+
```bash
330+
python leaderboard/generate.py # writes leaderboard/site/leaderboard.js + api/
331+
python -m http.server -d leaderboard/site 8000 # serve the static site
332+
# open http://localhost:8000
333+
```
334+
335+
Both `leaderboard.js` and `leaderboard/site/api/` are gitignored — the GitHub
336+
Actions workflow regenerates them on every merge to `main`.
337+
323338
### Alternative: open a submission issue (no git required)
324339

325340
If you'd rather not use git, paste your `result.json` into a

DEVELOPMENT.md

Lines changed: 48 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -32,13 +32,14 @@ AccelMark/
3232
│ ├── loadgen.py ← Shared timing and measurement engine
3333
│ └── types.py ← InferenceResult, SampleRecord
3434
├── suites/
35-
│ ├── suite_A/suite.json + requests.jsonl
36-
│ ├── suite_B/suite.json + requests.jsonl
37-
│ ├── suite_C/suite.json + suite.py + requests.jsonl
38-
│ ├── suite_D/suite.json + requests.jsonl
39-
│ ├── suite_E/suite.json + suite.py + requests.jsonl
40-
│ ├── suite_F/suite.json + requests.jsonl
41-
│ └── suite_G/suite.json + requests.jsonl
35+
│ ├── suite_A/suite.json
36+
│ ├── suite_B/suite.json
37+
│ ├── suite_C/suite.json + suite.py ← suite.py is optional; only C and E ship one
38+
│ ├── suite_D/suite.json
39+
│ ├── suite_E/suite.json + suite.py
40+
│ ├── suite_F/suite.json
41+
│ └── suite_G/suite.json
42+
│ (request data lives in datasets/, referenced by "dataset" in suite.json)
4243
├── datasets/
4344
│ ├── sharegpt_standard_v1/requests.jsonl ← 500 prompts, ~280/310 tok
4445
│ ├── sharegpt_longctx_v1/requests.jsonl ← 200 prompts, ~28K input tok (Suite D)
@@ -554,12 +555,15 @@ descriptions and distributions.
554555
If you need a custom distribution:
555556

556557
1. Create `datasets/{your_dataset}_v1/requests.jsonl`
557-
2. Create `datasets/{your_dataset}_v1/README.md`
558+
2. Create `datasets/{your_dataset}_v1/README.md` (must document source +
559+
upstream license — see `datasets/README.md`)
558560
3. Set `"dataset": "{your_dataset}_v1"` in your suite.json
559561

560-
If your suite needs a custom dataset only used by that suite, you can
561-
also place `requests.jsonl` directly in `suites/suite_X/` — the
562-
benchmark runner checks there as a fallback.
562+
The `dataset` field is **required**`BenchmarkRunner._resolve_requests_path`
563+
loads `datasets/<name>/requests.jsonl` and raises `FileNotFoundError` if it
564+
cannot find the file. Earlier versions allowed putting `requests.jsonl`
565+
directly under `suites/suite_X/`; that fallback has been removed in favor
566+
of the immutable, versioned `datasets/` layout.
563567

564568
Dataset format (one JSON object per line):
565569
```json
@@ -622,6 +626,38 @@ not shown on the main leaderboard.
622626

623627
---
624628

629+
## Adding a new scenario type
630+
631+
If you need a scenario name that none of `accuracy / offline / online /
632+
interactive / sustained / speculative / burst` covers, you can register
633+
one without forking the dispatch logic:
634+
635+
1. Open `runners/benchmark_runner.py` and add a row to
636+
`_SCENARIO_REGISTRY` near the top of the file:
637+
638+
```python
639+
"your_scenario": ScenarioSpec(
640+
name="your_scenario",
641+
inference_kind="streaming", # or "offline"
642+
needs_streaming=True, # require SUPPORTS_STREAMING?
643+
use_async=True, # passed to load_model()
644+
merge_key="your_scenario", # None = no-merge (e.g. accuracy)
645+
),
646+
```
647+
648+
2. If the scenario needs special LoadGen behaviour (e.g. like `sustained`),
649+
add a branch under "Run benchmark" inside `_run_single_scenario`.
650+
651+
3. List the new scenario name in your suite's
652+
`scenarios.{default,extra}` array — the merge order is derived from
653+
the registry automatically.
654+
655+
Without a registry entry the base class falls back to a streaming
656+
inference path with `merge_key = <scenario>`. Register an entry whenever
657+
you want the scenario to be treated differently (offline, no merge, etc.).
658+
659+
---
660+
625661
## Suite plugin system
626662

627663
Suites with custom orchestration logic (multiple subprocesses, special
@@ -1098,6 +1134,6 @@ python runners/validate_submission.py --dir /tmp/accelmark_test/
10981134
## Questions and Support
10991135

11001136
- **Bug in LoadGen or schema:** Open a GitHub Issue
1101-
- **New suite proposal:** Open a GitHub Issue with the "Request new suite" template
1137+
- **New suite proposal:** Open a GitHub Issue with the [**Propose a new suite**](https://github.com/JuhaoLiang1997/AccelMark/issues/new?template=new_suite.md) template
11021138
- **New platform support:** Open a PR with a working platform script and at least one verified result
11031139
- **Leaderboard question:** Check `leaderboard/generate.py` — it's well-commented

NOTICE

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
AccelMark
2+
Copyright 2024-2026 Juhao Liang and The AccelMark Contributors
3+
4+
This product includes software developed as part of the AccelMark project
5+
(https://github.com/JuhaoLiang1997/AccelMark).
6+
7+
Licensed under the Apache License, Version 2.0 (the "License");
8+
you may not use this file except in compliance with the License.
9+
You may obtain a copy of the License at
10+
11+
http://www.apache.org/licenses/LICENSE-2.0
12+
13+
================================================================================
14+
Third-party bundled data
15+
================================================================================
16+
17+
The AccelMark source tree includes a small amount of third-party data so that
18+
benchmark runs are fully reproducible without network access. Each bundled
19+
dataset retains its upstream license; the Apache 2.0 license above covers only
20+
the AccelMark code, schemas, and configuration around it.
21+
22+
--------------------------------------------------------------------------------
23+
1. datasets/sharegpt_standard_v1/requests.jsonl (500 prompts)
24+
datasets/sharegpt_edge_v1/requests.jsonl (500 prompts)
25+
datasets/sharegpt_longctx_v1/requests.jsonl (200 prompts)
26+
--------------------------------------------------------------------------------
27+
28+
Derived from the ShareGPT GPT-4 conversational dataset curated by:
29+
30+
shibing624/sharegpt_gpt4
31+
https://huggingface.co/datasets/shibing624/sharegpt_gpt4
32+
License: CC BY 4.0
33+
(https://creativecommons.org/licenses/by/4.0/)
34+
35+
The upstream corpus was assembled from publicly shared ChatGPT/GPT-4
36+
conversations. AccelMark's variants are filtered subsets used as fixed
37+
benchmark inputs; no derivation is intended as the authoritative copy.
38+
39+
Attribution: shibing624/sharegpt_gpt4 contributors, distributed under CC BY 4.0.
40+
41+
See datasets/<name>/README.md for the per-subset filtering criteria and
42+
token statistics.
43+
44+
--------------------------------------------------------------------------------
45+
2. schema/accuracy_subset.jsonl (100 multiple-choice items)
46+
--------------------------------------------------------------------------------
47+
48+
A 100-question subset of MMLU (Massive Multitask Language Understanding):
49+
50+
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D.,
51+
& Steinhardt, J. (2021). "Measuring Massive Multitask Language
52+
Understanding." International Conference on Learning Representations.
53+
https://arxiv.org/abs/2009.03300
54+
https://github.com/hendrycks/test
55+
56+
License: MIT
57+
(https://opensource.org/licenses/MIT)
58+
59+
AccelMark uses this subset purely as an accuracy gate (model-quality
60+
sanity check) — it is NOT a measurement of MMLU performance. The subset
61+
is immutable; see CONTRIBUTING.md "A few rules".
62+
63+
================================================================================
64+
Third-party software dependencies
65+
================================================================================
66+
67+
AccelMark's Python runtime dependencies (jsonschema, numpy, pyyaml, …) and
68+
the framework backends invoked by each runner (vLLM, SGLang, mlx-lm,
69+
vllm-ascend, vllm-rocm, vllm-tpu, vllm-musa, …) retain their own licenses.
70+
See each runner's requirements.txt for pinned versions; see the upstream
71+
projects for the corresponding license terms.

0 commit comments

Comments
 (0)