Skip to content

Commit a2c496a

Browse files
authored
[OMNIML-4788] specdec_bench: configuration.json provenance + upload_to_s3 (#1531)
> [!WARNING] > **Breaking on-disk schema change (specdec_bench v1.0.0).** This PR renames the acceptance-rate metric fields across `AcceptanceRate` / `MTBench` / `SpecBench` writers: > > | Old (pre-1.0.0) | New (1.0.0) | > |---|---| > | `Request_AR` | `Request_AL` | > | `Category_AR` | `Category_AL` | > | `Average_AR` | `Average_AL` | > | — | `Joint_Acceptance_Rate` (new) | > > The renamed values were always **acceptance length** (mean tokens generated per inference step), not a rate, and the visualizer reads `*_AL`. Pre-1.0.0 runs in S3 have `*_AR` and no `Joint_AR`; they must be re-run or post-processed before comparing. The visualizer aggregates runs by `specdec_bench` major version so accidental cross-methodology comparison is blocked. ### What does this PR do? Type of change: new feature Adds reproducibility provenance to `specdec_bench/configuration.json` and ports `upload_to_s3.py` from `iputterman/specdec_bench@main` (personal-namespace fork) into upstream. This is the first PR in a multi-stage migration off Izzy's fork now that he's left the team. Tracked in [OMNIML-4788](https://jirasw.nvidia.com/browse/OMNIML-4788). **Provenance fields added to configuration.json** (alongside existing argv / engine_version / gpu / python_version): - `specdec_bench_version` — methodology semver declared in `specdec_bench/__init__.py`. Bump minor on additive metrics, major on changed metric *definitions*. The visualizer (Phase 4 of the migration) will aggregate runs by major version so plots don't accidentally compare across methodology changes. - `specdec_bench_sha`, `modelopt_sha`, `modelopt_version`, `nmm_sandbox_sha`, `container_image` — code/runtime provenance. Each prefers an env var set by the harness (`SPECDEC_BENCH_SHA`, `MODELOPT_SHA`, `MODELOPT_VERSION`, `NMM_SANDBOX_SHA`, `CONTAINER_IMAGE`) and falls back to `git rev-parse` / `modelopt.__version__` when running standalone. The env-var preference is necessary because the runtime container has no `.git/` (the launcher packager tarballs source without git metadata) and may not have `modelopt` installed. - `checkpoint.{path, size_bytes, index_sha256, index_source}` — cheap reproducibility fingerprint that hashes `model.safetensors.index.json` (or `config.json` fallback). Changes whenever any tensor changes. - `serving_config` — engine-level config dict captured after init via a new `Model.get_serving_config()` method. VLLM dumps `AsyncEngineArgs` + the live `vllm_config.to_dict()`; SGLANG dumps the `engine_kwargs` passed to `sgl.Engine`; TRTLLM left at the base default `{}` for a later iteration. - `timestamp` — UTC ISO 8601. **Other changes** - `upload_to_s3.py` + `specdec_bench/s3_utils.py` ported from iputterman/specdec_bench@main. Recognizes run dirs by sentinel files, refuses to overwrite existing S3 prefixes. - `_redact_config` allowlists `tokenizer`, `tokenizer_path`, `tokenizer_mode`, `tokenizer_revision` so the model path stops being redacted (latent bug from substring-matching `token` ⊂ `tokenizer`). - `requirements_speed.txt`: `boto3`, `botocore` added (used by `s3_utils`). **Out of scope** (deferred to Phase 1b / Phase 2): - `--sweep_config` driver that emits per-run-dir nesting `<sweep>/<NNN_dataset_c<conc>>/` - `--s3_upload` flag baked into `run.py` itself - Launcher auto-injection of the provenance env vars (currently the example YAML sets them statically) - `container_digest` (enroot integration) and full GPU/driver inventory - TRTLLM `get_serving_config()` ### Usage ```bash # Run a smoke benchmark (Qwen3.5-4B + vLLM + MTP draft=3) — example YAML included uv run launch.py --yaml examples/Qwen/Qwen3.5-4B/specdec_bench_mtp.yaml --yes # After it lands, upload the run directory to S3: S3_KEY_ID=team-specdec-workgroup \ S3_SECRET=... \ python upload_to_s3.py /path/to/sweep_dir s3://team-specdec-workgroup/results ``` ### Testing Cluster-tested end-to-end on cw-dfw (Slurm job 11978794, NeMo Run experiment `cicd_1779403623`, ~19 min wall): - Qwen3.5-4B + vLLM + MTP draft=3 + SPEED-Bench-Internal/qualitative (80 requests) - `configuration.json` (22 KB) populated all eight new provenance fields - `Request_AR` mean 3.327 (vs 3.330 on the pre-Phase-1a run — within noise; methodology unchanged) - `upload_to_s3.py` (real upload, not dry-run) landed [s3://team-specdec-workgroup/results/qwen35_4_mtp_smoke_2026-05-21/specdec_bench_mtp/](https://app.s8k.io/buckets/team-specdec-workgroup/?prefix=results%2Fqwen35_4_mtp_smoke_2026-05-21%2F) where the visualizer at http://10.131.132.205:8080 can pick it up. ### Before your PR is "Ready for review" - Is this change backward compatible?: ✅ - `configuration.json` only gains fields. `upload_to_s3.py` / `s3_utils.py` are new files. `Model.get_serving_config()` default = `{}` so existing subclasses without an override behave as before. - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ - `boto3` / `botocore` are Apache 2.0 (permissive); `upload_to_s3.py` + `s3_utils.py` are ported from a private NVIDIA repo with explicit copyright headers retained. - Did you write any new necessary tests?: ❌ - Validated by cluster smoke (see Testing). Will add unit-tests for `dump_env` provenance fields and `upload_to_s3._discover_runs` in a follow-up. - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A - Internal-facing tooling. - Did you get Claude approval on this PR?: ❌ (triggering after open) ### Additional Information Tracked in JIRA [OMNIML-4788](https://jirasw.nvidia.com/browse/OMNIML-4788). The full multi-phase plan is on that ticket's SPEC block — this PR is Phase 1a. Cherry-picked alongside the harness change are two example YAMLs (`examples/Qwen/Qwen3.5-4B/specdec_bench.yaml` for the NONE autoregressive baseline, `..._mtp.yaml` for the MTP run) that gave us cluster-test evidence. Can be split out if preferred. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added an S3 upload CLI for benchmark results with dry-run and skip-existing options * Automatic capture of run configuration, provenance and redacted environment into saved config * Models now export serving configuration for reproducible runs * New launcher entrypoint and example job configs for Qwen SPEED-Bench runs * **Documentation** * README section describing S3 upload usage and supported local layouts * **Bug Fixes / Changes** * Acceptance-rate metric keys renamed in output (AR -> AL) * **Tests / CI** * New tests for redaction and S3 utilities; CI now runs specdec_bench examples <!-- review_stack_entry_start --> [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1531?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: chenhany <chenhany@nvidia.com>
1 parent 999c999 commit a2c496a

20 files changed

Lines changed: 1099 additions & 31 deletions

File tree

.github/workflows/example_tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ jobs:
3636
strategy:
3737
fail-fast: false
3838
matrix:
39-
example: [llm_distill, llm_qat, llm_sparsity, diffusers_sparsity]
39+
example: [llm_distill, llm_qat, llm_sparsity, diffusers_sparsity, specdec_bench]
4040
include:
4141
- example: speculative_decoding
4242
docker_image: "26.01"

examples/specdec_bench/README.md

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ python3 run.py \
6464

6565
### Running [SPEED-Bench](https://huggingface.co/datasets/nvidia/SPEED-Bench) on Llama 3.3 70B + Eagle 3
6666

67-
1. Install the requirements file using `pip install -r requirements_speed.txt`
67+
1. Install the requirements file using `pip install -r requirements.txt`
6868

6969
2. Prepare the data using the provided script:
7070

@@ -145,6 +145,33 @@ python3 run.py \
145145
--runtime_params runtime_args_long_context.yaml
146146
```
147147

148+
## Uploading results to S3
149+
150+
Each `run.py` invocation writes a result directory containing `configuration.json`,
151+
`timing.json`, `acceptance_rate.json`, and (when applicable) `mtbench.json` / `specbench.json`.
152+
`upload_to_s3.py` is a single-file, drop-in tool that uploads one run — or an entire sweep —
153+
to any S3-compatible bucket:
154+
155+
```bash
156+
python upload_to_s3.py /path/to/run_or_sweep_dir s3://your-bucket/some/prefix \
157+
--endpoint https://your-s3-endpoint \
158+
--key-id YOUR_KEY_ID \
159+
--secret YOUR_SECRET
160+
```
161+
162+
`--endpoint`, `--key-id`, and `--secret` default to the `S3_ENDPOINT`, `S3_KEY_ID`, and
163+
`S3_SECRET` environment variables. Omit `--endpoint` (or set `S3_ENDPOINT=""`) to use AWS S3's
164+
default endpoint. Use `--dry-run` to preview the upload plan, and `--skip-existing` to skip
165+
runs already present at the destination instead of failing.
166+
167+
The tool handles two directory layouts and mirrors them into S3:
168+
169+
- **Flat**`LOCAL_DIR/run_name/{configuration,timing,...}.json`
170+
- **Sweep**`LOCAL_DIR/sweep_name/run_name/{configuration,timing,...}.json`
171+
172+
`LOCAL_DIR`'s basename is preserved in the destination prefix, so re-uploads from the same
173+
source land in the same place.
174+
148175
## Notes
149176

150177
The goal of this benchmark is to provide an easy way to configure, run, and compare speculative implementations across frameworks in an apples-to-apples method.

examples/specdec_bench/requirements_speed.txt renamed to examples/specdec_bench/requirements.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
boto3>=1.34.0
2+
botocore>=1.34.0
13
datasets>=3.1.0
24
rich>=14.2.0
35
seaborn>=0.13.2

examples/specdec_bench/run.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
from specdec_bench import datasets, metrics, models, runners
2121
from specdec_bench.utils import (
2222
decode_chat,
23+
dump_env,
2324
encode_chat,
2425
get_tokenizer,
2526
postprocess_base,
@@ -174,6 +175,10 @@ def run_simple(args):
174175
if args.save_dir is not None:
175176
for metric in metrics_list:
176177
metric.update_directory(args.save_dir)
178+
# Stamp configuration.json BEFORE the run loop so the file lands even
179+
# when the run crashes mid-way. Engine init is already done, so the
180+
# live serving_config from the model is available.
181+
dump_env(args, args.save_dir, overrides={"serving_config": model.get_serving_config()})
177182

178183
runner = runners.SimpleRunner(model, metrics=metrics_list)
179184

examples/specdec_bench/specdec_bench/__init__.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,8 @@
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
1515

16+
# Re-export modelopt's version so configuration.json's `specdec_bench_version`
17+
# tracks the parent package without a separate semver source of truth.
18+
# Breaking schema/methodology changes are recorded in commit messages and
19+
# fingerprinted by `specdec_bench_sha` in configuration.json.
20+
from modelopt import __version__

examples/specdec_bench/specdec_bench/datasets/speed.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,7 @@ def __init__(
147147
):
148148
if not_installed:
149149
raise ImportError(
150-
"Additional packages are required to use SPEED-Bench. Please run `pip install -r requirements_speed.txt`"
150+
"Additional packages are required to use SPEED-Bench. Please run `pip install -r requirements.txt`"
151151
)
152152
self.data: list[Request] = []
153153
self.num_samples = num_samples

examples/specdec_bench/specdec_bench/metrics/acceptance_rate.py

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -55,23 +55,31 @@ def _process_lengths(self, lengths):
5555
self.out["Conditional_Acceptance_Rate"][k] = running_len / sum_lengths / prev_ratio
5656
prev_ratio = running_len / sum_lengths
5757
running_len -= v
58+
# Joint acceptance rate at step k = product of conditional acceptance
59+
# rates at steps 1..k = probability that ≥k tokens are accepted in
60+
# a row. The visualizer renders this as a separate panel.
61+
self.out["Joint_Acceptance_Rate"] = {}
62+
running_joint = 1.0
63+
for k, cond_ar in self.out["Conditional_Acceptance_Rate"].items():
64+
running_joint *= cond_ar
65+
self.out["Joint_Acceptance_Rate"][k] = running_joint
5866

5967
def process_final(self, text_outputs):
6068
all_ar = []
6169
lengths = {}
62-
self.out["Request_AR"] = {}
70+
self.out["Request_AL"] = {}
6371
self.prompt_ar = dict(sorted(self.prompt_ar.items(), key=lambda x: x[0]))
6472
for request_id, turns in self.prompt_ar.items():
65-
self.out["Request_AR"][request_id] = {}
73+
self.out["Request_AL"][request_id] = {}
6674
for turn_id, turn in turns.items():
6775
ar = sum(turn) / len(turn)
68-
self.out["Request_AR"][request_id][turn_id] = ar
76+
self.out["Request_AL"][request_id][turn_id] = ar
6977
all_ar.append(ar)
7078
self._get_lengths(turn, lengths)
71-
print(request_id, turn_id, self.out["Request_AR"][request_id][turn_id])
79+
print(request_id, turn_id, self.out["Request_AL"][request_id][turn_id])
7280
average_ar = sum(all_ar) / len(all_ar)
73-
print("Average AR:", average_ar)
74-
self.out["Average_AR"] = average_ar
81+
print("Average AL:", average_ar)
82+
self.out["Average_AL"] = average_ar
7583
self._process_lengths(lengths)
7684
self.write()
7785
self._format_write_output(text_outputs)

examples/specdec_bench/specdec_bench/metrics/mtbench.py

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -34,29 +34,29 @@ class MTBench(AcceptanceRate):
3434
def process_final(self, text_outputs):
3535
i = 0
3636
lengths = {}
37-
self.out["Request_AR"] = {}
37+
self.out["Request_AL"] = {}
3838
self.prompt_ar = dict(sorted(self.prompt_ar.items(), key=lambda x: x[0]))
3939
for request_id, turns in self.prompt_ar.items():
4040
turn_1 = turns[0]
4141
turn_2 = turns[1]
4242
q_id = request_id
4343
mtbench_topic = MTBENCH_TOPICS[q_id // 10]
44-
self.out["Request_AR"][request_id] = sum(turn_1 + turn_2) / len(turn_1 + turn_2)
44+
self.out["Request_AL"][request_id] = sum(turn_1 + turn_2) / len(turn_1 + turn_2)
4545
self._get_lengths(turn_1, lengths)
4646
self._get_lengths(turn_2, lengths)
4747
print(mtbench_topic, sum(turn_1 + turn_2) / len(turn_1 + turn_2))
4848
per_category = [[] for _ in range(len(MTBENCH_TOPICS))]
49-
for q_id, ar in self.out["Request_AR"].items():
49+
for q_id, ar in self.out["Request_AL"].items():
5050
per_category[q_id // 10].append(ar)
51-
self.out["Category_AR"] = {}
51+
self.out["Category_AL"] = {}
5252
for i, category in enumerate(per_category):
5353
if len(category) > 0:
5454
category_ar = sum(category) / len(category)
55-
self.out["Category_AR"][MTBENCH_TOPICS[i]] = category_ar
56-
print(f"{MTBENCH_TOPICS[i]} Average AR: {category_ar}")
57-
average_ar = sum(self.out["Request_AR"].values()) / len(self.out["Request_AR"])
58-
print("Average AR:", average_ar)
59-
self.out["Average_AR"] = average_ar
55+
self.out["Category_AL"][MTBENCH_TOPICS[i]] = category_ar
56+
print(f"{MTBENCH_TOPICS[i]} Average AL: {category_ar}")
57+
average_ar = sum(self.out["Request_AL"].values()) / len(self.out["Request_AL"])
58+
print("Average AL:", average_ar)
59+
self.out["Average_AL"] = average_ar
6060
self._process_lengths(lengths)
6161
self.write()
6262
self._format_write_output(text_outputs)

examples/specdec_bench/specdec_bench/metrics/specbench.py

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -44,26 +44,26 @@ def __init__(self, requests):
4444

4545
def process_final(self, text_outputs):
4646
lengths = {}
47-
self.out["Request_AR"] = {}
47+
self.out["Request_AL"] = {}
4848
for request_id, request in enumerate(self.requests):
4949
turns = self.prompt_ar[request_id].values()
5050
assert len(turns) == len(request.turns), (
5151
f"Number of turns {len(turns)} does not match number of turns in request {len(request.turns)}"
5252
)
53-
self.out["Request_AR"][request.question_id] = mean(list(chain(*turns)))
53+
self.out["Request_AL"][request.question_id] = mean(list(chain(*turns)))
5454
for turn in turns:
5555
self._get_lengths(turn, lengths)
56-
print(request.category, self.out["Request_AR"][request.question_id])
56+
print(request.category, self.out["Request_AL"][request.question_id])
5757
per_category = defaultdict(list)
5858
for request in self.requests:
59-
per_category[request.category].append(self.out["Request_AR"][request.question_id])
60-
self.out["Category_AR"] = {}
59+
per_category[request.category].append(self.out["Request_AL"][request.question_id])
60+
self.out["Category_AL"] = {}
6161
for category_name, category_ar in per_category.items():
6262
if len(category_ar) > 0:
6363
category_ar = mean(category_ar)
64-
self.out["Category_AR"][category_name] = category_ar
65-
average_ar = mean(self.out["Request_AR"].values())
66-
self.out["Average_AR"] = average_ar
64+
self.out["Category_AL"][category_name] = category_ar
65+
average_ar = mean(self.out["Request_AL"].values())
66+
self.out["Average_AL"] = average_ar
6767
self._process_lengths(lengths)
6868
self.write()
6969
self._format_write_output(text_outputs)
@@ -93,15 +93,15 @@ def _pretty_print_results(self):
9393
header_style="bold magenta",
9494
)
9595
table.add_column("Category", style="cyan", no_wrap=True)
96-
table.add_column("Average AR", justify="right", style="green")
96+
table.add_column("Average AL", justify="right", style="green")
9797

9898
# Add category rows
99-
for category_name, category_ar in sorted(self.out["Category_AR"].items()):
99+
for category_name, category_ar in sorted(self.out["Category_AL"].items()):
100100
table.add_row(category_name, f"{category_ar:.4f}")
101101

102102
# Add separator and summary row
103103
table.add_section()
104-
table.add_row("[bold]Overall Average[/bold]", f"[bold]{self.out['Average_AR']:.4f}[/bold]")
104+
table.add_row("[bold]Overall Average[/bold]", f"[bold]{self.out['Average_AL']:.4f}[/bold]")
105105

106106
console.print(table)
107107

@@ -124,8 +124,8 @@ def _create_visualizations(
124124

125125
df_clean = pd.DataFrame.from_dict(
126126
{
127-
"question_id": list(self.out["Request_AR"].keys()),
128-
"acceptance_rate": list(self.out["Request_AR"].values()),
127+
"question_id": list(self.out["Request_AL"].keys()),
128+
"acceptance_rate": list(self.out["Request_AL"].values()),
129129
"category": [request.category for request in self.requests],
130130
"response_length": [
131131
mean([len(c["content"]) for c in messages if c["role"] == "assistant"])

examples/specdec_bench/specdec_bench/models/base.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,5 +27,14 @@ async def run(self, prompt_ids, sampling_params, request_id, turn_id):
2727
"""
2828
raise NotImplementedError
2929

30+
def get_serving_config(self):
31+
"""Return a JSON-serializable dict describing the engine's effective config.
32+
33+
Captured into configuration.json's `serving_config` for reproducibility.
34+
Subclasses override to surface engine-specific defaults (max_model_len,
35+
kv_cache_dtype, etc.) that don't appear in the CLI args. Default: empty.
36+
"""
37+
return {}
38+
3039
def stop(self):
3140
pass

0 commit comments

Comments
 (0)