feat: Plug in warmup phase by arekay-nv · Pull Request #305 · mlcommons/endpoints

arekay-nv · 2026-05-05T01:42:47Z

Plugs in warmup specification with salt via config.

fixes datasets as default local folder for cache - causing pre-commit to complain.

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

github-actions · 2026-05-05T01:42:54Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces a formal warmup phase to the benchmark execution process, replacing the previous environment variable-based implementation. Key changes include the addition of a WarmupConfig schema, a SaltedDataset wrapper that injects random salts into prompts to prevent KV-cache reuse, and a drain_after configuration to manage phase transitions. The EchoServer and testing fixtures were also updated to support custom request handling for new integration tests. Review feedback identifies missing imports for os and logging in dataset.py and notes that the SaltedDataset constructor signature is incompatible with its base class, which could break inherited class methods.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

arekay-nv

Review Council — Multi-AI Code Review

Reviewed by Codex + Claude | Depth: thorough

Found 14 issues across 6 files: 0 critical, 3 high, 4 medium, 7 low.

See inline comments for details. A grouped summary will follow as a separate top-level comment.

arekay-nv · 2026-05-05T03:03:13Z

+      guaranteeing all overlap events have been recorded before we assert.
+    """
+
+    _DELAY = 0.15  # seconds; long enough for perf to start while warmup is pending


[Review Council — Claude] low · testing

_DELAY = 0.15 is fragile for the overlap-observation tests:

_DELAY = 0.15 # seconds; long enough for perf to start while warmup is pending

On a heavily-loaded CI runner, the entire warmup batch (5 requests) can plausibly round-trip within 150 ms before the perf phase issues, producing a false "no overlap" and flaking test_drain_false_overlap_observed / test_drain_false_max_concurrent_exceeds_single_phase_size.

Fix: use a controllable handler that explicitly blocks until the test signals release (e.g. via asyncio.Event) instead of a fixed sleep. If sticking with a sleep, increase to ≥1s.

arekay-nv · 2026-05-05T03:03:13Z

+_SALT_RE = re.compile(r"^\[([0-9a-f]{16})\] (.+)$")
+
+_MINIMAL_CLIENT = HTTPClientConfig(
+    num_workers=1, warmup_connections=0, max_connections=10


[Review Council — Claude] low · testing

All warmup integration tests run with num_workers=1:

_MINIMAL_CLIENT = HTTPClientConfig( num_workers=1, warmup_connections=0, max_connections=10 )

None cover the multi-worker warmup path, where ZMQ ordering between worker processes matters and where salted prompts must be distinct across workers. Suggest adding at least one warmup test with num_workers >= 2 to exercise the multi-process distribution path the production code uses.

arekay-nv · 2026-05-05T03:04:05Z

Review Council — Multi-AI Code Review

Reviewed by Codex + Claude | Depth: thorough | HEAD: 56258be

Found 14 issues across 6 files. Inline comments posted in this review.

Cross-reviewer agreement (boosted confidence): execute.py:365 (warmup RNG seeding) and dataset.py:467 (multimodal salt) were flagged by both Codex and Claude.

🔴 Must Fix (high)

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/dataset_manager/dataset.py`	414	api-contract	Claude	Default `datasets_dir` silently changed `datasets/` → `dataset_cache/` (orthogonal to warmup; risks losing existing caches)
2	`src/inference_endpoint/dataset_manager/dataset.py`	459	bug	Claude	`SaltedDataset.load_sample()` mutates `prompt` only, leaving `input_tokens` (used by SGLang adapter) unchanged → KV-cache reuse not actually prevented for tokenized datasets
3	`src/inference_endpoint/commands/benchmark/execute.py`	365	bug	Both	Warmup `rng_sched` / `rng_sample_index` use bare `random.Random()` (unseeded) → warmup sample order is nondeterministic across runs, breaking reproducibility

🟡 Should Fix (medium)

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/dataset_manager/dataset.py`	467	bug	Both	Multimodal salting only fires when `prompt[0]` is text; image-first prompts (common shape) silently skip salting
2	`src/inference_endpoint/dataset_manager/dataset.py`	444	bug	Claude	`SaltedDataset.data` snapshots `inner.data` at construction; later `inner.load(force=True)` desyncs the wrapper from the inner dataset
3	`src/inference_endpoint/commands/benchmark/execute.py`	357	design	Claude	`warmup_rt` built ad-hoc instead of via `dataclasses.replace`; load pattern hardcoded to `MAX_THROUGHPUT` (no way to warm up at the perf-phase QPS/concurrency)
4	`src/inference_endpoint/load_generator/session.py`	273	api-contract	Claude	`ENABLE_WARMUP=1` env-var escape hatch removed without a deprecation warning; existing scripts now silently no-op

🔵 Consider (low)

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/load_generator/session.py`	68	api-contract	Claude	`PhaseConfig.drain_after=True` default flips warmup drain semantics for direct callers (was effectively `False`)
2	`src/inference_endpoint/load_generator/session.py`	423	data-integrity	Claude	`drain_after=False` filter only covers `_on_sample_complete`; warmup `COMPLETE` events still publish to ZMQ → event log pollution
3	`src/inference_endpoint/async_utils/services/metrics_aggregator/__main__.py`	100	error-handling	Claude	Tokenizer load failure silently downgraded to `nullcontext()`; user-requested `--tokenizer` produces empty ISL/OSL/TPOT with only a warning
4	`src/inference_endpoint/dataset_manager/dataset.py`	464	design	Claude	`os.urandom(8)` salt ignores seeded RNG → salted prompts are nondeterministic even when the user sets `dataloader_random_seed`
5	`tests/unit/commands/test_benchmark.py`	471	design	Claude	`TestBuildPhases` uses lazy/inline imports throughout (~15 sites) — explicit AGENTS.md violation
6	`tests/integration/commands/test_warmup.py`	292	testing	Claude	`_DELAY = 0.15` is fragile for overlap-observation tests — risk of CI flakes; replace with explicit `asyncio.Event` synchronization
7	`tests/integration/commands/test_warmup.py`	65	testing	Claude	All warmup integration tests use `num_workers=1`; multi-worker warmup path (ZMQ ordering, salt distinctness) is uncovered

Theme summary

The warmup feature lands the structural plumbing well, but several correctness/reproducibility properties the YAML knobs claim to provide are not actually enforced end-to-end:

Reproducibility — warmup RNGs are unseeded (execute.py:365) and the salt uses os.urandom (dataset.py:464). Two runs of the same config produce different warmup workloads, indirectly affecting perf-phase reproducibility when warmup sizes the cache.
Salt completeness — SaltedDataset only handles plain-text prompts. Tokenized datasets (SGLang adapter via input_tokens) and image-first multimodal prompts silently skip salting, defeating warmup.salt=true for those paths.
API/back-compat surface — the ENABLE_WARMUP env var was removed without a deprecation warning; the new PhaseConfig.drain_after=True default flips warmup drain semantics for direct callers; datasets_dir default rename couples an orthogonal cache-directory change to this PR.
Test coverage — single-worker only; the multi-worker ZMQ ordering and salt-distinctness across worker processes is the riskiest part of the architecture and is currently unexercised. The _DELAY = 0.15 overlap tests are also flake-prone.

🤖 Posted by /review-council skill.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

- Add `random_seed` field to WarmupConfig for deterministic warmup scheduling - Use `dataclasses.replace` + warmup seed for warmup RuntimeSettings - Fix SaltedDataset.data to be a property (avoids stale snapshot after inner reload) - Fix multimodal salting to find first text part at any index (handles image-first prompts) - Log warning when input_tokens present without prompt (salting not possible) - Fix ruff-format CI failure in test_async_session.py - Move inline imports to top of test_benchmark.py (AGENTS.md compliance) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

arekay-nv · 2026-05-05T11:55:11Z

+                *prompt[1:],
+            ]
+            return {**data, "prompt": salted_parts}
+        return data  # unsupported prompt type — skip salting


[Review Council — Claude] medium · bug

When the prompt is a multimodal list where ALL parts are non-text (e.g., all image_url), the loop finds no text part and falls through to return data with no warning:

if isinstance(prompt, list) and prompt: for i, part in enumerate(prompt): if isinstance(part, dict) and part.get("type") == "text": ... return {**data, "prompt": salted_parts} return data # unsupported prompt type — skip salting (silent!)

Unlike the input_tokens case (which emits a logger.warning), this fallback is completely silent. Users who enable salt=True for image-only multimodal datasets will not see any indication that KV-cache bypass is not functioning. Add a logger.warning consistent with the input_tokens path.

arekay-nv · 2026-05-05T11:55:11Z

+        sd = SaltedDataset(inner)
+        assert sd.load_sample(0) == {}
+
+    def test_non_dict_sample_is_returned_as_is(self):


[Review Council — Claude] low · testing

sd.data = inner.data is redundant now that SaltedDataset.data is a property that delegates to self._inner.data. The assignment calls the property setter which sets self._inner.data = inner.data — a no-op since they reference the same list. This won't cause a test failure, but it misleads readers into thinking SaltedDataset.data is a plain attribute:

inner.data = ["raw string sample"] # override with non-dict sd = SaltedDataset(inner) sd.data = inner.data # redundant — sd.data property already returns inner.data

Remove the sd.data = inner.data line.

arekay-nv

test

arekay-nv · 2026-05-05T11:56:00Z

+                *prompt[1:],
+            ]
+            return {**data, "prompt": salted_parts}
+        return data  # unsupported prompt type — skip salting


[Review Council — Claude] medium · bug

When the prompt is a multimodal list where ALL parts are non-text (e.g., all image_url), the loop finds no text part and falls through to return data without salting and without any warning:

if isinstance(prompt, list) and prompt: for i, part in enumerate(prompt): if isinstance(part, dict) and part.get("type") == "text": ... return {**data, "prompt": salted_parts} return data # unsupported prompt type — skip salting (silent!)

Unlike the input_tokens case (which emits logger.warning), this fallback is completely silent. Users who set salt=True for image-only datasets will see no indication that KV-cache bypass is not functioning. Add a logger.warning here, consistent with the input_tokens path.

arekay-nv · 2026-05-05T11:56:02Z

+        sd = SaltedDataset(inner)
+        assert sd.load_sample(0) == {}
+
+    def test_non_dict_sample_is_returned_as_is(self):


[Review Council — Claude] low · testing

sd.data = inner.data is now redundant: SaltedDataset.data is a property that delegates to self._inner.data. The assignment calls the setter which sets self._inner.data = inner.data — a no-op since they already reference the same list.

inner.data = ["raw string sample"] # override with non-dict sd = SaltedDataset(inner) sd.data = inner.data # redundant — property already returns inner.data

Remove the sd.data = inner.data line.

arekay-nv · 2026-05-05T11:56:21Z

Review Council — Multi-AI Code Review

Reviewed by Claude (Codex CLI unavailable) | Depth: thorough | HEAD: `1143d64`

Found 7 new issues across 5 files. Inline comments posted for 4; 3 are out-of-diff or post-fix.

Cross-reviewer note: Issues already addressed in the previous round (unseeded RNGs, SaltedDataset.data snapshot, multimodal salting at index 0, inline imports, ruff-format) are not re-reported.

🔴 Must Fix (high)

#	File	Line	Category	Summary
1	`commands/benchmark/execute.py`	365	design	Inline ↑ After RNG-seed fix: `rng_sched` and `rng_sample_index` both seeded with same `random_seed` value — identical pseudo-random sequences for scheduler and sample selection

🟡 Should Fix (medium)

#	File	Line	Category	Summary
2	`dataset_manager/dataset.py`	434	design	Inline ↑ `SaltedDataset` is registered in `Dataset.PREDEFINED` under `"SaltedDataset"` via `__init_subclass__`; instantiating via registry raises `TypeError` (incompatible constructor). Fix: `class SaltedDataset(Dataset, dataset_id=None)`
3	`dataset_manager/dataset.py`	478	bug	Inline ↑ Silent no-salt fallback for image-only multimodal prompts (all parts are `image_url`, no text part exists) — no `logger.warning` emitted, `salt=True` silently no-ops
4	`load_generator/session.py`	366	bug	`_drain_inflight` awaits `self._drain_event.wait()` with no timeout. If a warmup request is dropped (network error, server restart), the drain hangs indefinitely and the benchmark never starts. Only `SIGINT` unblocks it.

🔵 Consider (low)

#	File	Line	Category	Summary
5	`tests/unit/dataset_manager/test_salted_dataset.py`	143	testing	Inline ↑ `sd.data = inner.data` is redundant now that `data` is a property delegating to `_inner.data`
6	`config/schema.py`	413	api-contract	`random_seed: int = Field(0, ...)` default is implicitly deterministic; `Optional[int] = None` with "use fresh seed when None" semantics would better match `salt=True` cache-busting intent
7	`config/schema.py`	395	api-contract	`WarmupConfig` has no `@cyclopts.Parameter` annotations — warmup is YAML-only with no CLI flags. Undocumented limitation; users testing via CLI will find warmup silently inactive.

🤖 Posted by /review-council skill.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

arekay-nv

Review Council — Multi-AI Code Review (Round 2)

Reviewed by Codex + Claude | Depth: thorough | HEAD: c9e80b0 (was 56258be in round 1)

Re-reviewing after the author's response commits. Most round-1 findings are addressed; this round focuses on new code (global timeout, warmup_random_seed, register=False) and incomplete fixes.

Found 4 new issues across 3 files: 0 critical, 1 high, 1 medium, 2 low. See inline comments — a tiered summary will follow.

arekay-nv · 2026-05-05T15:36:32Z

-        throughout the current phase."""
+        Bounded by the global experiment timeout: if the caller schedules a
+        loop.call_later that calls stop(), stop() sets _drain_event, unblocking
+        this wait without leaving it hung indefinitely."""


[Review Council — Claude] medium · error-handling

The new docstring on _drain_inflight claims a safety guarantee that only holds in some configurations:

async def _drain_inflight(self, phase_issuer: PhaseIssuer) -> None: """Wait for all in-flight responses from this phase to complete. Bounded by the global experiment timeout: if the caller schedules a loop.call_later that calls stop(), stop() sets _drain_event, unblocking this wait without leaving it hung indefinitely.""" ... self._drain_event.clear() await self._drain_event.wait()

The loop.call_later is only scheduled in execute.py:561 if ctx.rt_settings.max_duration_ms is not None. runtime_settings.from_config maps max_duration_ms == 0 to None, and the schema default is 0 — so the default offline/online CLI run has no global timer, and _drain_inflight retains its old unbounded await self._drain_event.wait(). A single dropped or hanging request will hang the benchmark forever.

The docstring suggests this case is solved; it isn't.

Suggested fixes:

Make the docstring conditional ("if a global timeout is configured…") and add a note that the drain is otherwise unbounded, OR

Add an intrinsic drain timeout (a fixed cap like 30–60s, or a multiple of observed completion latency) inside _drain_inflight itself so the safety property is a property of the function, not a contract on the caller.

arekay-nv · 2026-05-05T15:36:33Z

+            for _ in range(n_draw)
+        ]
+
+        assert order_with == order_without, (


[Review Council — Claude] low · testing

test_performance_sample_order_identical_with_and_without_warmup does not actually exercise the property it claims to test:

def make_rt(): return RuntimeSettings( ... rng_sched=random.Random(99), rng_sample_index=random.Random(99), ... ) ctx_with = self._make_ctx(config_with, make_rt(), simple_dataset) ctx_without = self._make_ctx(config_without, make_rt(), simple_dataset)

make_rt() is called twice and produces two separate RuntimeSettings instances with freshly-seeded Random(99) RNGs. _build_phases in the warmup path uses dataclass_replace which substitutes new Random(seed) instances on the warmup phase only — it never touches the perf phase's RNGs. So this test only verifies that _build_phases doesn't drain the perf RNG, which it never could (it only stores references). It does not verify the regression it claims to guard against: that executing the warmup phase leaves the perf-phase RNG untouched.

A tighter test would either:

Build phases from a single shared RuntimeSettings, then advance the warmup phase's rng_sample_index (e.g., draw N samples from create_sample_order(warmup_rt)), and assert that the perf-phase order matches a control with no warmup, OR

Drop this test as redundant with test_warmup_phase_uses_independent_rngs (which already asserts warmup_rt.rng_sched is not perf_rt.rng_sched). The current implementation gives a false sense of coverage for the actual race the docstring describes.

arekay-nv · 2026-05-05T15:37:20Z

Review Council — Multi-AI Code Review (Round 2)

Reviewed by Codex + Claude | Depth: thorough | HEAD: c9e80b0 (was 56258be in round 1)

Re-review after the author's response commits. 4 new issues across 3 files. Inline comments in this review.

Cross-reviewer agreement (boosted confidence): execute.py:570 — global timeout semantic change — was the only finding both Codex and Claude flagged in round 2.

🔴 Must Fix (high)

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/commands/benchmark/execute.py`	570	api-contract	Both	Global timer makes warmup eat into the per-phase `max_duration_ms` budget; `_make_stop_check` still uses per-phase semantics → two clocks on the same configured value, accuracy phases configured for unbounded duration can be truncated

🟡 Should Fix (medium)

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/load_generator/session.py`	361	error-handling	Claude	`_drain_inflight` docstring claims a safety guarantee that only holds when `max_duration_ms is not None`; the default offline/online run still has unbounded drain on a hung request

🔵 Consider (low)

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/commands/benchmark/execute.py`	568	concurrency	Claude	`_on_global_timeout` has no `_done` guard; can fire during the cleanup gap between `session.run()` returning and `global_timeout_handle.cancel()` running
2	`tests/unit/commands/test_benchmark.py`	775	testing	Claude	`test_performance_sample_order_identical_with_and_without_warmup` builds two separate RuntimeSettings with fresh seeded RNGs — it doesn't actually exercise warmup-perturbing-perf-RNG; gives a false sense of coverage

Round 1 follow-up status

✅ Fixed — execute.py:365 — warmup RNGs unseeded — now random.Random(warmup_random_seed)
✅ Fixed — dataset.py:467 — multimodal salt only first text — now scans for first text part at any index
✅ Fixed — dataset.py:444 — SaltedDataset.data snapshot — now @property delegating to _inner.data
✅ Fixed — test_benchmark.py:471 — lazy imports — imports moved to top
✅ Fixed — dataset.py:434 — registry pollution — now class SaltedDataset(Dataset, register=False)
🟡 Partial — dataset.py:459 — input_tokens not salted — warning added for input_tokens-without-prompt; common case (both present) still silently drops salt
🟡 Partial — execute.py:357 — ad-hoc warmup_rt — now uses dataclass_replace ✅ but LoadPattern(MAX_THROUGHPUT) is still hardcoded
⏸️ Unaddressed — dataset.py:414 — datasets/→dataset_cache/`` — still a silent breaking default change
⏸️ Unaddressed — session.py:273 — ENABLE_WARMUP env-var removed — no deprecation warning
⏸️ Unaddressed — session.py:68 — drain_after=True default flips warmup — direct callers get opposite semantics
⏸️ Unaddressed — session.py:423 — late warmup COMPLETE events still publish to ZMQ
⏸️ Unaddressed — metrics_aggregator/__main__.py:100 — tokenizer fail silently downgrades
⏸️ Unaddressed — dataset.py:464 — os.urandom salt ignores seeded RNG
⏸️ Unaddressed — test_warmup.py:292 — _DELAY=0.15 flake risk
⏸️ Unaddressed — test_warmup.py:65 — no multi-worker tests

Theme summary (round 2)

The biggest new concern is the global wall-clock timer added to bound the warmup drain. The mechanism works, but it silently changes what max_duration_ms means for any user who enables warmup (warmup time now eats into the perf-phase budget, and accuracy phases configured for None get truncated by the global cap). Either scope the timer to the perf phase, expose a separate warmup.max_duration_ms, or rename the field — but don't double-mean the same configured value.

The drain-safety claim in _drain_inflight is also overstated: with the schema default (max_duration_ms=0 → None), no global timer is scheduled and the drain is unbounded — a single hung request can still hang the benchmark.

🤖 Posted by /review-council skill.

nv-alicheng

Review Council — Multi-AI Code Review (run #2)

Reviewed by: Codex + Claude | Depth: thorough

Found 12 new issues (deduped against the 31+ existing inline comments). Severity: 1 high, 2 medium, 9 low.

See the inline comments for details. A summary table is posted as a separate top-level comment after this review.

nv-alicheng · 2026-05-07T18:58:21Z

 from inference_endpoint.openai.openai_types_gen import CreateChatCompletionRequest
 from inference_endpoint.utils.logging import setup_logging

+RequestHandler = Callable[[web.Request], web.Response | Awaitable[web.Response]]


[Review Council — Claude] low · api-contract

RequestHandler = Callable[[web.Request], web.Response | Awaitable[web.Response]] doesn't allow None, but _dispatch (line 110) explicitly relies on None to fall through to the default echo handler. A user-provided handler that forgets to return therefore silently passes through default behavior instead of failing loudly — and the type alias misleads them about what's supported. Either widen the alias to ... | None, or assert non-None inside _dispatch so the error surfaces.

nv-alicheng · 2026-05-07T18:58:21Z

+        warmup_rt = phases[0].runtime_settings
+        perf_rt = phases[1].runtime_settings
+        assert warmup_rt.rng_sched is not perf_rt.rng_sched
+        assert warmup_rt.rng_sample_index is not perf_rt.rng_sample_index


[Review Council — Claude] low · testing

test_warmup_uses_independent_rng_instances only checks is not identity. This passes even if both warmup RNGs were seeded with the same seed as perf — i.e. distinct objects but identical sequences, which would still cause warmup to consume the same state pattern as perf. Add an assertion comparing the first few drawn values, e.g. assert [warmup_rt.rng_sched.random() for _ in range(5)] != [perf_rt.rng_sched.random() for _ in range(5)], to actually catch the regression the test claims to guard.

nv-alicheng · 2026-05-07T18:58:21Z

+    def test_all_salts_unique_across_full_dataset(self, loaded_inner):
+        sd = SaltedDataset(loaded_inner)
+        prompts = [sd.load_sample(i)["prompt"] for i in range(sd.num_samples())]
+        assert len(set(prompts)) == len(prompts)


[Review Council — Claude] low · testing

SaltedDataset.load_sample has a non-trivial branch at dataset.py:471 that warns when input_tokens is present without prompt, and a salt-the-prompt branch when both are present (which still leaves input_tokens unsalted — see existing comment on dataset.py:467). Neither path is exercised here. Add tests for (a) {"input_tokens": [...], "prompt": "..."} (salting still happens but input_tokens unchanged — silent miss for token-based adapters), and (b) {"input_tokens": [...]} only (warning fires, data returned unchanged).

nv-alicheng · 2026-05-07T18:58:21Z

    runtime: RuntimeConfig = Field(default_factory=RuntimeConfig)
    load_pattern: LoadPattern = Field(default_factory=LoadPattern)
    client: HTTPClientConfig = Field(default_factory=HTTPClientConfig)
+    warmup: WarmupConfig = Field(default_factory=WarmupConfig)


[Review Council — Claude] low · api-contract

Settings.warmup: WarmupConfig has no cyclopts.Parameter annotation, so cyclopts auto-generates only dotted CLI flags like --settings.warmup.enabled=true, --settings.warmup.n-requests=.... Compare with runtime and load_pattern which expose ergonomic shorthands via field-level cyclopts.Parameter(alias=...). As-is, enabling warmup from the CLI is unwieldy. Add cyclopts.Parameter(alias="--warmup-...") aliases on the WarmupConfig fields, or document that warmup is YAML-only.

nv-alicheng · 2026-05-07T18:59:05Z

Review Council — Multi-AI Code Review (run #2)

Reviewed by: Codex + Claude | Depth: thorough

Found 12 new issues across 6 files (deduped against 31 existing inline comments).

🔴 Must Fix (high)

Issues that produce incorrect benchmark results in normal usage.

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/commands/benchmark/execute.py`	557	bug	Codex	Global `max_duration_ms` timer covers warmup + perf + accuracy combined; slow warmup eats into perf budget. Silent semantic shift from previous "perf-only" meaning.

🟡 Should Fix (medium)

Real issues that trigger under specific conditions or design flaws that compound.

#	File	Line	Category	Reviewer(s)	Summary
2	`src/inference_endpoint/async_utils/services/metrics_aggregator/__main__.py`	98	bug	Claude	`TokenizePool` partial-init failure leaks the executor's worker threads on retry.
3	`src/inference_endpoint/commands/benchmark/execute.py`	367	design	Claude	Warmup hardcoded to `MAX_THROUGHPUT` regardless of perf load pattern — surprising for online (Poisson/concurrency) runs since warmup exercises a different server code path than the steady-state pattern it's meant to warm up for.

🔵 Consider (low)

Valid improvements; could be follow-ups.

#	File	Line	Category	Reviewer(s)	Summary
4	`src/inference_endpoint/testing/echo_server.py`	36	api-contract	Claude	`RequestHandler` type alias doesn't allow `None`, but `_dispatch` relies on `None` for fall-through. Either widen the type or assert non-None.
5	`src/inference_endpoint/dataset_manager/dataset.py`	465	design	Claude	`SaltedDataset.load()` is a no-op — unloaded inner causes opaque `AssertionError` later. Either delegate or raise in `__init__`.
6	`src/inference_endpoint/dataset_manager/dataset.py`	480	data-integrity	Claude	Salt prefix `"[{salt}] "` adds ~5 token overhead — distorts ISL distribution between warmup and perf, especially for short prompts.
7	`src/inference_endpoint/config/schema.py`	428	api-contract	Claude	`WarmupConfig` lacks `cyclopts.Parameter` aliases — only verbose `--settings.warmup.*` flags, unlike sibling `runtime`/`load_pattern`.
8	`src/inference_endpoint/commands/benchmark/execute.py`	360	bug	Claude	`min_duration_ms=0` + `n_requests=None` falls back to "issue every dataset sample at MAX_THROUGHPUT" — undocumented expansion of warmup's intent.
9	`tests/unit/commands/test_benchmark.py`	411	testing	Claude	`test_defaults` doesn't assert the `warmup_random_seed=42` default; same gap in `test_all_flags_enabled` and `test_yaml_roundtrip`.
10	`tests/unit/commands/test_benchmark.py`	721	testing	Claude	`test_warmup_uses_independent_rng_instances` only checks `is not` identity; passes even if both RNGs were seeded identically.
11	`tests/unit/dataset_manager/test_salted_dataset.py`	109	testing	Claude	`test_salt_unique_across_different_indices` compares already-different prompts; passes even if salting is broken.
12	`tests/unit/dataset_manager/test_salted_dataset.py`	218	testing	Claude	No tests exercise the `input_tokens`-present branches of `SaltedDataset.load_sample` (the ones that produce silent token-ID misses for SGLang).

What we deduped

Codex P2 (SaltedDataset doesn't salt input_tokens) — already covered by existing comment on dataset.py:467 ([Review Council — Claude] high · bug). Dropped.
Claude test_warmup.py:433 (max_concurrent > 5 flakiness) — already covered by existing comment on test_warmup.py:292 ([Review Council — Claude] low · testing), which names that exact test. Dropped.

Notes on Codex coverage this run

Codex returned only 2 findings on this run (both correctness-grade). The first reproduced an existing high-severity finding (input_tokens salting); the second (the max_duration_ms timing bug) is genuinely new and is the most impactful issue in this review. The remaining coverage came from Claude.

nv-alicheng · 2026-05-07T21:28:42Z

        return cls(df, transforms=transforms, repeats=num_repeats)


+class SaltedDataset(Dataset, register=False):


@arekay-nv Can you clarify the purpose and usage of this? Conceptually a salted dataset is just a normal dataset, but on load_sample, it prepends a randomly generated fixed length salt right? I'm wondering if adding a salt: bool = False argument to the base class .load_sample function is warranted. A lot of the body of this class is just remaps to the underlying dataset.

I like that approach. Updated. Thanks!

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Plug in warmup phase

06a9237

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

github-actions Bot requested a review from nvzhihanj May 5, 2026 01:42

gemini-code-assist Bot reviewed May 5, 2026

View reviewed changes

Comment thread src/inference_endpoint/dataset_manager/dataset.py Outdated

Comment thread src/inference_endpoint/dataset_manager/dataset.py Outdated

github-code-quality Bot found potential problems May 5, 2026

View reviewed changes

Comment thread src/inference_endpoint/dataset_manager/dataset.py Fixed

Missing changes.

56258be

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

github-code-quality Bot found potential problems May 5, 2026

View reviewed changes

Comment thread src/inference_endpoint/dataset_manager/dataset.py Fixed

arekay-nv commented May 5, 2026

View reviewed changes

arekay-nv and others added 2 commits May 4, 2026 20:08

Fix test

4d1c89e

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

arekay-nv commented May 5, 2026

View reviewed changes

Comment thread src/inference_endpoint/dataset_manager/dataset.py Outdated

arekay-nv commented May 5, 2026

View reviewed changes

Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated

arekay-nv commented May 5, 2026

View reviewed changes

Address comments

c9e80b0

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

github-code-quality Bot found potential problems May 5, 2026

View reviewed changes

Comment thread src/inference_endpoint/dataset_manager/dataset.py Fixed

arekay-nv commented May 5, 2026

View reviewed changes

arekay-nv marked this pull request as ready for review May 7, 2026 16:38

arekay-nv requested review from a team, nv-alicheng and viraatc May 7, 2026 16:38

nv-alicheng reviewed May 7, 2026

View reviewed changes

Fixes

b95cfc1

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

github-code-quality Bot found potential problems May 7, 2026

View reviewed changes

Comment thread src/inference_endpoint/dataset_manager/dataset.py Fixed

Refactor salt

5005648

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

		return cls(df, transforms=transforms, repeats=num_repeats)


		class SaltedDataset(Dataset, register=False):

Conversation

arekay-nv commented May 5, 2026

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Review Council — Multi-AI Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arekay-nv commented May 5, 2026

Review Council — Multi-AI Code Review

🔴 Must Fix (high)

🟡 Should Fix (medium)

🔵 Consider (low)

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arekay-nv commented May 5, 2026

Review Council — Multi-AI Code Review

🔴 Must Fix (high)

🟡 Should Fix (medium)

🔵 Consider (low)

Uh oh!

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Review Council — Multi-AI Code Review (Round 2)

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arekay-nv commented May 5, 2026

Review Council — Multi-AI Code Review (Round 2)

🔴 Must Fix (high)

🟡 Should Fix (medium)

🔵 Consider (low)

Round 1 follow-up status

Theme summary (round 2)

Uh oh!

github-actions Bot commented May 5, 2026 •

edited

Loading