Conversation
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces a formal warmup phase to the benchmark execution process, replacing the previous environment variable-based implementation. Key changes include the addition of a WarmupConfig schema, a SaltedDataset wrapper that injects random salts into prompts to prevent KV-cache reuse, and a drain_after configuration to manage phase transitions. The EchoServer and testing fixtures were also updated to support custom request handling for new integration tests. Review feedback identifies missing imports for os and logging in dataset.py and notes that the SaltedDataset constructor signature is incompatible with its base class, which could break inherited class methods.
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
arekay-nv
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review
Reviewed by Codex + Claude | Depth: thorough
Found 14 issues across 6 files: 0 critical, 3 high, 4 medium, 7 low.
See inline comments for details. A grouped summary will follow as a separate top-level comment.
| guaranteeing all overlap events have been recorded before we assert. | ||
| """ | ||
|
|
||
| _DELAY = 0.15 # seconds; long enough for perf to start while warmup is pending |
There was a problem hiding this comment.
[Review Council — Claude] low · testing
_DELAY = 0.15 is fragile for the overlap-observation tests:
_DELAY = 0.15 # seconds; long enough for perf to start while warmup is pendingOn a heavily-loaded CI runner, the entire warmup batch (5 requests) can plausibly round-trip within 150 ms before the perf phase issues, producing a false "no overlap" and flaking test_drain_false_overlap_observed / test_drain_false_max_concurrent_exceeds_single_phase_size.
Fix: use a controllable handler that explicitly blocks until the test signals release (e.g. via asyncio.Event) instead of a fixed sleep. If sticking with a sleep, increase to ≥1s.
| _SALT_RE = re.compile(r"^\[([0-9a-f]{16})\] (.+)$") | ||
|
|
||
| _MINIMAL_CLIENT = HTTPClientConfig( | ||
| num_workers=1, warmup_connections=0, max_connections=10 |
There was a problem hiding this comment.
[Review Council — Claude] low · testing
All warmup integration tests run with num_workers=1:
_MINIMAL_CLIENT = HTTPClientConfig(
num_workers=1, warmup_connections=0, max_connections=10
)None cover the multi-worker warmup path, where ZMQ ordering between worker processes matters and where salted prompts must be distinct across workers. Suggest adding at least one warmup test with num_workers >= 2 to exercise the multi-process distribution path the production code uses.
Review Council — Multi-AI Code ReviewReviewed by Codex + Claude | Depth: thorough | HEAD: Found 14 issues across 6 files. Inline comments posted in this review. Cross-reviewer agreement (boosted confidence): execute.py:365 (warmup RNG seeding) and dataset.py:467 (multimodal salt) were flagged by both Codex and Claude. 🔴 Must Fix (high)
🟡 Should Fix (medium)
🔵 Consider (low)
Theme summary The warmup feature lands the structural plumbing well, but several correctness/reproducibility properties the YAML knobs claim to provide are not actually enforced end-to-end:
🤖 Posted by |
- Add `random_seed` field to WarmupConfig for deterministic warmup scheduling - Use `dataclasses.replace` + warmup seed for warmup RuntimeSettings - Fix SaltedDataset.data to be a property (avoids stale snapshot after inner reload) - Fix multimodal salting to find first text part at any index (handles image-first prompts) - Log warning when input_tokens present without prompt (salting not possible) - Fix ruff-format CI failure in test_async_session.py - Move inline imports to top of test_benchmark.py (AGENTS.md compliance) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| *prompt[1:], | ||
| ] | ||
| return {**data, "prompt": salted_parts} | ||
| return data # unsupported prompt type — skip salting |
There was a problem hiding this comment.
[Review Council — Claude] medium · bug
When the prompt is a multimodal list where ALL parts are non-text (e.g., all image_url), the loop finds no text part and falls through to return data with no warning:
if isinstance(prompt, list) and prompt:
for i, part in enumerate(prompt):
if isinstance(part, dict) and part.get("type") == "text":
...
return {**data, "prompt": salted_parts}
return data # unsupported prompt type — skip salting (silent!)Unlike the input_tokens case (which emits a logger.warning), this fallback is completely silent. Users who enable salt=True for image-only multimodal datasets will not see any indication that KV-cache bypass is not functioning. Add a logger.warning consistent with the input_tokens path.
| sd = SaltedDataset(inner) | ||
| assert sd.load_sample(0) == {} | ||
|
|
||
| def test_non_dict_sample_is_returned_as_is(self): |
There was a problem hiding this comment.
[Review Council — Claude] low · testing
sd.data = inner.data is redundant now that SaltedDataset.data is a property that delegates to self._inner.data. The assignment calls the property setter which sets self._inner.data = inner.data — a no-op since they reference the same list. This won't cause a test failure, but it misleads readers into thinking SaltedDataset.data is a plain attribute:
inner.data = ["raw string sample"] # override with non-dict
sd = SaltedDataset(inner)
sd.data = inner.data # redundant — sd.data property already returns inner.dataRemove the sd.data = inner.data line.
| *prompt[1:], | ||
| ] | ||
| return {**data, "prompt": salted_parts} | ||
| return data # unsupported prompt type — skip salting |
There was a problem hiding this comment.
[Review Council — Claude] medium · bug
When the prompt is a multimodal list where ALL parts are non-text (e.g., all image_url), the loop finds no text part and falls through to return data without salting and without any warning:
if isinstance(prompt, list) and prompt:
for i, part in enumerate(prompt):
if isinstance(part, dict) and part.get("type") == "text":
...
return {**data, "prompt": salted_parts}
return data # unsupported prompt type — skip salting (silent!)Unlike the input_tokens case (which emits logger.warning), this fallback is completely silent. Users who set salt=True for image-only datasets will see no indication that KV-cache bypass is not functioning. Add a logger.warning here, consistent with the input_tokens path.
| sd = SaltedDataset(inner) | ||
| assert sd.load_sample(0) == {} | ||
|
|
||
| def test_non_dict_sample_is_returned_as_is(self): |
There was a problem hiding this comment.
[Review Council — Claude] low · testing
sd.data = inner.data is now redundant: SaltedDataset.data is a property that delegates to self._inner.data. The assignment calls the setter which sets self._inner.data = inner.data — a no-op since they already reference the same list.
inner.data = ["raw string sample"] # override with non-dict
sd = SaltedDataset(inner)
sd.data = inner.data # redundant — property already returns inner.dataRemove the sd.data = inner.data line.
Review Council — Multi-AI Code ReviewReviewed by Claude (Codex CLI unavailable) | Depth: thorough | HEAD: `1143d64` Found 7 new issues across 5 files. Inline comments posted for 4; 3 are out-of-diff or post-fix. Cross-reviewer note: Issues already addressed in the previous round (unseeded RNGs, 🔴 Must Fix (high)
🟡 Should Fix (medium)
🔵 Consider (low)
🤖 Posted by |
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
arekay-nv
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review (Round 2)
Reviewed by Codex + Claude | Depth: thorough | HEAD: c9e80b0 (was 56258be in round 1)
Re-reviewing after the author's response commits. Most round-1 findings are addressed; this round focuses on new code (global timeout, warmup_random_seed, register=False) and incomplete fixes.
Found 4 new issues across 3 files: 0 critical, 1 high, 1 medium, 2 low. See inline comments — a tiered summary will follow.
| throughout the current phase.""" | ||
| Bounded by the global experiment timeout: if the caller schedules a | ||
| loop.call_later that calls stop(), stop() sets _drain_event, unblocking | ||
| this wait without leaving it hung indefinitely.""" |
There was a problem hiding this comment.
[Review Council — Claude] medium · error-handling
The new docstring on _drain_inflight claims a safety guarantee that only holds in some configurations:
async def _drain_inflight(self, phase_issuer: PhaseIssuer) -> None:
"""Wait for all in-flight responses from this phase to complete.
Bounded by the global experiment timeout: if the caller schedules a
loop.call_later that calls stop(), stop() sets _drain_event, unblocking
this wait without leaving it hung indefinitely."""
...
self._drain_event.clear()
await self._drain_event.wait()The loop.call_later is only scheduled in execute.py:561 if ctx.rt_settings.max_duration_ms is not None. runtime_settings.from_config maps max_duration_ms == 0 to None, and the schema default is 0 — so the default offline/online CLI run has no global timer, and _drain_inflight retains its old unbounded await self._drain_event.wait(). A single dropped or hanging request will hang the benchmark forever.
The docstring suggests this case is solved; it isn't.
Suggested fixes:
- Make the docstring conditional ("if a global timeout is configured…") and add a note that the drain is otherwise unbounded, OR
- Add an intrinsic drain timeout (a fixed cap like 30–60s, or a multiple of observed completion latency) inside
_drain_inflightitself so the safety property is a property of the function, not a contract on the caller.
| for _ in range(n_draw) | ||
| ] | ||
|
|
||
| assert order_with == order_without, ( |
There was a problem hiding this comment.
[Review Council — Claude] low · testing
test_performance_sample_order_identical_with_and_without_warmup does not actually exercise the property it claims to test:
def make_rt():
return RuntimeSettings(
...
rng_sched=random.Random(99),
rng_sample_index=random.Random(99),
...
)
ctx_with = self._make_ctx(config_with, make_rt(), simple_dataset)
ctx_without = self._make_ctx(config_without, make_rt(), simple_dataset)make_rt() is called twice and produces two separate RuntimeSettings instances with freshly-seeded Random(99) RNGs. _build_phases in the warmup path uses dataclass_replace which substitutes new Random(seed) instances on the warmup phase only — it never touches the perf phase's RNGs. So this test only verifies that _build_phases doesn't drain the perf RNG, which it never could (it only stores references). It does not verify the regression it claims to guard against: that executing the warmup phase leaves the perf-phase RNG untouched.
A tighter test would either:
- Build phases from a single shared
RuntimeSettings, then advance the warmup phase'srng_sample_index(e.g., draw N samples fromcreate_sample_order(warmup_rt)), and assert that the perf-phase order matches a control with no warmup, OR - Drop this test as redundant with
test_warmup_phase_uses_independent_rngs(which already assertswarmup_rt.rng_sched is not perf_rt.rng_sched). The current implementation gives a false sense of coverage for the actual race the docstring describes.
Review Council — Multi-AI Code Review (Round 2)Reviewed by Codex + Claude | Depth: thorough | HEAD: Re-review after the author's response commits. 4 new issues across 3 files. Inline comments in this review. Cross-reviewer agreement (boosted confidence): execute.py:570 — global timeout semantic change — was the only finding both Codex and Claude flagged in round 2. 🔴 Must Fix (high)
🟡 Should Fix (medium)
🔵 Consider (low)
Round 1 follow-up status
Theme summary (round 2)The biggest new concern is the global wall-clock timer added to bound the warmup drain. The mechanism works, but it silently changes what The drain-safety claim in 🤖 Posted by |
nv-alicheng
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review (run #2)
Reviewed by: Codex + Claude | Depth: thorough
Found 12 new issues (deduped against the 31+ existing inline comments). Severity: 1 high, 2 medium, 9 low.
See the inline comments for details. A summary table is posted as a separate top-level comment after this review.
| from inference_endpoint.openai.openai_types_gen import CreateChatCompletionRequest | ||
| from inference_endpoint.utils.logging import setup_logging | ||
|
|
||
| RequestHandler = Callable[[web.Request], web.Response | Awaitable[web.Response]] |
There was a problem hiding this comment.
[Review Council — Claude] low · api-contract
RequestHandler = Callable[[web.Request], web.Response | Awaitable[web.Response]] doesn't allow None, but _dispatch (line 110) explicitly relies on None to fall through to the default echo handler. A user-provided handler that forgets to return therefore silently passes through default behavior instead of failing loudly — and the type alias misleads them about what's supported. Either widen the alias to ... | None, or assert non-None inside _dispatch so the error surfaces.
| warmup_rt = phases[0].runtime_settings | ||
| perf_rt = phases[1].runtime_settings | ||
| assert warmup_rt.rng_sched is not perf_rt.rng_sched | ||
| assert warmup_rt.rng_sample_index is not perf_rt.rng_sample_index |
There was a problem hiding this comment.
[Review Council — Claude] low · testing
test_warmup_uses_independent_rng_instances only checks is not identity. This passes even if both warmup RNGs were seeded with the same seed as perf — i.e. distinct objects but identical sequences, which would still cause warmup to consume the same state pattern as perf. Add an assertion comparing the first few drawn values, e.g. assert [warmup_rt.rng_sched.random() for _ in range(5)] != [perf_rt.rng_sched.random() for _ in range(5)], to actually catch the regression the test claims to guard.
| def test_all_salts_unique_across_full_dataset(self, loaded_inner): | ||
| sd = SaltedDataset(loaded_inner) | ||
| prompts = [sd.load_sample(i)["prompt"] for i in range(sd.num_samples())] | ||
| assert len(set(prompts)) == len(prompts) |
There was a problem hiding this comment.
[Review Council — Claude] low · testing
SaltedDataset.load_sample has a non-trivial branch at dataset.py:471 that warns when input_tokens is present without prompt, and a salt-the-prompt branch when both are present (which still leaves input_tokens unsalted — see existing comment on dataset.py:467). Neither path is exercised here. Add tests for (a) {"input_tokens": [...], "prompt": "..."} (salting still happens but input_tokens unchanged — silent miss for token-based adapters), and (b) {"input_tokens": [...]} only (warning fires, data returned unchanged).
| runtime: RuntimeConfig = Field(default_factory=RuntimeConfig) | ||
| load_pattern: LoadPattern = Field(default_factory=LoadPattern) | ||
| client: HTTPClientConfig = Field(default_factory=HTTPClientConfig) | ||
| warmup: WarmupConfig = Field(default_factory=WarmupConfig) |
There was a problem hiding this comment.
[Review Council — Claude] low · api-contract
Settings.warmup: WarmupConfig has no cyclopts.Parameter annotation, so cyclopts auto-generates only dotted CLI flags like --settings.warmup.enabled=true, --settings.warmup.n-requests=.... Compare with runtime and load_pattern which expose ergonomic shorthands via field-level cyclopts.Parameter(alias=...). As-is, enabling warmup from the CLI is unwieldy. Add cyclopts.Parameter(alias="--warmup-...") aliases on the WarmupConfig fields, or document that warmup is YAML-only.
Review Council — Multi-AI Code Review (run #2)Reviewed by: Codex + Claude | Depth: thorough Found 12 new issues across 6 files (deduped against 31 existing inline comments). 🔴 Must Fix (high)Issues that produce incorrect benchmark results in normal usage.
🟡 Should Fix (medium)Real issues that trigger under specific conditions or design flaws that compound.
🔵 Consider (low)Valid improvements; could be follow-ups.
What we deduped
Notes on Codex coverage this runCodex returned only 2 findings on this run (both correctness-grade). The first reproduced an existing high-severity finding (input_tokens salting); the second (the |
| return cls(df, transforms=transforms, repeats=num_repeats) | ||
|
|
||
|
|
||
| class SaltedDataset(Dataset, register=False): |
There was a problem hiding this comment.
@arekay-nv Can you clarify the purpose and usage of this? Conceptually a salted dataset is just a normal dataset, but on load_sample, it prepends a randomly generated fixed length salt right? I'm wondering if adding a salt: bool = False argument to the base class .load_sample function is warranted. A lot of the body of this class is just remaps to the underlying dataset.
There was a problem hiding this comment.
I like that approach. Updated. Thanks!
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Plugs in warmup specification with salt via config.
datasetsas default local folder for cache - causing pre-commit to complain.Type of change
Related issues
Testing
Checklist