Skip to content

feat: Plug in warmup phase#305

Open
arekay-nv wants to merge 7 commits intomainfrom
arekay/plugin_warmup_phase
Open

feat: Plug in warmup phase#305
arekay-nv wants to merge 7 commits intomainfrom
arekay/plugin_warmup_phase

Conversation

@arekay-nv
Copy link
Copy Markdown
Collaborator

Plugs in warmup specification with salt via config.

  • fixes datasets as default local folder for cache - causing pre-commit to complain.

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@github-actions github-actions Bot requested a review from nvzhihanj May 5, 2026 01:42
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a formal warmup phase to the benchmark execution process, replacing the previous environment variable-based implementation. Key changes include the addition of a WarmupConfig schema, a SaltedDataset wrapper that injects random salts into prompts to prevent KV-cache reuse, and a drain_after configuration to manage phase transitions. The EchoServer and testing fixtures were also updated to support custom request handling for new integration tests. Review feedback identifies missing imports for os and logging in dataset.py and notes that the SaltedDataset constructor signature is incompatible with its base class, which could break inherited class methods.

Comment thread src/inference_endpoint/dataset_manager/dataset.py Outdated
Comment thread src/inference_endpoint/dataset_manager/dataset.py Outdated
Comment thread src/inference_endpoint/dataset_manager/dataset.py Fixed
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Comment thread src/inference_endpoint/dataset_manager/dataset.py Fixed
Copy link
Copy Markdown
Collaborator Author

@arekay-nv arekay-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review

Reviewed by Codex + Claude | Depth: thorough

Found 14 issues across 6 files: 0 critical, 3 high, 4 medium, 7 low.

See inline comments for details. A grouped summary will follow as a separate top-level comment.

Comment thread src/inference_endpoint/dataset_manager/dataset.py Outdated
Comment thread src/inference_endpoint/dataset_manager/dataset.py Outdated
Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
Comment thread src/inference_endpoint/dataset_manager/dataset.py Outdated
Comment thread src/inference_endpoint/dataset_manager/dataset.py Outdated
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/__main__.py Outdated
Comment thread src/inference_endpoint/dataset_manager/dataset.py Outdated
Comment thread tests/unit/commands/test_benchmark.py Outdated
guaranteeing all overlap events have been recorded before we assert.
"""

_DELAY = 0.15 # seconds; long enough for perf to start while warmup is pending
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] low · testing

_DELAY = 0.15 is fragile for the overlap-observation tests:

_DELAY = 0.15  # seconds; long enough for perf to start while warmup is pending

On a heavily-loaded CI runner, the entire warmup batch (5 requests) can plausibly round-trip within 150 ms before the perf phase issues, producing a false "no overlap" and flaking test_drain_false_overlap_observed / test_drain_false_max_concurrent_exceeds_single_phase_size.

Fix: use a controllable handler that explicitly blocks until the test signals release (e.g. via asyncio.Event) instead of a fixed sleep. If sticking with a sleep, increase to ≥1s.

_SALT_RE = re.compile(r"^\[([0-9a-f]{16})\] (.+)$")

_MINIMAL_CLIENT = HTTPClientConfig(
num_workers=1, warmup_connections=0, max_connections=10
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] low · testing

All warmup integration tests run with num_workers=1:

_MINIMAL_CLIENT = HTTPClientConfig(
    num_workers=1, warmup_connections=0, max_connections=10
)

None cover the multi-worker warmup path, where ZMQ ordering between worker processes matters and where salted prompts must be distinct across workers. Suggest adding at least one warmup test with num_workers >= 2 to exercise the multi-process distribution path the production code uses.

@arekay-nv
Copy link
Copy Markdown
Collaborator Author

Review Council — Multi-AI Code Review

Reviewed by Codex + Claude | Depth: thorough | HEAD: 56258be

Found 14 issues across 6 files. Inline comments posted in this review.

Cross-reviewer agreement (boosted confidence): execute.py:365 (warmup RNG seeding) and dataset.py:467 (multimodal salt) were flagged by both Codex and Claude.

🔴 Must Fix (high)

# File Line Category Reviewer(s) Summary
1 src/inference_endpoint/dataset_manager/dataset.py 414 api-contract Claude Default datasets_dir silently changed datasets/dataset_cache/ (orthogonal to warmup; risks losing existing caches)
2 src/inference_endpoint/dataset_manager/dataset.py 459 bug Claude SaltedDataset.load_sample() mutates prompt only, leaving input_tokens (used by SGLang adapter) unchanged → KV-cache reuse not actually prevented for tokenized datasets
3 src/inference_endpoint/commands/benchmark/execute.py 365 bug Both Warmup rng_sched / rng_sample_index use bare random.Random() (unseeded) → warmup sample order is nondeterministic across runs, breaking reproducibility

🟡 Should Fix (medium)

# File Line Category Reviewer(s) Summary
1 src/inference_endpoint/dataset_manager/dataset.py 467 bug Both Multimodal salting only fires when prompt[0] is text; image-first prompts (common shape) silently skip salting
2 src/inference_endpoint/dataset_manager/dataset.py 444 bug Claude SaltedDataset.data snapshots inner.data at construction; later inner.load(force=True) desyncs the wrapper from the inner dataset
3 src/inference_endpoint/commands/benchmark/execute.py 357 design Claude warmup_rt built ad-hoc instead of via dataclasses.replace; load pattern hardcoded to MAX_THROUGHPUT (no way to warm up at the perf-phase QPS/concurrency)
4 src/inference_endpoint/load_generator/session.py 273 api-contract Claude ENABLE_WARMUP=1 env-var escape hatch removed without a deprecation warning; existing scripts now silently no-op

🔵 Consider (low)

# File Line Category Reviewer(s) Summary
1 src/inference_endpoint/load_generator/session.py 68 api-contract Claude PhaseConfig.drain_after=True default flips warmup drain semantics for direct callers (was effectively False)
2 src/inference_endpoint/load_generator/session.py 423 data-integrity Claude drain_after=False filter only covers _on_sample_complete; warmup COMPLETE events still publish to ZMQ → event log pollution
3 src/inference_endpoint/async_utils/services/metrics_aggregator/__main__.py 100 error-handling Claude Tokenizer load failure silently downgraded to nullcontext(); user-requested --tokenizer produces empty ISL/OSL/TPOT with only a warning
4 src/inference_endpoint/dataset_manager/dataset.py 464 design Claude os.urandom(8) salt ignores seeded RNG → salted prompts are nondeterministic even when the user sets dataloader_random_seed
5 tests/unit/commands/test_benchmark.py 471 design Claude TestBuildPhases uses lazy/inline imports throughout (~15 sites) — explicit AGENTS.md violation
6 tests/integration/commands/test_warmup.py 292 testing Claude _DELAY = 0.15 is fragile for overlap-observation tests — risk of CI flakes; replace with explicit asyncio.Event synchronization
7 tests/integration/commands/test_warmup.py 65 testing Claude All warmup integration tests use num_workers=1; multi-worker warmup path (ZMQ ordering, salt distinctness) is uncovered

Theme summary

The warmup feature lands the structural plumbing well, but several correctness/reproducibility properties the YAML knobs claim to provide are not actually enforced end-to-end:

  • Reproducibility — warmup RNGs are unseeded (execute.py:365) and the salt uses os.urandom (dataset.py:464). Two runs of the same config produce different warmup workloads, indirectly affecting perf-phase reproducibility when warmup sizes the cache.
  • Salt completenessSaltedDataset only handles plain-text prompts. Tokenized datasets (SGLang adapter via input_tokens) and image-first multimodal prompts silently skip salting, defeating warmup.salt=true for those paths.
  • API/back-compat surface — the ENABLE_WARMUP env var was removed without a deprecation warning; the new PhaseConfig.drain_after=True default flips warmup drain semantics for direct callers; datasets_dir default rename couples an orthogonal cache-directory change to this PR.
  • Test coverage — single-worker only; the multi-worker ZMQ ordering and salt-distinctness across worker processes is the riskiest part of the architecture and is currently unexercised. The _DELAY = 0.15 overlap tests are also flake-prone.

🤖 Posted by /review-council skill.

arekay-nv and others added 2 commits May 4, 2026 20:08
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
- Add `random_seed` field to WarmupConfig for deterministic warmup scheduling
- Use `dataclasses.replace` + warmup seed for warmup RuntimeSettings
- Fix SaltedDataset.data to be a property (avoids stale snapshot after inner reload)
- Fix multimodal salting to find first text part at any index (handles image-first prompts)
- Log warning when input_tokens present without prompt (salting not possible)
- Fix ruff-format CI failure in test_async_session.py
- Move inline imports to top of test_benchmark.py (AGENTS.md compliance)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread src/inference_endpoint/dataset_manager/dataset.py Outdated
*prompt[1:],
]
return {**data, "prompt": salted_parts}
return data # unsupported prompt type — skip salting
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] medium · bug

When the prompt is a multimodal list where ALL parts are non-text (e.g., all image_url), the loop finds no text part and falls through to return data with no warning:

if isinstance(prompt, list) and prompt:
    for i, part in enumerate(prompt):
        if isinstance(part, dict) and part.get("type") == "text":
            ...
            return {**data, "prompt": salted_parts}
return data  # unsupported prompt type — skip salting (silent!)

Unlike the input_tokens case (which emits a logger.warning), this fallback is completely silent. Users who enable salt=True for image-only multimodal datasets will not see any indication that KV-cache bypass is not functioning. Add a logger.warning consistent with the input_tokens path.

Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
sd = SaltedDataset(inner)
assert sd.load_sample(0) == {}

def test_non_dict_sample_is_returned_as_is(self):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] low · testing

sd.data = inner.data is redundant now that SaltedDataset.data is a property that delegates to self._inner.data. The assignment calls the property setter which sets self._inner.data = inner.data — a no-op since they reference the same list. This won't cause a test failure, but it misleads readers into thinking SaltedDataset.data is a plain attribute:

inner.data = ["raw string sample"]  # override with non-dict
sd = SaltedDataset(inner)
sd.data = inner.data  # redundant — sd.data property already returns inner.data

Remove the sd.data = inner.data line.

Copy link
Copy Markdown
Collaborator Author

@arekay-nv arekay-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

Comment thread src/inference_endpoint/dataset_manager/dataset.py Outdated
*prompt[1:],
]
return {**data, "prompt": salted_parts}
return data # unsupported prompt type — skip salting
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] medium · bug

When the prompt is a multimodal list where ALL parts are non-text (e.g., all image_url), the loop finds no text part and falls through to return data without salting and without any warning:

if isinstance(prompt, list) and prompt:
    for i, part in enumerate(prompt):
        if isinstance(part, dict) and part.get("type") == "text":
            ...
            return {**data, "prompt": salted_parts}
return data  # unsupported prompt type — skip salting (silent!)

Unlike the input_tokens case (which emits logger.warning), this fallback is completely silent. Users who set salt=True for image-only datasets will see no indication that KV-cache bypass is not functioning. Add a logger.warning here, consistent with the input_tokens path.

Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
sd = SaltedDataset(inner)
assert sd.load_sample(0) == {}

def test_non_dict_sample_is_returned_as_is(self):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] low · testing

sd.data = inner.data is now redundant: SaltedDataset.data is a property that delegates to self._inner.data. The assignment calls the setter which sets self._inner.data = inner.data — a no-op since they already reference the same list.

inner.data = ["raw string sample"]  # override with non-dict
sd = SaltedDataset(inner)
sd.data = inner.data  # redundant — property already returns inner.data

Remove the sd.data = inner.data line.

@arekay-nv
Copy link
Copy Markdown
Collaborator Author

Review Council — Multi-AI Code Review

Reviewed by Claude (Codex CLI unavailable) | Depth: thorough | HEAD: `1143d64`

Found 7 new issues across 5 files. Inline comments posted for 4; 3 are out-of-diff or post-fix.

Cross-reviewer note: Issues already addressed in the previous round (unseeded RNGs, SaltedDataset.data snapshot, multimodal salting at index 0, inline imports, ruff-format) are not re-reported.

🔴 Must Fix (high)

# File Line Category Summary
1 commands/benchmark/execute.py 365 design Inline ↑ After RNG-seed fix: rng_sched and rng_sample_index both seeded with same random_seed value — identical pseudo-random sequences for scheduler and sample selection

🟡 Should Fix (medium)

# File Line Category Summary
2 dataset_manager/dataset.py 434 design Inline ↑ SaltedDataset is registered in Dataset.PREDEFINED under "SaltedDataset" via __init_subclass__; instantiating via registry raises TypeError (incompatible constructor). Fix: class SaltedDataset(Dataset, dataset_id=None)
3 dataset_manager/dataset.py 478 bug Inline ↑ Silent no-salt fallback for image-only multimodal prompts (all parts are image_url, no text part exists) — no logger.warning emitted, salt=True silently no-ops
4 load_generator/session.py 366 bug _drain_inflight awaits self._drain_event.wait() with no timeout. If a warmup request is dropped (network error, server restart), the drain hangs indefinitely and the benchmark never starts. Only SIGINT unblocks it.

🔵 Consider (low)

# File Line Category Summary
5 tests/unit/dataset_manager/test_salted_dataset.py 143 testing Inline ↑ sd.data = inner.data is redundant now that data is a property delegating to _inner.data
6 config/schema.py 413 api-contract random_seed: int = Field(0, ...) default is implicitly deterministic; Optional[int] = None with "use fresh seed when None" semantics would better match salt=True cache-busting intent
7 config/schema.py 395 api-contract WarmupConfig has no @cyclopts.Parameter annotations — warmup is YAML-only with no CLI flags. Undocumented limitation; users testing via CLI will find warmup silently inactive.

🤖 Posted by /review-council skill.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Comment thread src/inference_endpoint/dataset_manager/dataset.py Fixed
Copy link
Copy Markdown
Collaborator Author

@arekay-nv arekay-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review (Round 2)

Reviewed by Codex + Claude | Depth: thorough | HEAD: c9e80b0 (was 56258be in round 1)

Re-reviewing after the author's response commits. Most round-1 findings are addressed; this round focuses on new code (global timeout, warmup_random_seed, register=False) and incomplete fixes.

Found 4 new issues across 3 files: 0 critical, 1 high, 1 medium, 2 low. See inline comments — a tiered summary will follow.

Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
throughout the current phase."""
Bounded by the global experiment timeout: if the caller schedules a
loop.call_later that calls stop(), stop() sets _drain_event, unblocking
this wait without leaving it hung indefinitely."""
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] medium · error-handling

The new docstring on _drain_inflight claims a safety guarantee that only holds in some configurations:

async def _drain_inflight(self, phase_issuer: PhaseIssuer) -> None:
    """Wait for all in-flight responses from this phase to complete.

    Bounded by the global experiment timeout: if the caller schedules a
    loop.call_later that calls stop(), stop() sets _drain_event, unblocking
    this wait without leaving it hung indefinitely."""
    ...
    self._drain_event.clear()
    await self._drain_event.wait()

The loop.call_later is only scheduled in execute.py:561 if ctx.rt_settings.max_duration_ms is not None. runtime_settings.from_config maps max_duration_ms == 0 to None, and the schema default is 0 — so the default offline/online CLI run has no global timer, and _drain_inflight retains its old unbounded await self._drain_event.wait(). A single dropped or hanging request will hang the benchmark forever.

The docstring suggests this case is solved; it isn't.

Suggested fixes:

  • Make the docstring conditional ("if a global timeout is configured…") and add a note that the drain is otherwise unbounded, OR
  • Add an intrinsic drain timeout (a fixed cap like 30–60s, or a multiple of observed completion latency) inside _drain_inflight itself so the safety property is a property of the function, not a contract on the caller.

Comment thread src/inference_endpoint/commands/benchmark/execute.py
for _ in range(n_draw)
]

assert order_with == order_without, (
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] low · testing

test_performance_sample_order_identical_with_and_without_warmup does not actually exercise the property it claims to test:

def make_rt():
    return RuntimeSettings(
        ...
        rng_sched=random.Random(99),
        rng_sample_index=random.Random(99),
        ...
    )

ctx_with = self._make_ctx(config_with, make_rt(), simple_dataset)
ctx_without = self._make_ctx(config_without, make_rt(), simple_dataset)

make_rt() is called twice and produces two separate RuntimeSettings instances with freshly-seeded Random(99) RNGs. _build_phases in the warmup path uses dataclass_replace which substitutes new Random(seed) instances on the warmup phase only — it never touches the perf phase's RNGs. So this test only verifies that _build_phases doesn't drain the perf RNG, which it never could (it only stores references). It does not verify the regression it claims to guard against: that executing the warmup phase leaves the perf-phase RNG untouched.

A tighter test would either:

  1. Build phases from a single shared RuntimeSettings, then advance the warmup phase's rng_sample_index (e.g., draw N samples from create_sample_order(warmup_rt)), and assert that the perf-phase order matches a control with no warmup, OR
  2. Drop this test as redundant with test_warmup_phase_uses_independent_rngs (which already asserts warmup_rt.rng_sched is not perf_rt.rng_sched). The current implementation gives a false sense of coverage for the actual race the docstring describes.

@arekay-nv
Copy link
Copy Markdown
Collaborator Author

Review Council — Multi-AI Code Review (Round 2)

Reviewed by Codex + Claude | Depth: thorough | HEAD: c9e80b0 (was 56258be in round 1)

Re-review after the author's response commits. 4 new issues across 3 files. Inline comments in this review.

Cross-reviewer agreement (boosted confidence): execute.py:570 — global timeout semantic change — was the only finding both Codex and Claude flagged in round 2.

🔴 Must Fix (high)

# File Line Category Reviewer(s) Summary
1 src/inference_endpoint/commands/benchmark/execute.py 570 api-contract Both Global timer makes warmup eat into the per-phase max_duration_ms budget; _make_stop_check still uses per-phase semantics → two clocks on the same configured value, accuracy phases configured for unbounded duration can be truncated

🟡 Should Fix (medium)

# File Line Category Reviewer(s) Summary
1 src/inference_endpoint/load_generator/session.py 361 error-handling Claude _drain_inflight docstring claims a safety guarantee that only holds when max_duration_ms is not None; the default offline/online run still has unbounded drain on a hung request

🔵 Consider (low)

# File Line Category Reviewer(s) Summary
1 src/inference_endpoint/commands/benchmark/execute.py 568 concurrency Claude _on_global_timeout has no _done guard; can fire during the cleanup gap between session.run() returning and global_timeout_handle.cancel() running
2 tests/unit/commands/test_benchmark.py 775 testing Claude test_performance_sample_order_identical_with_and_without_warmup builds two separate RuntimeSettings with fresh seeded RNGs — it doesn't actually exercise warmup-perturbing-perf-RNG; gives a false sense of coverage

Round 1 follow-up status

  • ✅ Fixed — execute.py:365 — warmup RNGs unseeded — now random.Random(warmup_random_seed)
  • ✅ Fixed — dataset.py:467 — multimodal salt only first text — now scans for first text part at any index
  • ✅ Fixed — dataset.py:444 — SaltedDataset.data snapshot — now @property delegating to _inner.data
  • ✅ Fixed — test_benchmark.py:471 — lazy imports — imports moved to top
  • ✅ Fixed — dataset.py:434 — registry pollution — now class SaltedDataset(Dataset, register=False)
  • 🟡 Partial — dataset.py:459 — input_tokens not salted — warning added for input_tokens-without-prompt; common case (both present) still silently drops salt
  • 🟡 Partial — execute.py:357 — ad-hoc warmup_rt — now uses dataclass_replace ✅ but LoadPattern(MAX_THROUGHPUT) is still hardcoded
  • ⏸️ Unaddressed — dataset.py:414 — datasets/dataset_cache/`` — still a silent breaking default change
  • ⏸️ Unaddressed — session.py:273 — ENABLE_WARMUP env-var removed — no deprecation warning
  • ⏸️ Unaddressed — session.py:68 — drain_after=True default flips warmup — direct callers get opposite semantics
  • ⏸️ Unaddressed — session.py:423 — late warmup COMPLETE events still publish to ZMQ
  • ⏸️ Unaddressed — metrics_aggregator/__main__.py:100 — tokenizer fail silently downgrades
  • ⏸️ Unaddressed — dataset.py:464 — os.urandom salt ignores seeded RNG
  • ⏸️ Unaddressed — test_warmup.py:292 — _DELAY=0.15 flake risk
  • ⏸️ Unaddressed — test_warmup.py:65 — no multi-worker tests

Theme summary (round 2)

The biggest new concern is the global wall-clock timer added to bound the warmup drain. The mechanism works, but it silently changes what max_duration_ms means for any user who enables warmup (warmup time now eats into the perf-phase budget, and accuracy phases configured for None get truncated by the global cap). Either scope the timer to the perf phase, expose a separate warmup.max_duration_ms, or rename the field — but don't double-mean the same configured value.

The drain-safety claim in _drain_inflight is also overstated: with the schema default (max_duration_ms=0None), no global timer is scheduled and the drain is unbounded — a single hung request can still hang the benchmark.

🤖 Posted by /review-council skill.

@arekay-nv arekay-nv marked this pull request as ready for review May 7, 2026 16:38
@arekay-nv arekay-nv requested review from a team, nv-alicheng and viraatc May 7, 2026 16:38
Copy link
Copy Markdown
Collaborator

@nv-alicheng nv-alicheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review (run #2)

Reviewed by: Codex + Claude | Depth: thorough

Found 12 new issues (deduped against the 31+ existing inline comments). Severity: 1 high, 2 medium, 9 low.

See the inline comments for details. A summary table is posted as a separate top-level comment after this review.

Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/__main__.py Outdated
from inference_endpoint.openai.openai_types_gen import CreateChatCompletionRequest
from inference_endpoint.utils.logging import setup_logging

RequestHandler = Callable[[web.Request], web.Response | Awaitable[web.Response]]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] low · api-contract

RequestHandler = Callable[[web.Request], web.Response | Awaitable[web.Response]] doesn't allow None, but _dispatch (line 110) explicitly relies on None to fall through to the default echo handler. A user-provided handler that forgets to return therefore silently passes through default behavior instead of failing loudly — and the type alias misleads them about what's supported. Either widen the alias to ... | None, or assert non-None inside _dispatch so the error surfaces.

Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
Comment thread src/inference_endpoint/dataset_manager/dataset.py Outdated
warmup_rt = phases[0].runtime_settings
perf_rt = phases[1].runtime_settings
assert warmup_rt.rng_sched is not perf_rt.rng_sched
assert warmup_rt.rng_sample_index is not perf_rt.rng_sample_index
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] low · testing

test_warmup_uses_independent_rng_instances only checks is not identity. This passes even if both warmup RNGs were seeded with the same seed as perf — i.e. distinct objects but identical sequences, which would still cause warmup to consume the same state pattern as perf. Add an assertion comparing the first few drawn values, e.g. assert [warmup_rt.rng_sched.random() for _ in range(5)] != [perf_rt.rng_sched.random() for _ in range(5)], to actually catch the regression the test claims to guard.

def test_all_salts_unique_across_full_dataset(self, loaded_inner):
sd = SaltedDataset(loaded_inner)
prompts = [sd.load_sample(i)["prompt"] for i in range(sd.num_samples())]
assert len(set(prompts)) == len(prompts)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] low · testing

SaltedDataset.load_sample has a non-trivial branch at dataset.py:471 that warns when input_tokens is present without prompt, and a salt-the-prompt branch when both are present (which still leaves input_tokens unsalted — see existing comment on dataset.py:467). Neither path is exercised here. Add tests for (a) {"input_tokens": [...], "prompt": "..."} (salting still happens but input_tokens unchanged — silent miss for token-based adapters), and (b) {"input_tokens": [...]} only (warning fires, data returned unchanged).

Comment thread src/inference_endpoint/dataset_manager/dataset.py Outdated
runtime: RuntimeConfig = Field(default_factory=RuntimeConfig)
load_pattern: LoadPattern = Field(default_factory=LoadPattern)
client: HTTPClientConfig = Field(default_factory=HTTPClientConfig)
warmup: WarmupConfig = Field(default_factory=WarmupConfig)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Review Council — Claude] low · api-contract

Settings.warmup: WarmupConfig has no cyclopts.Parameter annotation, so cyclopts auto-generates only dotted CLI flags like --settings.warmup.enabled=true, --settings.warmup.n-requests=.... Compare with runtime and load_pattern which expose ergonomic shorthands via field-level cyclopts.Parameter(alias=...). As-is, enabling warmup from the CLI is unwieldy. Add cyclopts.Parameter(alias="--warmup-...") aliases on the WarmupConfig fields, or document that warmup is YAML-only.

Comment thread src/inference_endpoint/commands/benchmark/execute.py
@nv-alicheng
Copy link
Copy Markdown
Collaborator

Review Council — Multi-AI Code Review (run #2)

Reviewed by: Codex + Claude | Depth: thorough

Found 12 new issues across 6 files (deduped against 31 existing inline comments).

🔴 Must Fix (high)

Issues that produce incorrect benchmark results in normal usage.

# File Line Category Reviewer(s) Summary
1 src/inference_endpoint/commands/benchmark/execute.py 557 bug Codex Global max_duration_ms timer covers warmup + perf + accuracy combined; slow warmup eats into perf budget. Silent semantic shift from previous "perf-only" meaning.

🟡 Should Fix (medium)

Real issues that trigger under specific conditions or design flaws that compound.

# File Line Category Reviewer(s) Summary
2 src/inference_endpoint/async_utils/services/metrics_aggregator/__main__.py 98 bug Claude TokenizePool partial-init failure leaks the executor's worker threads on retry.
3 src/inference_endpoint/commands/benchmark/execute.py 367 design Claude Warmup hardcoded to MAX_THROUGHPUT regardless of perf load pattern — surprising for online (Poisson/concurrency) runs since warmup exercises a different server code path than the steady-state pattern it's meant to warm up for.

🔵 Consider (low)

Valid improvements; could be follow-ups.

# File Line Category Reviewer(s) Summary
4 src/inference_endpoint/testing/echo_server.py 36 api-contract Claude RequestHandler type alias doesn't allow None, but _dispatch relies on None for fall-through. Either widen the type or assert non-None.
5 src/inference_endpoint/dataset_manager/dataset.py 465 design Claude SaltedDataset.load() is a no-op — unloaded inner causes opaque AssertionError later. Either delegate or raise in __init__.
6 src/inference_endpoint/dataset_manager/dataset.py 480 data-integrity Claude Salt prefix "[{salt}] " adds ~5 token overhead — distorts ISL distribution between warmup and perf, especially for short prompts.
7 src/inference_endpoint/config/schema.py 428 api-contract Claude WarmupConfig lacks cyclopts.Parameter aliases — only verbose --settings.warmup.* flags, unlike sibling runtime/load_pattern.
8 src/inference_endpoint/commands/benchmark/execute.py 360 bug Claude min_duration_ms=0 + n_requests=None falls back to "issue every dataset sample at MAX_THROUGHPUT" — undocumented expansion of warmup's intent.
9 tests/unit/commands/test_benchmark.py 411 testing Claude test_defaults doesn't assert the warmup_random_seed=42 default; same gap in test_all_flags_enabled and test_yaml_roundtrip.
10 tests/unit/commands/test_benchmark.py 721 testing Claude test_warmup_uses_independent_rng_instances only checks is not identity; passes even if both RNGs were seeded identically.
11 tests/unit/dataset_manager/test_salted_dataset.py 109 testing Claude test_salt_unique_across_different_indices compares already-different prompts; passes even if salting is broken.
12 tests/unit/dataset_manager/test_salted_dataset.py 218 testing Claude No tests exercise the input_tokens-present branches of SaltedDataset.load_sample (the ones that produce silent token-ID misses for SGLang).

What we deduped

  • Codex P2 (SaltedDataset doesn't salt input_tokens) — already covered by existing comment on dataset.py:467 ([Review Council — Claude] high · bug). Dropped.
  • Claude test_warmup.py:433 (max_concurrent > 5 flakiness) — already covered by existing comment on test_warmup.py:292 ([Review Council — Claude] low · testing), which names that exact test. Dropped.

Notes on Codex coverage this run

Codex returned only 2 findings on this run (both correctness-grade). The first reproduced an existing high-severity finding (input_tokens salting); the second (the max_duration_ms timing bug) is genuinely new and is the most impactful issue in this review. The remaining coverage came from Claude.

return cls(df, transforms=transforms, repeats=num_repeats)


class SaltedDataset(Dataset, register=False):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arekay-nv Can you clarify the purpose and usage of this? Conceptually a salted dataset is just a normal dataset, but on load_sample, it prepends a randomly generated fixed length salt right? I'm wondering if adding a salt: bool = False argument to the base class .load_sample function is warranted. A lot of the body of this class is just remaps to the underlying dataset.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that approach. Updated. Thanks!

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Comment thread src/inference_endpoint/dataset_manager/dataset.py Fixed
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants