Skip to content

[None][fix] fix bench-moe benchmark bugs#15397

Open
guqiqi wants to merge 5 commits into
NVIDIA:mainfrom
guqiqi:dev-bench-moe
Open

[None][fix] fix bench-moe benchmark bugs#15397
guqiqi wants to merge 5 commits into
NVIDIA:mainfrom
guqiqi:dev-bench-moe

Conversation

@guqiqi

@guqiqi guqiqi commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Description

  • Add timeout result write back for long-term hang
  • Add error result write back in case of kernel error
  • Adjust MoE workload computation for non-DP attention path
  • Align MoE token assignment logic for non-DP attention path
  • Prune TEP/TTP forced-comm candidates at generation time
  • Add DENSEGEMM+EP validation
  • Remove spurious .cpu() in MXFP4 quantize_utils

Summary by CodeRabbit

  • Bug Fixes

    • Fixed token distribution calculation in non-distributed mode to correctly handle single-rank cases.
  • Tests

    • Enhanced candidate execution watchdog with improved timeout handling, logging, and callback support for better failure tracking in benchmarks.
    • Improved consistency of distributed parallel mode configuration propagation across routing and token distribution logic.

@guqiqi guqiqi requested a review from a team as a code owner June 16, 2026 05:45
@guqiqi guqiqi requested a review from mikeiovine June 16, 2026 05:45
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Two orthogonal changes: (1) an enable_dp flag is threaded through the MoE benchmark's routing-builder (_build_per_rank_num_tokens, _per_rank_tokens, _build_routing_plan) and consumed by case_runner, cli, results, and the production ConfigurableMoE.calculate_num_chunks; (2) CandidateWatchdog gains an on_timeout callback and a rank-0 grace delay, and the MPI worker loop pre-writes per-candidate failure placeholders that are overwritten on timeout or replaced on success.

Changes

DP-aware token sharding

Layer / File(s) Summary
Routing builder: enable_dp parameter and token distribution logic
tests/microbenchmarks/bench_moe/routing/builders.py
_build_per_rank_num_tokens gains enable_dp: non-DP replicates the full batch to every rank; DP distributes tokens across ranks. Validation of explicit per_rank_num_tokens sums against num_tokens vs num_tokens * world_size accordingly. _per_rank_tokens and _build_routing_plan signatures updated to pass the flag through.
ConfigurableMoE: assert single-element list in non-DP path
tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py
calculate_num_chunks non-DP branch now asserts all_rank_num_tokens has exactly one element and reads num_rows from that element, replacing sum(all_rank_num_tokens).
Benchmark tooling: thread enable_dp through case_runner, cli, and results
tests/microbenchmarks/bench_moe/case_runner.py, tests/microbenchmarks/bench_moe/cli.py, tests/microbenchmarks/bench_moe/results.py
_resolve_layout_and_plan passes enable_dp into _build_routing_plan and _per_rank_tokens; TEP/TTP path sets all_rank_num_tokens = None. _build_worker_header and _make_skipped_run_result derive enable_dp from _resolve_mapping_layout before calling _per_rank_tokens.

Watchdog crash-resume semantics

Layer / File(s) Summary
Terminal constants and CandidateWatchdog on_timeout extension
tests/microbenchmarks/bench_moe/worker.py
WATCHDOG_TIMEOUT_PREFIX and INCOMPLETE_PREFIX constants added. CandidateWatchdog.__init__ gains on_timeout: Optional[Callable[[], None]] and rank0_persist_grace_s: float = 8.0; timeout path invokes the callback (exception-isolated), sleeps non-rank-0 for the grace period, then SIGKILLs.
Candidate loop: pre-write failed placeholder and replace on completion
tests/microbenchmarks/bench_moe/worker.py
Worker pre-writes a per-candidate INCOMPLETE_PREFIX terminal row and checkpoints before execution. on_timeout callback overwrites it with WATCHDOG_TIMEOUT_PREFIX and timeout statuses and re-checkpoints. On normal completion the placeholder is replaced (not appended) with the actual result row.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description is sparse and lacks required sections from the template. It provides only a bulleted list of changes without explanations, test coverage details, or PR checklist confirmation. Complete the PR description by adding the Description section with detailed explanations, Test Coverage section listing relevant tests, and confirm PR Checklist items. Use CodeRabbit's summary if helpful to expand the content.
Title check ❓ Inconclusive The title '[None][fix] fix bench-moe benchmark bugs' is vague and uses generic terms. It mentions 'bugs' without specifying what bugs are being fixed, making it difficult to understand the actual changes from the title alone. Replace with a more specific title that highlights the main fix, such as '[None][fix] Fix bench-moe timeout handling and non-DP MoE path bugs' to better convey the key changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/microbenchmarks/bench_moe/case_runner.py (1)

556-569: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Narrow the routing-plan catch to expected validation/file exceptions.

This block currently catches every exception and re-labels it as a skipped case, which can hide genuine programming errors in _build_routing_plan and make regressions harder to detect.

🔧 Suggested fix
-        except Exception as exc:
+        except (ValueError, FileNotFoundError, json.JSONDecodeError) as exc:
             reason = f"routing plan error: {type(exc).__name__}: {exc}"
             _maybe_print_rank0(f"[bench_moe] {reason}")
             return _short_circuit(result, "skipped", reason)

As per coding guidelines, “Avoid broad exception handling — catch specific exceptions, not bare except” and “limit the except to the smallest set of errors possible.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/microbenchmarks/bench_moe/case_runner.py` around lines 556 - 569, The
try-except block around the _build_routing_plan call is catching the broad
Exception class, which hides genuine programming errors and makes regressions
harder to detect. Replace the generic except Exception clause with specific
exception types that represent the expected validation or file-related errors
that _build_routing_plan can raise (such as ValueError, FileNotFoundError, or
similar domain-specific exceptions). This ensures that only anticipated errors
are caught and re-labeled as skipped cases, while unexpected errors will
propagate and be caught during testing.

Sources: Coding guidelines, Linters/SAST tools

🧹 Nitpick comments (1)
tests/microbenchmarks/bench_moe/worker.py (1)

253-269: 💤 Low value

Defensive exception handling is appropriate here, but consider a brief inline comment.

The static analysis flags broad except Exception at lines 253, 261, and 268. However, in a watchdog context where SIGKILL must fire regardless of what fails, catching all exceptions is intentional and correct—an uncaught exception would leave the process hung.

Consider adding a brief inline comment (e.g., # Catch-all: watchdog must SIGKILL regardless of bookkeeping failures) to document this design choice and silence future reviewers/linters.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/microbenchmarks/bench_moe/worker.py` around lines 253 - 269, The broad
except Exception blocks at lines 253, 261, and 268 in the watchdog timeout
handling logic lack documentation explaining why catching all exceptions is
intentional. Add brief inline comments above or next to each except Exception
clause (including the outer bare except at line 253 and the inner except blocks
around the _on_timeout callback and stderr operations) to document that this
design is intentional because the watchdog must ensure SIGKILL fires regardless
of any bookkeeping failures. Use a comment like "# Catch-all: watchdog must
SIGKILL regardless of bookkeeping failures" to clarify the design choice for
future reviewers and static analysis tools.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`:
- Around line 421-430: The condition checking whether to use the DP path in the
calculate_num_chunks method is overly restrictive. The check `if self.use_dp and
self.comm is not None:` incorrectly assumes that when self.comm is None, the
configuration is non-DP, but DP can also have self.comm as None (in fused-comm
or commless paths). Change the condition to check only `if self.use_dp:` without
the additional self.comm check, so that the padded calculation path is used
whenever DP is enabled regardless of the comm state, and the assertion path is
only used when self.use_dp is False (truly non-DP).

In `@tests/microbenchmarks/bench_moe/results.py`:
- Around line 411-413: The `_resolve_mapping_layout(config, world_size)` call in
the `_make_skipped_run_result` function is unguarded and can raise an exception,
causing the function to fail instead of producing a skipped row as intended.
Wrap the `_resolve_mapping_layout` call in a try-except block to handle any
failures gracefully, using sensible default values for `_enable_dp` if the
layout resolution fails, so that a skipped row result is always returned even
when layout resolution fails.

In `@tests/microbenchmarks/bench_moe/worker.py`:
- Around line 593-595: The file has formatting inconsistencies detected by
ruff-format. Run the ruff formatter locally on the file to automatically apply
the expected formatting changes, then commit the reformatted result. This will
ensure the code matches the project's formatting standards and resolve the CI
failure.

---

Outside diff comments:
In `@tests/microbenchmarks/bench_moe/case_runner.py`:
- Around line 556-569: The try-except block around the _build_routing_plan call
is catching the broad Exception class, which hides genuine programming errors
and makes regressions harder to detect. Replace the generic except Exception
clause with specific exception types that represent the expected validation or
file-related errors that _build_routing_plan can raise (such as ValueError,
FileNotFoundError, or similar domain-specific exceptions). This ensures that
only anticipated errors are caught and re-labeled as skipped cases, while
unexpected errors will propagate and be caught during testing.

---

Nitpick comments:
In `@tests/microbenchmarks/bench_moe/worker.py`:
- Around line 253-269: The broad except Exception blocks at lines 253, 261, and
268 in the watchdog timeout handling logic lack documentation explaining why
catching all exceptions is intentional. Add brief inline comments above or next
to each except Exception clause (including the outer bare except at line 253 and
the inner except blocks around the _on_timeout callback and stderr operations)
to document that this design is intentional because the watchdog must ensure
SIGKILL fires regardless of any bookkeeping failures. Use a comment like "#
Catch-all: watchdog must SIGKILL regardless of bookkeeping failures" to clarify
the design choice for future reviewers and static analysis tools.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f5c26f92-55cf-4363-a5f5-560de83f08c2

📥 Commits

Reviewing files that changed from the base of the PR and between 09449d4 and a3735ed.

📒 Files selected for processing (6)
  • tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py
  • tests/microbenchmarks/bench_moe/case_runner.py
  • tests/microbenchmarks/bench_moe/cli.py
  • tests/microbenchmarks/bench_moe/results.py
  • tests/microbenchmarks/bench_moe/routing/builders.py
  • tests/microbenchmarks/bench_moe/worker.py

Comment thread tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py
Comment thread tests/microbenchmarks/bench_moe/results.py
Comment thread tests/microbenchmarks/bench_moe/worker.py Outdated
@xxi-nv

xxi-nv commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@xxi-nv xxi-nv self-requested a review June 16, 2026 05:59
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54492 [ run ] triggered by Bot. Commit: a3735ed Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54492 [ run ] completed with state SUCCESS. Commit: a3735ed
/LLM/main/L0_MergeRequest_PR pipeline #43557 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@guqiqi guqiqi force-pushed the dev-bench-moe branch 2 times, most recently from 384f4a9 to de514f0 Compare June 17, 2026 09:05
@guqiqi

guqiqi commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54819 [ run ] triggered by Bot. Commit: de514f0 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54819 [ run ] completed with state SUCCESS. Commit: de514f0
/LLM/main/L0_MergeRequest_PR pipeline #43832 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@guqiqi guqiqi force-pushed the dev-bench-moe branch 2 times, most recently from a80ce94 to f39a42a Compare June 23, 2026 10:30
@guqiqi

guqiqi commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55227 [ run ] triggered by Bot. Commit: f39a42a Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55227 [ run ] completed with state SUCCESS. Commit: f39a42a
/LLM/main/L0_MergeRequest_PR pipeline #44186 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@xxi-nv

xxi-nv commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55645 [ run ] triggered by Bot. Commit: cb6dd3a Link to invocation

@guqiqi

guqiqi commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55658 [ run ] triggered by Bot. Commit: d168fe5 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55645 [ run ] completed with state ABORTED. Commit: cb6dd3a

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55658 [ run ] completed with state SUCCESS. Commit: d168fe5
/LLM/main/L0_MergeRequest_PR pipeline #44565 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

guqiqi added 2 commits June 29, 2026 22:45
…ation for non-DP attention path, and MoE logic for non-DP attention path

Signed-off-by: guqiqi <29116997+guqiqi@users.noreply.github.com>
Signed-off-by: guqiqi <29116997+guqiqi@users.noreply.github.com>
guqiqi added 3 commits June 29, 2026 22:45
…tion time; add DENSEGEMM+EP validation; remove spurious .cpu() in MXFP4 quantize_utils

Signed-off-by: guqiqi <29116997+guqiqi@users.noreply.github.com>
Signed-off-by: guqiqi <29116997+guqiqi@users.noreply.github.com>
Signed-off-by: guqiqi <29116997+guqiqi@users.noreply.github.com>
@guqiqi

guqiqi commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57358 [ run ] triggered by Bot. Commit: df096b7 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57358 [ run ] completed with state SUCCESS. Commit: df096b7
/LLM/main/L0_MergeRequest_PR pipeline #46109 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@guqiqi

guqiqi commented Jul 4, 2026

Copy link
Copy Markdown
Contributor Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57547 [ run ] triggered by Bot. Commit: df096b7 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57547 [ run ] completed with state SUCCESS. Commit: df096b7
/LLM/main/L0_MergeRequest_PR pipeline #46275 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@guqiqi

guqiqi commented Jul 4, 2026

Copy link
Copy Markdown
Contributor Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57560 [ run ] triggered by Bot. Commit: df096b7 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57560 [ run ] completed with state SUCCESS. Commit: df096b7
/LLM/main/L0_MergeRequest_PR pipeline #46288 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants