[None][fix] fix bench-moe benchmark bugs by guqiqi · Pull Request #15397 · NVIDIA/TensorRT-LLM

guqiqi · 2026-06-16T05:45:30Z

Description

Add timeout result write back for long-term hang
Add error result write back in case of kernel error
Adjust MoE workload computation for non-DP attention path
Align MoE token assignment logic for non-DP attention path
Prune TEP/TTP forced-comm candidates at generation time
Add DENSEGEMM+EP validation
Remove spurious .cpu() in MXFP4 quantize_utils

Summary by CodeRabbit

Bug Fixes
- Fixed token distribution calculation in non-distributed mode to correctly handle single-rank cases.
Tests
- Enhanced candidate execution watchdog with improved timeout handling, logging, and callback support for better failure tracking in benchmarks.
- Improved consistency of distributed parallel mode configuration propagation across routing and token distribution logic.

coderabbitai · 2026-06-16T05:47:11Z

📝 Walkthrough

Walkthrough

Two orthogonal changes: (1) an enable_dp flag is threaded through the MoE benchmark's routing-builder (_build_per_rank_num_tokens, _per_rank_tokens, _build_routing_plan) and consumed by case_runner, cli, results, and the production ConfigurableMoE.calculate_num_chunks; (2) CandidateWatchdog gains an on_timeout callback and a rank-0 grace delay, and the MPI worker loop pre-writes per-candidate failure placeholders that are overwritten on timeout or replaced on success.

Changes

DP-aware token sharding

Layer / File(s)	Summary
Routing builder: `enable_dp` parameter and token distribution logic `tests/microbenchmarks/bench_moe/routing/builders.py`	`_build_per_rank_num_tokens` gains `enable_dp`: non-DP replicates the full batch to every rank; DP distributes tokens across ranks. Validation of explicit `per_rank_num_tokens` sums against `num_tokens` vs `num_tokens * world_size` accordingly. `_per_rank_tokens` and `_build_routing_plan` signatures updated to pass the flag through.
ConfigurableMoE: assert single-element list in non-DP path `tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`	`calculate_num_chunks` non-DP branch now asserts `all_rank_num_tokens` has exactly one element and reads `num_rows` from that element, replacing `sum(all_rank_num_tokens)`.
Benchmark tooling: thread `enable_dp` through case_runner, cli, and results `tests/microbenchmarks/bench_moe/case_runner.py`, `tests/microbenchmarks/bench_moe/cli.py`, `tests/microbenchmarks/bench_moe/results.py`	`_resolve_layout_and_plan` passes `enable_dp` into `_build_routing_plan` and `_per_rank_tokens`; TEP/TTP path sets `all_rank_num_tokens = None`. `_build_worker_header` and `_make_skipped_run_result` derive `enable_dp` from `_resolve_mapping_layout` before calling `_per_rank_tokens`.

Watchdog crash-resume semantics

Layer / File(s)	Summary
Terminal constants and `CandidateWatchdog` `on_timeout` extension `tests/microbenchmarks/bench_moe/worker.py`	`WATCHDOG_TIMEOUT_PREFIX` and `INCOMPLETE_PREFIX` constants added. `CandidateWatchdog.__init__` gains `on_timeout: Optional[Callable[[], None]]` and `rank0_persist_grace_s: float = 8.0`; timeout path invokes the callback (exception-isolated), sleeps non-rank-0 for the grace period, then SIGKILLs.
Candidate loop: pre-write failed placeholder and replace on completion `tests/microbenchmarks/bench_moe/worker.py`	Worker pre-writes a per-candidate `INCOMPLETE_PREFIX` terminal row and checkpoints before execution. `on_timeout` callback overwrites it with `WATCHDOG_TIMEOUT_PREFIX` and `timeout` statuses and re-checkpoints. On normal completion the placeholder is replaced (not appended) with the actual result row.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description is sparse and lacks required sections from the template. It provides only a bulleted list of changes without explanations, test coverage details, or PR checklist confirmation.	Complete the PR description by adding the Description section with detailed explanations, Test Coverage section listing relevant tests, and confirm PR Checklist items. Use CodeRabbit's summary if helpful to expand the content.
Title check	❓ Inconclusive	The title '[None][fix] fix bench-moe benchmark bugs' is vague and uses generic terms. It mentions 'bugs' without specifying what bugs are being fixed, making it difficult to understand the actual changes from the title alone.	Replace with a more specific title that highlights the main fix, such as '[None][fix] Fix bench-moe timeout handling and non-DP MoE path bugs' to better convey the key changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/microbenchmarks/bench_moe/case_runner.py (1)
556-569: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Narrow the routing-plan catch to expected validation/file exceptions.

This block currently catches every exception and re-labels it as a skipped case, which can hide genuine programming errors in _build_routing_plan and make regressions harder to detect.
🔧 Suggested fix
-        except Exception as exc:
+        except (ValueError, FileNotFoundError, json.JSONDecodeError) as exc:
             reason = f"routing plan error: {type(exc).__name__}: {exc}"
             _maybe_print_rank0(f"[bench_moe] {reason}")
             return _short_circuit(result, "skipped", reason)
As per coding guidelines, “Avoid broad exception handling — catch specific exceptions, not bare except” and “limit the except to the smallest set of errors possible.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/microbenchmarks/bench_moe/case_runner.py` around lines 556 - 569, The
try-except block around the _build_routing_plan call is catching the broad
Exception class, which hides genuine programming errors and makes regressions
harder to detect. Replace the generic except Exception clause with specific
exception types that represent the expected validation or file-related errors
that _build_routing_plan can raise (such as ValueError, FileNotFoundError, or
similar domain-specific exceptions). This ensures that only anticipated errors
are caught and re-labeled as skipped cases, while unexpected errors will
propagate and be caught during testing.
Sources: Coding guidelines, Linters/SAST tools

🧹 Nitpick comments (1)

tests/microbenchmarks/bench_moe/worker.py (1)
253-269: 💤 Low value

Defensive exception handling is appropriate here, but consider a brief inline comment.

The static analysis flags broad except Exception at lines 253, 261, and 268. However, in a watchdog context where SIGKILL must fire regardless of what fails, catching all exceptions is intentional and correct—an uncaught exception would leave the process hung.

Consider adding a brief inline comment (e.g., # Catch-all: watchdog must SIGKILL regardless of bookkeeping failures) to document this design choice and silence future reviewers/linters.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/microbenchmarks/bench_moe/worker.py` around lines 253 - 269, The broad
except Exception blocks at lines 253, 261, and 268 in the watchdog timeout
handling logic lack documentation explaining why catching all exceptions is
intentional. Add brief inline comments above or next to each except Exception
clause (including the outer bare except at line 253 and the inner except blocks
around the _on_timeout callback and stderr operations) to document that this
design is intentional because the watchdog must ensure SIGKILL fires regardless
of any bookkeeping failures. Use a comment like "# Catch-all: watchdog must
SIGKILL regardless of bookkeeping failures" to clarify the design choice for
future reviewers and static analysis tools.
Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`:
- Around line 421-430: The condition checking whether to use the DP path in the
calculate_num_chunks method is overly restrictive. The check `if self.use_dp and
self.comm is not None:` incorrectly assumes that when self.comm is None, the
configuration is non-DP, but DP can also have self.comm as None (in fused-comm
or commless paths). Change the condition to check only `if self.use_dp:` without
the additional self.comm check, so that the padded calculation path is used
whenever DP is enabled regardless of the comm state, and the assertion path is
only used when self.use_dp is False (truly non-DP).

In `@tests/microbenchmarks/bench_moe/results.py`:
- Around line 411-413: The `_resolve_mapping_layout(config, world_size)` call in
the `_make_skipped_run_result` function is unguarded and can raise an exception,
causing the function to fail instead of producing a skipped row as intended.
Wrap the `_resolve_mapping_layout` call in a try-except block to handle any
failures gracefully, using sensible default values for `_enable_dp` if the
layout resolution fails, so that a skipped row result is always returned even
when layout resolution fails.

In `@tests/microbenchmarks/bench_moe/worker.py`:
- Around line 593-595: The file has formatting inconsistencies detected by
ruff-format. Run the ruff formatter locally on the file to automatically apply
the expected formatting changes, then commit the reformatted result. This will
ensure the code matches the project's formatting standards and resolve the CI
failure.

---

Outside diff comments:
In `@tests/microbenchmarks/bench_moe/case_runner.py`:
- Around line 556-569: The try-except block around the _build_routing_plan call
is catching the broad Exception class, which hides genuine programming errors
and makes regressions harder to detect. Replace the generic except Exception
clause with specific exception types that represent the expected validation or
file-related errors that _build_routing_plan can raise (such as ValueError,
FileNotFoundError, or similar domain-specific exceptions). This ensures that
only anticipated errors are caught and re-labeled as skipped cases, while
unexpected errors will propagate and be caught during testing.

---

Nitpick comments:
In `@tests/microbenchmarks/bench_moe/worker.py`:
- Around line 253-269: The broad except Exception blocks at lines 253, 261, and
268 in the watchdog timeout handling logic lack documentation explaining why
catching all exceptions is intentional. Add brief inline comments above or next
to each except Exception clause (including the outer bare except at line 253 and
the inner except blocks around the _on_timeout callback and stderr operations)
to document that this design is intentional because the watchdog must ensure
SIGKILL fires regardless of any bookkeeping failures. Use a comment like "#
Catch-all: watchdog must SIGKILL regardless of bookkeeping failures" to clarify
the design choice for future reviewers and static analysis tools.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f5c26f92-55cf-4363-a5f5-560de83f08c2

📥 Commits

Reviewing files that changed from the base of the PR and between 09449d4 and a3735ed.

📒 Files selected for processing (6)

tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py
tests/microbenchmarks/bench_moe/case_runner.py
tests/microbenchmarks/bench_moe/cli.py
tests/microbenchmarks/bench_moe/results.py
tests/microbenchmarks/bench_moe/routing/builders.py
tests/microbenchmarks/bench_moe/worker.py

xxi-nv · 2026-06-16T05:58:16Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-16T06:04:31Z

PR_Github #54492 [ run ] triggered by Bot. Commit: a3735ed Link to invocation

tensorrt-cicd · 2026-06-16T14:58:46Z

PR_Github #54492 [ run ] completed with state SUCCESS. Commit: a3735ed
/LLM/main/L0_MergeRequest_PR pipeline #43557 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

guqiqi · 2026-06-17T09:08:18Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-17T09:14:01Z

PR_Github #54819 [ run ] triggered by Bot. Commit: de514f0 Link to invocation

tensorrt-cicd · 2026-06-17T16:13:55Z

PR_Github #54819 [ run ] completed with state SUCCESS. Commit: de514f0
/LLM/main/L0_MergeRequest_PR pipeline #43832 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

guqiqi · 2026-06-23T10:33:35Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-23T10:40:02Z

PR_Github #55227 [ run ] triggered by Bot. Commit: f39a42a Link to invocation

tensorrt-cicd · 2026-06-23T18:10:28Z

PR_Github #55227 [ run ] completed with state SUCCESS. Commit: f39a42a
/LLM/main/L0_MergeRequest_PR pipeline #44186 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

xxi-nv · 2026-06-25T01:37:01Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-25T01:43:34Z

PR_Github #55645 [ run ] triggered by Bot. Commit: cb6dd3a Link to invocation

guqiqi · 2026-06-25T02:05:12Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-25T02:10:43Z

PR_Github #55658 [ run ] triggered by Bot. Commit: d168fe5 Link to invocation

tensorrt-cicd · 2026-06-25T02:14:10Z

PR_Github #55645 [ run ] completed with state ABORTED. Commit: cb6dd3a

Link to invocation

tensorrt-cicd · 2026-06-25T10:07:44Z

PR_Github #55658 [ run ] completed with state SUCCESS. Commit: d168fe5
/LLM/main/L0_MergeRequest_PR pipeline #44565 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…ation for non-DP attention path, and MoE logic for non-DP attention path Signed-off-by: guqiqi <29116997+guqiqi@users.noreply.github.com>

Signed-off-by: guqiqi <29116997+guqiqi@users.noreply.github.com>

…tion time; add DENSEGEMM+EP validation; remove spurious .cpu() in MXFP4 quantize_utils Signed-off-by: guqiqi <29116997+guqiqi@users.noreply.github.com>

Signed-off-by: guqiqi <29116997+guqiqi@users.noreply.github.com>

guqiqi · 2026-07-03T04:48:25Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-07-03T04:54:46Z

PR_Github #57358 [ run ] triggered by Bot. Commit: df096b7 Link to invocation

tensorrt-cicd · 2026-07-03T12:28:47Z

PR_Github #57358 [ run ] completed with state SUCCESS. Commit: df096b7
/LLM/main/L0_MergeRequest_PR pipeline #46109 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

guqiqi · 2026-07-04T08:06:41Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-07-04T08:13:01Z

PR_Github #57547 [ run ] triggered by Bot. Commit: df096b7 Link to invocation

tensorrt-cicd · 2026-07-04T12:08:29Z

PR_Github #57547 [ run ] completed with state SUCCESS. Commit: df096b7
/LLM/main/L0_MergeRequest_PR pipeline #46275 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

guqiqi · 2026-07-04T14:37:04Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-07-04T14:42:54Z

PR_Github #57560 [ run ] triggered by Bot. Commit: df096b7 Link to invocation

tensorrt-cicd · 2026-07-04T15:29:46Z

PR_Github #57560 [ run ] completed with state SUCCESS. Commit: df096b7
/LLM/main/L0_MergeRequest_PR pipeline #46288 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

guqiqi requested a review from a team as a code owner June 16, 2026 05:45

guqiqi requested a review from mikeiovine June 16, 2026 05:45

github-actions Bot assigned guqiqi Jun 16, 2026

coderabbitai Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py

Comment thread tests/microbenchmarks/bench_moe/results.py

Comment thread tests/microbenchmarks/bench_moe/worker.py Outdated

xxi-nv self-requested a review June 16, 2026 05:59

guqiqi force-pushed the dev-bench-moe branch from a3735ed to 06e6568 Compare June 16, 2026 07:04

xxi-nv approved these changes Jun 17, 2026

View reviewed changes

guqiqi force-pushed the dev-bench-moe branch 2 times, most recently from 384f4a9 to de514f0 Compare June 17, 2026 09:05

guqiqi force-pushed the dev-bench-moe branch 2 times, most recently from a80ce94 to f39a42a Compare June 23, 2026 10:30

guqiqi force-pushed the dev-bench-moe branch from cb6dd3a to d168fe5 Compare June 25, 2026 02:01

guqiqi added 2 commits June 29, 2026 22:45

fix: bench-moe timeout handler, kernel error handler, workload comput…

dbd3976

…ation for non-DP attention path, and MoE logic for non-DP attention path Signed-off-by: guqiqi <29116997+guqiqi@users.noreply.github.com>

fix: unittest failure for assertion in non-DP attention path

e7a1f15

Signed-off-by: guqiqi <29116997+guqiqi@users.noreply.github.com>

guqiqi added 3 commits June 29, 2026 22:45

bench-moe enhancement: prune TEP/TTP forced-comm candidates at genera…

54a85be

…tion time; add DENSEGEMM+EP validation; remove spurious .cpu() in MXFP4 quantize_utils Signed-off-by: guqiqi <29116997+guqiqi@users.noreply.github.com>

add DTP/TTP support

687e70c

Signed-off-by: guqiqi <29116997+guqiqi@users.noreply.github.com>

fix: bench_moe 0-token rank deadlock, feasibility assertion

df096b7

Signed-off-by: guqiqi <29116997+guqiqi@users.noreply.github.com>

guqiqi force-pushed the dev-bench-moe branch from 018fc82 to df096b7 Compare June 30, 2026 05:47

Uh oh!

Conversation

guqiqi commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xxi-nv commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

guqiqi commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

guqiqi commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

xxi-nv commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

guqiqi commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

guqiqi commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

guqiqi commented Jul 4, 2026

Uh oh!

tensorrt-cicd commented Jul 4, 2026

Uh oh!

tensorrt-cicd commented Jul 4, 2026

Uh oh!

guqiqi commented Jul 4, 2026

Uh oh!

tensorrt-cicd commented Jul 4, 2026

Uh oh!

tensorrt-cicd commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

guqiqi commented Jun 16, 2026 •

edited

Loading

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading