[None][fix] fix bench-moe benchmark bugs#15397
Conversation
📝 WalkthroughWalkthroughTwo orthogonal changes: (1) an ChangesDP-aware token sharding
Watchdog crash-resume semantics
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/microbenchmarks/bench_moe/case_runner.py (1)
556-569:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winNarrow the routing-plan catch to expected validation/file exceptions.
This block currently catches every exception and re-labels it as a skipped case, which can hide genuine programming errors in
_build_routing_planand make regressions harder to detect.🔧 Suggested fix
- except Exception as exc: + except (ValueError, FileNotFoundError, json.JSONDecodeError) as exc: reason = f"routing plan error: {type(exc).__name__}: {exc}" _maybe_print_rank0(f"[bench_moe] {reason}") return _short_circuit(result, "skipped", reason)As per coding guidelines, “Avoid broad exception handling — catch specific exceptions, not bare
except” and “limit the except to the smallest set of errors possible.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/microbenchmarks/bench_moe/case_runner.py` around lines 556 - 569, The try-except block around the _build_routing_plan call is catching the broad Exception class, which hides genuine programming errors and makes regressions harder to detect. Replace the generic except Exception clause with specific exception types that represent the expected validation or file-related errors that _build_routing_plan can raise (such as ValueError, FileNotFoundError, or similar domain-specific exceptions). This ensures that only anticipated errors are caught and re-labeled as skipped cases, while unexpected errors will propagate and be caught during testing.Sources: Coding guidelines, Linters/SAST tools
🧹 Nitpick comments (1)
tests/microbenchmarks/bench_moe/worker.py (1)
253-269: 💤 Low valueDefensive exception handling is appropriate here, but consider a brief inline comment.
The static analysis flags broad
except Exceptionat lines 253, 261, and 268. However, in a watchdog context where SIGKILL must fire regardless of what fails, catching all exceptions is intentional and correct—an uncaught exception would leave the process hung.Consider adding a brief inline comment (e.g.,
# Catch-all: watchdog must SIGKILL regardless of bookkeeping failures) to document this design choice and silence future reviewers/linters.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/microbenchmarks/bench_moe/worker.py` around lines 253 - 269, The broad except Exception blocks at lines 253, 261, and 268 in the watchdog timeout handling logic lack documentation explaining why catching all exceptions is intentional. Add brief inline comments above or next to each except Exception clause (including the outer bare except at line 253 and the inner except blocks around the _on_timeout callback and stderr operations) to document that this design is intentional because the watchdog must ensure SIGKILL fires regardless of any bookkeeping failures. Use a comment like "# Catch-all: watchdog must SIGKILL regardless of bookkeeping failures" to clarify the design choice for future reviewers and static analysis tools.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`:
- Around line 421-430: The condition checking whether to use the DP path in the
calculate_num_chunks method is overly restrictive. The check `if self.use_dp and
self.comm is not None:` incorrectly assumes that when self.comm is None, the
configuration is non-DP, but DP can also have self.comm as None (in fused-comm
or commless paths). Change the condition to check only `if self.use_dp:` without
the additional self.comm check, so that the padded calculation path is used
whenever DP is enabled regardless of the comm state, and the assertion path is
only used when self.use_dp is False (truly non-DP).
In `@tests/microbenchmarks/bench_moe/results.py`:
- Around line 411-413: The `_resolve_mapping_layout(config, world_size)` call in
the `_make_skipped_run_result` function is unguarded and can raise an exception,
causing the function to fail instead of producing a skipped row as intended.
Wrap the `_resolve_mapping_layout` call in a try-except block to handle any
failures gracefully, using sensible default values for `_enable_dp` if the
layout resolution fails, so that a skipped row result is always returned even
when layout resolution fails.
In `@tests/microbenchmarks/bench_moe/worker.py`:
- Around line 593-595: The file has formatting inconsistencies detected by
ruff-format. Run the ruff formatter locally on the file to automatically apply
the expected formatting changes, then commit the reformatted result. This will
ensure the code matches the project's formatting standards and resolve the CI
failure.
---
Outside diff comments:
In `@tests/microbenchmarks/bench_moe/case_runner.py`:
- Around line 556-569: The try-except block around the _build_routing_plan call
is catching the broad Exception class, which hides genuine programming errors
and makes regressions harder to detect. Replace the generic except Exception
clause with specific exception types that represent the expected validation or
file-related errors that _build_routing_plan can raise (such as ValueError,
FileNotFoundError, or similar domain-specific exceptions). This ensures that
only anticipated errors are caught and re-labeled as skipped cases, while
unexpected errors will propagate and be caught during testing.
---
Nitpick comments:
In `@tests/microbenchmarks/bench_moe/worker.py`:
- Around line 253-269: The broad except Exception blocks at lines 253, 261, and
268 in the watchdog timeout handling logic lack documentation explaining why
catching all exceptions is intentional. Add brief inline comments above or next
to each except Exception clause (including the outer bare except at line 253 and
the inner except blocks around the _on_timeout callback and stderr operations)
to document that this design is intentional because the watchdog must ensure
SIGKILL fires regardless of any bookkeeping failures. Use a comment like "#
Catch-all: watchdog must SIGKILL regardless of bookkeeping failures" to clarify
the design choice for future reviewers and static analysis tools.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f5c26f92-55cf-4363-a5f5-560de83f08c2
📒 Files selected for processing (6)
tensorrt_llm/_torch/modules/fused_moe/configurable_moe.pytests/microbenchmarks/bench_moe/case_runner.pytests/microbenchmarks/bench_moe/cli.pytests/microbenchmarks/bench_moe/results.pytests/microbenchmarks/bench_moe/routing/builders.pytests/microbenchmarks/bench_moe/worker.py
|
/bot run --disable-fail-fast |
|
PR_Github #54492 [ run ] triggered by Bot. Commit: |
|
PR_Github #54492 [ run ] completed with state
|
384f4a9 to
de514f0
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #54819 [ run ] triggered by Bot. Commit: |
|
PR_Github #54819 [ run ] completed with state
|
a80ce94 to
f39a42a
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #55227 [ run ] triggered by Bot. Commit: |
|
PR_Github #55227 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #55645 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #55658 [ run ] triggered by Bot. Commit: |
|
PR_Github #55645 [ run ] completed with state |
|
PR_Github #55658 [ run ] completed with state
|
…ation for non-DP attention path, and MoE logic for non-DP attention path Signed-off-by: guqiqi <29116997+guqiqi@users.noreply.github.com>
Signed-off-by: guqiqi <29116997+guqiqi@users.noreply.github.com>
…tion time; add DENSEGEMM+EP validation; remove spurious .cpu() in MXFP4 quantize_utils Signed-off-by: guqiqi <29116997+guqiqi@users.noreply.github.com>
Signed-off-by: guqiqi <29116997+guqiqi@users.noreply.github.com>
Signed-off-by: guqiqi <29116997+guqiqi@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #57358 [ run ] triggered by Bot. Commit: |
|
PR_Github #57358 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #57547 [ run ] triggered by Bot. Commit: |
|
PR_Github #57547 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #57560 [ run ] triggered by Bot. Commit: |
|
PR_Github #57560 [ run ] completed with state
|
Description
Summary by CodeRabbit
Bug Fixes
Tests