Skip to content

[TRTLLM-12557][feat] WideEP FT: add AlltoAll watchdog (1a.4)#5

Closed
chienchunhung wants to merge 1 commit into
WideEP-FT/1a.2-nvlink-kernel-maskfrom
WideEP-FT/1a.4-alltoall-watchdog
Closed

[TRTLLM-12557][feat] WideEP FT: add AlltoAll watchdog (1a.4)#5
chienchunhung wants to merge 1 commit into
WideEP-FT/1a.2-nvlink-kernel-maskfrom
WideEP-FT/1a.4-alltoall-watchdog

Conversation

@chienchunhung

Copy link
Copy Markdown
Owner

Summary

  • Add an opt-in AlltoAllWatchdog host thread that polls dispatch/combine completion flags in FIFO order and reports timed-out ranks.
  • Wire optional EP health/watchdog support into MoeAlltoAll and NVLinkOneSided, including active-rank-mask forwarding from health.
  • Add focused watchdog unit coverage for completion, timeout reporting, existing failed-rank filtering, local-rank handling, FIFO clearing, and workspace-offset reads.

Validation

  • python -m compileall -q tensorrt_llm/_torch/alltoall_watchdog.py tensorrt_llm/_torch/distributed/moe_alltoall.py tensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.py tests/unittest/_torch/modules/test_alltoall_watchdog.py
  • git diff --check

Not run: pytest, because the available local Python runtimes do not have torch installed.

@chienchunhung chienchunhung changed the title [codex] WideEP FT: add AlltoAll watchdog [TRTLLM-12557][feat] WideEP FT: add AlltoAll watchdog Jun 22, 2026
@chienchunhung chienchunhung changed the title [TRTLLM-12557][feat] WideEP FT: add AlltoAll watchdog [TRTLLM-12557][feat] WideEP FT: add AlltoAll watchdog (1a.4) Jun 22, 2026
@chienchunhung

Copy link
Copy Markdown
Owner Author

Superseded by NVIDIA#15524.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant