Skip to content

[None][feat] WideEP FT: add EPLB mask-only reconfigure#4

Closed
chienchunhung wants to merge 1 commit into
WideEP-FT/1a.4-alltoall-watchdogfrom
WideEP-FT/1b.1-reconfigure-masking
Closed

[None][feat] WideEP FT: add EPLB mask-only reconfigure#4
chienchunhung wants to merge 1 commit into
WideEP-FT/1a.4-alltoall-watchdogfrom
WideEP-FT/1b.1-reconfigure-masking

Conversation

@chienchunhung

Copy link
Copy Markdown
Owner

Summary

  • Add a C++ MoeLoadBalancer::reconfigureMaskOnly(deadRanks) path that rebuilds EPLB placement metadata while leaving weights in place.
  • Keep dead ranks excluded from later dynamic EPLB placement/replication so degraded routing does not drift back onto failed ranks.
  • Expose the API through nanobind and the Python MoeLoadBalancer wrapper, with focused C++/Python tests.

Why

1b.1 needs EPLB to make dead-rank slots unreachable after the communication-layer active-rank mask is installed. This implements the MVP mask-only slot remap path and fails closed if any expert would lose its last surviving replica.

Validation

  • git diff --check
  • python -m py_compile tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py tests/unittest/_torch/modules/test_moe_load_balancer.py
  • Attempted: PYTHONPATH=. pytest -q tests/unittest/_torch/modules/test_moe_load_balancer.py -k 'lifecycle_methods or reconfigure_mask_only_rejects_active_iteration' (blocked locally: torch is not installed in this environment)

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
@chienchunhung

Copy link
Copy Markdown
Owner Author

Superseded by NVIDIA#15525.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant