[None][feat] WideEP FT: add EPLB mask-only reconfigure by chienchunhung · Pull Request #4 · chienchunhung/TensorRT-LLM

chienchunhung · 2026-06-22T19:44:25Z

Summary

Add a C++ MoeLoadBalancer::reconfigureMaskOnly(deadRanks) path that rebuilds EPLB placement metadata while leaving weights in place.
Keep dead ranks excluded from later dynamic EPLB placement/replication so degraded routing does not drift back onto failed ranks.
Expose the API through nanobind and the Python MoeLoadBalancer wrapper, with focused C++/Python tests.

Why

1b.1 needs EPLB to make dead-rank slots unreachable after the communication-layer active-rank mask is installed. This implements the MVP mask-only slot remap path and fails closed if any expert would lose its last surviving replica.

Validation

git diff --check
python -m py_compile tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py tests/unittest/_torch/modules/test_moe_load_balancer.py
Attempted: PYTHONPATH=. pytest -q tests/unittest/_torch/modules/test_moe_load_balancer.py -k 'lifecycle_methods or reconfigure_mask_only_rejects_active_iteration' (blocked locally: torch is not installed in this environment)

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

chienchunhung · 2026-06-22T21:25:33Z

Superseded by NVIDIA#15525.

[None][feat] WideEP FT: add EPLB mask-only reconfigure

b8ca167

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>

github-actions Bot assigned chienchunhung Jun 22, 2026

chienchunhung closed this Jun 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][feat] WideEP FT: add EPLB mask-only reconfigure#4

[None][feat] WideEP FT: add EPLB mask-only reconfigure#4
chienchunhung wants to merge 1 commit into
WideEP-FT/1a.4-alltoall-watchdogfrom
WideEP-FT/1b.1-reconfigure-masking

chienchunhung commented Jun 22, 2026

Uh oh!

chienchunhung commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chienchunhung commented Jun 22, 2026

Summary

Why

Validation

Uh oh!

chienchunhung commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant