Commit d74f4ce
committed
[JAX] MoE test driver: fix bootstrap recv-buffer formula + drop redundant align128 config
Two small EP-MoE-test-driver edits that had been sitting uncommitted
since the bring-up:
1. ``_compute_worst_case_recv_pr`` was sizing the bootstrap with the
old natural-dropless formula
``ceil((B/dp)*S*K / num_local_experts)`` and rounding to align=128.
That under-sizes the bootstrap on small configs because NCCL EP's
HT path lays out the per-rank receive buffer as
``[num_local_experts, ep_size * max_tokens_per_rank, hidden]``
(LL combine assertion at ``nccl_ep.cc:2185`` + HT IPC sizing at
``nccl_ep.cc:415``). When runtime ``recv_pr`` exceeded the
bootstrap-time ``recv_capacity_per_rank``, ``ncclEpDispatch``
aborted with ``invalid argument`` at ``ep_backend.cpp:414``.
Switch to the worst-case formula
``num_local_experts * ep_size * max_tokens_per_rank`` so the
bootstrap reserves enough capacity for every config in the
parametrize list (and matches the ``natural_spe`` computation in
moe.py).
2. Drop the ``softmax-topk-early`` and ``softmax-align128``
parametrize cases. Replaced with two TODO comments that
document the scope:
- ``softmax-topk-early`` is off because the early-weighting
multiply ``intermediate * recv_w * mask`` is currently
vulnerable to ``0 * NaN -> NaN`` from padded recv slots.
Late weighting (combine-side) is unaffected and stays
covered.
- ``softmax-align128`` is now redundant: moe.py floors
``slots_per_expert`` at 128 unconditionally
(``effective_align = max(align_size, 128)``, landed in
``b1e99803``), so ``align_size=0`` and ``align_size=128``
produce identical layouts. A distinct case only matters if
the floor is loosened or a recipe demands >128 alignment.
Module-level docstring updated to drop the stale ``align_size=0/128``
reference.1 parent a260c4b commit d74f4ce
1 file changed
Lines changed: 28 additions & 23 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
37 | | - | |
38 | | - | |
39 | | - | |
40 | | - | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
41 | 40 | | |
42 | 41 | | |
43 | 42 | | |
| |||
193 | 192 | | |
194 | 193 | | |
195 | 194 | | |
196 | | - | |
197 | | - | |
198 | | - | |
199 | | - | |
200 | | - | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
201 | 206 | | |
202 | 207 | | |
203 | | - | |
204 | 208 | | |
205 | | - | |
206 | | - | |
207 | | - | |
208 | | - | |
209 | | - | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
210 | 212 | | |
211 | 213 | | |
212 | 214 | | |
| |||
530 | 532 | | |
531 | 533 | | |
532 | 534 | | |
533 | | - | |
534 | | - | |
535 | | - | |
536 | | - | |
537 | | - | |
538 | | - | |
539 | | - | |
540 | | - | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
541 | 546 | | |
542 | 547 | | |
543 | 548 | | |
| |||
0 commit comments