You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[None][fix] Restore benchmark-disagg immediate fail-fast in _prepare_and_schedule_batch
The cherry-pick of #14042-era "Fix deepseekv4 stall" replaced main's immediate
"Insufficient KV cache for gen-only benchmark mode" guard with a time-based
gen-count stall watchdog. On current main that watchdog is both superseded and
incompatible: main already handles the ADP fill-completion case via
_is_benchmark_disagg_fill_complete (per-rank allgather), and the time-based
watchdog never fires on a single scheduling iteration, so it regressed
tests/unittest/_torch/executor/test_benchmark_disagg.py
(TestFailFastDuringBenchmarkFill, TestFillPhaseEndToEnd) which assert an
immediate fail-fast when all benchmark requests are fetched (or the fill phase
is over) and the scheduler can fit no INIT request.
Restore main's immediate guard (fail once all requests are fetched and no INIT
request fits, suppressed during warmup) and drop the now-unused
benchmark_fill_stall_timeout_s. Verified on 8xB200: test_benchmark_disagg.py,
test_py_executor.py and the tp4_pp2_dp_both transceiver cases all pass
(240 passed, 0 failed).
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
0 commit comments