You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(router): output TPS via per-req deltas, skip aborted reqs in stats
Two correctness fixes flagged in PR review:
1. count_output_tokens(len(running_batch.reqs)) once per router loop is
wrong — the router loop polls on schedule_time_interval, decoupled
from inference, so this overcounts when the loop is faster than
decode and undercounts when slower, and includes paused/prefill-only
reqs. Track shm_cur_output_len per request and accumulate the delta
each tick (with a tail settlement when the req is filtered out so we
don't lose its last tokens to the post-final-tick window).
2. on_request_completed() and router_statics.update() now both run for
aborted requests, whose candetoken_out_len is a short partial value.
Restore the prior `if not req.is_aborted` guard so disconnects don't
bias the output-length EMA used by KV-budget estimators.
0 commit comments