[Bugfix] Fix double-counted max_q_seqlen in decode delta kv_seqlens#4685
Conversation
|
Thanks for the PR! seq.num_all_ids can be stale when delta is built. EngineLoop prefetches the next inputs before waiting for the current output and before _finish_forward_output() updates the SchedulerSequence(read _main_loop_get_outputs in engine_loop for more details). A better solution would be: kv_seqlens = [seq.num_all_ids + max_q_seqlen for seq in valid_seqs]
sum_kv_seqlen = sum(kv_seqlens)
max_kv_seqlen = max(kv_seqlens) |
4c84f26 to
df2a1b4
Compare
|
Thanks @grimoire — you're right, that's the detail I missed. Adopted your suggestion in df2a1b4: kv_seqlens = [seq.num_all_ids + max_q_seqlen for seq in valid_seqs]
sum_kv_seqlen = sum(kv_seqlens)
max_kv_seqlen = max(kv_seqlens)
Added a GPU-free regression test (2 valid + 1 dropped seq, (FYI the red |
create_model_inputs_delta / create_model_inputs_delta_valid_only build
kv_seqlens as [seq.num_all_ids + max_q_seqlen]. num_all_ids can be one decode
step stale here -- EngineLoop prefetches the next inputs before
_finish_forward_output() advances the sequence -- so the +max_q_seqlen recovers
this forward's kv length. But the reductions then added max_q_seqlen a SECOND
time and used batch_size = len(self.running_seqs), which counts
scheduler-dropped invalid seqs:
sum_kv_seqlen = sum(kv_seqlens) + batch_size * max_q_seqlen
max_kv_seqlen = max(kv_seqlens) + max_q_seqlen
so max_kv_seqlen / sum_kv_seqlen were over-counted (max by max_q_seqlen,
scaling with spec/MTP num_decode_tokens), over-allocating the attention grid +
kv-cache resources.
Reduce over kv_seqlens directly; the +max_q_seqlen is already applied once in
the comprehension.
Fixes InternLM#4024
Co-authored-by: Claude <noreply@anthropic.com>
df2a1b4 to
3c61d26
Compare
|
Also strengthened the regression test (3c61d26): instead of hard-coded expected values, it now derives them from the engine's own one-decode-step advance ( |
There was a problem hiding this comment.
Pull request overview
Fixes an arithmetic bug in the PyTorch engine’s decode “delta” fast-path where max_kv_seqlen / sum_kv_seqlen were over-counted (double-applying + max_q_seqlen and incorrectly including scheduler-dropped sequences). This aligns the delta builders with the engine’s “advance exactly one decode step” invariant used by ModelInputs.step() / get_model_inputs_next_decoding().
Changes:
- Correct
sum_kv_seqlen/max_kv_seqlenreductions inInputsMakerAsync.create_model_inputs_delta()andcreate_model_inputs_delta_valid_only()to reduce directly over the already-advancedkv_seqlens. - Add a regression unit test covering scheduler-dropped sequences and multiple
max_q_seqlenvalues (standard decode vs spec/MTP-like multi-token decode).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| lmdeploy/pytorch/engine/inputs_maker.py | Fixes double-counting of KV sequence lengths in decode delta builders (and avoids counting invalid/dropped seqs). |
| tests/pytorch/engine/test_inputs_maker.py | Adds a regression test verifying delta KV lengths match “base + exactly one decode advance” for valid seqs only. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
Fixes #4024 (confirmed by @grimoire, who reviewed and pinned the exact fix). The decode "delta" fast-path over-counts
max_kv_seqlen/sum_kv_seqlen.Root cause
InputsMakerAsync.create_model_inputs_delta(andcreate_model_inputs_delta_valid_only) build:num_all_idsis one decode step stale at delta-build time —EngineLoop._prefetch_next_inputs(engine_loop.py:435) runs before_finish_forward_output(:438) advances theSchedulerSequence(per @grimoire) — so the+ max_q_seqlenin the comprehension is needed to recover this forward's kv length. The bug is the second+ max_q_seqlenin the reductions, plusbatch_size = len(self.running_seqs), which counts scheduler-dropped invalid seqs. The result over-countsmax_kv_seqlenbymax_q_seqlen(scaling with spec-decode / MTPnum_decode_tokens).Per @grimoire these values feed the attention grid + kv-cache resource allocation, so the effect is conservative over-allocation (minor waste, no accuracy/perf impact).
Fix
Reduce over
kv_seqlensdirectly (the+ max_q_seqlenis already applied once in the comprehension):Same fix in
create_model_inputs_delta_valid_only.Test
tests/pytorch/engine/test_inputs_maker.py::test_create_model_inputs_delta_valid_only_matches_one_decode_advance(GPU-free; 2 valid + 1 scheduler-dropped seq), parametrized overmax_q_seqlen ∈ {1, 4}(standard decode + spec/MTP). The expected value is derived from the engine's own one-decode-step advance —ModelInputs.step(model_inputs.py) /get_model_inputs_next_decoding(strategies/ar/model_inputs.py):max_kv += max_q_seqlen,sum_kv += num_valid * max_q_seqlen— not a hardcoded number. It asserts the delta equalsbase_kv (num_all_ids) + exactly one advance, and parametrizing proves the offset is onemax_q_seqlen, rejecting both the old+ 2*max_qand a+ 0regression:2 passed(full file: 11 passed).2 failed— e.g.assert 258 == 250 + 4(max_q=4).pre-commit(ruff-check + docformatter) clean.Dup-check
Issue open, unassigned. No open PR touches this logic (#4679/#4534/#4631 leave the delta builders untouched). cc @grimoire.
AI-assisted (Claude); every line human-reviewed.