Skip to content

[Bugfix] Fix double-counted max_q_seqlen in decode delta kv_seqlens#4685

Merged
lvhan028 merged 1 commit into
InternLM:mainfrom
waynehacking8:wayne/fix-4024-delta-kv-seqlen
Jun 18, 2026
Merged

[Bugfix] Fix double-counted max_q_seqlen in decode delta kv_seqlens#4685
lvhan028 merged 1 commit into
InternLM:mainfrom
waynehacking8:wayne/fix-4024-delta-kv-seqlen

Conversation

@waynehacking8

@waynehacking8 waynehacking8 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #4024 (confirmed by @grimoire, who reviewed and pinned the exact fix). The decode "delta" fast-path over-counts max_kv_seqlen / sum_kv_seqlen.

The issue points at ModelInputs.split() (since refactored away); the same arithmetic error now lives in the decode delta builders, so this PR fixes the current location.

Root cause

InputsMakerAsync.create_model_inputs_delta (and create_model_inputs_delta_valid_only) build:

kv_seqlens = [seq.num_all_ids + max_q_seqlen for seq in valid_seqs]
sum_kv_seqlen = sum(kv_seqlens) + batch_size * max_q_seqlen
max_kv_seqlen = max(kv_seqlens) + max_q_seqlen

num_all_ids is one decode step stale at delta-build time — EngineLoop._prefetch_next_inputs (engine_loop.py:435) runs before _finish_forward_output (:438) advances the SchedulerSequence (per @grimoire) — so the + max_q_seqlen in the comprehension is needed to recover this forward's kv length. The bug is the second + max_q_seqlen in the reductions, plus batch_size = len(self.running_seqs), which counts scheduler-dropped invalid seqs. The result over-counts max_kv_seqlen by max_q_seqlen (scaling with spec-decode / MTP num_decode_tokens).

Per @grimoire these values feed the attention grid + kv-cache resource allocation, so the effect is conservative over-allocation (minor waste, no accuracy/perf impact).

Fix

Reduce over kv_seqlens directly (the + max_q_seqlen is already applied once in the comprehension):

kv_seqlens = [seq.num_all_ids + max_q_seqlen for seq in valid_seqs]
sum_kv_seqlen = sum(kv_seqlens)
max_kv_seqlen = max(kv_seqlens)

Same fix in create_model_inputs_delta_valid_only.

Test

tests/pytorch/engine/test_inputs_maker.py::test_create_model_inputs_delta_valid_only_matches_one_decode_advance (GPU-free; 2 valid + 1 scheduler-dropped seq), parametrized over max_q_seqlen ∈ {1, 4} (standard decode + spec/MTP). The expected value is derived from the engine's own one-decode-step advanceModelInputs.step (model_inputs.py) / get_model_inputs_next_decoding (strategies/ar/model_inputs.py): max_kv += max_q_seqlen, sum_kv += num_valid * max_q_seqlen — not a hardcoded number. It asserts the delta equals base_kv (num_all_ids) + exactly one advance, and parametrizing proves the offset is one max_q_seqlen, rejecting both the old + 2*max_q and a + 0 regression:

  • WITH fix: 2 passed (full file: 11 passed).
  • WITHOUT fix (original): 2 failed — e.g. assert 258 == 250 + 4 (max_q=4).

pre-commit (ruff-check + docformatter) clean.

Note: the red lint check is a flaky shields.io badge 503/timeout in README.md / README_zh-CN.md — the pre-commit "Linting" step passed and this PR only touches .py files.

Dup-check

Issue open, unassigned. No open PR touches this logic (#4679/#4534/#4631 leave the delta builders untouched). cc @grimoire.


AI-assisted (Claude); every line human-reviewed.

@grimoire

Copy link
Copy Markdown
Collaborator

Thanks for the PR!

seq.num_all_ids can be stale when delta is built. EngineLoop prefetches the next inputs before waiting for the current output and before _finish_forward_output() updates the SchedulerSequence(read _main_loop_get_outputs in engine_loop for more details).
The model-agent later applies this delta after its internal StepInputs has advanced, so bare seq.num_all_ids can be one decode query behind the ModelInputs actually used for the next forward. That makes delta.max_kv_seqlen smaller than history_lengths + seq_length, and the smaller value is passed to attention metadata.

A better solution would be:

kv_seqlens = [seq.num_all_ids + max_q_seqlen for seq in valid_seqs]
sum_kv_seqlen = sum(kv_seqlens)
max_kv_seqlen = max(kv_seqlens)

@waynehacking8 waynehacking8 force-pushed the wayne/fix-4024-delta-kv-seqlen branch from 4c84f26 to df2a1b4 Compare June 17, 2026 03:48
@waynehacking8

Copy link
Copy Markdown
Contributor Author

Thanks @grimoire — you're right, that's the detail I missed. Adopted your suggestion in df2a1b4:

kv_seqlens = [seq.num_all_ids + max_q_seqlen for seq in valid_seqs]
sum_kv_seqlen = sum(kv_seqlens)
max_kv_seqlen = max(kv_seqlens)

num_all_ids is one decode query stale at delta-build time (EngineLoop prefetches before _finish_forward_output() advances the sequence), so the + max_q_seqlen in the comprehension is needed to recover this forward's kv length — the bug was adding it a second time in the reductions (plus batch_size, which counts scheduler-dropped invalid seqs). Applied the same fix to create_model_inputs_delta_valid_only.

Added a GPU-free regression test (2 valid + 1 dropped seq, max_q_seqlen=4): max_kv_seqlen == 254 / sum_kv_seqlen == 358; the old code gave 258 / 370.

(FYI the red lint check is the flaky shields.io badge 503 in README.md / README_zh-CN.md — the pre-commit "Linting" step passed; this PR is .py-only.)

create_model_inputs_delta / create_model_inputs_delta_valid_only build
kv_seqlens as [seq.num_all_ids + max_q_seqlen]. num_all_ids can be one decode
step stale here -- EngineLoop prefetches the next inputs before
_finish_forward_output() advances the sequence -- so the +max_q_seqlen recovers
this forward's kv length. But the reductions then added max_q_seqlen a SECOND
time and used batch_size = len(self.running_seqs), which counts
scheduler-dropped invalid seqs:

    sum_kv_seqlen = sum(kv_seqlens) + batch_size * max_q_seqlen
    max_kv_seqlen = max(kv_seqlens) + max_q_seqlen

so max_kv_seqlen / sum_kv_seqlen were over-counted (max by max_q_seqlen,
scaling with spec/MTP num_decode_tokens), over-allocating the attention grid +
kv-cache resources.

Reduce over kv_seqlens directly; the +max_q_seqlen is already applied once in
the comprehension.

Fixes InternLM#4024

Co-authored-by: Claude <noreply@anthropic.com>
@waynehacking8 waynehacking8 force-pushed the wayne/fix-4024-delta-kv-seqlen branch from df2a1b4 to 3c61d26 Compare June 17, 2026 03:59
@waynehacking8

Copy link
Copy Markdown
Contributor Author

Also strengthened the regression test (3c61d26): instead of hard-coded expected values, it now derives them from the engine's own one-decode-step advance (ModelInputs.step / get_model_inputs_next_decoding: max_kv += max_q_seqlen) and is parametrized over max_q_seqlen ∈ {1, 4} — so it asserts delta == base_kv (num_all_ids) + exactly one advance and proves the offset is one max_q_seqlen (rejecting both the old +2*max_q and a +0 regression). Verified 2 passed with the fix, 2 failed against the original.

@grimoire grimoire left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an arithmetic bug in the PyTorch engine’s decode “delta” fast-path where max_kv_seqlen / sum_kv_seqlen were over-counted (double-applying + max_q_seqlen and incorrectly including scheduler-dropped sequences). This aligns the delta builders with the engine’s “advance exactly one decode step” invariant used by ModelInputs.step() / get_model_inputs_next_decoding().

Changes:

  • Correct sum_kv_seqlen / max_kv_seqlen reductions in InputsMakerAsync.create_model_inputs_delta() and create_model_inputs_delta_valid_only() to reduce directly over the already-advanced kv_seqlens.
  • Add a regression unit test covering scheduler-dropped sequences and multiple max_q_seqlen values (standard decode vs spec/MTP-like multi-token decode).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
lmdeploy/pytorch/engine/inputs_maker.py Fixes double-counting of KV sequence lengths in decode delta builders (and avoids counting invalid/dropped seqs).
tests/pytorch/engine/test_inputs_maker.py Adds a regression test verifying delta KV lengths match “base + exactly one decode advance” for valid seqs only.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@RunningLeon RunningLeon left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lvhan028 lvhan028 merged commit 00d5dc8 into InternLM:main Jun 18, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] ModelInputs.split()中对max_kv_seqlen和sum_kv_seqlen的计算可能存在问题

5 participants