[Bugfix] Fix double-counted max_q_seqlen in decode delta kv_seqlens by waynehacking8 · Pull Request #4685 · InternLM/lmdeploy

waynehacking8 · 2026-06-16T22:14:54Z

Summary

Fixes #4024 (confirmed by @grimoire, who reviewed and pinned the exact fix). The decode "delta" fast-path over-counts max_kv_seqlen / sum_kv_seqlen.

The issue points at ModelInputs.split() (since refactored away); the same arithmetic error now lives in the decode delta builders, so this PR fixes the current location.

Root cause

InputsMakerAsync.create_model_inputs_delta (and create_model_inputs_delta_valid_only) build:

kv_seqlens = [seq.num_all_ids + max_q_seqlen for seq in valid_seqs]
sum_kv_seqlen = sum(kv_seqlens) + batch_size * max_q_seqlen
max_kv_seqlen = max(kv_seqlens) + max_q_seqlen

num_all_ids is one decode step stale at delta-build time — EngineLoop._prefetch_next_inputs (engine_loop.py:435) runs before _finish_forward_output (:438) advances the SchedulerSequence (per @grimoire) — so the + max_q_seqlen in the comprehension is needed to recover this forward's kv length. The bug is the second + max_q_seqlen in the reductions, plus batch_size = len(self.running_seqs), which counts scheduler-dropped invalid seqs. The result over-counts max_kv_seqlen by max_q_seqlen (scaling with spec-decode / MTP num_decode_tokens).

Per @grimoire these values feed the attention grid + kv-cache resource allocation, so the effect is conservative over-allocation (minor waste, no accuracy/perf impact).

Fix

Reduce over kv_seqlens directly (the + max_q_seqlen is already applied once in the comprehension):

kv_seqlens = [seq.num_all_ids + max_q_seqlen for seq in valid_seqs]
sum_kv_seqlen = sum(kv_seqlens)
max_kv_seqlen = max(kv_seqlens)

Same fix in create_model_inputs_delta_valid_only.

Test

tests/pytorch/engine/test_inputs_maker.py::test_create_model_inputs_delta_valid_only_matches_one_decode_advance (GPU-free; 2 valid + 1 scheduler-dropped seq), parametrized over max_q_seqlen ∈ {1, 4} (standard decode + spec/MTP). The expected value is derived from the engine's own one-decode-step advance — ModelInputs.step (model_inputs.py) / get_model_inputs_next_decoding (strategies/ar/model_inputs.py): max_kv += max_q_seqlen, sum_kv += num_valid * max_q_seqlen — not a hardcoded number. It asserts the delta equals base_kv (num_all_ids) + exactly one advance, and parametrizing proves the offset is one max_q_seqlen, rejecting both the old + 2*max_q and a + 0 regression:

WITH fix: 2 passed (full file: 11 passed).
WITHOUT fix (original): 2 failed — e.g. assert 258 == 250 + 4 (max_q=4).

pre-commit (ruff-check + docformatter) clean.

Note: the red lint check is a flaky shields.io badge 503/timeout in README.md / README_zh-CN.md — the pre-commit "Linting" step passed and this PR only touches .py files.

Dup-check

Issue open, unassigned. No open PR touches this logic (#4679/#4534/#4631 leave the delta builders untouched). cc @grimoire.

AI-assisted (Claude); every line human-reviewed.

grimoire · 2026-06-17T03:36:03Z

Thanks for the PR!

seq.num_all_ids can be stale when delta is built. EngineLoop prefetches the next inputs before waiting for the current output and before _finish_forward_output() updates the SchedulerSequence(read _main_loop_get_outputs in engine_loop for more details).
The model-agent later applies this delta after its internal StepInputs has advanced, so bare seq.num_all_ids can be one decode query behind the ModelInputs actually used for the next forward. That makes delta.max_kv_seqlen smaller than history_lengths + seq_length, and the smaller value is passed to attention metadata.

A better solution would be:

kv_seqlens = [seq.num_all_ids + max_q_seqlen for seq in valid_seqs]
sum_kv_seqlen = sum(kv_seqlens)
max_kv_seqlen = max(kv_seqlens)

waynehacking8 · 2026-06-17T03:48:46Z

Thanks @grimoire — you're right, that's the detail I missed. Adopted your suggestion in df2a1b4:

kv_seqlens = [seq.num_all_ids + max_q_seqlen for seq in valid_seqs]
sum_kv_seqlen = sum(kv_seqlens)
max_kv_seqlen = max(kv_seqlens)

num_all_ids is one decode query stale at delta-build time (EngineLoop prefetches before _finish_forward_output() advances the sequence), so the + max_q_seqlen in the comprehension is needed to recover this forward's kv length — the bug was adding it a second time in the reductions (plus batch_size, which counts scheduler-dropped invalid seqs). Applied the same fix to create_model_inputs_delta_valid_only.

Added a GPU-free regression test (2 valid + 1 dropped seq, max_q_seqlen=4): max_kv_seqlen == 254 / sum_kv_seqlen == 358; the old code gave 258 / 370.

(FYI the red lint check is the flaky shields.io badge 503 in README.md / README_zh-CN.md — the pre-commit "Linting" step passed; this PR is .py-only.)

create_model_inputs_delta / create_model_inputs_delta_valid_only build kv_seqlens as [seq.num_all_ids + max_q_seqlen]. num_all_ids can be one decode step stale here -- EngineLoop prefetches the next inputs before _finish_forward_output() advances the sequence -- so the +max_q_seqlen recovers this forward's kv length. But the reductions then added max_q_seqlen a SECOND time and used batch_size = len(self.running_seqs), which counts scheduler-dropped invalid seqs: sum_kv_seqlen = sum(kv_seqlens) + batch_size * max_q_seqlen max_kv_seqlen = max(kv_seqlens) + max_q_seqlen so max_kv_seqlen / sum_kv_seqlen were over-counted (max by max_q_seqlen, scaling with spec/MTP num_decode_tokens), over-allocating the attention grid + kv-cache resources. Reduce over kv_seqlens directly; the +max_q_seqlen is already applied once in the comprehension. Fixes InternLM#4024 Co-authored-by: Claude <noreply@anthropic.com>

waynehacking8 · 2026-06-17T04:00:30Z

Also strengthened the regression test (3c61d26): instead of hard-coded expected values, it now derives them from the engine's own one-decode-step advance (ModelInputs.step / get_model_inputs_next_decoding: max_kv += max_q_seqlen) and is parametrized over max_q_seqlen ∈ {1, 4} — so it asserts delta == base_kv (num_all_ids) + exactly one advance and proves the offset is one max_q_seqlen (rejecting both the old +2*max_q and a +0 regression). Verified 2 passed with the fix, 2 failed against the original.

grimoire

LGTM!

Copilot

Pull request overview

Fixes an arithmetic bug in the PyTorch engine’s decode “delta” fast-path where max_kv_seqlen / sum_kv_seqlen were over-counted (double-applying + max_q_seqlen and incorrectly including scheduler-dropped sequences). This aligns the delta builders with the engine’s “advance exactly one decode step” invariant used by ModelInputs.step() / get_model_inputs_next_decoding().

Changes:

Correct sum_kv_seqlen / max_kv_seqlen reductions in InputsMakerAsync.create_model_inputs_delta() and create_model_inputs_delta_valid_only() to reduce directly over the already-advanced kv_seqlens.
Add a regression unit test covering scheduler-dropped sequences and multiple max_q_seqlen values (standard decode vs spec/MTP-like multi-token decode).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
lmdeploy/pytorch/engine/inputs_maker.py	Fixes double-counting of KV sequence lengths in decode delta builders (and avoids counting invalid/dropped seqs).
tests/pytorch/engine/test_inputs_maker.py	Adds a regression test verifying delta KV lengths match “base + exactly one decode advance” for valid seqs only.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

RunningLeon

LGTM

waynehacking8 force-pushed the wayne/fix-4024-delta-kv-seqlen branch from 4c84f26 to df2a1b4 Compare June 17, 2026 03:48

waynehacking8 force-pushed the wayne/fix-4024-delta-kv-seqlen branch from df2a1b4 to 3c61d26 Compare June 17, 2026 03:59

grimoire approved these changes Jun 17, 2026

View reviewed changes

lvhan028 requested a review from Copilot June 17, 2026 14:35

Copilot started reviewing on behalf of lvhan028 June 17, 2026 14:35 View session

lvhan028 requested a review from RunningLeon June 17, 2026 14:36

Copilot AI reviewed Jun 17, 2026

View reviewed changes

RunningLeon approved these changes Jun 18, 2026

View reviewed changes

lvhan028 added the Bug:P1 label Jun 18, 2026

lvhan028 merged commit 00d5dc8 into InternLM:main Jun 18, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Fix double-counted max_q_seqlen in decode delta kv_seqlens#4685

[Bugfix] Fix double-counted max_q_seqlen in decode delta kv_seqlens#4685
lvhan028 merged 1 commit into
InternLM:mainfrom
waynehacking8:wayne/fix-4024-delta-kv-seqlen

waynehacking8 commented Jun 16, 2026 •

edited

Loading

Uh oh!

grimoire commented Jun 17, 2026

Uh oh!

waynehacking8 commented Jun 17, 2026

Uh oh!

waynehacking8 commented Jun 17, 2026

Uh oh!

grimoire left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

RunningLeon left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

waynehacking8 commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Test

Dup-check

Uh oh!

grimoire commented Jun 17, 2026

Uh oh!

waynehacking8 commented Jun 17, 2026

Uh oh!

waynehacking8 commented Jun 17, 2026

Uh oh!

grimoire left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

RunningLeon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

waynehacking8 commented Jun 16, 2026 •

edited

Loading