Interleave long-context prefill chunks with decode by grimoire · Pull Request #4631 · InternLM/lmdeploy

grimoire · 2026-05-28T06:16:08Z

requirement

Refactor prefix caching for pytorch engine #4618

Interleave chunk and decoding. Real prefix caching would be done in future PR.

Copilot

Pull request overview

This PR adds long-context chunk prefill interleaving with decode in the PyTorch engine loop, primarily by introducing a temporary KV allocation cap for non-final chunks and by updating the engine input policy to defer chunk forwards behind decode when appropriate. It also adjusts speculative decoding’s chunk carry behavior so interleaved decode does not disrupt pending chunk state, and adds targeted tests for these behaviors.

Changes:

Add kv_token_limit to SchedulerSequence and apply it across block allocation and prefix-cache trie allocation to keep non-final long-context chunks from over-allocating KV.
Introduce scheduler support for reserving KV blocks chunk-by-chunk (reserve_long_context_chunk) and update the engine input maker to interleave decode between chunk forwards.
Add tests covering chunk/decode interleaving, chunk KV limiting (including sliding-window), and spec-agent chunk carry semantics.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`lmdeploy/pytorch/engine/inputs_maker.py`	Implements the interleaving policy between long-context chunk forwards and decode; tracks last forward kind.
`lmdeploy/pytorch/messages.py`	Adds `SchedulerSequence.kv_token_limit` to represent temporary KV ownership bounds.
`lmdeploy/pytorch/paging/block_manager/default_block_manager.py`	Clamps required block computation by `kv_token_limit`.
`lmdeploy/pytorch/paging/block_manager/window_block_manager.py`	Updates sliding-window required-block math to remain valid under `kv_token_limit`.
`lmdeploy/pytorch/paging/block_trie.py`	Prevents prefix-cache trie allocation beyond `kv_token_limit`.
`lmdeploy/pytorch/paging/scheduler.py`	Applies chunk KV limits during prefill scheduling and adds `reserve_long_context_chunk` for incremental KV growth.
`lmdeploy/pytorch/paging/seq_states/states.py`	Clears `kv_token_limit` when freeing sequences.
`lmdeploy/pytorch/spec_decode/spec_agent.py`	Keeps long-chunk carry state across interleaved decode (and DP dummy placeholders).
`tests/pytorch/engine/test_inputs_maker.py`	Adds interleaving-policy tests and fakes needed to exercise it.
`tests/pytorch/paging/test_block_manager.py`	Adds coverage for `kv_token_limit` behavior under windowed allocation.
`tests/pytorch/paging/test_scheduler.py`	Adds coverage for decode growth reclaiming, chunk-limited prefill scheduling, and reservation behavior.
`tests/pytorch/spec_decode/test_spec_agent.py`	Adds coverage for chunk carry preservation/clearing across decode and prefill cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        evictable = self.hanging + self.waiting
+        if not self.eviction_helper.evict_for_seq(seq, evictable, prealloc_size):


+        evictable = self.hanging + self.waiting
+        if not self.eviction_helper.evict_for_seq(seq, evictable, prealloc_size):
+            seq.kv_token_limit = old_kv_token_limit
+            return False
+
+        self.block_manager.allocate(seq, prealloc_size)
+        self.block_trie.allocate(seq)
+        return True


grimoire force-pushed the refactor-chunked-prefill branch 3 times, most recently from 0f3284c to c76bae2 Compare June 8, 2026 04:11

grimoire force-pushed the refactor-chunked-prefill branch from c76bae2 to a5cb8a7 Compare June 12, 2026 12:46

grimoire added 2 commits June 16, 2026 15:57

first slice chunked prefill

d6e7baf

update dp ep mpt

24d5eea

grimoire force-pushed the refactor-chunked-prefill branch from a5cb8a7 to 24d5eea Compare June 16, 2026 08:32

waynehacking8 mentioned this pull request Jun 16, 2026

[Bugfix] Fix double-counted max_q_seqlen in decode delta kv_seqlens #4685

Open

grimoire changed the title ~~[WIP] Interleave long-context prefill chunks with decode~~ Interleave long-context prefill chunks with decode Jun 17, 2026

grimoire added 2 commits June 17, 2026 12:28

force last chunk prefill

20a54a2

fix lint

49cac47

grimoire marked this pull request as ready for review June 17, 2026 06:23

Copilot AI review requested due to automatic review settings June 17, 2026 06:23

Copilot started reviewing on behalf of grimoire June 17, 2026 06:24 View session

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interleave long-context prefill chunks with decode#4631

Interleave long-context prefill chunks with decode#4631
grimoire wants to merge 4 commits into
InternLM:mainfrom
grimoire:refactor-chunked-prefill

grimoire commented May 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		evictable = self.hanging + self.waiting
		if not self.eviction_helper.evict_for_seq(seq, evictable, prealloc_size):

Conversation

grimoire commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

requirement

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

grimoire commented May 28, 2026 •

edited

Loading