feat(turbomind): memory allocator, object cache, and scheduler integration by lzhangzz · Pull Request #4717 · InternLM/lmdeploy

lzhangzz · 2026-06-29T06:04:29Z

Summary

Introduces a new memory subsystem and an object-cache–based scheduler for TurboMind, replacing the legacy BlockManager/SequenceManager machinery with a unified caching runtime and integrating recurrent (GDN) state management.

What's done

Memory subsystem

New page/slab/object allocator stack: page-granular backing storage, slab sub-allocation, and typed object allocation with caching.
Allocator statistics tracking and logging.
Extensive unit-test coverage for the allocators.

Scheduler & cache runtime

New object-cache–based scheduler replacing the old block/sequence managers.
Prefix caching via trie lookup and checkpoint caching.
Composite cache objects managed through a cache registry.
Configurable cache boundary policies for reuse.
Stateless-only request lifecycle with updated batch-op contracts.
Allocator stats surfaced through the engine, with refreshed engine contract documentation.

Recurrent (GDN) state & attention

GDN frontier handling and block-grid recurrent state management.
Block-grid adaptations across the attention and KV-cache path.

Generation & sampling

Updated generation, sampling, logits processing, guided decoding, and stop criteria to match the new request and output flow.

Cleanup

Removed the legacy BlockManager, BlockTrie, SequenceManager, and their tests, now superseded by the memory subsystem and new scheduler.

…& get_ppl

Copilot

Pull request overview

This PR modernizes TurboMind’s runtime by introducing a unified object-cache/memory subsystem and integrating it into scheduling/execution, while removing legacy session/stateful paths and updating Python bindings + tests to reflect TurboMind’s stateless-only behavior.

Changes:

Integrates an object-cache budget/registry + allocator into TurboMind engine construction and updates model/engine wiring accordingly.
Removes/rewires legacy session-oriented APIs and request plumbing (Python bindings, model code paths) toward stateless-only requests.
Updates output/scheduler-side data plumbing (e.g., CE-loss/output handling, request containers) and adjusts pipeline tests for TurboMind behavior.

Reviewed changes

Copilot reviewed 83 out of 83 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/test_lmdeploy/test_pipeline.py	Updates session-related streaming test expectations; TurboMind now rejects interactive multi-turn requests.
src/turbomind/turbomind.cc	Adds object-cache budget computation + allocator/registry setup; adjusts engine/model construction and warmup session flags.
src/turbomind/python/bind.cpp	Simplifies `SessionParam` (removes start/end flags) and removes the Python-exposed `ModelRequest.end()` API.
src/turbomind/models/qwen3_5vit/qwen3_5vit.cc	Removes legacy session-flag guard and migrates from `RequestCache` to `Sequence` request plumbing.
src/turbomind/models/output_processor.cc	Refactors output range bookkeeping and CE-loss segment capture to avoid executor-side `Sequence` access.
src/turbomind/kernels/sampling_kernels.cu	Updates sampling RNG usage to rely on per-row RNG-state indices (new pointer input).
src/turbomind/engine/request.h	Adds a new engine status code (`kOutOfMemory = 11`) in the request status enum.
lmdeploy/turbomind/turbomind.py	Maintains Python-side mapping from engine status codes to `ResponseType` for user-visible responses.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Squash of 44 commits (749775f..489cfdf1) overhauling TurboMind prefix caching for native-vision VLM and making cache behavior configurable. VLM prefix caching (native Qwen3.5 ViT): - Add Fingerprint type and hash fold for per-image identity - Fingerprint-aware PrefixKey fold and PrefixTrie identity/search, bounding Search by fp_pos.size() to avoid OOB - Plumb image fingerprint from Python into Qwen3_5VitItem; compute it in to_turbomind_multimodal; fill multimodal_spans and log ViT-skip on cache hit - Fold per-image fingerprints into prefix matching/indexing via MultiModalSpan projection on Sequence - Enable prefix caching for native-vision TurboMind VLM in serve - Unit tests for Fingerprint identity and PrefixTrie image_fps match Prefix-cache interface overhaul: - Introduce CacheMode enum, parser, and publish-gate helper - Scheduler CacheMode modes with image-gated auto prompt boundary; drop CacheBoundaryPolicy in favor of prompt_boundary helper - Add cache_prompt_boundary_skip knob (default 1, legacy boundary) - Publish prompt-boundary node at prompt_len - cache_prompt_boundary_skip - Rename config to cache_prompt/cache_generation/cache_checkpoint_interval across EngineConfig, TurbomindEngineConfig, scripts, and docs Robustness fixes: - Map kOutOfMemory status to ResponseType.OUT_OF_MEMORY - Bound PrefixTrie::Search by fp_pos.size() to avoid OOB Docs/contracts synced to cache_prompt/cache_generation modes; superseded interim docs, scripts, and tests removed.

Publish took a t0 but scanned from block 0 every commit, unlike its SetProducers/CheckProducers siblings which iterate [t0/bs, ceil(end/bs)). The prefix work was dead: this request only marks/clears producers on [t0/bs, ...) within the same pass, and every indexed block below t0 is already valid (Resume advances resume_len only over valid prefix; the in-flight [resume_len, t0) region was published at the prior forward's commit). The block straddling t0 sits at index t0/bs, so it is still processed and publication semantics are unchanged. This drops decode publication from ~end/bs iterations per step to the newly-covered tail, and stops unconditionally writing is_valid on blocks the pass never proved allocated.

…nning When no request could be admitted, the scheduler left an empty active batch and the engine resubmitted empty batches indefinitely, so a request too large for the cache never received a terminal status (kOutOfMemory was defined but never assigned). Add FailStalledHeadOfLine: on a pass that admits nothing with no in-flight work remaining, fail the highest-priority eligible request with kOutOfMemory so it retires, releases its cache, and lower-priority requests can proceed. The inflight check distinguishes a genuine deadlock from a transient async drain. Document the guarantee as the README forward-progress contract.

__builtin_clz is a GCC/Clang builtin unavailable on MSVC. Provide a separate _BitScanReverse-based implementation for MSVC instead of branching inside the function.

Demote cache admission/resume/defer/publish logs to kWarning and build messages only when logging is enabled; guard LogResume to resuming sequences.

lzhangzz added 2 commits June 29, 2026 06:01

feat(turbomind): memory allocator/scheduler + origin/main merge, vit …

6e509d9

…& get_ppl

fix lint

b031df7

lzhangzz changed the title ~~# feat(turbomind): memory allocator, object cache, and scheduler integration~~ feat(turbomind): memory allocator, object cache, and scheduler integration Jun 29, 2026

lvhan028 requested a review from Copilot June 29, 2026 07:41

lvhan028 added the enhancement New feature or request label Jun 29, 2026

Copilot started reviewing on behalf of lvhan028 June 29, 2026 07:41 View session

Copilot AI reviewed Jun 29, 2026

View reviewed changes

Comment thread src/turbomind/kernels/sampling_kernels.cu

Comment thread lmdeploy/turbomind/turbomind.py

lzhangzz added 8 commits June 29, 2026 08:24

fix: use stateless API for chat CLI

795fd47

feat: support images in chat cli

749775f

chore: fix lint

616ed10

fix(turbomind): add MSVC-compatible ceil_log2 implementation

791b68f

__builtin_clz is a GCC/Clang builtin unavailable on MSVC. Provide a separate _BitScanReverse-based implementation for MSVC instead of branching inside the function.

refactor(turbomind): lazy-eval scheduler cache logs at warning level

eae778f

Demote cache admission/resume/defer/publish logs to kWarning and build messages only when logging is enabled; guard LogResume to resuming sequences.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(turbomind): memory allocator, object cache, and scheduler integration#4717

feat(turbomind): memory allocator, object cache, and scheduler integration#4717
lzhangzz wants to merge 10 commits into
InternLM:mainfrom
lzhangzz:memory-1b

lzhangzz commented Jun 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

lzhangzz commented Jun 29, 2026

Summary

What's done

Memory subsystem

Scheduler & cache runtime

Recurrent (GDN) state & attention

Generation & sampling

Cleanup

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants