Skip to content

feat(turbomind): memory allocator, object cache, and scheduler integration#4717

Open
lzhangzz wants to merge 10 commits into
InternLM:mainfrom
lzhangzz:memory-1b
Open

feat(turbomind): memory allocator, object cache, and scheduler integration#4717
lzhangzz wants to merge 10 commits into
InternLM:mainfrom
lzhangzz:memory-1b

Conversation

@lzhangzz

Copy link
Copy Markdown
Collaborator

Summary

Introduces a new memory subsystem and an object-cache–based scheduler for TurboMind, replacing the legacy BlockManager/SequenceManager machinery with a unified caching runtime and integrating recurrent (GDN) state management.

What's done

Memory subsystem

  • New page/slab/object allocator stack: page-granular backing storage, slab sub-allocation, and typed object allocation with caching.
  • Allocator statistics tracking and logging.
  • Extensive unit-test coverage for the allocators.

Scheduler & cache runtime

  • New object-cache–based scheduler replacing the old block/sequence managers.
  • Prefix caching via trie lookup and checkpoint caching.
  • Composite cache objects managed through a cache registry.
  • Configurable cache boundary policies for reuse.
  • Stateless-only request lifecycle with updated batch-op contracts.
  • Allocator stats surfaced through the engine, with refreshed engine contract documentation.

Recurrent (GDN) state & attention

  • GDN frontier handling and block-grid recurrent state management.
  • Block-grid adaptations across the attention and KV-cache path.

Generation & sampling

  • Updated generation, sampling, logits processing, guided decoding, and stop criteria to match the new request and output flow.

Cleanup

  • Removed the legacy BlockManager, BlockTrie, SequenceManager, and their tests, now superseded by the memory subsystem and new scheduler.

@lzhangzz lzhangzz changed the title # feat(turbomind): memory allocator, object cache, and scheduler integration feat(turbomind): memory allocator, object cache, and scheduler integration Jun 29, 2026
@lvhan028 lvhan028 requested a review from Copilot June 29, 2026 07:41
@lvhan028 lvhan028 added the enhancement New feature or request label Jun 29, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR modernizes TurboMind’s runtime by introducing a unified object-cache/memory subsystem and integrating it into scheduling/execution, while removing legacy session/stateful paths and updating Python bindings + tests to reflect TurboMind’s stateless-only behavior.

Changes:

  • Integrates an object-cache budget/registry + allocator into TurboMind engine construction and updates model/engine wiring accordingly.
  • Removes/rewires legacy session-oriented APIs and request plumbing (Python bindings, model code paths) toward stateless-only requests.
  • Updates output/scheduler-side data plumbing (e.g., CE-loss/output handling, request containers) and adjusts pipeline tests for TurboMind behavior.

Reviewed changes

Copilot reviewed 83 out of 83 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/test_lmdeploy/test_pipeline.py Updates session-related streaming test expectations; TurboMind now rejects interactive multi-turn requests.
src/turbomind/turbomind.cc Adds object-cache budget computation + allocator/registry setup; adjusts engine/model construction and warmup session flags.
src/turbomind/python/bind.cpp Simplifies SessionParam (removes start/end flags) and removes the Python-exposed ModelRequest.end() API.
src/turbomind/models/qwen3_5vit/qwen3_5vit.cc Removes legacy session-flag guard and migrates from RequestCache* to Sequence* request plumbing.
src/turbomind/models/output_processor.cc Refactors output range bookkeeping and CE-loss segment capture to avoid executor-side Sequence access.
src/turbomind/kernels/sampling_kernels.cu Updates sampling RNG usage to rely on per-row RNG-state indices (new pointer input).
src/turbomind/engine/request.h Adds a new engine status code (kOutOfMemory = 11) in the request status enum.
lmdeploy/turbomind/turbomind.py Maintains Python-side mapping from engine status codes to ResponseType for user-visible responses.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/turbomind/kernels/sampling_kernels.cu
Comment thread lmdeploy/turbomind/turbomind.py
lzhangzz added 8 commits June 29, 2026 08:24
Squash of 44 commits (749775f..489cfdf1) overhauling TurboMind prefix
caching for native-vision VLM and making cache behavior configurable.

VLM prefix caching (native Qwen3.5 ViT):
- Add Fingerprint type and hash fold for per-image identity
- Fingerprint-aware PrefixKey fold and PrefixTrie identity/search,
  bounding Search by fp_pos.size() to avoid OOB
- Plumb image fingerprint from Python into Qwen3_5VitItem; compute it
  in to_turbomind_multimodal; fill multimodal_spans and log ViT-skip
  on cache hit
- Fold per-image fingerprints into prefix matching/indexing via
  MultiModalSpan projection on Sequence
- Enable prefix caching for native-vision TurboMind VLM in serve
- Unit tests for Fingerprint identity and PrefixTrie image_fps match

Prefix-cache interface overhaul:
- Introduce CacheMode enum, parser, and publish-gate helper
- Scheduler CacheMode modes with image-gated auto prompt boundary;
  drop CacheBoundaryPolicy in favor of prompt_boundary helper
- Add cache_prompt_boundary_skip knob (default 1, legacy boundary)
- Publish prompt-boundary node at prompt_len - cache_prompt_boundary_skip
- Rename config to cache_prompt/cache_generation/cache_checkpoint_interval
  across EngineConfig, TurbomindEngineConfig, scripts, and docs

Robustness fixes:
- Map kOutOfMemory status to ResponseType.OUT_OF_MEMORY
- Bound PrefixTrie::Search by fp_pos.size() to avoid OOB

Docs/contracts synced to cache_prompt/cache_generation modes; superseded
interim docs, scripts, and tests removed.
Publish took a t0 but scanned from block 0 every commit, unlike its
SetProducers/CheckProducers siblings which iterate [t0/bs, ceil(end/bs)).
The prefix work was dead: this request only marks/clears producers on
[t0/bs, ...) within the same pass, and every indexed block below t0 is
already valid (Resume advances resume_len only over valid prefix; the
in-flight [resume_len, t0) region was published at the prior forward's
commit). The block straddling t0 sits at index t0/bs, so it is still
processed and publication semantics are unchanged.

This drops decode publication from ~end/bs iterations per step to the
newly-covered tail, and stops unconditionally writing is_valid on blocks
the pass never proved allocated.
…nning

When no request could be admitted, the scheduler left an empty active batch
and the engine resubmitted empty batches indefinitely, so a request too large
for the cache never received a terminal status (kOutOfMemory was defined but
never assigned).

Add FailStalledHeadOfLine: on a pass that admits nothing with no in-flight
work remaining, fail the highest-priority eligible request with kOutOfMemory so
it retires, releases its cache, and lower-priority requests can proceed. The
inflight check distinguishes a genuine deadlock from a transient async drain.
Document the guarantee as the README forward-progress contract.
__builtin_clz is a GCC/Clang builtin unavailable on MSVC. Provide a
separate _BitScanReverse-based implementation for MSVC instead of
branching inside the function.
Demote cache admission/resume/defer/publish logs to kWarning and build
messages only when logging is enabled; guard LogResume to resuming sequences.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants