feat(turbomind): memory allocator, object cache, and scheduler integration#4717
Open
lzhangzz wants to merge 10 commits into
Open
feat(turbomind): memory allocator, object cache, and scheduler integration#4717lzhangzz wants to merge 10 commits into
lzhangzz wants to merge 10 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR modernizes TurboMind’s runtime by introducing a unified object-cache/memory subsystem and integrating it into scheduling/execution, while removing legacy session/stateful paths and updating Python bindings + tests to reflect TurboMind’s stateless-only behavior.
Changes:
- Integrates an object-cache budget/registry + allocator into TurboMind engine construction and updates model/engine wiring accordingly.
- Removes/rewires legacy session-oriented APIs and request plumbing (Python bindings, model code paths) toward stateless-only requests.
- Updates output/scheduler-side data plumbing (e.g., CE-loss/output handling, request containers) and adjusts pipeline tests for TurboMind behavior.
Reviewed changes
Copilot reviewed 83 out of 83 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_lmdeploy/test_pipeline.py | Updates session-related streaming test expectations; TurboMind now rejects interactive multi-turn requests. |
| src/turbomind/turbomind.cc | Adds object-cache budget computation + allocator/registry setup; adjusts engine/model construction and warmup session flags. |
| src/turbomind/python/bind.cpp | Simplifies SessionParam (removes start/end flags) and removes the Python-exposed ModelRequest.end() API. |
| src/turbomind/models/qwen3_5vit/qwen3_5vit.cc | Removes legacy session-flag guard and migrates from RequestCache* to Sequence* request plumbing. |
| src/turbomind/models/output_processor.cc | Refactors output range bookkeeping and CE-loss segment capture to avoid executor-side Sequence access. |
| src/turbomind/kernels/sampling_kernels.cu | Updates sampling RNG usage to rely on per-row RNG-state indices (new pointer input). |
| src/turbomind/engine/request.h | Adds a new engine status code (kOutOfMemory = 11) in the request status enum. |
| lmdeploy/turbomind/turbomind.py | Maintains Python-side mapping from engine status codes to ResponseType for user-visible responses. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Squash of 44 commits (749775f..489cfdf1) overhauling TurboMind prefix caching for native-vision VLM and making cache behavior configurable. VLM prefix caching (native Qwen3.5 ViT): - Add Fingerprint type and hash fold for per-image identity - Fingerprint-aware PrefixKey fold and PrefixTrie identity/search, bounding Search by fp_pos.size() to avoid OOB - Plumb image fingerprint from Python into Qwen3_5VitItem; compute it in to_turbomind_multimodal; fill multimodal_spans and log ViT-skip on cache hit - Fold per-image fingerprints into prefix matching/indexing via MultiModalSpan projection on Sequence - Enable prefix caching for native-vision TurboMind VLM in serve - Unit tests for Fingerprint identity and PrefixTrie image_fps match Prefix-cache interface overhaul: - Introduce CacheMode enum, parser, and publish-gate helper - Scheduler CacheMode modes with image-gated auto prompt boundary; drop CacheBoundaryPolicy in favor of prompt_boundary helper - Add cache_prompt_boundary_skip knob (default 1, legacy boundary) - Publish prompt-boundary node at prompt_len - cache_prompt_boundary_skip - Rename config to cache_prompt/cache_generation/cache_checkpoint_interval across EngineConfig, TurbomindEngineConfig, scripts, and docs Robustness fixes: - Map kOutOfMemory status to ResponseType.OUT_OF_MEMORY - Bound PrefixTrie::Search by fp_pos.size() to avoid OOB Docs/contracts synced to cache_prompt/cache_generation modes; superseded interim docs, scripts, and tests removed.
Publish took a t0 but scanned from block 0 every commit, unlike its SetProducers/CheckProducers siblings which iterate [t0/bs, ceil(end/bs)). The prefix work was dead: this request only marks/clears producers on [t0/bs, ...) within the same pass, and every indexed block below t0 is already valid (Resume advances resume_len only over valid prefix; the in-flight [resume_len, t0) region was published at the prior forward's commit). The block straddling t0 sits at index t0/bs, so it is still processed and publication semantics are unchanged. This drops decode publication from ~end/bs iterations per step to the newly-covered tail, and stops unconditionally writing is_valid on blocks the pass never proved allocated.
…nning When no request could be admitted, the scheduler left an empty active batch and the engine resubmitted empty batches indefinitely, so a request too large for the cache never received a terminal status (kOutOfMemory was defined but never assigned). Add FailStalledHeadOfLine: on a pass that admits nothing with no in-flight work remaining, fail the highest-priority eligible request with kOutOfMemory so it retires, releases its cache, and lower-priority requests can proceed. The inflight check distinguishes a genuine deadlock from a transient async drain. Document the guarantee as the README forward-progress contract.
__builtin_clz is a GCC/Clang builtin unavailable on MSVC. Provide a separate _BitScanReverse-based implementation for MSVC instead of branching inside the function.
Demote cache admission/resume/defer/publish logs to kWarning and build messages only when logging is enabled; guard LogResume to resuming sequences.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduces a new memory subsystem and an object-cache–based scheduler for TurboMind, replacing the legacy
BlockManager/SequenceManagermachinery with a unified caching runtime and integrating recurrent (GDN) state management.What's done
Memory subsystem
Scheduler & cache runtime
Recurrent (GDN) state & attention
Generation & sampling
Cleanup
BlockManager,BlockTrie,SequenceManager, and their tests, now superseded by the memory subsystem and new scheduler.