Skip to content

feat: Turbomind linear gdn prefix caching#4465

Open
lapy wants to merge 5 commits intoInternLM:mainfrom
lapy:turbomind-linear-gdn-prefix-caching
Open

feat: Turbomind linear gdn prefix caching#4465
lapy wants to merge 5 commits intoInternLM:mainfrom
lapy:turbomind-linear-gdn-prefix-caching

Conversation

@lapy
Copy link
Copy Markdown
Contributor

@lapy lapy commented Mar 25, 2026

Qwen3.5 Hybrid Prefix Caching in TurboMind

Summary

This change adds prefix caching support for Qwen3.5 hybrid-attention models in TurboMind.

  • Full-attention layers keep using the existing KV prefix cache.
  • Gated DeltaNet linear-attention layers now store checkpointed recurrent state at configurable KV-block boundaries.
  • Prefix matches for hybrid models restore both:
    • shared KV blocks for full-attention layers
    • the closest compatible GDN checkpoint state

The implementation keeps quant_policy scoped to KV cache quantization. GDN prefix checkpoints remain in the model/state dtypes in this version.

User-Facing Changes

  • Added linear_prefix_cache_interval_blocks to TurbomindEngineConfig
  • Added --linear-prefix-cache-interval-blocks to the TurboMind CLI surface
  • Default interval is 2 KV blocks
  • Validation rejects values < 1

Runtime Design

Hybrid cache structure

  • Existing KV prefix cache is unchanged for full-attention layers.
  • A second cache family stores GDN prefix checkpoints.
  • Each checkpoint stores:
    • convolution state
    • recurrent state
  • Checkpoints are attached to trie nodes at the configured interval.

Prefix matching

On a prefix hit:

  • TurboMind matches normal KV blocks as before.
  • For hybrid models it also finds the deepest trie node with a valid GDN checkpoint.
  • Reusable prefix length is clamped to the deepest compatible linear checkpoint.
  • The matched GDN state is restored into the live per-sequence GDN buffers before decode continues.

Cache maintenance

  • GDN checkpoint slots are released when trie nodes are invalidated.
  • When KV cached blocks are freed or evicted, the trie is verified immediately so the corresponding GDN checkpoints are pruned in the same path.
  • If the GDN checkpoint pool is exhausted, TurboMind skips storing deeper checkpoints instead of aborting the request.
  • Warm-up requests never allocate GDN prefix checkpoint staging.
  • Large real batches that cannot afford checkpoint staging continue to run; they simply skip storing new GDN checkpoints for that batch.

Main Code Areas

  • CLI and engine config
    • lmdeploy/messages.py
    • lmdeploy/cli/utils.py
    • lmdeploy/cli/cli.py
    • lmdeploy/cli/serve.py
    • src/turbomind/turbomind.cc
    • src/turbomind/models/llama/llama_params.h
    • src/turbomind/engine/engine.cc
  • Core hybrid prefix-cache logic
    • src/turbomind/models/llama/BlockTrie.h
    • src/turbomind/models/llama/BlockTrie.cc
    • src/turbomind/models/llama/SequenceManager.h
    • src/turbomind/models/llama/SequenceManager.cc
    • src/turbomind/models/llama/GatedDeltaNetLayer.h
    • src/turbomind/models/llama/GatedDeltaNetLayer.cc
    • src/turbomind/models/llama/gated_delta_net_kernels.h
    • src/turbomind/models/llama/gated_delta_net_kernels.cu

Test Coverage Added

Python tests

  • tests/test_lmdeploy/test_turbomind/test_engine_config.py
    • default interval value
    • validation for invalid interval values
    • explicit override handling
  • tests/test_lmdeploy/test_turbomind/test_api_server.py
    • TurboMind api_server forwards hybrid prefix-cache options into TurbomindEngineConfig
    • TurboMind api_server uses the normal default CUDA max_batch_size when the user does not set one explicitly

Test commands run

python -m pytest -q tests/test_lmdeploy/test_turbomind/test_engine_config.py tests/test_lmdeploy/test_turbomind/test_api_server.py
python -m pytest -q tests/test_lmdeploy/test_turbomind/test_converter.py
cmake --build /root/lmdeploy/build --target _turbomind -j4

Observed results

  • test_engine_config.py + test_api_server.py: 5 passed
  • test_converter.py: 5 passed
  • _turbomind rebuilt successfully

Real-Model Validation

Model:

  • QuantTrio/Qwen3.5-27B-AWQ

Command used:

TM_LOG_LEVEL=INFO CUDA_VISIBLE_DEVICES=1,2 lmdeploy serve api_server \
  QuantTrio/Qwen3.5-27B-AWQ \
  --tp 2 \
  --server-port 23335 \
  --reasoning-parser qwen-qwq \
  --tool-call-parser qwen3coder \
  --quant-policy 8 \
  --enable-prefix-caching

Observed startup details:

  • Server reached full Uvicorn startup successfully.
  • TurboMind reported max cached tokens: 533248.
  • Warm-up completed successfully through 8320 tokens.

Observed hybrid prefix-cache hit on repeated request:

[TM][INFO] [SeqMgr][match] ID 2, hit blocks 8, linear_cache_len 512, cache_len 0
[TM][INFO] [SeqMgr][match] ID 2, after matching, blocks 8, cache_len 512

Request details:

  • request 1: prompt_tokens=626, completion_tokens=24
  • request 2: prompt_tokens=626, completion_tokens=24

This confirms that the second request reused both normal cached KV blocks and a compatible linear-attention checkpoint.

Notes

  • quant_policy remains KV-only in this PR.

@lapy
Copy link
Copy Markdown
Contributor Author

lapy commented Mar 25, 2026

TurboMind now treats Qwen3.5 hybrid attention as two cache families:

  • standard KV prefix cache for full-attention layers
  • checkpointed Gated DeltaNet (GDN) state for linear-attention layers

On a prefix hit, TurboMind restores both:

  • shared KV blocks for the matched full-attention prefix
  • the deepest compatible cached GDN checkpoint for the matched linear-attention prefix

Key changes

  • Changed the default linear_prefix_cache_interval_blocks from 2 to 64.
  • Updated the CLI/help text to describe the tradeoff more clearly: larger values reduce GDN checkpoint memory usage but increase recompute after a prefix hit.

New findings

The important new finding is that hybrid prefix caching must budget for three separate memory buckets:

  • KV cache blocks
  • live per-sequence GDN state
  • cached GDN prefix checkpoints

Before this change, TurboMind only effectively budgeted KV blocks plus live GDN state. The cached GDN checkpoint pool was lazy and not included in the initial capacity estimate.

For Qwen3.5-27B AWQ on tp=2 with quant_policy=8, a single cached GDN checkpoint is much larger than it first appears:

  • one int8 KV block of 64 items: about 1.016 MiB
  • one GDN checkpoint snapshot slot: about 37.9 MiB per rank

So while a single live GDN state is constant-size, dense GDN checkpointing becomes linear in cached prefix length and can significantly reduce available context capacity if it is budgeted conservatively.

linear_prefix_cache_interval_blocks is controlling every how many KV blocks do we save a GDN snapshot.

Example: default interval is 64, it means we save a GDN snapshot (37.9 MiB) every 64 int8 KV blocks (65 MiB). Reasonable tradeoff between reuse and memory footprint.

Real-model observations

Validated on QuantTrio/Qwen3.5-27B-AWQ with TurboMind and real repeated requests.

Observed hybrid prefix hit on repeated prompt:

  • hit blocks 8
  • linear_cache_len 512
  • after matching, blocks 8, cache_len 512

Observed context-capacity impact when GDN checkpoint memory is included in the budget:

  • interval 2: max cached tokens = 27200
  • interval 64: max cached tokens = 337600
  • interval 128: max cached tokens = 413952

This showed that interval 2 is far too dense for this model on 32GB V100s, while 64 and 128 are both practical. Based on these results, the default was changed to 64.

Runtime behavior

  • Huge prompts can still run even if new GDN checkpoints cannot be stored.
  • If GDN checkpoint staging would exceed the per-batch budget, TurboMind skips storing new GDN checkpoints for that batch instead of aborting the request.
  • If the GDN checkpoint slot pool is exhausted, deeper checkpoints are skipped until cached entries are evicted.
  • Prefix caching remains opportunistic acceleration data rather than a hard requirement for forward progress.

Validation

  • QuantTrio/Qwen3.5-27B-AWQ
  • CUDA_VISIBLE_DEVICES=1,2
  • tp=2
  • quant_policy=8
  • prefix caching enabled

@lvhan028 lvhan028 requested review from Copilot and lzhangzz March 26, 2026 09:56
@lvhan028 lvhan028 added the enhancement New feature or request label Apr 2, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds hybrid prefix caching support in TurboMind for Qwen3.5-style hybrid attention models by extending the existing KV prefix cache with periodic Gated DeltaNet (linear attention) state checkpoints, and wires the new option through Python config + CLI into the C++ engine.

Changes:

  • Introduces linear_prefix_cache_interval_blocks across Python config/CLI and TurboMind engine params to control linear-attention checkpoint cadence.
  • Extends TurboMind prefix-cache matching/caching to additionally capture and restore GDN conv/recurrent states at interval boundaries.
  • Adds Python tests to validate config defaults/validation and API server forwarding behavior.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_lmdeploy/test_turbomind/test_engine_config.py Adds tests for default/validation/override of the new interval config.
tests/test_lmdeploy/test_turbomind/test_api_server.py Ensures API server forwards the new option into TurbomindEngineConfig and preserves default CUDA batch sizing behavior.
src/turbomind/turbomind.cc Parses linear_prefix_cache_interval_blocks and removes the previous hard block on prefix caching with linear attention.
src/turbomind/models/llama/SequenceManager.h Adds per-sequence pending checkpoint tensors/metadata and threads the interval into SequenceManager.
src/turbomind/models/llama/SequenceManager.cc Budgets cache blocks considering checkpoint overhead; integrates trie verify hooks; restores linear states on prefix hits.
src/turbomind/models/llama/llama_params.h Adds the new engine parameter to EngineParam.
src/turbomind/models/llama/GatedDeltaNetLayer.h Adds capture staging buffers and bookkeeping for checkpoint capture during prefill.
src/turbomind/models/llama/GatedDeltaNetLayer.cc Computes per-request capture counts, allocates staging opportunistically, and publishes captured checkpoint slices to sequences for caching.
src/turbomind/models/llama/gated_delta_net_kernels.h Extends kernel launcher APIs to optionally write checkpoint captures.
src/turbomind/models/llama/gated_delta_net_kernels.cu Implements conv/recurrent checkpoint capture paths and adds new overloads for the launchers.
src/turbomind/models/llama/BlockTrie.h Extends trie nodes to optionally own a linear-state slot; returns a richer match result including linear checkpoint state.
src/turbomind/models/llama/BlockTrie.cc Stores/retrieves linear checkpoint state in trie nodes and releases it when nodes are invalidated.
src/turbomind/engine/engine.cc Passes the new interval into SequenceManager.
lmdeploy/messages.py Adds config field, docs, and validation for linear_prefix_cache_interval_blocks.
lmdeploy/cli/utils.py Adds --linear-prefix-cache-interval-blocks CLI option.
lmdeploy/cli/serve.py Wires the CLI option into TurbomindEngineConfig for API server.
lmdeploy/cli/cli.py Exposes the CLI option on the chat CLI path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jingyibo123
Copy link
Copy Markdown
Contributor

This looks promising, prefix-caching is the last piece towards a working lmdeploy production deployment of Qwen3.5

lapy added 5 commits April 10, 2026 16:24
…ntation; add related tests. This change enhances memory management for hybrid models by increasing the checkpoint interval, which may reduce memory usage but requires more recompute after prefix hits.
@lapy lapy force-pushed the turbomind-linear-gdn-prefix-caching branch from ad6573b to 8f1b581 Compare April 10, 2026 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants