[feature] Add SGLang HiCacheStorage integration with local and distributed modes by staryxchen · Pull Request #137 · taco-project/FlexKV

staryxchen · 2026-04-05T16:12:13Z

Summary

Add FlexKV as a storage backend for SGLang's HiRadixCache (L3 tier: GPU -> Host -> FlexKV)
Support local mode (single-node CPU cache + optional SSD) and distributed mode (multi-Prefill KV Cache sharing via Distributed RadixTree + Redis GMS + Mooncake P2P transfer)
New put_cpu() API enabling CPU-only PUT path without GPU block registration
Thread-safe adapter: _engine_lock serialises concurrent SGLang threads (backup/prefetch/query)

Changes

Core: CPU-only PUT Path

cache_engine.py: put_cpu() — allocate CPU blocks, insert into radix tree with is_ready=False, optionally create H2DISK ops
kvtask.py: put_cpu_match(), detect FLEXKV_CPU_ONLY=1 for thread mode
kvmanager.py: put_cpu(), launch_cpu(), get_cpu_cache_tensor()

SGLang Adapter (`flexkv/integration/sglang/`)

hicache_storage_adapter.py: FlexKVHiCacheStorage implementing SGLang HiCacheStorage interface
- batch_set_v1: Host -> layout transform -> FlexKV CPU cache -> optional SSD
- batch_get_v1: FlexKV CPU cache -> layout transform -> Host pool (+ cross-node P2P in distributed mode)
- batch_exists: Radix tree query (local + distributed via match_all())
- _engine_lock: Thread safety for concurrent SGLang backup/prefetch/query threads
- Auto-detects model params from SGLang mem_pool_host
patch_sglang.py: One-click SGLang patch tool (flexkv-patch-sglang CLI)
test_hicache_storage_adapter.py: 26 unit tests

Documentation

docs/sglang_adapter/README.md: Architecture, thread safety, data flow, configuration

Test Plan

Unit tests: 26/26 pass (token extraction, init, MLA, round-trip, dedup, distributed mode, concurrency)
SGLang E2E: backup/prefetch cycle verified, GSM8K accuracy consistent
Multi-GPU: TP=4, auto-detected per-rank params
SSD persistence: H2DISK io_uring transfer verified
PD disaggregation: Prefill (TP=2) + Decode (TP=2), 3500+ requests at 100% success rate
Distributed mode: cross-node GET via Mooncake P2P transfer verified
Thread safety: _engine_lock serialises batch_set_v1/batch_get_v1/batch_exists

Add a new put_cpu() API that allows writing KV data from external CPU memory into FlexKV's CPU cache without requiring GPU block registration. Core changes: - cache_engine.py: add put_cpu() and _put_impl_local_cpu_only() that allocate CPU blocks, insert into radix tree, and optionally create H2DISK transfer ops for SSD persistence - kvtask.py: add put_cpu_match() to KVTaskEngine; detect FLEXKV_CPU_ONLY env var to use thread mode TransferManager (shared address space) - kvmanager.py: add put_cpu(), launch_cpu(), get_cpu_cache_tensor() public methods for the CPU-only workflow This enables SGLang HiCacheStorage integration where data lives in Host memory and no GPU slots are available. Signed-off-by: staryxchen <staryxchen@tencent.com>

Implement FlexKVHiCacheStorage adapter for SGLang's HiRadixCache, using KVManager in thread mode (CPU-only PUT path). Adapter features: - batch_set_v1: write KV data from SGLang Host pool to FlexKV CPU cache via put_cpu(), with layout transform and dedup handling - batch_get_v1: read from FlexKV CPU cache, transform layout, write back to SGLang Host pool - batch_exists: query CPU cache engine radix tree directly - Auto-detect model params (num_layers, num_kv_heads, head_size) from SGLang's mem_pool_host at runtime - Support all SGLang layouts (layer_first, page_first, etc.) Unit tests (8 cases): - Token ID extraction (plain list, RadixKey, None) - Adapter initialization, MLA skip, graceful degradation - Full set -> exists -> get round-trip with data validation - Deduplication correctness, statistics collection Signed-off-by: staryxchen <staryxchen@tencent.com>

- docs/sglang_adapter/README.md: architecture overview, data flow, configuration guide, known limitations, verification checklist Signed-off-by: staryxchen <staryxchen@tencent.com>

…ia P2P Add distributed mode to the SGLang HiCacheStorage adapter, enabling multi-Prefill KV Cache sharing through FlexKV's Distributed RadixTree and Mooncake Transfer Engine P2P transfers. Key changes: - batch_exists: query distributed index (match_all) to discover remote blocks - batch_get_v1: fetch remote blocks via prefetch_async + wait, then read locally - batch_set_v1: write to local CPU cache with metadata published to Redis GMS - Add _fetch_remote_blocks() for synchronous P2P block retrieval - Add remote fetch statistics (get_remote_fetches/successes/failures) - Add distributed mode unit tests (7 mode configuration test cases) - Update README with distributed mode architecture and configuration - Add test script and Mooncake config examples for same-host verification Verified: dual-Prefill same-host E2E test passes with 231 blocks (3696 tokens) fetched cross-node at 4.12 GB/s via TCP P2P transfer. Signed-off-by: staryxchen <staryxchen@tencent.com>

Remove redundant .clone() before layout transform (permute+contiguous already copies), eliminate duplicate SequenceMeta allocation in distributed GET path, consolidate _stats lock acquisitions from 5+ to 2 per batch_get_v1 call, and hoist loop-invariant num_deduped computation. Signed-off-by: staryxchen <staryxchen@tencent.com>

- Remove custom logging setup and imports - Import shared logger from flexkv.common.debug module Signed-off-by: staryxchen <staryxchen@tencent.com>

- Integrate metrics recording for cache operations in SGLang integration Signed-off-by: staryxchen <staryxchen@tencent.com>

- Auto-detect kv_lora_rank + qk_rope_head_dim from mem_pool_host - Handle MLA 4D layout transforms (L,T,1,D) <-> (L,1,T,1,D) - Move Initializing log after auto-detect to show correct values - Add MLA unit tests for auto-detect, layout transform, set/get/dedup Signed-off-by: staryxchen <staryxchen@tencent.com>

get_stats() previously returned a custom Dict which failed SGLang's `assert isinstance(storage_metrics, StorageMetrics)` check. Replace with the standard StorageMetrics dataclass (prefetch_pgs, backup_pgs, prefetch_bandwidth, backup_bandwidth) using the sliding-window pattern consistent with HF3FS and MooncakeStore backends. Uses try/except import to support both sglang.srt.metrics.collector (older versions) and sglang.srt.observability.metrics_collector (newer). Signed-off-by: staryxchen <staryxchen@tencent.com>

…k shape - Extract mode literals ("local"/"distributed") into module-level constants MODE_LOCAL, MODE_DISTRIBUTED, _VALID_MODES to prevent typo-induced bugs - Extract error operation labels ("get"/"set"/"exists") into _OP_GET, _OP_SET, _OP_EXISTS constants for consistent Prometheus label usage - Cache block shape tuple as self._block_shape at init time instead of recomputing kv_dim and constructing the shape on every _get_block_shaped() call (hot path in batch_get_v1 per-block loop) - Remove dead field _started (set but never read) - Update tests to import and use the new constants Signed-off-by: staryxchen <staryxchen@tencent.com>

Refactor batch_get_v1/batch_set_v1 from whole-page copy (permute+flatten) to per-layer copy (direct kv_buffer slice write). Benefits even without layerwise pipeline enabled: - Eliminates permute(1,0,2,3,4).contiguous() temp allocation (~18MB/page) - Zero-copy view via block_data[layer_id] instead of full tensor copy - Direct kv_buffer write bypasses set_from_flat_data_page overhead New methods: _write_layer_to_host(), _read_layer_from_host() Optional layer_ready_callback for future pipeline integration. Non-layer_first layouts fall back to original whole-page path. Signed-off-by: staryxchen <staryxchen@tencent.com>

…ption SGLang runs backup (batch_set_v1) and prefetch (batch_get_v1) on separate threads, both accessing the shared CPU cache tensor. Without protection, a concurrent SET can overwrite a block that GET is reading, producing corrupted KV data that causes CUDA illegal memory access on the GPU side. Fix: clone block data at the two batch_get_v1 call sites only. The SET path continues using views so writes go directly into the CPU cache tensor as intended. Also add bounds checking in _get_block_view, _write_layer_to_host, and _read_layer_from_host to catch invalid block/host references early.

Provide a `flexkv-patch-sglang` command that auto-locates the SGLang install and applies the FlexKV integration patch (unified diff). Users no longer need to manually git-apply a patch file. - Add patches/sglang_flexkv.patch (3 SGLang source files) - Add patch_sglang.py with --check / --revert / --sglang-path - Register console_scripts entry point in setup.py - Include .patch files in package_data

Update data flow, layout transform, file tables, parameter docs, and test counts to match the current adapter code.

_write_layer_to_host and _read_layer_from_host checked host_start against kv_buf.shape[2], which is the token axis for MHA (2,L,T,H,D) but the num_kv_heads axis (size 1) for MLA (L,T,1,D). This caused all MLA layerwise operations to silently bail out on the OOB guard. Fix: use shape[1] for MLA (token axis) and shape[2] for MHA.

…re-write race The CPU-only PUT path (commit 6452176) inserted CPU blocks into the radix tree with is_ready=True before the caller had filled data into those blocks. This created a window where concurrent batch_get_v1 could match and read partially-written or uninitialized blocks. Root cause: _put_impl_local_cpu_only() set is_ready=True at insert time, but put_cpu() returns cpu_block_ids for the caller to fill afterwards — the data is not yet present when the tree is updated. Fix: insert with is_ready=False and return a data_ready_callback that the caller invokes after filling data. batch_set_v1 now calls this callback immediately after the data copy loop, before launch_cpu(). Also defer _process_empty_graph from put_cpu_match to launch_cpu to prevent premature task completion before data filling. Signed-off-by: staryxchen <staryxchen@tencent.com>

Add 4 tests that verify the temporal correctness of the deferred block visibility fix: 1. test_cpu_put_blocks_not_ready_before_data_fill Core regression test: after put_cpu(), radix tree has blocks (num_matched_blocks > 0) but they are NOT visible to readers (num_ready_matched_blocks == 0) until data_ready_callback(). 2. test_cpu_put_visibility_after_data_ready_callback End-to-end: manually drives put_cpu → fill data → data_ready_cb → launch_cpu, then verifies batch_get_v1 reads correct data. 3. test_batch_set_v1_makes_blocks_ready Verifies batch_set_v1 (which calls data_ready_cb internally) leaves blocks in a fully ready state in the radix tree. 4. test_concurrent_get_during_put_sees_no_partial_data Simulates a reader arriving while writer is mid-fill; reader must see nothing. After fill completes, reader sees correct data. Signed-off-by: staryxchen <staryxchen@tencent.com>

- Add data_ready_callback step to PUT data flow diagram - Update test count from 22 to 26 (4 new concurrency tests) - Add verification checklist item for deferred block visibility Signed-off-by: staryxchen <staryxchen@tencent.com>

…access SGLang's HiCacheController calls batch_set_v1, batch_get_v1, and batch_exists from 3 independent threads (backup_thread, prefetch_thread, prefetch_io_aux) on the same storage backend instance without synchronization. FlexKV's radix tree, mempool, and CPU cache tensor are not thread-safe, so concurrent access can cause data corruption when batch_set_v1 triggers LRU eviction that recycles blocks being read by batch_get_v1. Changes: - Add threading.Lock (_engine_lock) to serialise cache-engine operations - Remove .clone() in batch_get_v1 (no longer needed under lock) - Update _get_block_shaped docstring to document lock requirement The lock is a simple mutex — sufficient because SGLang uses only one backup_thread (writer) and the prefetch threads are effectively serialised. A future C++ layer change (std::shared_mutex in CRadixTreeIndex) can provide finer-grained concurrency. Signed-off-by: staryxchen <staryxchen@tencent.com>

- Add "Thread Safety" section explaining SGLang's 3-thread model and the adapter's _engine_lock serialisation strategy - Update data flow diagrams to annotate lock holding - Note that batch_get_v1 reads block views directly (no clone) - Add thread safety item to verification checklist Signed-off-by: staryxchen <staryxchen@tencent.com>

…ompat Address three review issues from PR taco-project#137: 1. Add singleton guard to FlexKVHiCacheStorage._init_kv_manager() to prevent multiple instances from silently corrupting process-global env vars (FLEXKV_CPU_ONLY / FLEXKV_INSTANCE_NUM). Raises RuntimeError if a second instance tries to start a KVManager. The guard is released via shutdown() so the slot can be reused. 2. Release _engine_lock before remote P2P fetch in batch_get_v1(). The distributed path previously held the lock during the entire network round-trip (up to prefetch_timeout=5s), blocking all other SGLang threads. Now uses a 3-phase approach: brief lock for distributed discovery, lock-free remote fetch, then re-acquire for local read. 3. Replace Python 3.10+ type hints (str | None, list[str]) with typing.Optional[str] and typing.List[str] in patch_sglang.py to match setup.py's python_requires=">=3.6". Signed-off-by: staryxchen <staryxchen@tencent.com>

staryxchen added 4 commits April 4, 2026 17:33

docs: add SGLang integration documentation

6aee377

- docs/sglang_adapter/README.md: architecture overview, data flow, configuration guide, known limitations, verification checklist Signed-off-by: staryxchen <staryxchen@tencent.com>

staryxchen mentioned this pull request Apr 5, 2026

[feature] Add SGLang HiCacheStorage integration with CPU-only PUT path #136

Closed

5 tasks

linhu-nv requested review from linhu-nv and zhuofan1123 April 6, 2026 05:44

staryxchen added 2 commits April 7, 2026 13:47

refactor(logging): replace custom logger with shared flexkv_logger

81e424e

- Remove custom logging setup and imports - Import shared logger from flexkv.common.debug module Signed-off-by: staryxchen <staryxchen@tencent.com>

feat(sglang): add metrics support for SGLang adapter

ad6c6f0

- Integrate metrics recording for cache operations in SGLang integration Signed-off-by: staryxchen <staryxchen@tencent.com>

staryxchen force-pushed the feat/sglang-distributed-mode branch 4 times, most recently from be22913 to cebd369 Compare April 7, 2026 12:55

staryxchen added 2 commits April 7, 2026 21:22

staryxchen force-pushed the feat/sglang-distributed-mode branch 2 times, most recently from cd2ea62 to d7ec982 Compare April 7, 2026 14:14

staryxchen force-pushed the feat/sglang-distributed-mode branch from d7ec982 to 35aced5 Compare April 7, 2026 14:15

staryxchen added 3 commits April 8, 2026 15:56

staryxchen force-pushed the feat/sglang-distributed-mode branch from 2a55ff6 to 0045b24 Compare April 8, 2026 14:02

docs(sglang): sync README with actual implementation

0397861

Update data flow, layout transform, file tables, parameter docs, and test counts to match the current adapter code.

staryxchen force-pushed the feat/sglang-distributed-mode branch from 6d5dc08 to 0397861 Compare April 9, 2026 06:38

staryxchen added 4 commits April 9, 2026 17:13

staryxchen added 3 commits April 10, 2026 10:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Add SGLang HiCacheStorage integration with local and distributed modes#137

[feature] Add SGLang HiCacheStorage integration with local and distributed modes#137
staryxchen wants to merge 21 commits into
taco-project:mainfrom
staryxchen:feat/sglang-distributed-mode

staryxchen commented Apr 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

staryxchen commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Core: CPU-only PUT Path

SGLang Adapter (flexkv/integration/sglang/)

Documentation

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

staryxchen commented Apr 5, 2026 •

edited

Loading

SGLang Adapter (`flexkv/integration/sglang/`)