[feature] Add SGLang HiCacheStorage integration with local and distributed modes#137
Open
staryxchen wants to merge 21 commits into
Open
[feature] Add SGLang HiCacheStorage integration with local and distributed modes#137staryxchen wants to merge 21 commits into
staryxchen wants to merge 21 commits into
Conversation
Add a new put_cpu() API that allows writing KV data from external CPU memory into FlexKV's CPU cache without requiring GPU block registration. Core changes: - cache_engine.py: add put_cpu() and _put_impl_local_cpu_only() that allocate CPU blocks, insert into radix tree, and optionally create H2DISK transfer ops for SSD persistence - kvtask.py: add put_cpu_match() to KVTaskEngine; detect FLEXKV_CPU_ONLY env var to use thread mode TransferManager (shared address space) - kvmanager.py: add put_cpu(), launch_cpu(), get_cpu_cache_tensor() public methods for the CPU-only workflow This enables SGLang HiCacheStorage integration where data lives in Host memory and no GPU slots are available. Signed-off-by: staryxchen <staryxchen@tencent.com>
Implement FlexKVHiCacheStorage adapter for SGLang's HiRadixCache, using KVManager in thread mode (CPU-only PUT path). Adapter features: - batch_set_v1: write KV data from SGLang Host pool to FlexKV CPU cache via put_cpu(), with layout transform and dedup handling - batch_get_v1: read from FlexKV CPU cache, transform layout, write back to SGLang Host pool - batch_exists: query CPU cache engine radix tree directly - Auto-detect model params (num_layers, num_kv_heads, head_size) from SGLang's mem_pool_host at runtime - Support all SGLang layouts (layer_first, page_first, etc.) Unit tests (8 cases): - Token ID extraction (plain list, RadixKey, None) - Adapter initialization, MLA skip, graceful degradation - Full set -> exists -> get round-trip with data validation - Deduplication correctness, statistics collection Signed-off-by: staryxchen <staryxchen@tencent.com>
- docs/sglang_adapter/README.md: architecture overview, data flow, configuration guide, known limitations, verification checklist Signed-off-by: staryxchen <staryxchen@tencent.com>
…ia P2P Add distributed mode to the SGLang HiCacheStorage adapter, enabling multi-Prefill KV Cache sharing through FlexKV's Distributed RadixTree and Mooncake Transfer Engine P2P transfers. Key changes: - batch_exists: query distributed index (match_all) to discover remote blocks - batch_get_v1: fetch remote blocks via prefetch_async + wait, then read locally - batch_set_v1: write to local CPU cache with metadata published to Redis GMS - Add _fetch_remote_blocks() for synchronous P2P block retrieval - Add remote fetch statistics (get_remote_fetches/successes/failures) - Add distributed mode unit tests (7 mode configuration test cases) - Update README with distributed mode architecture and configuration - Add test script and Mooncake config examples for same-host verification Verified: dual-Prefill same-host E2E test passes with 231 blocks (3696 tokens) fetched cross-node at 4.12 GB/s via TCP P2P transfer. Signed-off-by: staryxchen <staryxchen@tencent.com>
5 tasks
Remove redundant .clone() before layout transform (permute+contiguous already copies), eliminate duplicate SequenceMeta allocation in distributed GET path, consolidate _stats lock acquisitions from 5+ to 2 per batch_get_v1 call, and hoist loop-invariant num_deduped computation. Signed-off-by: staryxchen <staryxchen@tencent.com>
- Remove custom logging setup and imports - Import shared logger from flexkv.common.debug module Signed-off-by: staryxchen <staryxchen@tencent.com>
- Integrate metrics recording for cache operations in SGLang integration Signed-off-by: staryxchen <staryxchen@tencent.com>
be22913 to
cebd369
Compare
- Auto-detect kv_lora_rank + qk_rope_head_dim from mem_pool_host - Handle MLA 4D layout transforms (L,T,1,D) <-> (L,1,T,1,D) - Move Initializing log after auto-detect to show correct values - Add MLA unit tests for auto-detect, layout transform, set/get/dedup Signed-off-by: staryxchen <staryxchen@tencent.com>
get_stats() previously returned a custom Dict which failed SGLang's `assert isinstance(storage_metrics, StorageMetrics)` check. Replace with the standard StorageMetrics dataclass (prefetch_pgs, backup_pgs, prefetch_bandwidth, backup_bandwidth) using the sliding-window pattern consistent with HF3FS and MooncakeStore backends. Uses try/except import to support both sglang.srt.metrics.collector (older versions) and sglang.srt.observability.metrics_collector (newer). Signed-off-by: staryxchen <staryxchen@tencent.com>
cd2ea62 to
d7ec982
Compare
…k shape
- Extract mode literals ("local"/"distributed") into module-level constants
MODE_LOCAL, MODE_DISTRIBUTED, _VALID_MODES to prevent typo-induced bugs
- Extract error operation labels ("get"/"set"/"exists") into _OP_GET,
_OP_SET, _OP_EXISTS constants for consistent Prometheus label usage
- Cache block shape tuple as self._block_shape at init time instead of
recomputing kv_dim and constructing the shape on every _get_block_shaped()
call (hot path in batch_get_v1 per-block loop)
- Remove dead field _started (set but never read)
- Update tests to import and use the new constants
Signed-off-by: staryxchen <staryxchen@tencent.com>
d7ec982 to
35aced5
Compare
Refactor batch_get_v1/batch_set_v1 from whole-page copy (permute+flatten) to per-layer copy (direct kv_buffer slice write). Benefits even without layerwise pipeline enabled: - Eliminates permute(1,0,2,3,4).contiguous() temp allocation (~18MB/page) - Zero-copy view via block_data[layer_id] instead of full tensor copy - Direct kv_buffer write bypasses set_from_flat_data_page overhead New methods: _write_layer_to_host(), _read_layer_from_host() Optional layer_ready_callback for future pipeline integration. Non-layer_first layouts fall back to original whole-page path. Signed-off-by: staryxchen <staryxchen@tencent.com>
…ption SGLang runs backup (batch_set_v1) and prefetch (batch_get_v1) on separate threads, both accessing the shared CPU cache tensor. Without protection, a concurrent SET can overwrite a block that GET is reading, producing corrupted KV data that causes CUDA illegal memory access on the GPU side. Fix: clone block data at the two batch_get_v1 call sites only. The SET path continues using views so writes go directly into the CPU cache tensor as intended. Also add bounds checking in _get_block_view, _write_layer_to_host, and _read_layer_from_host to catch invalid block/host references early.
Provide a `flexkv-patch-sglang` command that auto-locates the SGLang install and applies the FlexKV integration patch (unified diff). Users no longer need to manually git-apply a patch file. - Add patches/sglang_flexkv.patch (3 SGLang source files) - Add patch_sglang.py with --check / --revert / --sglang-path - Register console_scripts entry point in setup.py - Include .patch files in package_data
2a55ff6 to
0045b24
Compare
Update data flow, layout transform, file tables, parameter docs, and test counts to match the current adapter code.
6d5dc08 to
0397861
Compare
_write_layer_to_host and _read_layer_from_host checked host_start against kv_buf.shape[2], which is the token axis for MHA (2,L,T,H,D) but the num_kv_heads axis (size 1) for MLA (L,T,1,D). This caused all MLA layerwise operations to silently bail out on the OOB guard. Fix: use shape[1] for MLA (token axis) and shape[2] for MHA.
…re-write race The CPU-only PUT path (commit 6452176) inserted CPU blocks into the radix tree with is_ready=True before the caller had filled data into those blocks. This created a window where concurrent batch_get_v1 could match and read partially-written or uninitialized blocks. Root cause: _put_impl_local_cpu_only() set is_ready=True at insert time, but put_cpu() returns cpu_block_ids for the caller to fill afterwards — the data is not yet present when the tree is updated. Fix: insert with is_ready=False and return a data_ready_callback that the caller invokes after filling data. batch_set_v1 now calls this callback immediately after the data copy loop, before launch_cpu(). Also defer _process_empty_graph from put_cpu_match to launch_cpu to prevent premature task completion before data filling. Signed-off-by: staryxchen <staryxchen@tencent.com>
Add 4 tests that verify the temporal correctness of the deferred block visibility fix: 1. test_cpu_put_blocks_not_ready_before_data_fill Core regression test: after put_cpu(), radix tree has blocks (num_matched_blocks > 0) but they are NOT visible to readers (num_ready_matched_blocks == 0) until data_ready_callback(). 2. test_cpu_put_visibility_after_data_ready_callback End-to-end: manually drives put_cpu → fill data → data_ready_cb → launch_cpu, then verifies batch_get_v1 reads correct data. 3. test_batch_set_v1_makes_blocks_ready Verifies batch_set_v1 (which calls data_ready_cb internally) leaves blocks in a fully ready state in the radix tree. 4. test_concurrent_get_during_put_sees_no_partial_data Simulates a reader arriving while writer is mid-fill; reader must see nothing. After fill completes, reader sees correct data. Signed-off-by: staryxchen <staryxchen@tencent.com>
- Add data_ready_callback step to PUT data flow diagram - Update test count from 22 to 26 (4 new concurrency tests) - Add verification checklist item for deferred block visibility Signed-off-by: staryxchen <staryxchen@tencent.com>
…access SGLang's HiCacheController calls batch_set_v1, batch_get_v1, and batch_exists from 3 independent threads (backup_thread, prefetch_thread, prefetch_io_aux) on the same storage backend instance without synchronization. FlexKV's radix tree, mempool, and CPU cache tensor are not thread-safe, so concurrent access can cause data corruption when batch_set_v1 triggers LRU eviction that recycles blocks being read by batch_get_v1. Changes: - Add threading.Lock (_engine_lock) to serialise cache-engine operations - Remove .clone() in batch_get_v1 (no longer needed under lock) - Update _get_block_shaped docstring to document lock requirement The lock is a simple mutex — sufficient because SGLang uses only one backup_thread (writer) and the prefetch threads are effectively serialised. A future C++ layer change (std::shared_mutex in CRadixTreeIndex) can provide finer-grained concurrency. Signed-off-by: staryxchen <staryxchen@tencent.com>
- Add "Thread Safety" section explaining SGLang's 3-thread model and the adapter's _engine_lock serialisation strategy - Update data flow diagrams to annotate lock holding - Note that batch_get_v1 reads block views directly (no clone) - Add thread safety item to verification checklist Signed-off-by: staryxchen <staryxchen@tencent.com>
…ompat Address three review issues from PR taco-project#137: 1. Add singleton guard to FlexKVHiCacheStorage._init_kv_manager() to prevent multiple instances from silently corrupting process-global env vars (FLEXKV_CPU_ONLY / FLEXKV_INSTANCE_NUM). Raises RuntimeError if a second instance tries to start a KVManager. The guard is released via shutdown() so the slot can be reused. 2. Release _engine_lock before remote P2P fetch in batch_get_v1(). The distributed path previously held the lock during the entire network round-trip (up to prefetch_timeout=5s), blocking all other SGLang threads. Now uses a 3-phase approach: brief lock for distributed discovery, lock-free remote fetch, then re-acquire for local read. 3. Replace Python 3.10+ type hints (str | None, list[str]) with typing.Optional[str] and typing.List[str] in patch_sglang.py to match setup.py's python_requires=">=3.6". Signed-off-by: staryxchen <staryxchen@tencent.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
put_cpu()API enabling CPU-only PUT path without GPU block registration_engine_lockserialises concurrent SGLang threads (backup/prefetch/query)Changes
Core: CPU-only PUT Path
cache_engine.py:put_cpu()— allocate CPU blocks, insert into radix tree withis_ready=False, optionally create H2DISK opskvtask.py:put_cpu_match(), detectFLEXKV_CPU_ONLY=1for thread modekvmanager.py:put_cpu(),launch_cpu(),get_cpu_cache_tensor()SGLang Adapter (
flexkv/integration/sglang/)hicache_storage_adapter.py:FlexKVHiCacheStorageimplementing SGLangHiCacheStorageinterfacebatch_set_v1: Host -> layout transform -> FlexKV CPU cache -> optional SSDbatch_get_v1: FlexKV CPU cache -> layout transform -> Host pool (+ cross-node P2P in distributed mode)batch_exists: Radix tree query (local + distributed viamatch_all())_engine_lock: Thread safety for concurrent SGLang backup/prefetch/query threadsmem_pool_hostpatch_sglang.py: One-click SGLang patch tool (flexkv-patch-sglangCLI)test_hicache_storage_adapter.py: 26 unit testsDocumentation
docs/sglang_adapter/README.md: Architecture, thread safety, data flow, configurationTest Plan
_engine_lockserialises batch_set_v1/batch_get_v1/batch_exists