Skip to content

[feature] Add SGLang HiCacheStorage integration with local and distributed modes#137

Open
staryxchen wants to merge 21 commits into
taco-project:mainfrom
staryxchen:feat/sglang-distributed-mode
Open

[feature] Add SGLang HiCacheStorage integration with local and distributed modes#137
staryxchen wants to merge 21 commits into
taco-project:mainfrom
staryxchen:feat/sglang-distributed-mode

Conversation

@staryxchen
Copy link
Copy Markdown
Contributor

@staryxchen staryxchen commented Apr 5, 2026

Summary

  • Add FlexKV as a storage backend for SGLang's HiRadixCache (L3 tier: GPU -> Host -> FlexKV)
  • Support local mode (single-node CPU cache + optional SSD) and distributed mode (multi-Prefill KV Cache sharing via Distributed RadixTree + Redis GMS + Mooncake P2P transfer)
  • New put_cpu() API enabling CPU-only PUT path without GPU block registration
  • Thread-safe adapter: _engine_lock serialises concurrent SGLang threads (backup/prefetch/query)

Changes

Core: CPU-only PUT Path

  • cache_engine.py: put_cpu() — allocate CPU blocks, insert into radix tree with is_ready=False, optionally create H2DISK ops
  • kvtask.py: put_cpu_match(), detect FLEXKV_CPU_ONLY=1 for thread mode
  • kvmanager.py: put_cpu(), launch_cpu(), get_cpu_cache_tensor()

SGLang Adapter (flexkv/integration/sglang/)

  • hicache_storage_adapter.py: FlexKVHiCacheStorage implementing SGLang HiCacheStorage interface
    • batch_set_v1: Host -> layout transform -> FlexKV CPU cache -> optional SSD
    • batch_get_v1: FlexKV CPU cache -> layout transform -> Host pool (+ cross-node P2P in distributed mode)
    • batch_exists: Radix tree query (local + distributed via match_all())
    • _engine_lock: Thread safety for concurrent SGLang backup/prefetch/query threads
    • Auto-detects model params from SGLang mem_pool_host
  • patch_sglang.py: One-click SGLang patch tool (flexkv-patch-sglang CLI)
  • test_hicache_storage_adapter.py: 26 unit tests

Documentation

  • docs/sglang_adapter/README.md: Architecture, thread safety, data flow, configuration

Test Plan

  • Unit tests: 26/26 pass (token extraction, init, MLA, round-trip, dedup, distributed mode, concurrency)
  • SGLang E2E: backup/prefetch cycle verified, GSM8K accuracy consistent
  • Multi-GPU: TP=4, auto-detected per-rank params
  • SSD persistence: H2DISK io_uring transfer verified
  • PD disaggregation: Prefill (TP=2) + Decode (TP=2), 3500+ requests at 100% success rate
  • Distributed mode: cross-node GET via Mooncake P2P transfer verified
  • Thread safety: _engine_lock serialises batch_set_v1/batch_get_v1/batch_exists

Add a new put_cpu() API that allows writing KV data from external CPU
memory into FlexKV's CPU cache without requiring GPU block registration.

Core changes:
- cache_engine.py: add put_cpu() and _put_impl_local_cpu_only() that
  allocate CPU blocks, insert into radix tree, and optionally create
  H2DISK transfer ops for SSD persistence
- kvtask.py: add put_cpu_match() to KVTaskEngine; detect FLEXKV_CPU_ONLY
  env var to use thread mode TransferManager (shared address space)
- kvmanager.py: add put_cpu(), launch_cpu(), get_cpu_cache_tensor()
  public methods for the CPU-only workflow

This enables SGLang HiCacheStorage integration where data lives in Host
memory and no GPU slots are available.

Signed-off-by: staryxchen <staryxchen@tencent.com>
Implement FlexKVHiCacheStorage adapter for SGLang's HiRadixCache,
using KVManager in thread mode (CPU-only PUT path).

Adapter features:
- batch_set_v1: write KV data from SGLang Host pool to FlexKV CPU
  cache via put_cpu(), with layout transform and dedup handling
- batch_get_v1: read from FlexKV CPU cache, transform layout, write
  back to SGLang Host pool
- batch_exists: query CPU cache engine radix tree directly
- Auto-detect model params (num_layers, num_kv_heads, head_size)
  from SGLang's mem_pool_host at runtime
- Support all SGLang layouts (layer_first, page_first, etc.)

Unit tests (8 cases):
- Token ID extraction (plain list, RadixKey, None)
- Adapter initialization, MLA skip, graceful degradation
- Full set -> exists -> get round-trip with data validation
- Deduplication correctness, statistics collection

Signed-off-by: staryxchen <staryxchen@tencent.com>
- docs/sglang_adapter/README.md: architecture overview, data flow,
  configuration guide, known limitations, verification checklist

Signed-off-by: staryxchen <staryxchen@tencent.com>
…ia P2P

Add distributed mode to the SGLang HiCacheStorage adapter, enabling
multi-Prefill KV Cache sharing through FlexKV's Distributed RadixTree
and Mooncake Transfer Engine P2P transfers.

Key changes:
- batch_exists: query distributed index (match_all) to discover remote blocks
- batch_get_v1: fetch remote blocks via prefetch_async + wait, then read locally
- batch_set_v1: write to local CPU cache with metadata published to Redis GMS
- Add _fetch_remote_blocks() for synchronous P2P block retrieval
- Add remote fetch statistics (get_remote_fetches/successes/failures)
- Add distributed mode unit tests (7 mode configuration test cases)
- Update README with distributed mode architecture and configuration
- Add test script and Mooncake config examples for same-host verification

Verified: dual-Prefill same-host E2E test passes with 231 blocks (3696 tokens)
fetched cross-node at 4.12 GB/s via TCP P2P transfer.

Signed-off-by: staryxchen <staryxchen@tencent.com>
Remove redundant .clone() before layout transform (permute+contiguous
already copies), eliminate duplicate SequenceMeta allocation in
distributed GET path, consolidate _stats lock acquisitions from 5+ to 2
per batch_get_v1 call, and hoist loop-invariant num_deduped computation.

Signed-off-by: staryxchen <staryxchen@tencent.com>
@linhu-nv linhu-nv requested review from linhu-nv and zhuofan1123 April 6, 2026 05:44
- Remove custom logging setup and imports
- Import shared logger from flexkv.common.debug module

Signed-off-by: staryxchen <staryxchen@tencent.com>
- Integrate metrics recording for cache operations in SGLang integration

Signed-off-by: staryxchen <staryxchen@tencent.com>
@staryxchen staryxchen force-pushed the feat/sglang-distributed-mode branch 4 times, most recently from be22913 to cebd369 Compare April 7, 2026 12:55
- Auto-detect kv_lora_rank + qk_rope_head_dim from mem_pool_host
- Handle MLA 4D layout transforms (L,T,1,D) <-> (L,1,T,1,D)
- Move Initializing log after auto-detect to show correct values
- Add MLA unit tests for auto-detect, layout transform, set/get/dedup

Signed-off-by: staryxchen <staryxchen@tencent.com>
get_stats() previously returned a custom Dict which failed SGLang's
`assert isinstance(storage_metrics, StorageMetrics)` check. Replace
with the standard StorageMetrics dataclass (prefetch_pgs, backup_pgs,
prefetch_bandwidth, backup_bandwidth) using the sliding-window pattern
consistent with HF3FS and MooncakeStore backends.

Uses try/except import to support both sglang.srt.metrics.collector
(older versions) and sglang.srt.observability.metrics_collector (newer).

Signed-off-by: staryxchen <staryxchen@tencent.com>
@staryxchen staryxchen force-pushed the feat/sglang-distributed-mode branch 2 times, most recently from cd2ea62 to d7ec982 Compare April 7, 2026 14:14
…k shape

- Extract mode literals ("local"/"distributed") into module-level constants
  MODE_LOCAL, MODE_DISTRIBUTED, _VALID_MODES to prevent typo-induced bugs
- Extract error operation labels ("get"/"set"/"exists") into _OP_GET,
  _OP_SET, _OP_EXISTS constants for consistent Prometheus label usage
- Cache block shape tuple as self._block_shape at init time instead of
  recomputing kv_dim and constructing the shape on every _get_block_shaped()
  call (hot path in batch_get_v1 per-block loop)
- Remove dead field _started (set but never read)
- Update tests to import and use the new constants

Signed-off-by: staryxchen <staryxchen@tencent.com>
@staryxchen staryxchen force-pushed the feat/sglang-distributed-mode branch from d7ec982 to 35aced5 Compare April 7, 2026 14:15
Refactor batch_get_v1/batch_set_v1 from whole-page copy (permute+flatten)
to per-layer copy (direct kv_buffer slice write). Benefits even without
layerwise pipeline enabled:
- Eliminates permute(1,0,2,3,4).contiguous() temp allocation (~18MB/page)
- Zero-copy view via block_data[layer_id] instead of full tensor copy
- Direct kv_buffer write bypasses set_from_flat_data_page overhead

New methods: _write_layer_to_host(), _read_layer_from_host()
Optional layer_ready_callback for future pipeline integration.
Non-layer_first layouts fall back to original whole-page path.

Signed-off-by: staryxchen <staryxchen@tencent.com>
…ption

SGLang runs backup (batch_set_v1) and prefetch (batch_get_v1) on
separate threads, both accessing the shared CPU cache tensor.
Without protection, a concurrent SET can overwrite a block that GET
is reading, producing corrupted KV data that causes CUDA illegal
memory access on the GPU side.

Fix: clone block data at the two batch_get_v1 call sites only.
The SET path continues using views so writes go directly into the
CPU cache tensor as intended.

Also add bounds checking in _get_block_view, _write_layer_to_host,
and _read_layer_from_host to catch invalid block/host references
early.
Provide a `flexkv-patch-sglang` command that auto-locates the SGLang
install and applies the FlexKV integration patch (unified diff).
Users no longer need to manually git-apply a patch file.

- Add patches/sglang_flexkv.patch (3 SGLang source files)
- Add patch_sglang.py with --check / --revert / --sglang-path
- Register console_scripts entry point in setup.py
- Include .patch files in package_data
@staryxchen staryxchen force-pushed the feat/sglang-distributed-mode branch from 2a55ff6 to 0045b24 Compare April 8, 2026 14:02
Update data flow, layout transform, file tables, parameter docs, and
test counts to match the current adapter code.
@staryxchen staryxchen force-pushed the feat/sglang-distributed-mode branch from 6d5dc08 to 0397861 Compare April 9, 2026 06:38
_write_layer_to_host and _read_layer_from_host checked host_start
against kv_buf.shape[2], which is the token axis for MHA (2,L,T,H,D)
but the num_kv_heads axis (size 1) for MLA (L,T,1,D). This caused
all MLA layerwise operations to silently bail out on the OOB guard.

Fix: use shape[1] for MLA (token axis) and shape[2] for MHA.
…re-write race

The CPU-only PUT path (commit 6452176) inserted CPU blocks into the
radix tree with is_ready=True before the caller had filled data into
those blocks.  This created a window where concurrent batch_get_v1
could match and read partially-written or uninitialized blocks.

Root cause: _put_impl_local_cpu_only() set is_ready=True at insert
time, but put_cpu() returns cpu_block_ids for the caller to fill
afterwards — the data is not yet present when the tree is updated.

Fix: insert with is_ready=False and return a data_ready_callback that
the caller invokes after filling data.  batch_set_v1 now calls this
callback immediately after the data copy loop, before launch_cpu().
Also defer _process_empty_graph from put_cpu_match to launch_cpu to
prevent premature task completion before data filling.

Signed-off-by: staryxchen <staryxchen@tencent.com>
Add 4 tests that verify the temporal correctness of the deferred
block visibility fix:

1. test_cpu_put_blocks_not_ready_before_data_fill
   Core regression test: after put_cpu(), radix tree has blocks
   (num_matched_blocks > 0) but they are NOT visible to readers
   (num_ready_matched_blocks == 0) until data_ready_callback().

2. test_cpu_put_visibility_after_data_ready_callback
   End-to-end: manually drives put_cpu → fill data → data_ready_cb
   → launch_cpu, then verifies batch_get_v1 reads correct data.

3. test_batch_set_v1_makes_blocks_ready
   Verifies batch_set_v1 (which calls data_ready_cb internally)
   leaves blocks in a fully ready state in the radix tree.

4. test_concurrent_get_during_put_sees_no_partial_data
   Simulates a reader arriving while writer is mid-fill; reader
   must see nothing.  After fill completes, reader sees correct data.

Signed-off-by: staryxchen <staryxchen@tencent.com>
- Add data_ready_callback step to PUT data flow diagram
- Update test count from 22 to 26 (4 new concurrency tests)
- Add verification checklist item for deferred block visibility

Signed-off-by: staryxchen <staryxchen@tencent.com>
…access

SGLang's HiCacheController calls batch_set_v1, batch_get_v1, and
batch_exists from 3 independent threads (backup_thread, prefetch_thread,
prefetch_io_aux) on the same storage backend instance without
synchronization.

FlexKV's radix tree, mempool, and CPU cache tensor are not thread-safe,
so concurrent access can cause data corruption when batch_set_v1
triggers LRU eviction that recycles blocks being read by batch_get_v1.

Changes:
- Add threading.Lock (_engine_lock) to serialise cache-engine operations
- Remove .clone() in batch_get_v1 (no longer needed under lock)
- Update _get_block_shaped docstring to document lock requirement

The lock is a simple mutex — sufficient because SGLang uses only one
backup_thread (writer) and the prefetch threads are effectively
serialised.  A future C++ layer change (std::shared_mutex in
CRadixTreeIndex) can provide finer-grained concurrency.

Signed-off-by: staryxchen <staryxchen@tencent.com>
- Add "Thread Safety" section explaining SGLang's 3-thread model
  and the adapter's _engine_lock serialisation strategy
- Update data flow diagrams to annotate lock holding
- Note that batch_get_v1 reads block views directly (no clone)
- Add thread safety item to verification checklist

Signed-off-by: staryxchen <staryxchen@tencent.com>
…ompat

Address three review issues from PR taco-project#137:

1. Add singleton guard to FlexKVHiCacheStorage._init_kv_manager() to
   prevent multiple instances from silently corrupting process-global
   env vars (FLEXKV_CPU_ONLY / FLEXKV_INSTANCE_NUM). Raises RuntimeError
   if a second instance tries to start a KVManager. The guard is released
   via shutdown() so the slot can be reused.

2. Release _engine_lock before remote P2P fetch in batch_get_v1(). The
   distributed path previously held the lock during the entire network
   round-trip (up to prefetch_timeout=5s), blocking all other SGLang
   threads. Now uses a 3-phase approach: brief lock for distributed
   discovery, lock-free remote fetch, then re-acquire for local read.

3. Replace Python 3.10+ type hints (str | None, list[str]) with
   typing.Optional[str] and typing.List[str] in patch_sglang.py to
   match setup.py's python_requires=">=3.6".

Signed-off-by: staryxchen <staryxchen@tencent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant