Skip to content

Commit 22bc183

Browse files
author
phaedonsun
committed
feat(dist_reuse): KV cache sharing across TP/PP/CP + single-node radix match
Squashes 5 commits (20a65d5 + d97db4d + 72632ec + b9230c6 + 7cd04ad) into a single landed feature. This is the full dist_reuse stack on top of PR #165 (RankInfo refactor), validated end-to-end on a 2-machine GPU setup (gpu-146.56.224.46 master / gpu-129.211.162.213 peer): S1 (single-node TP=1) cached_ratio 99.65% PASS S2 (single-node TP=2) cached_ratio 99.65% PASS S3 (cross-node TP=1) master cold->warm 99.63%, peer crosshit 99.63% storage=272 backend= FlexKVConnector (PEERH2H @ 6.22 GB/s via mooncake/RDMA) PASS x3 357/357 unit tests on both nodes PASS == Original commits (in chronological order) == [20a65d5] feat(dist_reuse): KV cache sharing across TP/PP/CP + single- node radix match Initial dist_reuse stack: master coordinator, sharing-domain key, aggregate radix, redis-meta namespace, multi-node policy, P2P transfer types (PEERH2H/H2PEERH/PEERSSD2H/H2PEERSSD), failure detector, four S{1..4} sglang+FlexKV e2e benchmark scripts. [d97db4d] fix(dist_reuse): unblock cross-instance KV cache sharing on s3_cross_node_tp1 Three runtime bugs blocked the s3 (master prime / peer crosshit) flow: 1) GPUCPUTransferWorker._transfer_impl had positional-arg drift on the transfer_kv_blocks pybind: C++ added 'start_layer_id' between 'chunk_size_in_bytes' and 'num_layers' (transfer.cu 2025-07-10), which silently mapped is_h2d=False onto transfer_num_cta and launched D2H kernels with gridDim(0) -> cudaErrorInvalidConfiguration on every D2H. Fix: bind every value to the C++ pybind name with kwargs and add 'start_layer_id=0' explicitly. 2) GlobalCacheEngine._maybe_attach_multi_sd_peerh2h_ops carried a dead 'layer_num' parameter which the only caller in _get_impl_local passed undefined -> NameError on first cross-instance reuse hit. Fix: drop the dead parameter and 6 call sites in tests/test_d3_filter_and_get_clones.py. 3) merge_to_batch_graph raised NotImplementedError on PEERH2H / H2PEERH / PEERSSD2H / H2PEERSSD as soon as a real cross-instance hit produced a P2P op. Fix: whitelist the four types as P2P passthrough (preserves per-block src_block_node_ids and per-op target_node_ids from D-3 multi-SD broadcast clones), wire dependencies on merged_h2d_op (GET) / merged_d2h_op (PUT). [72632ec] fix(memory_handle): propagate _import_tensor_handle exceptions Previously _import_tensor_handle logged the error and returned torch.empty(0) on import failure, which silently dropped the wrapper into a 0-element tensor and surfaced as an unrelated IndexError later in worker.py::_get_layer_ptrs (layer_blocks[lay_id][0] out of range). Now always re-raise, keeping the original traceback so cross-node CUDA IPC handle device-id mismatches surface at their source. Consistent with _import_cuda_ipc_handle which never swallowed. [b9230c6] fix(config): move tp_node_idx from ModelConfig to RankInfo PR #165 removed tp_rank from ModelConfig but ModelConfig.tp_node_idx still referenced self.tp_rank, raising AttributeError. Two pre-existing test_cache_config_batch_b.py cases failed because of this. Fix: remove ModelConfig.tp_node_idx (replaced with a migration comment); add RankInfo.tp_node_idx (tp_rank // tp_size_per_node) to complement RankInfo.tp_rank_per_node (tp_rank % tp_size_per_node); update the two TP-node-count tests to construct a RankInfo for tp_node_idx assertions. [7cd04ad] docs(monitoring): document the new flexkv_py_dist_reuse_* metrics Added user-facing documentation for the 5 cross-instance reuse metrics in docs/monitoring/README_{en,zh}.md (kept in sync): * \xa72.3 'Cross-instance Reuse Metrics' table with type, labels, severity and KNOWN_ISSUE-derived alert thresholds. * 'Instrumentation status' subtable that flags the two metrics (lease_meta_nullptr_total / about_to_evict_total) whose Python collector hooks are ready but whose C++ master-side trigger has not yet landed, with a callout that '0' on these two does NOT mean 'system healthy'. * \xa71.1 environment variable table now documents PROMETHEUS_MULTIPROC_DIR (the sample dir used by prometheus_client across sglang TP/PP workers, KVManager subprocess and transfer workers). * \xa73.5 'Multiprocess Scrape Notes' explaining the MultiProcessCollector aggregation path and the recommended /dev/shm/flexkv_prom tmpfs override. * \xa73.6 'Recommended PromQL alerts' section with 4 ready-to-paste Prometheus alert rules: - FlexKVDistReuseLeaseMetaNullptr (critical, any positive) - FlexKVDistReusePeerReadFailureRate (critical, > 0.1%) - FlexKVDistReusePeerReadP99High (warning, > 500ms) - FlexKVDistReuseEvictPressure (warning, ratio > 10) * The /metrics curl verification snippet now also greps flexkv_py_dist_reuse_.
1 parent 00cc828 commit 22bc183

67 files changed

Lines changed: 15393 additions & 473 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# FlexKV Two-Node Direct-Mode e2e — How to Run
2+
3+
This harness validates **distributed KV cache sharing** across two physical
4+
hosts (146 ↔ 129) **without touching sglang** — it drives FlexKV directly
5+
via ``KVManager`` + ``KVTPClient`` inside ``benchmark_dist_direct.py``.
6+
7+
## Why direct mode (not ``benchmark_dist_kvcache.py``)
8+
9+
* No ``KVServer`` subprocess → simpler lifecycle, no residual IPC socket at
10+
``/tmp/flexkv_server`` to clean up.
11+
* Uses the same ``get_match`` / ``put_async`` main path that §2.1 wires
12+
through to ``_sharing_domain_gate_get`` and ``_notify_sd_ready_on_put``,
13+
so any breakage here is a real breakage.
14+
* GPU footprint is tiny (~40 MB / GPU) — safe to run alongside a live
15+
sglang serving that already owns most of the memory.
16+
17+
## Conflict isolation with the running sglang process
18+
19+
| Resource | sglang (GLM-5-FP8) | this benchmark |
20+
|---|---|---|
21+
| Mooncake engine TCP port | **5555** (sglang Transfer) | **5556 on 146** / **5557 on 129** |
22+
| Redis logical DB | 0 (mooncake keys ``mooncake/*``) | 0 (mooncake) + **DB 1 for flexkv keys** |
23+
| Redis key prefixes (DB 1) || ``sd:*``, ``instance:*``, ``node:*``|
24+
| GPU VRAM | most of it | one GPU, ~40 MB |
25+
| IPC sockets | ``/tmp/flexkv_server`` | **none** (direct mode) |
26+
27+
So the only shared resources are the physical Redis server (different DB)
28+
and the RDMA NICs (different QP). Neither overlaps in state.
29+
30+
## Quick checklist before launching
31+
32+
1. **Redis reachable + password OK**:
33+
```bash
34+
redis-cli -h 10.206.0.9 -p 6379 -a 123456 PING # → PONG
35+
```
36+
2. **DB 1 is clean (first time only)**:
37+
```bash
38+
redis-cli -h 10.206.0.9 -p 6379 -a 123456 -n 1 DBSIZE # → (integer) 0
39+
# If non-zero and you know it's our leftover state, wipe it:
40+
# redis-cli -h 10.206.0.9 -p 6379 -a 123456 -n 1 FLUSHDB
41+
# Do NOT touch DB 0 — mooncake + sglang live there.
42+
```
43+
3. **Mooncake ports 5556 / 5557 free** on the respective hosts:
44+
```bash
45+
ss -ltn | grep -E ':5556|:5557' # should be empty
46+
```
47+
4. **FlexKV built with FLEXKV_ENABLE_P2P=1** on both hosts:
48+
```bash
49+
python3 -c 'from flexkv.cache.redis_meta import dist_available; print(dist_available())'
50+
# → True
51+
```
52+
53+
## Run
54+
55+
### Step A — on host **146** (10.206.0.9), start **PUT-only**
56+
57+
```bash
58+
cd /data1/phaedonsun/flexkv/FlexKV
59+
60+
export PYTHONPATH=/data1/phaedonsun/flexkv/FlexKV
61+
export LD_LIBRARY_PATH=/data1/phaedonsun/flexkv/FlexKV/build/lib
62+
export CUDA_VISIBLE_DEVICES=0 # pick any single free GPU
63+
64+
python3 benchmarks/dist_benchmark/benchmark_dist_direct.py \
65+
--config benchmarks/dist_benchmark/twonode_direct_146.yml \
66+
--mode put-only \
67+
--batch-size 1 --sequence-length 256 \
68+
--seed 42 \
69+
--rebuild-interval-ms 20
70+
```
71+
72+
Expected final line before the process idles:
73+
```
74+
Data published to Redis. Press Enter to shutdown (keep running for other nodes to GET)...
75+
```
76+
77+
### Step B — on host **129** (10.206.0.13), start **GET-only with same seed**
78+
79+
```bash
80+
cd /data1/phaedonsun/flexkv/FlexKV
81+
82+
export PYTHONPATH=/data1/phaedonsun/flexkv/FlexKV
83+
export LD_LIBRARY_PATH=/data1/phaedonsun/flexkv/FlexKV/build/lib
84+
export CUDA_VISIBLE_DEVICES=0 # different physical GPU, same index is fine
85+
86+
python3 benchmarks/dist_benchmark/benchmark_dist_direct.py \
87+
--config benchmarks/dist_benchmark/twonode_direct_129.yml \
88+
--mode get-only \
89+
--batch-size 1 --sequence-length 256 \
90+
--seed 42 \
91+
--rebuild-interval-ms 20
92+
```
93+
94+
## Success criteria
95+
96+
In the 129 log look for:
97+
98+
```
99+
--- GET Phase ---
100+
GET: 256/256 tokens, data_size: 0.000 GB, cache_ratio: 100.00% ...
101+
```
102+
103+
A non-zero ``cache_ratio`` means the 129 instance:
104+
1. Found the 146 instance via the shared Redis (``instance:*`` discovery)
105+
2. Resolved the 146 peer SD's ``node_id`` from the aggregate radix
106+
3. Issued a Mooncake RDMA read against 146's mooncake engine @ 5556
107+
4. Received KV data that matches byte-for-byte what 146 PUT
108+
109+
If ``cache_ratio: 0.00%``:
110+
* Check the 129 log for ``[DistReuse]`` lines — the §2.1 gate ruled it out.
111+
* Check ``KEYS sd:*`` in Redis DB 1 — the 146 side should have published
112+
``sd:<…>:block:<nid>:<hash>`` keys.
113+
* Check Mooncake connectivity by running the ``transfer_engine_bench``
114+
binary between 146:5556 and 129:5557.
115+
116+
## Teardown
117+
118+
* 146: press Ctrl-C in the PUT-only terminal (the Ctrl-C handler calls
119+
``kvmanager.shutdown()`` which releases Mooncake + Redis state).
120+
* 129: the GET-only run exits on its own; its ``atexit`` hook tears
121+
KVManager down.
122+
* Optionally wipe Redis DB 1 between runs:
123+
```bash
124+
redis-cli -h 10.206.0.9 -p 6379 -a 123456 -n 1 FLUSHDB
125+
```

benchmarks/dist_benchmark/benchmark_dist_direct.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,14 @@ def load_dist_direct_config(config_path: str):
132132
user_config.local_ip = config["local_ip"]
133133
if "redis_password" in config:
134134
user_config.redis_password = config["redis_password"]
135+
# Optional: pick a non-default Redis logical DB so FlexKV keys don't collide
136+
# with other tenants (e.g. Mooncake meta, or another running FlexKV /
137+
# sglang instance on the same physical Redis). The flexkv keys (``sd:*``,
138+
# ``instance:*``, ``node:*`` …) all live in the selected DB; the mooncake
139+
# backend continues to use whatever DB its ``metadata_server`` URL
140+
# implies (default 0).
141+
if "flexkv_redis_db" in config:
142+
user_config.flexkv_redis_db = int(config["flexkv_redis_db"])
135143

136144
# Auto-generate mooncake config JSON and set MOONCAKE_CONFIG_PATH if P2P is enabled
137145
if config.get("enable_p2p_cpu", False) or config.get("enable_p2p_ssd", False):

0 commit comments

Comments
 (0)