Skip to content

[Do Not Merge] - Add LMCacheConnectorV1 PD diagnostic harness#329

Open
AAbouzeid wants to merge 9 commits into
ovg-project:mainfrom
AAbouzeid:pd-disagg-LMCacheConnectorV1-1
Open

[Do Not Merge] - Add LMCacheConnectorV1 PD diagnostic harness#329
AAbouzeid wants to merge 9 commits into
ovg-project:mainfrom
AAbouzeid:pd-disagg-LMCacheConnectorV1-1

Conversation

@AAbouzeid
Copy link
Copy Markdown

@AAbouzeid AAbouzeid commented May 12, 2026

Summary

This PR adds a diagnostic harness and evidence collector for validating vLLM LMCacheConnectorV1 in a 1-prefill / 1-decode disaggregated-prefill setup, then uses that evidence to isolate the kvcached compatibility issue and verify the simplest fix.

The validated fix is to run LMCacheConnectorV1 with kvcached's non-compound per-layer KV layout:

KVCACHED_CONTIGUOUS_LAYOUT=false

With the default compound layout, kvcached exposes per-layer KV tensors as non-contiguous layer-interleaved views that LMCache cannot normalize. With KVCACHED_CONTIGUOUS_LAYOUT=false, LMCache sees contiguous per-layer KV tensors, the prefiller store path succeeds, decoder retrieval succeeds, and the run reports non-zero LMCache hit tokens.

Added

  • experiments/12_lmcache_connector_v1_debug.sh

    • Starts a local LMCache disagg proxy, one prefiller, and one decoder.
    • Uses LMCacheConnectorV1 with LMCache PD/NIXL config.
    • Sends deterministic long prompts so LMCache chunk hits can be observed.
    • Attaches prefiller-side kv_transfer_params.disagg_spec so the prefiller has decoder receiver metadata.
    • Supports RUN_WITH_KVCACHED=0 plain baseline and RUN_WITH_KVCACHED=1 kvcached mode.
    • Supports KV_LAYOUT_DIAG=1 to log KV tensor shape, stride, storage offset, storage pointer, and contiguity.
  • experiments/collect_lmcache_connector_v1_evidence.sh

    • Packages run logs, configs, request/response JSON, package versions, system info, and layout diagnostics into an evidence directory and tarball.
  • experiments/lmcache_connector_v1_validation.md

    • Documents how to run the validation and summarizes the evidence.

Fix / Compatibility Mode

LMCacheConnectorV1 should not use kvcached's current compound contiguous layout. It should run with:

KVCACHED_CONTIGUOUS_LAYOUT=false

Why:

  • kvcached default compound layout optimizes VMM page mapping by backing all layers with one shared allocation.
  • That layout gives vLLM valid attention tensors, but the per-layer tensor views are still layer-interleaved at the stride/storage level.
  • LMCache's GPU connector supports KV tensors that are already contiguous or recoverable by metadata-only permutation.
  • The compound kvcached view is neither, so LMCache fails before PD/NIXL transfer can complete.
  • Non-compound layout gives LMCache independent contiguous per-layer tensors and the full LMCacheConnectorV1 path works.

Implementation options after this diagnostic PR:

  1. Auto-select non-compound layout when vLLM is configured with kv_connector='LMCacheConnectorV1'.
  2. Document KVCACHED_CONTIGUOUS_LAYOUT=false as required for LMCacheConnectorV1 until auto-detection is added.
  3. Add a fail-fast guard if LMCacheConnectorV1 is used with the compound layout.

Commands Used

Plain baseline:

RUN_WITH_KVCACHED=0 \
INSTALL_DEPS=0 \
GPU_MEM_UTIL=0.45 \
RUN_ID=plain_lmcache_hits_1 \
TIMEOUT_REQUEST=300 \
KEEP_ALIVE_ON_FAIL=0 \
./experiments/12_lmcache_connector_v1_debug.sh

kvcached default compound layout, expected failure:

KV_LAYOUT_DIAG=1 \
RUN_WITH_KVCACHED=1 \
INSTALL_DEPS=0 \
GPU_MEM_UTIL=0.45 \
RUN_ID=kvcached_layout_diag_1 \
TIMEOUT_REQUEST=300 \
KEEP_ALIVE_ON_FAIL=0 \
./experiments/12_lmcache_connector_v1_debug.sh

kvcached non-compound layout, verified pass:

KVCACHED_CONTIGUOUS_LAYOUT=false \
KV_LAYOUT_DIAG=1 \
RUN_WITH_KVCACHED=1 \
INSTALL_DEPS=1 \
GPU_MEM_UTIL=0.45 \
RUN_ID=kvcached_noncontig_lmcache_1 \
TIMEOUT_REQUEST=300 \
KEEP_ALIVE_ON_FAIL=0 \
./experiments/12_lmcache_connector_v1_debug.sh

Evidence collection:

./experiments/collect_lmcache_connector_v1_evidence.sh \
  experiments/logs_lmcache_v1_debug/plain_lmcache_hits_1

./experiments/collect_lmcache_connector_v1_evidence.sh \
  experiments/logs_lmcache_v1_debug/kvcached_layout_diag_1

./experiments/collect_lmcache_connector_v1_evidence.sh \
  experiments/logs_lmcache_v1_debug/kvcached_noncontig_lmcache_1

Environment

vLLM: 0.19.0
LMCache: 0.4.4
NIXL: 1.1.0
Model: Qwen/Qwen2.5-1.5B-Instruct
Observed GPUs: 2x NVIDIA H100 80GB HBM3 and 2x NVIDIA A100-SXM4-80GB across pods

Evidence Bundles

These evidence tarballs are committed under experiments/evidence/lmcache_connector_v1/. Their filenames and top-level archive directories are timestamp-free so the artifacts are stable in the repository and PR discussion.

  • experiments/evidence/lmcache_connector_v1/lmcache_connector_v1_plain_hits.tar.gz - plain vLLM baseline with LMCacheConnectorV1 completing and reporting 512-token LMCache hits.
  • experiments/evidence/lmcache_connector_v1/lmcache_connector_v1_kvcached_default_failure.tar.gz - initial kvcached default-layout failure in LMCache GPU KV store.
  • experiments/evidence/lmcache_connector_v1/lmcache_connector_v1_plain_layout_diag.tar.gz - plain vLLM layout diagnostic control showing contiguous per-layer KV tensors.
  • experiments/evidence/lmcache_connector_v1/lmcache_connector_v1_kvcached_compound_layout_failure_diag.tar.gz - kvcached default compound-layout diagnostic showing layer-interleaved non-contiguous KV views and the LMCache ValueError.
  • experiments/evidence/lmcache_connector_v1/lmcache_connector_v1_kvcached_noncompound_layout_fix_pass.tar.gz - passing fix proof with KVCACHED_CONTIGUOUS_LAYOUT=false, contiguous LMCache-visible per-layer KV tensors, successful prefiller/decoder requests, and 512-token LMCache hits.

Evidence: Plain vLLM LMCacheConnectorV1 Works

Run ID: plain_lmcache_hits_1

The plain baseline completed and reached real LMCache retrieve paths with non-zero hit tokens.

Prefiller:

Reqid: ..., Total tokens 546, LMCache hit tokens: 0, need to load: 0
[req_id=...] Stored 546 out of total 546 tokens.
Reqid: ..., Total tokens 545, LMCache hit tokens: 512, need to load: 512
[req_id=...] Retrieved 512 out of 512 required tokens
[req_id=...] Stored 545 out of total 545 tokens.

Decoder:

Reqid: ..., Total tokens 546, LMCache hit tokens: 512, need to load: 512
[req_id=...] Retrieved 512 out of 512 required tokens
Reqid: ..., Total tokens 545, LMCache hit tokens: 512, need to load: 512
[req_id=...] Retrieved 512 out of 512 required tokens

Plain per-layer KV tensors are already contiguous:

shape=(2, 75835, 16, 2, 128)
stride=(310620160, 4096, 256, 128, 1)
storage_offset=0
is_contiguous=True

Evidence: kvcached Compound Layout Fails

Run ID: kvcached_layout_diag_1

The kvcached default run starts both vLLM instances, but the first prefiller request returns HTTP 500.

Proxy:

prefill_done request_id=... status=500 bytes=153

The failure is not missing PD request metadata. The failing scheduler output includes:

extra_args={'kv_transfer_params': {'ret_first_tok': True, 'disagg_spec': {
  'req_id': 'lmcache-disagg-...',
  'receiver_host': 'localhost',
  'receiver_init_port': [7300],
  'receiver_alloc_port': [7400]}}}
disagg_spec=DisaggSpec(... receiver_init_port=[7300], receiver_alloc_port=[7400])

Root stack:

lmcache_engine.store
  -> gpu_connector.batched_from_gpu
  -> initialize_kvcaches_ptr
  -> permute_kv_caches_to_contiguous
ValueError: tensor is non-contiguous for reasons other than permutation

kvcached was asked for the normal vLLM logical KV shape:

kvcache_shape=(2, 75835, 16, 2, 128)
block_size=16
num_layers=28
contiguous_layout=True

But it returned layer views into a shared compound backing allocation:

kvcached.alloc_kv_cache result[0]:
shape=(2, 185088, 16, 2, 128)
stride=(4096, 229376, 256, 128, 1)
storage_offset=0
is_contiguous=False
storage_data_ptr=34084860461056

kvcached.alloc_kv_cache result[1]:
shape=(2, 185088, 16, 2, 128)
stride=(4096, 229376, 256, 128, 1)
storage_offset=8192
is_contiguous=False
storage_data_ptr=34084860461056

kvcached.alloc_kv_cache result[27]:
shape=(2, 185088, 16, 2, 128)
stride=(4096, 229376, 256, 128, 1)
storage_offset=221184
is_contiguous=False
storage_data_ptr=34084860461056

LMCache receives the same view and rejects it:

lmcache.permute_to_contiguous input:
shape=(2, 185088, 16, 2, 128)
stride=(4096, 229376, 256, 128, 1)
storage_offset=0
is_contiguous=False

lmcache.permute_to_contiguous raised ValueError:
tensor is non-contiguous for reasons other than permutation
(e.g., slicing or as_strided). Cannot recover contiguous view.

The key mismatch is the hidden layer interleaving. For a logical NHD tensor shaped [2, NB, 16, 2, 128], the expected contiguous block stride is 4096. The compound kvcached view has block stride 229376, which is 4096 * 56, i.e. 28 layers times 2 KV buffers.

Evidence: Non-Compound Layout Fix Works

Run ID: kvcached_noncontig_lmcache_1

Evidence bundle:

experiments/evidence/lmcache_connector_v1/lmcache_connector_v1_kvcached_noncompound_layout_fix_pass.tar.gz

The same kvcached + LMCacheConnectorV1 run passes when launched with KVCACHED_CONTIGUOUS_LAYOUT=false.

Harness result:

[INFO] Classifier: decoder reached LMCache retrieve path.
[INFO] Classifier: decoder reported non-zero LMCache hit tokens.
[PASS] LMCacheConnectorV1 debug harness completed

Requests completed through the proxy:

prefill_done request_id=lmcache-disagg-1b5bf6882faa4f3b87421e0fb27b40e3 status=200 bytes=482
decode_done request_id=lmcache-disagg-1b5bf6882faa4f3b87421e0fb27b40e3 status=200 bytes=568
prefill_done request_id=lmcache-disagg-6abf3ca4395241549d9fbd2cdaa6aa23 status=200 bytes=479
decode_done request_id=lmcache-disagg-6abf3ca4395241549d9fbd2cdaa6aa23 status=200 bytes=572

The prefiller stores the first long prompt and hits on the shared prefix for the second:

Reqid: ..., Total tokens 546, LMCache hit tokens: 0, need to load: 0
[req_id=...] Stored 546 out of total 546 tokens.
Reqid: ..., Total tokens 545, LMCache hit tokens: 512, need to load: 512
[req_id=...] Retrieved 512 out of 512 required tokens
[req_id=...] Stored 545 out of total 545 tokens.

The decoder retrieves from LMCache for both requests:

Reqid: ..., Total tokens 546, LMCache hit tokens: 512, need to load: 512
[req_id=...] Retrieved 512 out of 512 required tokens
Reqid: ..., Total tokens 545, LMCache hit tokens: 512, need to load: 512
[req_id=...] Retrieved 512 out of 512 required tokens

Most importantly, LMCache now sees contiguous per-layer tensors:

lmcache.permute_kv_caches_to_contiguous input[0]:
shape=(2, 51456, 16, 2, 128)
stride=(210763776, 4096, 256, 128, 1)
storage_offset=0
is_contiguous=True
storage_data_ptr=34084860461056
storage_nbytes=843055104

lmcache.permute_kv_caches_to_contiguous input[1]:
shape=(2, 51456, 16, 2, 128)
stride=(210763776, 4096, 256, 128, 1)
storage_offset=0
is_contiguous=True
storage_data_ptr=34085732876288

lmcache.permute_kv_caches_to_contiguous input[27]:
shape=(2, 51456, 16, 2, 128)
stride=(210763776, 4096, 256, 128, 1)
storage_offset=0
is_contiguous=True
storage_data_ptr=34108415672320

This is the proof of the fix: the layout incompatibility disappears, LMCache's store/retrieve paths execute, and the run reports real 512-token LMCache hits.

Notes

The repeated LMCache log line below appears in both failing and passing runs and is not the root failure:

LMCache ERROR: PrometheusLogger instance already created with different metadata.

The actual failing condition was the compound-layout tensor stride mismatch. That condition is absent in the KVCACHED_CONTIGUOUS_LAYOUT=false passing run.

@AAbouzeid AAbouzeid changed the title Add LMCacheConnectorV1 PD diagnostic harness [Do Not Merge] - Add LMCacheConnectorV1 PD diagnostic harness May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant