Skip to content

[None][feat] Support request-scoped capacity-only KV cache compaction#15697

Open
Hudayday wants to merge 2 commits into
NVIDIA:mainfrom
Hudayday:kvcache-v2-nongrowing-reclaim
Open

[None][feat] Support request-scoped capacity-only KV cache compaction#15697
Hudayday wants to merge 2 commits into
NVIDIA:mainfrom
Hudayday:kvcache-v2-nongrowing-reclaim

Conversation

@Hudayday

@Hudayday Hudayday commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

Description

KV-cache compression can physically compact a request's live KV into a
smaller dense prefix. During generation, the V2 runtime adapter normally
passes the request's monotonically growing logical token count as
history_length. That behavior is correct for ordinary full attention, but it
causes a compacted request to grow back toward its logical length and prevents
the reclaimed pages from remaining available to the KV pool.

The V2 core already supports the required operation: shrink capacity while
preserving the existing committed history with
resize(capacity, history_length=None).

This PR adds only the request-scoped runtime-adapter plumbing needed by
KV-cache compression. It does not change the V2 core and does not add a
fork/rewind or public manager API.

Changes

  • Preserve the existing path unless a request explicitly sets
    py_kv_cache_generation_capacity_only=True.
  • For an opted-in generation request, pass history_length=None, preserving
    the core's current committed history while allowing physical capacity to
    shrink.
  • Consume an optional
    (target_capacity, published_capacity, event) compaction marker.
  • Wait for the producer CUDA event before reclaimed pages can be reused.
  • Preserve capacity growth that occurred after marker publication and apply
    the current rewind:
    target + (live_capacity - published_capacity) - rewind.
  • Clear the compaction marker only after resize() succeeds so a failed resize
    can be retried.
  • Add focused tests and register them in the A10 pre-merge test list.

Compatibility

Requests that do not explicitly opt in call resize() with the same capacity
and history arguments as before. The opt-in check is request scoped and
fail-closed.

There is no public API change, no V2 core change, and no manager-level
compression state. The compression implementation owns publishing the
request marker and clearing its generation capacity-only flag at request
completion.

Validation

  • python3 -m compileall: passed
  • git diff --check: passed
  • ruff check: passed
  • ruff format --check: passed
  • Focused V2 unit test: 5 passed
  • Integrated TriAttention/V2 focused suite: 71 passed
  • GPT-OSS 20B/120B dense and union smoke: 4/4 passed

@Hudayday Hudayday requested a review from a team as a code owner June 28, 2026 16:33
@Hudayday Hudayday requested a review from lowsfer June 28, 2026 16:33
@Hudayday

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

KVCacheManagerV2.update_resources gains a compression reclaim branch: when py_kv_evicted_tokens > 0 and the request is not completing, it calls kv_cache.fork(max_beam - evicted), falling back to resize(None, max_beam - evicted) on exception. A new unit test file covers all five dispatch scenarios and is added to the A10 pre-merge test list.

Changes

KV-cache compression reclaim

Layer / File(s) Summary
Compression reclaim branch in update_resources
tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py
Adds completing flag and py_kv_evicted_tokens check; routes evicted non-completing requests through kv_cache.fork(max_beam - evicted) with exception-triggered fallback to resize(None, max_beam - evicted). Non-evicted path retains prior new_capacity assignment logic.
Unit tests and CI registration
tests/unittest/_torch/pyexecutor/test_kv_cache_v2_compression_reclaim.py, tests/integration/test_lists/test-db/l0_a10.yml
New test module with _fake_manager and mock helpers covers five branches: fork-on-eviction, original resize on unevicted, None-capacity on completing, fork-failure fallback, and inactive-cache skip. File added to A10 pre-merge list.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

api-compatible

Suggested reviewers

  • byshiue
  • heyuhhh
  • lfr-0531
  • Superjomn
  • PerkzZheng
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly matches the main change: request-scoped KV cache compaction and capacity-only handling.
Description check ✅ Passed The PR description is detailed and relevant, with clear problem, changes, compatibility, and validation sections, though it omits the template's Test Coverage section.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py`:
- Around line 2497-2502: The no-free fallback in
kv_cache_manager_v2._KVCacheManagerV2 should not ignore the boolean result from
kv_cache.resize(None, req.max_beam_num_tokens - evicted). After the warning in
the fallback branch, check the return value just like the normal resize path and
treat a failed resize as fatal or otherwise handle it consistently so the
request state stays aligned with the live _KVCache state.

In `@tests/unittest/_torch/pyexecutor/test_kv_cache_v2_compression_reclaim.py`:
- Around line 77-83: The current test coverage only exercises completion with
evicted=0, so the combined evicted-and-completing path in the reclaim logic is
still untested. Add a test in test_kv_cache_v2_compression_reclaim.py using
_fake_manager, _run, and _req with evicted > 0 and a completing state such as
LlmRequestState.GENERATION_COMPLETE or CONTEXT_INIT, and assert that the request
still calls resize(None, max_beam - 1) while fork() is not called. This should
verify the guard in the kv cache reclaim behavior when both conditions are
present.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f47017d9-645a-4086-ab61-1067b72e308b

📥 Commits

Reviewing files that changed from the base of the PR and between 5ec0c84 and c1c180c.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py
  • tests/integration/test_lists/test-db/l0_a10.yml
  • tests/unittest/_torch/pyexecutor/test_kv_cache_v2_compression_reclaim.py

Comment thread tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py Outdated
Comment thread tests/unittest/_torch/pyexecutor/test_kv_cache_v2_compression_reclaim.py Outdated
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56237 [ run ] triggered by Bot. Commit: c1c180c Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56237 [ run ] completed with state SUCCESS. Commit: c1c180c
/LLM/main/L0_MergeRequest_PR pipeline #45098 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@Hudayday

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56247 [ run ] triggered by Bot. Commit: c1c180c Link to invocation

@Hudayday Hudayday marked this pull request as draft June 29, 2026 02:32
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56247 [ run ] completed with state SUCCESS. Commit: c1c180c
/LLM/main/L0_MergeRequest_PR pipeline #45107 completed with status: 'SUCCESS'

CI Report

Link to invocation

@Hudayday Hudayday force-pushed the kvcache-v2-nongrowing-reclaim branch 2 times, most recently from 5cdbbf3 to 095f9ff Compare June 29, 2026 03:32
@Hudayday

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56292 [ run ] triggered by Bot. Commit: 095f9ff Link to invocation

@Hudayday Hudayday force-pushed the kvcache-v2-nongrowing-reclaim branch 2 times, most recently from 2f10293 to c11ca19 Compare June 29, 2026 05:19
@Hudayday Hudayday marked this pull request as ready for review June 29, 2026 05:21
@Hudayday

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56307 [ run ] triggered by Bot. Commit: c11ca19 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56292 [ run ] completed with state ABORTED. Commit: 095f9ff

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56307 [ run ] completed with state SUCCESS. Commit: c11ca19
/LLM/main/L0_MergeRequest_PR pipeline #45158 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@Hudayday

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56337 [ run ] triggered by Bot. Commit: c11ca19 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56337 [ run ] completed with state SUCCESS. Commit: c11ca19
/LLM/main/L0_MergeRequest_PR pipeline #45186 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@Hudayday

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56371 [ run ] triggered by Bot. Commit: c11ca19 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56371 [ run ] completed with state SUCCESS. Commit: c11ca19
/LLM/main/L0_MergeRequest_PR pipeline #45216 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@Hudayday

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@Hudayday

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

2 similar comments
@Hudayday

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@Hudayday

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56516 [ run ] triggered by Bot. Commit: 9f3148e Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56516 [ run ] completed with state SUCCESS. Commit: 9f3148e
/LLM/main/L0_MergeRequest_PR pipeline #45354 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@Hudayday

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56605 [ run ] triggered by Bot. Commit: 9f3148e Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56605 [ run ] completed with state SUCCESS. Commit: 9f3148e
/LLM/main/L0_MergeRequest_PR pipeline #45432 completed with status: 'SUCCESS'

CI Report

Link to invocation

@nvpohanh

nvpohanh commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

[by Codex] @lowsfer Could you review this PR? Thanks!

@Hudayday Hudayday marked this pull request as draft July 2, 2026 03:07
@Hudayday Hudayday force-pushed the kvcache-v2-nongrowing-reclaim branch 2 times, most recently from ef05e64 to b053331 Compare July 2, 2026 09:28
@Hudayday Hudayday changed the title [None][feat] Support non-growing KV history in KVCacheManagerV2 for KV-cache compression [None][feat] Support request-scoped capacity-only KV cache compaction Jul 2, 2026
@Hudayday Hudayday changed the title [None][feat] Support request-scoped capacity-only KV cache compaction [None][feat] Support request-scoped capacity-only KV cache compaction Jul 2, 2026
Signed-off-by: Hudayday <32944717+Hudayday@users.noreply.github.com>
@Hudayday Hudayday force-pushed the kvcache-v2-nongrowing-reclaim branch from b053331 to 5ba3c17 Compare July 2, 2026 09:54
@Hudayday Hudayday marked this pull request as ready for review July 2, 2026 09:58
@Hudayday

Hudayday commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

1 similar comment
@Hudayday

Hudayday commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@lowsfer lowsfer requested a review from yizhang-nv July 2, 2026 13:30
@Hudayday

Hudayday commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57196 [ run ] triggered by Bot. Commit: 5ba3c17 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57196 [ run ] completed with state SUCCESS. Commit: 5ba3c17
/LLM/main/L0_MergeRequest_PR pipeline #45968 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@Hudayday

Hudayday commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

Signed-off-by: Hudayday <32944717+Hudayday@users.noreply.github.com>
@Hudayday

Hudayday commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57359 [ run ] triggered by Bot. Commit: 2607a43 Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants