Skip to content

[None][fix] Draft KV cache should not allocate host memory#12756

Open
Shang-Pin wants to merge 1 commit intoNVIDIA:mainfrom
Shang-Pin:fix-draft-kv-host-memory
Open

[None][fix] Draft KV cache should not allocate host memory#12756
Shang-Pin wants to merge 1 commit intoNVIDIA:mainfrom
Shang-Pin:fix-draft-kv-host-memory

Conversation

@Shang-Pin
Copy link
Copy Markdown

@Shang-Pin Shang-Pin commented Apr 3, 2026

When using one-model speculative decoding with separate draft KV cache (e.g. EAGLE3), the draft cache inherits the target's KvCacheConfig which may have a non-zero host_cache_size. This causes unnecessary host memory allocation for the draft cache. Only the target model should use host offloading since draft tokens are transient and may be rejected during verification.

Fix: set host_cache_size=0 on the draft KV cache config before creating the draft KV cache manager.

Summary by CodeRabbit

  • Bug Fixes
    • Fixed KV cache memory management to prevent unnecessary host memory allocation during draft model execution, improving memory efficiency.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@Shang-Pin Shang-Pin requested a review from a team as a code owner April 3, 2026 23:50
@Shang-Pin Shang-Pin requested a review from achartier April 3, 2026 23:50
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 419c649c-96af-4cf0-b997-bfab293f03b7

📥 Commits

Reviewing files that changed from the base of the PR and between c3de562 and 79148ba.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/_util.py

📝 Walkthrough

Walkthrough

The change modifies the _create_one_model_draft_kv_cache_manager function to clone and update the draft KV cache configuration, setting host_cache_size to 0. This prevents the draft KV cache path from allocating host memory that would be controlled by the original configuration.

Changes

Cohort / File(s) Summary
Draft KV Cache Configuration
tensorrt_llm/_torch/pyexecutor/_util.py
Modified _create_one_model_draft_kv_cache_manager to clone the KV cache configuration and set host_cache_size to 0 before creating the draft KV cache manager, preventing unintended host memory allocation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description provides a clear explanation of the problem and the fix, but the template sections (Description, Test Coverage, and Checklist) are not properly filled in beyond the initial explanation. Complete the Description and Test Coverage sections explicitly, and confirm the PR Checklist items have been reviewed before merging.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: preventing draft KV cache from allocating host memory, which directly addresses the issue fixed in the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 4, 2026
@Funatiq Funatiq requested review from a team and zheyuf and removed request for a team April 7, 2026 09:15
Copy link
Copy Markdown
Collaborator

@zheyuf zheyuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

When using one-model speculative decoding with separate draft KV cache
(e.g. EAGLE3), the draft cache inherits the target's KvCacheConfig which
may have a non-zero host_cache_size. This causes unnecessary host memory
allocation for the draft cache. Only the target model should use host
offloading since draft tokens are transient and may be rejected during
verification.

Fix: set host_cache_size=0 on the draft KV cache config before creating
the draft KV cache manager.

Signed-off-by: Shang-Pin Sheng <shang-pin@tmatehq.com>
@zheyuf zheyuf force-pushed the fix-draft-kv-host-memory branch from 79148ba to 2575f4d Compare April 9, 2026 23:57
@zheyuf
Copy link
Copy Markdown
Collaborator

zheyuf commented Apr 9, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42595 [ run ] triggered by Bot. Commit: 2575f4d Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants