[None][fix] Draft KV cache should not allocate host memory by Shang-Pin · Pull Request #12756 · NVIDIA/TensorRT-LLM

Shang-Pin · 2026-04-03T23:50:25Z

When using one-model speculative decoding with separate draft KV cache (e.g. EAGLE3), the draft cache inherits the target's KvCacheConfig which may have a non-zero host_cache_size. This causes unnecessary host memory allocation for the draft cache. Only the target model should use host offloading since draft tokens are transient and may be rejected during verification.

Fix: set host_cache_size=0 on the draft KV cache config before creating the draft KV cache manager.

Summary by CodeRabbit

Bug Fixes
- Fixed KV cache memory management to prevent unnecessary host memory allocation during draft model execution, improving memory efficiency.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-04-03T23:53:29Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 419c649c-96af-4cf0-b997-bfab293f03b7

📥 Commits

Reviewing files that changed from the base of the PR and between c3de562 and 79148ba.

📒 Files selected for processing (1)

tensorrt_llm/_torch/pyexecutor/_util.py

📝 Walkthrough

Walkthrough

The change modifies the _create_one_model_draft_kv_cache_manager function to clone and update the draft KV cache configuration, setting host_cache_size to 0. This prevents the draft KV cache path from allocating host memory that would be controlled by the original configuration.

Changes

Cohort / File(s)	Summary
Draft KV Cache Configuration `tensorrt_llm/_torch/pyexecutor/_util.py`	Modified `_create_one_model_draft_kv_cache_manager` to clone the KV cache configuration and set `host_cache_size` to `0` before creating the draft KV cache manager, preventing unintended host memory allocation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The PR description provides a clear explanation of the problem and the fix, but the template sections (Description, Test Coverage, and Checklist) are not properly filled in beyond the initial explanation.	Complete the Description and Test Coverage sections explicitly, and confirm the PR Checklist items have been reviewed before merging.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: preventing draft KV cache from allocating host memory, which directly addresses the issue fixed in the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

zheyuf

LGTM

When using one-model speculative decoding with separate draft KV cache (e.g. EAGLE3), the draft cache inherits the target's KvCacheConfig which may have a non-zero host_cache_size. This causes unnecessary host memory allocation for the draft cache. Only the target model should use host offloading since draft tokens are transient and may be rejected during verification. Fix: set host_cache_size=0 on the draft KV cache config before creating the draft KV cache manager. Signed-off-by: Shang-Pin Sheng <shang-pin@tmatehq.com>

zheyuf · 2026-04-09T23:58:05Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-10T00:03:39Z

PR_Github #42595 [ run ] triggered by Bot. Commit: 2575f4d Link to invocation

Shang-Pin requested a review from a team as a code owner April 3, 2026 23:50

Shang-Pin requested a review from achartier April 3, 2026 23:50

github-actions bot assigned Shang-Pin Apr 3, 2026

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 4, 2026

Funatiq requested review from a team and zheyuf and removed request for a team April 7, 2026 09:15

zheyuf approved these changes Apr 9, 2026

View reviewed changes

zheyuf force-pushed the fix-draft-kv-host-memory branch from 79148ba to 2575f4d Compare April 9, 2026 23:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][fix] Draft KV cache should not allocate host memory#12756

[None][fix] Draft KV cache should not allocate host memory#12756
Shang-Pin wants to merge 1 commit intoNVIDIA:mainfrom
Shang-Pin:fix-draft-kv-host-memory

Shang-Pin commented Apr 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 3, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

zheyuf left a comment

Uh oh!

zheyuf commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Shang-Pin commented Apr 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Apr 3, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

zheyuf left a comment

Choose a reason for hiding this comment

Uh oh!

zheyuf commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Shang-Pin commented Apr 3, 2026 •

edited by coderabbitai bot

Loading