[None][feat] Add PyTorch reset_prefix_cache API by milesial · Pull Request #15313 · NVIDIA/TensorRT-LLM

milesial · 2026-06-12T17:10:57Z

Description

This relands #14970 after CI fixes for the RLHF Ray worker extension conflict.

Following vLLM reset_prefix_cache and SGLang flush_cache, add a python API + HTTP endpoint to reset the local KV cache state.
This is useful during benchmarking to reset the state between runs in a concurrency sweep for example.

Test Coverage

Added unit tests to tests/unittest/llmapi/test_llm.py

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

Summary by CodeRabbit

Release Notes

New Features
- Added beta reset_prefix_cache() method to LLM API for invalidating local KV cache prefix-reuse state in PyTorch backend.
- Added /reset_prefix_cache POST endpoint to OpenAI server.
- Enhanced validation to prevent cache reset when active or queued requests exist.
Improvements
- Refactored worker control endpoints with standardized error handling and collective RPC dispatch support.

Signed-off-by: milesial <milesial@users.noreply.github.com>

milesial · 2026-06-12T17:34:42Z

/bot run

tensorrt-cicd · 2026-06-12T17:41:34Z

PR_Github #53931 [ run ] triggered by Bot. Commit: 7daa9fc Link to invocation

tensorrt-cicd · 2026-06-12T20:10:56Z

PR_Github #53931 [ run ] completed with state SUCCESS. Commit: 7daa9fc
/LLM/main/L0_MergeRequest_PR pipeline #43024 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

milesial · 2026-06-12T22:59:39Z

/bot run

tensorrt-cicd · 2026-06-12T23:05:27Z

PR_Github #53969 [ run ] triggered by Bot. Commit: 3db30c5 Link to invocation

tensorrt-cicd · 2026-06-13T01:54:51Z

PR_Github #53969 [ run ] completed with state FAILURE. Commit: 3db30c5
/LLM/main/L0_MergeRequest_PR pipeline #43059 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

milesial · 2026-06-15T17:05:48Z

/bot run

coderabbitai · 2026-06-15T17:07:25Z

📝 Walkthrough

Walkthrough

Adds reset_prefix_cache() through the full execution stack: PyExecutor (with idle precondition), BaseWorker, _TorchLLM LLM API (beta), and a new /reset_prefix_cache OpenAI server endpoint. Removes the method from WorkerExtension. Refactors /release_memory, /resume_memory, and /update_weights endpoints into shared RPC dispatch helpers.

Changes

reset_prefix_cache feature and RL endpoint refactor

Layer / File(s)	Summary
PyExecutor and BaseWorker idle-guard implementation `tensorrt_llm/_torch/pyexecutor/py_executor.py`, `tensorrt_llm/executor/base_worker.py`	`PyExecutor.reset_prefix_cache()` raises `RuntimeError` when active or queued requests are present before calling `kv_cache_manager.reset_reuse_state()`. `BaseWorker.reset_prefix_cache()` wraps the call inside `engine.control_action()` with a capability check and `NotImplementedError` fallback.
_TorchLLM API method and WorkerExtension cleanup `tensorrt_llm/llmapi/llm.py`, `tensorrt_llm/llmapi/rlhf_utils.py`	`_TorchLLM.reset_prefix_cache()` (beta) validates `encode_only` and executor presence, dispatches via `_collective_rpc` when available or falls back to `executor.reset_prefix_cache()`. `WorkerExtension.reset_prefix_cache()` is removed so the base class method is inherited.
OpenAI server `/reset_prefix_cache` route and RL endpoint dispatch refactor `tensorrt_llm/serve/openai_server.py`	Registers POST `/reset_prefix_cache`, maps `NotImplementedError` to 501 and `RuntimeError`/`ValueError` to 409. Introduces shared `_run_worker_control_rpc`/`_handle_worker_control_rpc` helpers for unified executor → `AsyncLLM.collective_rpc` → `_collective_rpc` dispatch, replacing per-method ad-hoc logic in `/release_memory`, `/resume_memory`, and `/update_weights`.
PyExecutor and Ray worker extension tests `tests/unittest/_torch/executor/test_py_executor.py`, `tests/unittest/_torch/ray_orchestrator/single_gpu/test_llm_update_weights.py`	Three tests cover `PyExecutor.reset_prefix_cache()` idle, active-requests, and queued-requests cases. One test verifies `WorkerExtension` does not override `reset_prefix_cache` on `RayGPUWorker`.
LLM API tests, OpenAI server endpoint tests, and API stability schema `tests/unittest/llmapi/test_llm.py`, `tests/unittest/api_stability/references/llm.yaml`	Fake executors/generators cover `_TorchLLM.reset_prefix_cache()` dispatch and error paths. OpenAI server tests validate the new endpoint and refactored memory/update-weights handlers. API stability YAML adds `reset_prefix_cache` as beta.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant OpenAIServer
  participant _TorchLLM
  participant BaseWorker
  participant PyExecutor

  Client->>OpenAIServer: POST /reset_prefix_cache
  OpenAIServer->>_TorchLLM: reset_prefix_cache()
  _TorchLLM->>_TorchLLM: validate encode_only / executor present
  alt collective RPC supported
    _TorchLLM->>BaseWorker: _collective_rpc("reset_prefix_cache")
  else
    _TorchLLM->>BaseWorker: executor.reset_prefix_cache()
  end
  BaseWorker->>PyExecutor: engine.control_action(reset_prefix_cache)
  PyExecutor->>PyExecutor: raise RuntimeError if active or queued requests
  PyExecutor-->>BaseWorker: kv_cache_manager.reset_reuse_state()
  BaseWorker-->>OpenAIServer: success
  OpenAIServer-->>Client: {"status": "success"}

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#15306: Directly conflicts with this PR — it reverts the reset_prefix_cache feature by removing BaseWorker.reset_prefix_cache, _TorchLLM.reset_prefix_cache, the OpenAI server endpoint, and related tests on the same code paths.
NVIDIA/TensorRT-LLM#14970: Implements the same reset_prefix_cache end-to-end feature — BaseWorker, _TorchLLM dispatch (including _collective_rpc), the /reset_prefix_cache endpoint, and tests/unittest/llmapi/test_llm.py coverage.

Suggested reviewers

suyoggupta
hchings
DomBrown
achartier
chzblych
shuyixiong

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.75% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[None][feat] Add PyTorch reset_prefix_cache API' clearly and specifically summarizes the main feature addition across multiple files, following the repository's title template format.
Description check	✅ Passed	The PR description explains the purpose (relands `#14970`, adds reset_prefix_cache API following vLLM and SGLang patterns), mentions test coverage, and includes a completed checklist. However, it lacks detailed explanation of what the feature does and why the RLHF conflict required fixes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 5240-5243: The guard condition in reset_prefix_cache() method only
checks active_requests and waiting_queue but misses requests that may be pending
in executor_request_queue or request_accumulated. Extend the RuntimeError
condition to also verify that executor_request_queue and request_accumulated are
empty, ensuring the precondition truly enforces that no queued work exists
before allowing the kv_cache_manager.reset_reuse_state() call to proceed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a0779ce2-f81f-47d7-b3ce-336288610ad6

📥 Commits

Reviewing files that changed from the base of the PR and between 130ae82 and 92019b1.

📒 Files selected for processing (9)

tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/executor/base_worker.py
tensorrt_llm/llmapi/llm.py
tensorrt_llm/llmapi/rlhf_utils.py
tensorrt_llm/serve/openai_server.py
tests/unittest/_torch/executor/test_py_executor.py
tests/unittest/_torch/ray_orchestrator/single_gpu/test_llm_update_weights.py
tests/unittest/api_stability/references/llm.yaml
tests/unittest/llmapi/test_llm.py

💤 Files with no reviewable changes (1)

tensorrt_llm/llmapi/rlhf_utils.py

milesial · 2026-06-23T22:33:58Z

/bot run

tensorrt-cicd · 2026-06-23T22:40:47Z

PR_Github #55336 [ run ] triggered by Bot. Commit: 1d8fa34 Link to invocation

tensorrt-cicd · 2026-06-23T23:21:28Z

PR_Github #55336 [ run ] completed with state FAILURE. Commit: 1d8fa34
/LLM/main/L0_MergeRequest_PR pipeline #44287 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

milesial · 2026-06-24T19:36:36Z

/bot run

milesial · 2026-06-24T19:48:54Z

/bot run

milesial · 2026-06-24T21:05:55Z

/bot run

tensorrt-cicd · 2026-06-24T21:13:01Z

PR_Github #55591 [ run ] triggered by Bot. Commit: c3e1cb2 Link to invocation

tensorrt-cicd · 2026-06-24T22:50:57Z

PR_Github #55591 [ run ] completed with state FAILURE. Commit: c3e1cb2
/LLM/main/L0_MergeRequest_PR pipeline #44509 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

milesial · 2026-06-25T00:07:20Z

/bot run

tensorrt-cicd · 2026-06-25T00:13:32Z

PR_Github #55621 [ run ] triggered by Bot. Commit: c92c909 Link to invocation

tensorrt-cicd · 2026-06-25T02:17:11Z

PR_Github #55621 [ run ] completed with state FAILURE. Commit: c92c909
/LLM/main/L0_MergeRequest_PR pipeline #44540 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

github-actions Bot assigned milesial Jun 12, 2026

milesial added the api-compatible Accepted LLM API contract change that is backwards-compatible label Jun 12, 2026

[None][fix] Restore PyTorch reset_prefix_cache API

4a3554b

Signed-off-by: milesial <milesial@users.noreply.github.com>

milesial force-pushed the codex/reland-reset-prefix-cache branch from c9a58d5 to 4a3554b Compare June 12, 2026 17:30

Merge branch 'main' into codex/reland-reset-prefix-cache

7daa9fc

Merge branch 'main' into codex/reland-reset-prefix-cache

3db30c5

milesial marked this pull request as ready for review June 15, 2026 17:05

milesial requested review from a team as code owners June 15, 2026 17:05

milesial requested review from pcastonguay and suyoggupta June 15, 2026 17:05

Merge branch 'main' into codex/reland-reset-prefix-cache

92019b1

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py

Merge branch 'main' into codex/reland-reset-prefix-cache

1d8fa34

Merge branch 'main' into codex/reland-reset-prefix-cache

5f24bcf

Merge branch 'main' into codex/reland-reset-prefix-cache

c3e1cb2

Merge branch 'main' into codex/reland-reset-prefix-cache

c92c909

Uh oh!

Conversation

milesial commented Jun 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

Summary by CodeRabbit

Release Notes

Uh oh!

milesial commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

milesial commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

milesial commented Jun 15, 2026

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

milesial commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

milesial commented Jun 24, 2026

Uh oh!

milesial commented Jun 24, 2026

Uh oh!

milesial commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

milesial commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

milesial commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading