Skip to content

[None][feat] Add PyTorch reset_prefix_cache API#15313

Open
milesial wants to merge 8 commits into
NVIDIA:mainfrom
milesial:codex/reland-reset-prefix-cache
Open

[None][feat] Add PyTorch reset_prefix_cache API#15313
milesial wants to merge 8 commits into
NVIDIA:mainfrom
milesial:codex/reland-reset-prefix-cache

Conversation

@milesial

@milesial milesial commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Description

This relands #14970 after CI fixes for the RLHF Ray worker extension conflict.

Following vLLM reset_prefix_cache and SGLang flush_cache, add a python API + HTTP endpoint to reset the local KV cache state.
This is useful during benchmarking to reset the state between runs in a concurrency sweep for example.

Test Coverage

Added unit tests to tests/unittest/llmapi/test_llm.py

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added beta reset_prefix_cache() method to LLM API for invalidating local KV cache prefix-reuse state in PyTorch backend.
    • Added /reset_prefix_cache POST endpoint to OpenAI server.
    • Enhanced validation to prevent cache reset when active or queued requests exist.
  • Improvements

    • Refactored worker control endpoints with standardized error handling and collective RPC dispatch support.

@milesial milesial added the api-compatible Accepted LLM API contract change that is backwards-compatible label Jun 12, 2026
Signed-off-by: milesial <milesial@users.noreply.github.com>
@milesial milesial force-pushed the codex/reland-reset-prefix-cache branch from c9a58d5 to 4a3554b Compare June 12, 2026 17:30
@milesial

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53931 [ run ] triggered by Bot. Commit: 7daa9fc Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53931 [ run ] completed with state SUCCESS. Commit: 7daa9fc
/LLM/main/L0_MergeRequest_PR pipeline #43024 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@milesial

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53969 [ run ] triggered by Bot. Commit: 3db30c5 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53969 [ run ] completed with state FAILURE. Commit: 3db30c5
/LLM/main/L0_MergeRequest_PR pipeline #43059 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@milesial milesial marked this pull request as ready for review June 15, 2026 17:05
@milesial milesial requested review from a team as code owners June 15, 2026 17:05
@milesial

Copy link
Copy Markdown
Collaborator Author

/bot run

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds reset_prefix_cache() through the full execution stack: PyExecutor (with idle precondition), BaseWorker, _TorchLLM LLM API (beta), and a new /reset_prefix_cache OpenAI server endpoint. Removes the method from WorkerExtension. Refactors /release_memory, /resume_memory, and /update_weights endpoints into shared RPC dispatch helpers.

Changes

reset_prefix_cache feature and RL endpoint refactor

Layer / File(s) Summary
PyExecutor and BaseWorker idle-guard implementation
tensorrt_llm/_torch/pyexecutor/py_executor.py, tensorrt_llm/executor/base_worker.py
PyExecutor.reset_prefix_cache() raises RuntimeError when active or queued requests are present before calling kv_cache_manager.reset_reuse_state(). BaseWorker.reset_prefix_cache() wraps the call inside engine.control_action() with a capability check and NotImplementedError fallback.
_TorchLLM API method and WorkerExtension cleanup
tensorrt_llm/llmapi/llm.py, tensorrt_llm/llmapi/rlhf_utils.py
_TorchLLM.reset_prefix_cache() (beta) validates encode_only and executor presence, dispatches via _collective_rpc when available or falls back to executor.reset_prefix_cache(). WorkerExtension.reset_prefix_cache() is removed so the base class method is inherited.
OpenAI server /reset_prefix_cache route and RL endpoint dispatch refactor
tensorrt_llm/serve/openai_server.py
Registers POST /reset_prefix_cache, maps NotImplementedError to 501 and RuntimeError/ValueError to 409. Introduces shared _run_worker_control_rpc/_handle_worker_control_rpc helpers for unified executor → AsyncLLM.collective_rpc_collective_rpc dispatch, replacing per-method ad-hoc logic in /release_memory, /resume_memory, and /update_weights.
PyExecutor and Ray worker extension tests
tests/unittest/_torch/executor/test_py_executor.py, tests/unittest/_torch/ray_orchestrator/single_gpu/test_llm_update_weights.py
Three tests cover PyExecutor.reset_prefix_cache() idle, active-requests, and queued-requests cases. One test verifies WorkerExtension does not override reset_prefix_cache on RayGPUWorker.
LLM API tests, OpenAI server endpoint tests, and API stability schema
tests/unittest/llmapi/test_llm.py, tests/unittest/api_stability/references/llm.yaml
Fake executors/generators cover _TorchLLM.reset_prefix_cache() dispatch and error paths. OpenAI server tests validate the new endpoint and refactored memory/update-weights handlers. API stability YAML adds reset_prefix_cache as beta.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant OpenAIServer
  participant _TorchLLM
  participant BaseWorker
  participant PyExecutor

  Client->>OpenAIServer: POST /reset_prefix_cache
  OpenAIServer->>_TorchLLM: reset_prefix_cache()
  _TorchLLM->>_TorchLLM: validate encode_only / executor present
  alt collective RPC supported
    _TorchLLM->>BaseWorker: _collective_rpc("reset_prefix_cache")
  else
    _TorchLLM->>BaseWorker: executor.reset_prefix_cache()
  end
  BaseWorker->>PyExecutor: engine.control_action(reset_prefix_cache)
  PyExecutor->>PyExecutor: raise RuntimeError if active or queued requests
  PyExecutor-->>BaseWorker: kv_cache_manager.reset_reuse_state()
  BaseWorker-->>OpenAIServer: success
  OpenAIServer-->>Client: {"status": "success"}
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#15306: Directly conflicts with this PR — it reverts the reset_prefix_cache feature by removing BaseWorker.reset_prefix_cache, _TorchLLM.reset_prefix_cache, the OpenAI server endpoint, and related tests on the same code paths.
  • NVIDIA/TensorRT-LLM#14970: Implements the same reset_prefix_cache end-to-end feature — BaseWorker, _TorchLLM dispatch (including _collective_rpc), the /reset_prefix_cache endpoint, and tests/unittest/llmapi/test_llm.py coverage.

Suggested reviewers

  • suyoggupta
  • hchings
  • DomBrown
  • achartier
  • chzblych
  • shuyixiong
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title '[None][feat] Add PyTorch reset_prefix_cache API' clearly and specifically summarizes the main feature addition across multiple files, following the repository's title template format.
Description check ✅ Passed The PR description explains the purpose (relands #14970, adds reset_prefix_cache API following vLLM and SGLang patterns), mentions test coverage, and includes a completed checklist. However, it lacks detailed explanation of what the feature does and why the RLHF conflict required fixes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 5240-5243: The guard condition in reset_prefix_cache() method only
checks active_requests and waiting_queue but misses requests that may be pending
in executor_request_queue or request_accumulated. Extend the RuntimeError
condition to also verify that executor_request_queue and request_accumulated are
empty, ensuring the precondition truly enforces that no queued work exists
before allowing the kv_cache_manager.reset_reuse_state() call to proceed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a0779ce2-f81f-47d7-b3ce-336288610ad6

📥 Commits

Reviewing files that changed from the base of the PR and between 130ae82 and 92019b1.

📒 Files selected for processing (9)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/executor/base_worker.py
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/llmapi/rlhf_utils.py
  • tensorrt_llm/serve/openai_server.py
  • tests/unittest/_torch/executor/test_py_executor.py
  • tests/unittest/_torch/ray_orchestrator/single_gpu/test_llm_update_weights.py
  • tests/unittest/api_stability/references/llm.yaml
  • tests/unittest/llmapi/test_llm.py
💤 Files with no reviewable changes (1)
  • tensorrt_llm/llmapi/rlhf_utils.py

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py
@milesial

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55336 [ run ] triggered by Bot. Commit: 1d8fa34 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55336 [ run ] completed with state FAILURE. Commit: 1d8fa34
/LLM/main/L0_MergeRequest_PR pipeline #44287 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@milesial

Copy link
Copy Markdown
Collaborator Author

/bot run

1 similar comment
@milesial

Copy link
Copy Markdown
Collaborator Author

/bot run

@milesial

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55591 [ run ] triggered by Bot. Commit: c3e1cb2 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55591 [ run ] completed with state FAILURE. Commit: c3e1cb2
/LLM/main/L0_MergeRequest_PR pipeline #44509 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@milesial

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55621 [ run ] triggered by Bot. Commit: c92c909 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55621 [ run ] completed with state FAILURE. Commit: c92c909
/LLM/main/L0_MergeRequest_PR pipeline #44540 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-compatible Accepted LLM API contract change that is backwards-compatible

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants