Skip to content

[TRTLLM-12622][feat] Add native post-processing hook to trtllm-serve#15239

Open
xwang233 wants to merge 19 commits into
NVIDIA:mainfrom
xwang233:trtllm-12622-post-processor-hook
Open

[TRTLLM-12622][feat] Add native post-processing hook to trtllm-serve#15239
xwang233 wants to merge 19 commits into
NVIDIA:mainfrom
xwang233:trtllm-12622-post-processor-hook

Conversation

@xwang233

@xwang233 xwang233 commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a native, per-request post-processing hook to trtllm-serve (TRTLLM-12622). A user supplies an importable, picklable callable class via --post_processor (or the equivalent post_processor field in --extra_llm_api_options YAML). The hook runs at a single chokepoint after detokenization and before the per-endpoint response formatter, and may rewrite, suppress, or terminate the output. Each hook instance owns its own per-request state, keyed by request_id.

The hook works across the chat and completions endpoints (streaming and non-streaming) and with the post-processing worker pool both enabled and disabled.

Not client-bypassable (the hook is a server-side guardrail):

  • It runs even when a request sets detokenize=false — the server detokenizes for the hook regardless.
  • suppress/terminate withhold every client-visible channel — text, token ids, and logprobs — on both streaming and non-streaming paths.
  • Configurations the hook fundamentally cannot inspect are rejected up front: harmony/gpt-oss models (output reconstructed from raw token ids) at server startup, and skip_tokenizer_init (no detokenized text) at args validation.

Design

  • New module tensorrt_llm/executor/postprocessor_hook.py: PostProcChunk, PostProcVerdict, the PostProcAction enum (EMIT/SUPPRESS/TERMINATE, validated at construction so an unknown action cannot be smuggled in), the emit / suppress / terminate helpers, the PostProcessorHook protocol, the import-path loader (mirroring --custom_tokenizer), and apply_post_processor_hook.
  • Single chokepoint in DetokenizedGenerationResultBase._handle_response, which (via MRO) covers both the in-proxy and post-processing-worker detok paths.
  • Per-instance ownership (mirrors Triton's instance-owns-state model): each LLM and each post-processing worker builds its own hook instance from the import path — no process globals. BaseLLM builds self._post_processor_hook, threads it onto RequestOutput; PostprocWorker builds its own and attaches it per record. True no-op when unset; multiple LLMs in one process are independent.
  • emit(text) rewrites the text channel only; suppress/terminate withhold all channels. terminate cancels the engine request through the existing abort path (output-side only; no engine back-edge). emit/suppress act per output sequence; terminate cancels the whole request (the documented behavior under n>1/beam — no extra guard).
  • Error policy is fail-closed: a hook exception is re-raised (the worker path converts it to an ErrorResponse; the in-proxy path surfaces it to the serving handler). The server and sibling requests stay alive. For a safety guardrail, failing open would emit exactly the un-vetted content the hook exists to block.
  • proxy.py: late engine responses arriving after a terminate-driven record pop are dropped instead of raising KeyError.
  • Config surface: --post_processor CLI flag + BaseLlmArgs.post_processor (prototype) + an api_stability reference update.
  • Docs: docs/source/features/post-processor-hook.md.

Test Coverage

  • Unit — tests/unittest/executor/test_postprocessor_hook.py: verdict semantics (rewrite/suppress/terminate), all-channel withholding (streaming + non-streaming), per-request state isolation, enum action rejected at construction, fail-closed re-raise, import-path loader, independent per-instance state, harmony-rejection guard.
  • Unit — tests/unittest/executor/test_proxy_postproc_terminate.py: a late response after a terminate-driven pop is dropped without KeyError; a live final response is delivered and popped.
  • Unit — tests/unittest/llmapi/test_llm_args.py: post_processor + skip_tokenizer_init is rejected.
  • Integration (e2e) — tests/unittest/llmapi/apps/_test_openai_post_processor.py: chat & completions × streaming/non-streaming × num_postprocess_workers ∈ {0, 2} × rewrite/suppress/terminate, plus a detokenize=false-does-not-bypass-the-hook case.
  • api_stability reference updated for the new post_processor field.
  • Full suite passes on GB200: unit 21, api_stability 35, llm_args (post_processor) 1, integration 30 — 87 passed, 0 failed.

Notes

This is a prototype feature; the interface may change. Documented limitations: emit (rewrite) controls only the text channel and does not scrub the underlying token ids / logprobs, so a client reading those channels can still recover the original text — hard guardrails must use suppress/terminate, which withhold all channels. The reasoning/tool-call parsers run after the hook. Only the generation endpoints are covered (encode/embeddings/VisualGen are separate paths).

Summary by CodeRabbit

  • New Features

    • Added --post_processor option to customize output processing using user-supplied Python code
    • Supported operations: rewrite, suppress, or terminate generated output per request
  • Documentation

    • Added comprehensive guide for post-processor hook feature with usage examples and API reference
  • Tests

    • Added integration and unit tests validating post-processor behavior across streaming and non-streaming modes

xwang233 added 5 commits June 10, 2026 16:48
Add a user-pluggable, per-request, stateful post-processing hook for
trtllm-serve, equivalent to a Triton python-backend post-processor. The
hook runs after detokenization and before the per-endpoint response
formatter, and may rewrite, suppress, or terminate the streamed output.

- New executor/postprocessor_hook.py: PostProcChunk / PostProcVerdict,
  emit/suppress/terminate, the PostProcessorHook protocol, an import-path
  loader with a process build-once cache, and apply_post_processor_hook.
- Single chokepoint in DetokenizedGenerationResultBase._handle_response,
  covering both the postproc-worker path and the in-proxy RequestOutput
  path. The hook path is configured per process (set in BaseLLM.__init__
  and postproc_worker_main) via a new PostprocWorkerConfig.post_processor_hook.
- Config surface: BaseLlmArgs.post_processor (prototype) plus a
  trtllm-serve --post_processor flag (also settable via --extra_llm_api_options).
- terminate cancels the engine via result.abort() and forces the worker
  record done; suppress/terminate withhold the raw token-id/logprob channel
  too; hook exceptions are isolated per request (fail-open + logged).
- Reject --post_processor with harmony/gpt-oss models at startup (the
  harmony path rebuilds output from raw tokens and would bypass the hook).
- Unit tests for verdict semantics, per-request state isolation, the
  loader, and the process global; api_stability reference updated.

The hook is text-based and operates post-detok; a hook that rewrites or
suppresses text may desync stateful reasoning/tool parsers, and rewriting
text does not rewrite the underlying token ids.

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
Launch a real trtllm-serve with --post_processor and assert the
client-visible effect (rewrite / suppress / terminate) across the chat
and completions endpoints, streaming and non-streaming, with the postproc
worker pool both disabled (in-proxy detok) and enabled (worker-process
detok).

- _postproc_hook_samples.py: stateless deterministic sample hooks
  (UppercaseHook / SuppressHook / TerminateHook).
- _test_openai_post_processor.py: the endpoint matrix on TinyLlama-1.1B.
- test_e2e.py::test_openai_post_processor wrapper + l0_a10 test-list entry.

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
…ok terminate

When a post-processing hook returns `terminate`, the result is marked done and
popped from the proxy's `_results` map, but the engine can still emit in-flight
responses for the same client_id (abort is async, and the postproc worker
recreates a record for any late response). Those late responses reached
`process_res` after the result was removed, raising KeyError and recording a
fatal engine error that tore down the whole engine.

Look the result up with `.get()` and drop late responses for already-finalized
client_ids; make the corresponding pop idempotent. Verified on GB200: the
previously-failing test_chat_streaming[terminate-enable_processpool] now passes
(post-processor hook e2e suite 24/24).

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
Add a feature doc covering the native post-processing hook: how to enable it
via --post_processor (CLI and YAML), the PostProcChunk / verdict interface, three
worked examples (rewrite, stateful guardrail with terminate, suppress), per-request
state guidance, supported endpoints and limitations, and pointers to the unit and
e2e tests. Register it in the features toctree.

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
- test: streaming terminate cases now assert finish_reason == "stop", so the
  hook (not an empty generation) is verified to be what stopped the stream
  (the cell that previously hit the proxy KeyError race).
- refactor: hoist the post-processor-hook import in result.py to module scope,
  off the per-chunk detok hot path (postprocessor_hook is stdlib-only, so there
  is no circular import).
- docs: clarify that PostProcChunk.is_final is request-level.

No behavior change to the serving path. GPU re-validation pending.

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
@xwang233

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53409 [ run ] triggered by Bot. Commit: b46c4e7 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53409 [ run ] completed with state FAILURE. Commit: b46c4e7
/LLM/main/L0_MergeRequest_PR pipeline #42581 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@xwang233

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@xwang233 xwang233 added the api-compatible Accepted LLM API contract change that is backwards-compatible label Jun 11, 2026
@xwang233 xwang233 marked this pull request as ready for review June 11, 2026 18:33
@xwang233 xwang233 requested review from a team as code owners June 11, 2026 18:33
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53644 [ run ] triggered by Bot. Commit: b46c4e7 Link to invocation

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR implements a post-processor hook system for trtllm-serve that allows users to supply a Python callable to rewrite, suppress, or terminate output chunks in real time across all backends and streaming/non-streaming modes.

Changes

Post-processor hook feature

Layer / File(s) Summary
Hook contract, data structures, and execution engine
tensorrt_llm/executor/postprocessor_hook.py
Core module defines process-scoped hook configuration, per-process hook instance cache, PostProcChunk and PostProcVerdict data contracts, emit/suppress/terminate verdict helpers, and apply_post_processor_hook implementation with exception isolation, text rewriting, stop/abort handling, and token channel suppression for suppressed/terminated chunks.
Hook integration into result detokenization and postproc workers
tensorrt_llm/executor/result.py, tensorrt_llm/executor/postproc_worker.py
Integrates hook invocation into DetokenizedGenerationResultBase after detokenization completion; configures PostprocWorkerConfig with hook import path; updates postproc_worker_main to initialize hook at startup; forces stream finalization when hook-induced termination occurs.
Configuration schema, initialization, and CLI wiring
tensorrt_llm/llmapi/llm_args.py, tensorrt_llm/llmapi/llm.py, tensorrt_llm/commands/serve.py
Adds post_processor field to BaseLlmArgs as optional import path with prototype status; wires hook through LLM.init with eager resolution; configures hooks in TensorRT and PyTorch backend worker configs; adds --post_processor CLI option with override propagation.
Worker process pool integration
tensorrt_llm/executor/worker.py
Passes post_processor_hook configuration to postproc_worker_main when submitting postprocess worker tasks.
Robustness and constraint checks
tensorrt_llm/executor/proxy.py, tensorrt_llm/serve/openai_server.py
Hardens result dispatch to tolerate late responses via safe lookups; adds harmony/gpt-oss guardrail preventing text-based hooks from being silently bypassed by token-id-based reconstruction.
Unit tests for postprocessor_hook module
tests/unittest/executor/test_postprocessor_hook.py
Comprehensive test coverage for hook loading, caching, verdict actions, streaming/non-streaming diff rewriting, token channel suppression, per-request state isolation, and fail-open exception handling.
Sample hook implementations
tests/unittest/llmapi/apps/_postproc_hook_samples.py
Three deterministic sample hooks: UppercaseHook (text rewriting), SuppressHook (suppression), TerminateHook (termination).
End-to-end integration tests for OpenAI server
tests/unittest/llmapi/apps/_test_openai_post_processor.py
End-to-end tests validating hook behavior across OpenAI-compatible completions and chat endpoints in streaming and non-streaming modes.
Documentation, API reference, and CI configuration
docs/source/features/post-processor-hook.md, docs/source/index.rst, tests/unittest/api_stability/references/llm.yaml, tests/integration/defs/test_e2e.py, tests/integration/test_lists/test-db/l0_a10.yml
End-user feature guide covering configuration, hook interface, payload fields, verdict helpers, state management, and limitations; updates API stability schema and integration test configuration.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • moraxu
  • Superjomn
  • venkywonka
  • chang-l
  • ruodil
  • Wanli-Jiang
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.06% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and concisely describes the main feature: adding a native post-processing hook to trtllm-serve. It follows the repository's title format with JIRA ticket and type prefix.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed PR description is comprehensive, covering the feature, design, test coverage, and notes on limitations. It follows the spirit of the template with clear sections.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/executor/worker.py (1)

278-285: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Postproc workers still defer hook validation to the first request.

This now forwards the import path, but tensorrt_llm/executor/postproc_worker.py:279-294 only stores it with set_configured_post_processor_hook(...). With postproc workers enabled, a bad hook import/constructor failure will still show up on the first chunk instead of during startup, and each worker misses the cache warm-up that BaseLLM.__init__ does in the proxy process. Resolve the hook once in postproc_worker_main when the worker starts.

Suggested follow-up in tensorrt_llm/executor/postproc_worker.py
 def postproc_worker_main(feedin_ipc_addr: tuple[str, Optional[bytes]],
                          feedout_ipc_addr: tuple[str, Optional[bytes]],
                          tokenizer_dir: str,
                          record_creator: Callable,
                          post_processor_hook: Optional[str] = None):
-    from .postprocessor_hook import set_configured_post_processor_hook
+    from .postprocessor_hook import (
+        get_post_processor_hook,
+        set_configured_post_processor_hook,
+    )
     set_configured_post_processor_hook(post_processor_hook)
+    if post_processor_hook:
+        get_post_processor_hook(post_processor_hook)
     worker = PostprocWorker(feedin_ipc_addr,
                             feedout_ipc_addr,
                             tokenizer_dir=tokenizer_dir,
                             record_creator=record_creator)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/executor/worker.py` around lines 278 - 285, postproc_worker_main
currently just stores the hook import path via
set_configured_post_processor_hook and defers import/constructor errors to the
first request; modify postproc_worker_main to resolve and instantiate the
post-processor hook during worker startup (using the
postproc_worker_config.post_processor_hook import path), validate it, and run
the same warm-up/caching that BaseLLM.__init__ performs so any
import/constructor failures surface immediately and each PostprocWorker gets the
warmed instance instead of delaying to the first chunk; update references around
postproc_worker_main, PostprocWorker.default_record_creator, and
set_configured_post_processor_hook to use the resolved/validated hook instance
rather than the raw import path.
🧹 Nitpick comments (1)
docs/source/features/post-processor-hook.md (1)

29-29: ⚡ Quick win

Use --config instead of --extra_llm_api_options in documentation.

As per coding guidelines, when documenting CLI commands for trtllm-serve, prefer using --config over --extra_llm_api_options for specifying configuration files. The --config flag is the preferred, shorter alias.

📝 Suggested change
-Equivalently, set it in a YAML config passed via `--extra_llm_api_options`:
+Equivalently, set it in a YAML config passed via `--config`:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/source/features/post-processor-hook.md` at line 29, Replace the CLI flag
usage in the docs: change the example that says "set it in a YAML config passed
via `--extra_llm_api_options`" to use the preferred shorter alias `--config`
instead (for the `trtllm-serve` CLI), i.e. update any occurrence of
`--extra_llm_api_options` to `--config` in the sentence referencing trtllm-serve
so the documentation uses the correct preferred flag.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/executor/postprocessor_hook.py`:
- Around line 25-37: The __all__ list in the module is not alphabetically
sorted; update the __all__ variable (the list containing "PostProcChunk",
"PostProcVerdict", "PostProcessorHook", "emit", "suppress", "terminate",
"apply_post_processor_hook", "load_post_processor_hook",
"get_post_processor_hook", "set_configured_post_processor_hook",
"get_configured_post_processor_hook") so its items are in ASCII/lexicographic
order (e.g., alphabetize by string) to satisfy RUF022 and keep imports
deterministic.
- Around line 83-88: When loading the hook from import_path, instantiate the
class (hook_class()) then validate that the resulting instance is callable; if
not, raise a TypeError so it fails at load time instead of silently becoming a
passthrough at runtime. Update the try/except branch that currently returns
hook_class() to create instance = hook_class(), check callable(instance) and
raise a descriptive TypeError (including import_path/class_name) when not
callable; this touches the import_path parsing and hook_class instantiation used
by apply_post_processor_hook.

In `@tensorrt_llm/llmapi/llm.py`:
- Around line 233-245: Detect and prevent process-global hook conflicts by
checking the currently configured global hook before calling
set_configured_post_processor_hook(self.args.post_processor); if a different
hook is already set (via get_post_processor_hook or equivalent) and this LLM
instance intends to register a different post_processor, fail fast
(raise/configuration error) or instead keep the hook scoped to the instance by
storing self._post_processor_hook and ensuring
DetokenizedGenerationResultBase._apply_post_processor_hook reads the
instance-level hook (fall back to the global only if instance hook is None).
Update the constructor logic around set_configured_post_processor_hook and
get_post_processor_hook so it either 1) compares and errors on conflicts, or 2)
abandons modifying the process-global and preserves per-instance hook stored on
the BaseLLM (self._post_processor_hook) that downstream code uses.

---

Outside diff comments:
In `@tensorrt_llm/executor/worker.py`:
- Around line 278-285: postproc_worker_main currently just stores the hook
import path via set_configured_post_processor_hook and defers import/constructor
errors to the first request; modify postproc_worker_main to resolve and
instantiate the post-processor hook during worker startup (using the
postproc_worker_config.post_processor_hook import path), validate it, and run
the same warm-up/caching that BaseLLM.__init__ performs so any
import/constructor failures surface immediately and each PostprocWorker gets the
warmed instance instead of delaying to the first chunk; update references around
postproc_worker_main, PostprocWorker.default_record_creator, and
set_configured_post_processor_hook to use the resolved/validated hook instance
rather than the raw import path.

---

Nitpick comments:
In `@docs/source/features/post-processor-hook.md`:
- Line 29: Replace the CLI flag usage in the docs: change the example that says
"set it in a YAML config passed via `--extra_llm_api_options`" to use the
preferred shorter alias `--config` instead (for the `trtllm-serve` CLI), i.e.
update any occurrence of `--extra_llm_api_options` to `--config` in the sentence
referencing trtllm-serve so the documentation uses the correct preferred flag.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3eac517f-99e0-427c-9520-f548b414057c

📥 Commits

Reviewing files that changed from the base of the PR and between 7301075 and b46c4e7.

📒 Files selected for processing (17)
  • docs/source/features/post-processor-hook.md
  • docs/source/index.rst
  • tensorrt_llm/commands/serve.py
  • tensorrt_llm/executor/postproc_worker.py
  • tensorrt_llm/executor/postprocessor_hook.py
  • tensorrt_llm/executor/proxy.py
  • tensorrt_llm/executor/result.py
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/llmapi/llm.py
  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/serve/openai_server.py
  • tests/integration/defs/test_e2e.py
  • tests/integration/test_lists/test-db/l0_a10.yml
  • tests/unittest/api_stability/references/llm.yaml
  • tests/unittest/executor/test_postprocessor_hook.py
  • tests/unittest/llmapi/apps/_postproc_hook_samples.py
  • tests/unittest/llmapi/apps/_test_openai_post_processor.py

Comment thread tensorrt_llm/executor/postprocessor_hook.py
Comment thread tensorrt_llm/executor/postprocessor_hook.py
Comment thread tensorrt_llm/llmapi/llm.py Outdated
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53644 [ run ] completed with state SUCCESS. Commit: b46c4e7
/LLM/main/L0_MergeRequest_PR pipeline #42785 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

xwang233 added 5 commits June 15, 2026 13:17
Address PR review feedback on the trtllm-serve post-processing hook:

- Fail fast on conflicting in-process hook registration: a second LLM in
  the same process with a different --post_processor now raises instead of
  silently clobbering the process-global and applying the wrong hook to the
  already-running instance. Re-registering the same path is a no-op; clearing
  to None (e.g. on shutdown) is always allowed.
- Validate the import path BEFORE recording the process-global in
  BaseLLM.__init__, so a bad path leaves no stale registration; clear the
  global in BaseLLM.shutdown() so sequential LLMs still work.
- Add a focused unit test for the proxy late-response drop (the .get() at
  dispatch_result_task): a response for an already-popped client_id after a
  terminate is dropped without KeyError.
- Doc: clarify that engine batching is transparent (one call per request,
  keyed by request_id).
- Drop an internal study-plan section reference from a code comment.

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
Replace the process-global post-processing hook ownership with per-instance
ownership, so the hook is a property of the LLM (and of each post-processing
worker process), like the tokenizer it sits next to in BaseLlmArgs.

- postprocessor_hook.py: drop the process globals and their accessors
  (_CONFIGURED_HOOK_PATH, set_/get_configured_post_processor_hook,
  _HOOK_INSTANCE_CACHE, get_post_processor_hook) and the fail-fast conflict
  guard. Keep load_post_processor_hook / apply_post_processor_hook.
- result.py: DetokenizedGenerationResultBase carries its own
  _post_processor_hook; the detok chokepoint applies that instance attribute
  instead of reading a process global.
- llm.py: BaseLLM builds its own hook instance (eager import validation
  retained) and threads it onto the RequestOutput via _from_generation_result;
  remove the global registration and shutdown-clear.
- postproc_worker.py: each PostprocWorker builds and owns one hook instance
  (mirroring the tokenizer) and injects it onto every record it creates.
- Tests: replace the global/conflict/cache tests with instance-ownership
  tests; update the docs to describe per-instance ownership.

Independent LLM instances in one process now stay isolated without any
fail-fast guard, and the hook instance never crosses an IPC boundary (each
owner builds it from the import path).

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
…cessor-hook

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
Tighten the post-processing hook changes after the per-instance refactor:

- postprocessor_hook.py: fix the now-stale "built once per process" wording in
  the PostProcessorHook docstring (it is per-owner), drop a duplicated
  ownership paragraph in the module docstring and a redundant token-id NOTE,
  and shorten the is_final comment.
- postproc_worker.py: shorten the hook-injection comment.
- post-processor-hook.md: condense the batching note.
- test_postprocessor_hook.py: drop test_independent_instances_keep_separate_state
  — two distinct hook instances having separate state is tautological and the
  read-site wiring is already covered by the apply-method test.

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
… flag

Address review of the post-processing hook branch (two-agent review).

Main fix: a `/v1/completions` request with `detokenize=false` previously
bypassed the hook entirely, because the hook ran inside the detokenize guard.
The hook is a server-side guardrail and must not be bypassable by a client
flag (matching Triton, where the post-processor is a mandatory ensemble stage).
The detok+hook block now runs when detokenize is set OR a post_processor is
configured; the returned channel is unchanged (the response formatter honors
the request's detokenize flag separately), and suppress/terminate still
withhold the token-id channel.

Also:
- Extract the harmony fail-fast into OpenAIServer._ensure_post_processor_supported
  and unit-test it directly.
- Add an e2e test that detokenize=false does not bypass the terminate hook.
- proxy.py / postprocessor_hook.py comment accuracy; postproc_worker.py uses
  pop(client_id, None) like its hardened siblings.
- Docs: not-bypassable guarantee, disagg note, soften the untested responses
  endpoint claim.

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
xwang233 added 3 commits June 15, 2026 15:30
Close two more guardrail-bypass gaps found in review, by making the hook's
withholding contract complete and explicit rather than patching channels one
at a time.

- All-channel withholding: a channel×endpoint audit showed suppress/terminate
  only blanked the streaming diff views; the non-streaming formatters emit the
  FULL token_ids (completions) and FULL logprobs (chat + completions), which
  still leaked withheld content. _withhold_token_channel now truncates the full
  token_ids/logprobs to the presented prefix for non-streaming (mirroring how
  .text is blanked), keeping the proven streaming watermark path unchanged.
- Require a tokenizer: post_processor + skip_tokenizer_init is now rejected in
  BaseLlmArgs (the text-based hook has no text without a tokenizer), mirroring
  the harmony fail-fast.
- Docs: verdict table + limitations clarified (suppress/terminate withhold all
  channels; emit is text-only; tokenizer required).
- Tests: non-streaming full-channel suppress unit test; skip_tokenizer_init
  rejection test; e2e asserts token_ids are withheld under detokenize=false.

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
Third review pass confirmed no bypass: the non-streaming response delivers the
content the hook emitted before the first suppress/terminate — identical to
what a streaming client receives, so a client cannot obtain more output by
choosing non-streaming. Only the wording was imprecise.

- _withhold_token_channel docstring: "already-presented" -> "already-emitted"
  prefix, and note the streaming-consistent semantics.
- Doc: state that verdicts are per-chunk (suppress withholds this chunk,
  terminate keeps prior chunks), consistent across streaming/non-streaming, and
  that all-or-nothing withholding means suppressing from the first chunk.

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
Make the post-processing hook fail closed: a hook exception (or an
invalid verdict) now re-raises so the request errors instead of serving
the un-vetted chunk, matching Triton's per-request model. The in-proxy
path surfaces it to the serving handler; the worker path converts it to
an ErrorResponse. Server and sibling requests stay alive.

Replace the free-form verdict action string with a PostProcAction enum
validated in PostProcVerdict.__post_init__, so an unknown action can no
longer be smuggled past the dispatch.

Document the n>1 / beam behavior (emit/suppress are per output sequence;
terminate cancels the whole request) instead of assuming a single output,
and trim the over-verbose comments and JIRA references across the feature.

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
@xwang233

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54400 [ run ] triggered by Bot. Commit: aa3a324 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54400 [ run ] completed with state SUCCESS. Commit: aa3a324
/LLM/main/L0_MergeRequest_PR pipeline #43475 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

The dedicated unit test files were not enumerated in any test-db list,
so CI never collected them. Add them to the executor unittest block in
l0_a10.yml so they run in the A10-PyTorch stage.

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
@xwang233

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-PyTorch-1, A10-PyTorch-2" --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54635 [ run ] triggered by Bot. Commit: f66eb4d Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54635 [ run ] completed with state SUCCESS. Commit: f66eb4d
/LLM/main/L0_MergeRequest_PR pipeline #43677 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@JunyiXu-nv JunyiXu-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Only one possible improvement.

Comment thread tensorrt_llm/executor/postprocessor_hook.py
xwang233 added 2 commits June 22, 2026 14:44
…cessor-hook

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
load_post_processor_hook now checks callable() on the instantiated hook
and raises a descriptive ValueError, so a non-callable import path fails
at load/startup instead of per-chunk at runtime (benefits CI). Addresses
the CodeRabbit suggestion acknowledged on PR NVIDIA#15239 (thread r3425180772).

Updates the loader unit tests to use a callable stand-in (MagicMock) and
adds a non-callable negative case (OrderedDict).

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
@xwang233

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-PyTorch-1, A10-PyTorch-2" --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55090 [ run ] triggered by Bot. Commit: 689c8db Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55090 [ run ] completed with state SUCCESS. Commit: 689c8db
/LLM/main/L0_MergeRequest_PR pipeline #44077 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

Comment thread tests/unittest/executor/test_postprocessor_hook.py

@kaiyux kaiyux left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on behalf of trt-llm-doc-owners.

Comment thread docs/source/features/post-processor-hook.md Outdated
xwang233 added 3 commits June 23, 2026 09:46
…licate docs

Address PR review feedback:
- Rename PostProcChunk/PostProcVerdict/PostProcAction to
  PostProcessorHookChunk/PostProcessorHookVerdict/PostProcessorHookAction so
  they read distinctly from the existing postproc-worker types (PostprocParams,
  PostprocWorker, PostprocWorkerConfig).
- Replace the duplicated PostProcessorHookChunk field table in the docs with a
  pointer to the authoritative dataclass to avoid drift.
- Use repo-relative paths for the source-file links instead of absolute
  github.com blob URLs.

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
…cessor-hook

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
…ocessor_hook

Address PR review feedback (one level up from the type rename): the
user-facing CLI flag --post_processor and the BaseLlmArgs field
post_processor collided head-on with the pre-existing
PostprocParams.post_processor (the internal per-endpoint response
formatter), in the same postproc-worker subsystem.

Rename the user-facing surface to post_processor_hook, matching the
PostProcessorHook* types, the PostProcessorHook Protocol, the internal
PostprocWorkerConfig.post_processor_hook, and the doc file name. The
existing formatter PostprocParams.post_processor is left untouched.

Touches: CLI flag + plumbing (serve.py), BaseLlmArgs field + validation
(llm_args.py), eager-load (llm.py), harmony guard rename
(openai_server._ensure_post_processor_hook_supported), api_stability
reference, unit/e2e tests, and docs.

Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
@xwang233

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-PyTorch-1, A10-PyTorch-2" --disable-fail-fast

1 similar comment
@tburt-nv

Copy link
Copy Markdown
Collaborator

/bot run --stage-list "A10-PyTorch-1, A10-PyTorch-2" --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55342 [ run ] triggered by Bot. Commit: 5fd660e Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55342 [ run ] completed with state SUCCESS. Commit: 5fd660e
/LLM/main/L0_MergeRequest_PR pipeline #44296 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-compatible Accepted LLM API contract change that is backwards-compatible

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants