[TRTLLM-12622][feat] Add native post-processing hook to trtllm-serve by xwang233 · Pull Request #15239 · NVIDIA/TensorRT-LLM

xwang233 · 2026-06-11T00:30:50Z

Summary

Adds a native, per-request post-processing hook to trtllm-serve (TRTLLM-12622). A user supplies an importable, picklable callable class via --post_processor (or the equivalent post_processor field in --extra_llm_api_options YAML). The hook runs at a single chokepoint after detokenization and before the per-endpoint response formatter, and may rewrite, suppress, or terminate the output. Each hook instance owns its own per-request state, keyed by request_id.

The hook works across the chat and completions endpoints (streaming and non-streaming) and with the post-processing worker pool both enabled and disabled.

Not client-bypassable (the hook is a server-side guardrail):

It runs even when a request sets detokenize=false — the server detokenizes for the hook regardless.
suppress/terminate withhold every client-visible channel — text, token ids, and logprobs — on both streaming and non-streaming paths.
Configurations the hook fundamentally cannot inspect are rejected up front: harmony/gpt-oss models (output reconstructed from raw token ids) at server startup, and skip_tokenizer_init (no detokenized text) at args validation.

Design

New module tensorrt_llm/executor/postprocessor_hook.py: PostProcChunk, PostProcVerdict, the PostProcAction enum (EMIT/SUPPRESS/TERMINATE, validated at construction so an unknown action cannot be smuggled in), the emit / suppress / terminate helpers, the PostProcessorHook protocol, the import-path loader (mirroring --custom_tokenizer), and apply_post_processor_hook.
Single chokepoint in DetokenizedGenerationResultBase._handle_response, which (via MRO) covers both the in-proxy and post-processing-worker detok paths.
Per-instance ownership (mirrors Triton's instance-owns-state model): each LLM and each post-processing worker builds its own hook instance from the import path — no process globals. BaseLLM builds self._post_processor_hook, threads it onto RequestOutput; PostprocWorker builds its own and attaches it per record. True no-op when unset; multiple LLMs in one process are independent.
emit(text) rewrites the text channel only; suppress/terminate withhold all channels. terminate cancels the engine request through the existing abort path (output-side only; no engine back-edge). emit/suppress act per output sequence; terminate cancels the whole request (the documented behavior under n>1/beam — no extra guard).
Error policy is fail-closed: a hook exception is re-raised (the worker path converts it to an ErrorResponse; the in-proxy path surfaces it to the serving handler). The server and sibling requests stay alive. For a safety guardrail, failing open would emit exactly the un-vetted content the hook exists to block.
proxy.py: late engine responses arriving after a terminate-driven record pop are dropped instead of raising KeyError.
Config surface: --post_processor CLI flag + BaseLlmArgs.post_processor (prototype) + an api_stability reference update.
Docs: docs/source/features/post-processor-hook.md.

Test Coverage

Unit — tests/unittest/executor/test_postprocessor_hook.py: verdict semantics (rewrite/suppress/terminate), all-channel withholding (streaming + non-streaming), per-request state isolation, enum action rejected at construction, fail-closed re-raise, import-path loader, independent per-instance state, harmony-rejection guard.
Unit — tests/unittest/executor/test_proxy_postproc_terminate.py: a late response after a terminate-driven pop is dropped without KeyError; a live final response is delivered and popped.
Unit — tests/unittest/llmapi/test_llm_args.py: post_processor + skip_tokenizer_init is rejected.
Integration (e2e) — tests/unittest/llmapi/apps/_test_openai_post_processor.py: chat & completions × streaming/non-streaming × num_postprocess_workers ∈ {0, 2} × rewrite/suppress/terminate, plus a detokenize=false-does-not-bypass-the-hook case.
api_stability reference updated for the new post_processor field.
Full suite passes on GB200: unit 21, api_stability 35, llm_args (post_processor) 1, integration 30 — 87 passed, 0 failed.

Notes

This is a prototype feature; the interface may change. Documented limitations: emit (rewrite) controls only the text channel and does not scrub the underlying token ids / logprobs, so a client reading those channels can still recover the original text — hard guardrails must use suppress/terminate, which withhold all channels. The reasoning/tool-call parsers run after the hook. Only the generation endpoints are covered (encode/embeddings/VisualGen are separate paths).

Summary by CodeRabbit

New Features
- Added --post_processor option to customize output processing using user-supplied Python code
- Supported operations: rewrite, suppress, or terminate generated output per request
Documentation
- Added comprehensive guide for post-processor hook feature with usage examples and API reference
Tests
- Added integration and unit tests validating post-processor behavior across streaming and non-streaming modes

Add a user-pluggable, per-request, stateful post-processing hook for trtllm-serve, equivalent to a Triton python-backend post-processor. The hook runs after detokenization and before the per-endpoint response formatter, and may rewrite, suppress, or terminate the streamed output. - New executor/postprocessor_hook.py: PostProcChunk / PostProcVerdict, emit/suppress/terminate, the PostProcessorHook protocol, an import-path loader with a process build-once cache, and apply_post_processor_hook. - Single chokepoint in DetokenizedGenerationResultBase._handle_response, covering both the postproc-worker path and the in-proxy RequestOutput path. The hook path is configured per process (set in BaseLLM.__init__ and postproc_worker_main) via a new PostprocWorkerConfig.post_processor_hook. - Config surface: BaseLlmArgs.post_processor (prototype) plus a trtllm-serve --post_processor flag (also settable via --extra_llm_api_options). - terminate cancels the engine via result.abort() and forces the worker record done; suppress/terminate withhold the raw token-id/logprob channel too; hook exceptions are isolated per request (fail-open + logged). - Reject --post_processor with harmony/gpt-oss models at startup (the harmony path rebuilds output from raw tokens and would bypass the hook). - Unit tests for verdict semantics, per-request state isolation, the loader, and the process global; api_stability reference updated. The hook is text-based and operates post-detok; a hook that rewrites or suppresses text may desync stateful reasoning/tool parsers, and rewriting text does not rewrite the underlying token ids. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

Launch a real trtllm-serve with --post_processor and assert the client-visible effect (rewrite / suppress / terminate) across the chat and completions endpoints, streaming and non-streaming, with the postproc worker pool both disabled (in-proxy detok) and enabled (worker-process detok). - _postproc_hook_samples.py: stateless deterministic sample hooks (UppercaseHook / SuppressHook / TerminateHook). - _test_openai_post_processor.py: the endpoint matrix on TinyLlama-1.1B. - test_e2e.py::test_openai_post_processor wrapper + l0_a10 test-list entry. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

…ok terminate When a post-processing hook returns `terminate`, the result is marked done and popped from the proxy's `_results` map, but the engine can still emit in-flight responses for the same client_id (abort is async, and the postproc worker recreates a record for any late response). Those late responses reached `process_res` after the result was removed, raising KeyError and recording a fatal engine error that tore down the whole engine. Look the result up with `.get()` and drop late responses for already-finalized client_ids; make the corresponding pop idempotent. Verified on GB200: the previously-failing test_chat_streaming[terminate-enable_processpool] now passes (post-processor hook e2e suite 24/24). Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

Add a feature doc covering the native post-processing hook: how to enable it via --post_processor (CLI and YAML), the PostProcChunk / verdict interface, three worked examples (rewrite, stateful guardrail with terminate, suppress), per-request state guidance, supported endpoints and limitations, and pointers to the unit and e2e tests. Register it in the features toctree. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

- test: streaming terminate cases now assert finish_reason == "stop", so the hook (not an empty generation) is verified to be what stopped the stream (the cell that previously hit the proxy KeyError race). - refactor: hoist the post-processor-hook import in result.py to module scope, off the per-chunk detok hot path (postprocessor_hook is stdlib-only, so there is no circular import). - docs: clarify that PostProcChunk.is_final is request-level. No behavior change to the serving path. GPU re-validation pending. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

xwang233 · 2026-06-11T00:32:11Z

/bot run

tensorrt-cicd · 2026-06-11T00:38:27Z

PR_Github #53409 [ run ] triggered by Bot. Commit: b46c4e7 Link to invocation

tensorrt-cicd · 2026-06-11T04:06:22Z

PR_Github #53409 [ run ] completed with state FAILURE. Commit: b46c4e7
/LLM/main/L0_MergeRequest_PR pipeline #42581 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

xwang233 · 2026-06-11T18:32:51Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-11T18:39:25Z

PR_Github #53644 [ run ] triggered by Bot. Commit: b46c4e7 Link to invocation

coderabbitai · 2026-06-11T19:07:41Z

📝 Walkthrough

Walkthrough

This PR implements a post-processor hook system for trtllm-serve that allows users to supply a Python callable to rewrite, suppress, or terminate output chunks in real time across all backends and streaming/non-streaming modes.

Changes

Post-processor hook feature

Layer / File(s)	Summary
Hook contract, data structures, and execution engine `tensorrt_llm/executor/postprocessor_hook.py`	Core module defines process-scoped hook configuration, per-process hook instance cache, PostProcChunk and PostProcVerdict data contracts, emit/suppress/terminate verdict helpers, and apply_post_processor_hook implementation with exception isolation, text rewriting, stop/abort handling, and token channel suppression for suppressed/terminated chunks.
Hook integration into result detokenization and postproc workers `tensorrt_llm/executor/result.py`, `tensorrt_llm/executor/postproc_worker.py`	Integrates hook invocation into DetokenizedGenerationResultBase after detokenization completion; configures PostprocWorkerConfig with hook import path; updates postproc_worker_main to initialize hook at startup; forces stream finalization when hook-induced termination occurs.
Configuration schema, initialization, and CLI wiring `tensorrt_llm/llmapi/llm_args.py`, `tensorrt_llm/llmapi/llm.py`, `tensorrt_llm/commands/serve.py`	Adds post_processor field to BaseLlmArgs as optional import path with prototype status; wires hook through LLM.init with eager resolution; configures hooks in TensorRT and PyTorch backend worker configs; adds --post_processor CLI option with override propagation.
Worker process pool integration `tensorrt_llm/executor/worker.py`	Passes post_processor_hook configuration to postproc_worker_main when submitting postprocess worker tasks.
Robustness and constraint checks `tensorrt_llm/executor/proxy.py`, `tensorrt_llm/serve/openai_server.py`	Hardens result dispatch to tolerate late responses via safe lookups; adds harmony/gpt-oss guardrail preventing text-based hooks from being silently bypassed by token-id-based reconstruction.
Unit tests for postprocessor_hook module `tests/unittest/executor/test_postprocessor_hook.py`	Comprehensive test coverage for hook loading, caching, verdict actions, streaming/non-streaming diff rewriting, token channel suppression, per-request state isolation, and fail-open exception handling.
Sample hook implementations `tests/unittest/llmapi/apps/_postproc_hook_samples.py`	Three deterministic sample hooks: UppercaseHook (text rewriting), SuppressHook (suppression), TerminateHook (termination).
End-to-end integration tests for OpenAI server `tests/unittest/llmapi/apps/_test_openai_post_processor.py`	End-to-end tests validating hook behavior across OpenAI-compatible completions and chat endpoints in streaming and non-streaming modes.
Documentation, API reference, and CI configuration `docs/source/features/post-processor-hook.md`, `docs/source/index.rst`, `tests/unittest/api_stability/references/llm.yaml`, `tests/integration/defs/test_e2e.py`, `tests/integration/test_lists/test-db/l0_a10.yml`	End-user feature guide covering configuration, hook interface, payload fields, verdict helpers, state management, and limitations; updates API stability schema and integration test configuration.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

moraxu
Superjomn
venkywonka
chang-l
ruodil
Wanli-Jiang

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.06% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly and concisely describes the main feature: adding a native post-processing hook to trtllm-serve. It follows the repository's title format with JIRA ticket and type prefix.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	PR description is comprehensive, covering the feature, design, test coverage, and notes on limitations. It follows the spirit of the template with clear sections.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/executor/worker.py (1)

278-285: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Postproc workers still defer hook validation to the first request.

This now forwards the import path, but tensorrt_llm/executor/postproc_worker.py:279-294 only stores it with set_configured_post_processor_hook(...). With postproc workers enabled, a bad hook import/constructor failure will still show up on the first chunk instead of during startup, and each worker misses the cache warm-up that BaseLLM.__init__ does in the proxy process. Resolve the hook once in postproc_worker_main when the worker starts.

Suggested follow-up in tensorrt_llm/executor/postproc_worker.py

 def postproc_worker_main(feedin_ipc_addr: tuple[str, Optional[bytes]],
                          feedout_ipc_addr: tuple[str, Optional[bytes]],
                          tokenizer_dir: str,
                          record_creator: Callable,
                          post_processor_hook: Optional[str] = None):
-    from .postprocessor_hook import set_configured_post_processor_hook
+    from .postprocessor_hook import (
+        get_post_processor_hook,
+        set_configured_post_processor_hook,
+    )
     set_configured_post_processor_hook(post_processor_hook)
+    if post_processor_hook:
+        get_post_processor_hook(post_processor_hook)
     worker = PostprocWorker(feedin_ipc_addr,
                             feedout_ipc_addr,
                             tokenizer_dir=tokenizer_dir,
                             record_creator=record_creator)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/executor/worker.py` around lines 278 - 285, postproc_worker_main
currently just stores the hook import path via
set_configured_post_processor_hook and defers import/constructor errors to the
first request; modify postproc_worker_main to resolve and instantiate the
post-processor hook during worker startup (using the
postproc_worker_config.post_processor_hook import path), validate it, and run
the same warm-up/caching that BaseLLM.__init__ performs so any
import/constructor failures surface immediately and each PostprocWorker gets the
warmed instance instead of delaying to the first chunk; update references around
postproc_worker_main, PostprocWorker.default_record_creator, and
set_configured_post_processor_hook to use the resolved/validated hook instance
rather than the raw import path.

🧹 Nitpick comments (1)

docs/source/features/post-processor-hook.md (1)
29-29: ⚡ Quick win

Use --config instead of --extra_llm_api_options in documentation.

As per coding guidelines, when documenting CLI commands for trtllm-serve, prefer using --config over --extra_llm_api_options for specifying configuration files. The --config flag is the preferred, shorter alias.
📝 Suggested change
-Equivalently, set it in a YAML config passed via `--extra_llm_api_options`:
+Equivalently, set it in a YAML config passed via `--config`:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/source/features/post-processor-hook.md` at line 29, Replace the CLI flag
usage in the docs: change the example that says "set it in a YAML config passed
via `--extra_llm_api_options`" to use the preferred shorter alias `--config`
instead (for the `trtllm-serve` CLI), i.e. update any occurrence of
`--extra_llm_api_options` to `--config` in the sentence referencing trtllm-serve
so the documentation uses the correct preferred flag.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/executor/postprocessor_hook.py`:
- Around line 25-37: The __all__ list in the module is not alphabetically
sorted; update the __all__ variable (the list containing "PostProcChunk",
"PostProcVerdict", "PostProcessorHook", "emit", "suppress", "terminate",
"apply_post_processor_hook", "load_post_processor_hook",
"get_post_processor_hook", "set_configured_post_processor_hook",
"get_configured_post_processor_hook") so its items are in ASCII/lexicographic
order (e.g., alphabetize by string) to satisfy RUF022 and keep imports
deterministic.
- Around line 83-88: When loading the hook from import_path, instantiate the
class (hook_class()) then validate that the resulting instance is callable; if
not, raise a TypeError so it fails at load time instead of silently becoming a
passthrough at runtime. Update the try/except branch that currently returns
hook_class() to create instance = hook_class(), check callable(instance) and
raise a descriptive TypeError (including import_path/class_name) when not
callable; this touches the import_path parsing and hook_class instantiation used
by apply_post_processor_hook.

In `@tensorrt_llm/llmapi/llm.py`:
- Around line 233-245: Detect and prevent process-global hook conflicts by
checking the currently configured global hook before calling
set_configured_post_processor_hook(self.args.post_processor); if a different
hook is already set (via get_post_processor_hook or equivalent) and this LLM
instance intends to register a different post_processor, fail fast
(raise/configuration error) or instead keep the hook scoped to the instance by
storing self._post_processor_hook and ensuring
DetokenizedGenerationResultBase._apply_post_processor_hook reads the
instance-level hook (fall back to the global only if instance hook is None).
Update the constructor logic around set_configured_post_processor_hook and
get_post_processor_hook so it either 1) compares and errors on conflicts, or 2)
abandons modifying the process-global and preserves per-instance hook stored on
the BaseLLM (self._post_processor_hook) that downstream code uses.

---

Outside diff comments:
In `@tensorrt_llm/executor/worker.py`:
- Around line 278-285: postproc_worker_main currently just stores the hook
import path via set_configured_post_processor_hook and defers import/constructor
errors to the first request; modify postproc_worker_main to resolve and
instantiate the post-processor hook during worker startup (using the
postproc_worker_config.post_processor_hook import path), validate it, and run
the same warm-up/caching that BaseLLM.__init__ performs so any
import/constructor failures surface immediately and each PostprocWorker gets the
warmed instance instead of delaying to the first chunk; update references around
postproc_worker_main, PostprocWorker.default_record_creator, and
set_configured_post_processor_hook to use the resolved/validated hook instance
rather than the raw import path.

---

Nitpick comments:
In `@docs/source/features/post-processor-hook.md`:
- Line 29: Replace the CLI flag usage in the docs: change the example that says
"set it in a YAML config passed via `--extra_llm_api_options`" to use the
preferred shorter alias `--config` instead (for the `trtllm-serve` CLI), i.e.
update any occurrence of `--extra_llm_api_options` to `--config` in the sentence
referencing trtllm-serve so the documentation uses the correct preferred flag.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3eac517f-99e0-427c-9520-f548b414057c

📥 Commits

Reviewing files that changed from the base of the PR and between 7301075 and b46c4e7.

📒 Files selected for processing (17)

docs/source/features/post-processor-hook.md
docs/source/index.rst
tensorrt_llm/commands/serve.py
tensorrt_llm/executor/postproc_worker.py
tensorrt_llm/executor/postprocessor_hook.py
tensorrt_llm/executor/proxy.py
tensorrt_llm/executor/result.py
tensorrt_llm/executor/worker.py
tensorrt_llm/llmapi/llm.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/serve/openai_server.py
tests/integration/defs/test_e2e.py
tests/integration/test_lists/test-db/l0_a10.yml
tests/unittest/api_stability/references/llm.yaml
tests/unittest/executor/test_postprocessor_hook.py
tests/unittest/llmapi/apps/_postproc_hook_samples.py
tests/unittest/llmapi/apps/_test_openai_post_processor.py

tensorrt-cicd · 2026-06-12T02:57:15Z

PR_Github #53644 [ run ] completed with state SUCCESS. Commit: b46c4e7
/LLM/main/L0_MergeRequest_PR pipeline #42785 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Address PR review feedback on the trtllm-serve post-processing hook: - Fail fast on conflicting in-process hook registration: a second LLM in the same process with a different --post_processor now raises instead of silently clobbering the process-global and applying the wrong hook to the already-running instance. Re-registering the same path is a no-op; clearing to None (e.g. on shutdown) is always allowed. - Validate the import path BEFORE recording the process-global in BaseLLM.__init__, so a bad path leaves no stale registration; clear the global in BaseLLM.shutdown() so sequential LLMs still work. - Add a focused unit test for the proxy late-response drop (the .get() at dispatch_result_task): a response for an already-popped client_id after a terminate is dropped without KeyError. - Doc: clarify that engine batching is transparent (one call per request, keyed by request_id). - Drop an internal study-plan section reference from a code comment. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

Replace the process-global post-processing hook ownership with per-instance ownership, so the hook is a property of the LLM (and of each post-processing worker process), like the tokenizer it sits next to in BaseLlmArgs. - postprocessor_hook.py: drop the process globals and their accessors (_CONFIGURED_HOOK_PATH, set_/get_configured_post_processor_hook, _HOOK_INSTANCE_CACHE, get_post_processor_hook) and the fail-fast conflict guard. Keep load_post_processor_hook / apply_post_processor_hook. - result.py: DetokenizedGenerationResultBase carries its own _post_processor_hook; the detok chokepoint applies that instance attribute instead of reading a process global. - llm.py: BaseLLM builds its own hook instance (eager import validation retained) and threads it onto the RequestOutput via _from_generation_result; remove the global registration and shutdown-clear. - postproc_worker.py: each PostprocWorker builds and owns one hook instance (mirroring the tokenizer) and injects it onto every record it creates. - Tests: replace the global/conflict/cache tests with instance-ownership tests; update the docs to describe per-instance ownership. Independent LLM instances in one process now stay isolated without any fail-fast guard, and the hook instance never crosses an IPC boundary (each owner builds it from the import path). Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

…cessor-hook Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

Tighten the post-processing hook changes after the per-instance refactor: - postprocessor_hook.py: fix the now-stale "built once per process" wording in the PostProcessorHook docstring (it is per-owner), drop a duplicated ownership paragraph in the module docstring and a redundant token-id NOTE, and shorten the is_final comment. - postproc_worker.py: shorten the hook-injection comment. - post-processor-hook.md: condense the batching note. - test_postprocessor_hook.py: drop test_independent_instances_keep_separate_state — two distinct hook instances having separate state is tautological and the read-site wiring is already covered by the apply-method test. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

… flag Address review of the post-processing hook branch (two-agent review). Main fix: a `/v1/completions` request with `detokenize=false` previously bypassed the hook entirely, because the hook ran inside the detokenize guard. The hook is a server-side guardrail and must not be bypassable by a client flag (matching Triton, where the post-processor is a mandatory ensemble stage). The detok+hook block now runs when detokenize is set OR a post_processor is configured; the returned channel is unchanged (the response formatter honors the request's detokenize flag separately), and suppress/terminate still withhold the token-id channel. Also: - Extract the harmony fail-fast into OpenAIServer._ensure_post_processor_supported and unit-test it directly. - Add an e2e test that detokenize=false does not bypass the terminate hook. - proxy.py / postprocessor_hook.py comment accuracy; postproc_worker.py uses pop(client_id, None) like its hardened siblings. - Docs: not-bypassable guarantee, disagg note, soften the untested responses endpoint claim. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

Close two more guardrail-bypass gaps found in review, by making the hook's withholding contract complete and explicit rather than patching channels one at a time. - All-channel withholding: a channel×endpoint audit showed suppress/terminate only blanked the streaming diff views; the non-streaming formatters emit the FULL token_ids (completions) and FULL logprobs (chat + completions), which still leaked withheld content. _withhold_token_channel now truncates the full token_ids/logprobs to the presented prefix for non-streaming (mirroring how .text is blanked), keeping the proven streaming watermark path unchanged. - Require a tokenizer: post_processor + skip_tokenizer_init is now rejected in BaseLlmArgs (the text-based hook has no text without a tokenizer), mirroring the harmony fail-fast. - Docs: verdict table + limitations clarified (suppress/terminate withhold all channels; emit is text-only; tokenizer required). - Tests: non-streaming full-channel suppress unit test; skip_tokenizer_init rejection test; e2e asserts token_ids are withheld under detokenize=false. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

Third review pass confirmed no bypass: the non-streaming response delivers the content the hook emitted before the first suppress/terminate — identical to what a streaming client receives, so a client cannot obtain more output by choosing non-streaming. Only the wording was imprecise. - _withhold_token_channel docstring: "already-presented" -> "already-emitted" prefix, and note the streaming-consistent semantics. - Doc: state that verdicts are per-chunk (suppress withholds this chunk, terminate keeps prior chunks), consistent across streaming/non-streaming, and that all-or-nothing withholding means suppressing from the first chunk. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

Make the post-processing hook fail closed: a hook exception (or an invalid verdict) now re-raises so the request errors instead of serving the un-vetted chunk, matching Triton's per-request model. The in-proxy path surfaces it to the serving handler; the worker path converts it to an ErrorResponse. Server and sibling requests stay alive. Replace the free-form verdict action string with a PostProcAction enum validated in PostProcVerdict.__post_init__, so an unknown action can no longer be smuggled past the dispatch. Document the n>1 / beam behavior (emit/suppress are per output sequence; terminate cancels the whole request) instead of assuming a single output, and trim the over-verbose comments and JIRA references across the feature. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

xwang233 · 2026-06-16T01:31:21Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-16T01:38:19Z

PR_Github #54400 [ run ] triggered by Bot. Commit: aa3a324 Link to invocation

tensorrt-cicd · 2026-06-16T08:28:27Z

PR_Github #54400 [ run ] completed with state SUCCESS. Commit: aa3a324
/LLM/main/L0_MergeRequest_PR pipeline #43475 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

The dedicated unit test files were not enumerated in any test-db list, so CI never collected them. Add them to the executor unittest block in l0_a10.yml so they run in the A10-PyTorch stage. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

xwang233 · 2026-06-16T16:47:21Z

/bot run --stage-list "A10-PyTorch-1, A10-PyTorch-2" --disable-fail-fast

tensorrt-cicd · 2026-06-16T16:53:49Z

PR_Github #54635 [ run ] triggered by Bot. Commit: f66eb4d Link to invocation

tensorrt-cicd · 2026-06-16T19:37:34Z

PR_Github #54635 [ run ] completed with state SUCCESS. Commit: f66eb4d
/LLM/main/L0_MergeRequest_PR pipeline #43677 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

JunyiXu-nv

Overall LGTM. Only one possible improvement.

…cessor-hook Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

load_post_processor_hook now checks callable() on the instantiated hook and raises a descriptive ValueError, so a non-callable import path fails at load/startup instead of per-chunk at runtime (benefits CI). Addresses the CodeRabbit suggestion acknowledged on PR NVIDIA#15239 (thread r3425180772). Updates the loader unit tests to use a callable stand-in (MagicMock) and adds a non-callable negative case (OrderedDict). Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

xwang233 · 2026-06-22T22:21:14Z

/bot run --stage-list "A10-PyTorch-1, A10-PyTorch-2" --disable-fail-fast

tensorrt-cicd · 2026-06-22T22:27:05Z

PR_Github #55090 [ run ] triggered by Bot. Commit: 689c8db Link to invocation

tensorrt-cicd · 2026-06-23T00:35:42Z

PR_Github #55090 [ run ] completed with state SUCCESS. Commit: 689c8db
/LLM/main/L0_MergeRequest_PR pipeline #44077 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

kaiyux

Approving on behalf of trt-llm-doc-owners.

…licate docs Address PR review feedback: - Rename PostProcChunk/PostProcVerdict/PostProcAction to PostProcessorHookChunk/PostProcessorHookVerdict/PostProcessorHookAction so they read distinctly from the existing postproc-worker types (PostprocParams, PostprocWorker, PostprocWorkerConfig). - Replace the duplicated PostProcessorHookChunk field table in the docs with a pointer to the authoritative dataclass to avoid drift. - Use repo-relative paths for the source-file links instead of absolute github.com blob URLs. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

…cessor-hook Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

…ocessor_hook Address PR review feedback (one level up from the type rename): the user-facing CLI flag --post_processor and the BaseLlmArgs field post_processor collided head-on with the pre-existing PostprocParams.post_processor (the internal per-endpoint response formatter), in the same postproc-worker subsystem. Rename the user-facing surface to post_processor_hook, matching the PostProcessorHook* types, the PostProcessorHook Protocol, the internal PostprocWorkerConfig.post_processor_hook, and the doc file name. The existing formatter PostprocParams.post_processor is left untouched. Touches: CLI flag + plumbing (serve.py), BaseLlmArgs field + validation (llm_args.py), eager-load (llm.py), harmony guard rename (openai_server._ensure_post_processor_hook_supported), api_stability reference, unit/e2e tests, and docs. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

xwang233 · 2026-06-23T22:00:09Z

/bot run --stage-list "A10-PyTorch-1, A10-PyTorch-2" --disable-fail-fast

tburt-nv · 2026-06-23T23:08:12Z

/bot run --stage-list "A10-PyTorch-1, A10-PyTorch-2" --disable-fail-fast

tensorrt-cicd · 2026-06-23T23:17:39Z

PR_Github #55342 [ run ] triggered by Bot. Commit: 5fd660e Link to invocation

tensorrt-cicd · 2026-06-24T01:30:50Z

PR_Github #55342 [ run ] completed with state SUCCESS. Commit: 5fd660e
/LLM/main/L0_MergeRequest_PR pipeline #44296 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

xwang233 added 5 commits June 10, 2026 16:48

github-actions Bot assigned xwang233 Jun 11, 2026

xwang233 added the api-compatible Accepted LLM API contract change that is backwards-compatible label Jun 11, 2026

xwang233 marked this pull request as ready for review June 11, 2026 18:33

xwang233 requested review from a team as code owners June 11, 2026 18:33

xwang233 requested review from chang-l, laikhtewari, mikeiovine and suyoggupta June 11, 2026 18:33

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread tensorrt_llm/executor/postprocessor_hook.py

Comment thread tensorrt_llm/executor/postprocessor_hook.py

Comment thread tensorrt_llm/llmapi/llm.py Outdated

xinhe-nv approved these changes Jun 12, 2026

View reviewed changes

xwang233 added 5 commits June 15, 2026 13:17

Merge remote-tracking branch 'origin/main' into trtllm-12622-post-pro…

696fa4e

…cessor-hook Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

xwang233 added 3 commits June 15, 2026 15:30

xwang233 requested review from JunyiXu-nv, QiJune, kaiyux and venkywonka June 16, 2026 01:35

JunyiXu-nv approved these changes Jun 17, 2026

View reviewed changes

Comment thread tensorrt_llm/executor/postprocessor_hook.py

xwang233 added 2 commits June 22, 2026 14:44

Merge remote-tracking branch 'origin/main' into trtllm-12622-post-pro…

4dbe9f0

…cessor-hook Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

hchings approved these changes Jun 23, 2026

View reviewed changes

Comment thread tests/unittest/executor/test_postprocessor_hook.py

kaiyux approved these changes Jun 23, 2026

View reviewed changes

Comment thread docs/source/features/post-processor-hook.md Outdated

xwang233 added 3 commits June 23, 2026 09:46

Merge remote-tracking branch 'origin/main' into trtllm-12622-post-pro…

d4c096f

…cessor-hook Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>

Conversation

xwang233 commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Test Coverage

Notes

Summary by CodeRabbit

Uh oh!

xwang233 commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

xwang233 commented Jun 11, 2026

Uh oh!

tensorrt-cicd commented Jun 11, 2026

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 12, 2026

Uh oh!

xwang233 commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

xwang233 commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

tensorrt-cicd commented Jun 16, 2026

Uh oh!

JunyiXu-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xwang233 commented Jun 22, 2026

Uh oh!

tensorrt-cicd commented Jun 22, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

Uh oh!

kaiyux left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xwang233 commented Jun 23, 2026

Uh oh!

tburt-nv commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 23, 2026

Uh oh!

tensorrt-cicd commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

xwang233 commented Jun 11, 2026 •

edited

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading