[TRTLLM-12622][feat] Add native post-processing hook to trtllm-serve#15239
[TRTLLM-12622][feat] Add native post-processing hook to trtllm-serve#15239xwang233 wants to merge 19 commits into
Conversation
Add a user-pluggable, per-request, stateful post-processing hook for trtllm-serve, equivalent to a Triton python-backend post-processor. The hook runs after detokenization and before the per-endpoint response formatter, and may rewrite, suppress, or terminate the streamed output. - New executor/postprocessor_hook.py: PostProcChunk / PostProcVerdict, emit/suppress/terminate, the PostProcessorHook protocol, an import-path loader with a process build-once cache, and apply_post_processor_hook. - Single chokepoint in DetokenizedGenerationResultBase._handle_response, covering both the postproc-worker path and the in-proxy RequestOutput path. The hook path is configured per process (set in BaseLLM.__init__ and postproc_worker_main) via a new PostprocWorkerConfig.post_processor_hook. - Config surface: BaseLlmArgs.post_processor (prototype) plus a trtllm-serve --post_processor flag (also settable via --extra_llm_api_options). - terminate cancels the engine via result.abort() and forces the worker record done; suppress/terminate withhold the raw token-id/logprob channel too; hook exceptions are isolated per request (fail-open + logged). - Reject --post_processor with harmony/gpt-oss models at startup (the harmony path rebuilds output from raw tokens and would bypass the hook). - Unit tests for verdict semantics, per-request state isolation, the loader, and the process global; api_stability reference updated. The hook is text-based and operates post-detok; a hook that rewrites or suppresses text may desync stateful reasoning/tool parsers, and rewriting text does not rewrite the underlying token ids. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
Launch a real trtllm-serve with --post_processor and assert the client-visible effect (rewrite / suppress / terminate) across the chat and completions endpoints, streaming and non-streaming, with the postproc worker pool both disabled (in-proxy detok) and enabled (worker-process detok). - _postproc_hook_samples.py: stateless deterministic sample hooks (UppercaseHook / SuppressHook / TerminateHook). - _test_openai_post_processor.py: the endpoint matrix on TinyLlama-1.1B. - test_e2e.py::test_openai_post_processor wrapper + l0_a10 test-list entry. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
…ok terminate When a post-processing hook returns `terminate`, the result is marked done and popped from the proxy's `_results` map, but the engine can still emit in-flight responses for the same client_id (abort is async, and the postproc worker recreates a record for any late response). Those late responses reached `process_res` after the result was removed, raising KeyError and recording a fatal engine error that tore down the whole engine. Look the result up with `.get()` and drop late responses for already-finalized client_ids; make the corresponding pop idempotent. Verified on GB200: the previously-failing test_chat_streaming[terminate-enable_processpool] now passes (post-processor hook e2e suite 24/24). Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
Add a feature doc covering the native post-processing hook: how to enable it via --post_processor (CLI and YAML), the PostProcChunk / verdict interface, three worked examples (rewrite, stateful guardrail with terminate, suppress), per-request state guidance, supported endpoints and limitations, and pointers to the unit and e2e tests. Register it in the features toctree. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
- test: streaming terminate cases now assert finish_reason == "stop", so the hook (not an empty generation) is verified to be what stopped the stream (the cell that previously hit the proxy KeyError race). - refactor: hoist the post-processor-hook import in result.py to module scope, off the per-chunk detok hot path (postprocessor_hook is stdlib-only, so there is no circular import). - docs: clarify that PostProcChunk.is_final is request-level. No behavior change to the serving path. GPU re-validation pending. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
|
/bot run |
|
PR_Github #53409 [ run ] triggered by Bot. Commit: |
|
PR_Github #53409 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #53644 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThis PR implements a post-processor hook system for ChangesPost-processor hook feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/executor/worker.py (1)
278-285:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPostproc workers still defer hook validation to the first request.
This now forwards the import path, but
tensorrt_llm/executor/postproc_worker.py:279-294only stores it withset_configured_post_processor_hook(...). With postproc workers enabled, a bad hook import/constructor failure will still show up on the first chunk instead of during startup, and each worker misses the cache warm-up thatBaseLLM.__init__does in the proxy process. Resolve the hook once inpostproc_worker_mainwhen the worker starts.Suggested follow-up in
tensorrt_llm/executor/postproc_worker.pydef postproc_worker_main(feedin_ipc_addr: tuple[str, Optional[bytes]], feedout_ipc_addr: tuple[str, Optional[bytes]], tokenizer_dir: str, record_creator: Callable, post_processor_hook: Optional[str] = None): - from .postprocessor_hook import set_configured_post_processor_hook + from .postprocessor_hook import ( + get_post_processor_hook, + set_configured_post_processor_hook, + ) set_configured_post_processor_hook(post_processor_hook) + if post_processor_hook: + get_post_processor_hook(post_processor_hook) worker = PostprocWorker(feedin_ipc_addr, feedout_ipc_addr, tokenizer_dir=tokenizer_dir, record_creator=record_creator)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/executor/worker.py` around lines 278 - 285, postproc_worker_main currently just stores the hook import path via set_configured_post_processor_hook and defers import/constructor errors to the first request; modify postproc_worker_main to resolve and instantiate the post-processor hook during worker startup (using the postproc_worker_config.post_processor_hook import path), validate it, and run the same warm-up/caching that BaseLLM.__init__ performs so any import/constructor failures surface immediately and each PostprocWorker gets the warmed instance instead of delaying to the first chunk; update references around postproc_worker_main, PostprocWorker.default_record_creator, and set_configured_post_processor_hook to use the resolved/validated hook instance rather than the raw import path.
🧹 Nitpick comments (1)
docs/source/features/post-processor-hook.md (1)
29-29: ⚡ Quick winUse
--configinstead of--extra_llm_api_optionsin documentation.As per coding guidelines, when documenting CLI commands for
trtllm-serve, prefer using--configover--extra_llm_api_optionsfor specifying configuration files. The--configflag is the preferred, shorter alias.📝 Suggested change
-Equivalently, set it in a YAML config passed via `--extra_llm_api_options`: +Equivalently, set it in a YAML config passed via `--config`:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/source/features/post-processor-hook.md` at line 29, Replace the CLI flag usage in the docs: change the example that says "set it in a YAML config passed via `--extra_llm_api_options`" to use the preferred shorter alias `--config` instead (for the `trtllm-serve` CLI), i.e. update any occurrence of `--extra_llm_api_options` to `--config` in the sentence referencing trtllm-serve so the documentation uses the correct preferred flag.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/executor/postprocessor_hook.py`:
- Around line 25-37: The __all__ list in the module is not alphabetically
sorted; update the __all__ variable (the list containing "PostProcChunk",
"PostProcVerdict", "PostProcessorHook", "emit", "suppress", "terminate",
"apply_post_processor_hook", "load_post_processor_hook",
"get_post_processor_hook", "set_configured_post_processor_hook",
"get_configured_post_processor_hook") so its items are in ASCII/lexicographic
order (e.g., alphabetize by string) to satisfy RUF022 and keep imports
deterministic.
- Around line 83-88: When loading the hook from import_path, instantiate the
class (hook_class()) then validate that the resulting instance is callable; if
not, raise a TypeError so it fails at load time instead of silently becoming a
passthrough at runtime. Update the try/except branch that currently returns
hook_class() to create instance = hook_class(), check callable(instance) and
raise a descriptive TypeError (including import_path/class_name) when not
callable; this touches the import_path parsing and hook_class instantiation used
by apply_post_processor_hook.
In `@tensorrt_llm/llmapi/llm.py`:
- Around line 233-245: Detect and prevent process-global hook conflicts by
checking the currently configured global hook before calling
set_configured_post_processor_hook(self.args.post_processor); if a different
hook is already set (via get_post_processor_hook or equivalent) and this LLM
instance intends to register a different post_processor, fail fast
(raise/configuration error) or instead keep the hook scoped to the instance by
storing self._post_processor_hook and ensuring
DetokenizedGenerationResultBase._apply_post_processor_hook reads the
instance-level hook (fall back to the global only if instance hook is None).
Update the constructor logic around set_configured_post_processor_hook and
get_post_processor_hook so it either 1) compares and errors on conflicts, or 2)
abandons modifying the process-global and preserves per-instance hook stored on
the BaseLLM (self._post_processor_hook) that downstream code uses.
---
Outside diff comments:
In `@tensorrt_llm/executor/worker.py`:
- Around line 278-285: postproc_worker_main currently just stores the hook
import path via set_configured_post_processor_hook and defers import/constructor
errors to the first request; modify postproc_worker_main to resolve and
instantiate the post-processor hook during worker startup (using the
postproc_worker_config.post_processor_hook import path), validate it, and run
the same warm-up/caching that BaseLLM.__init__ performs so any
import/constructor failures surface immediately and each PostprocWorker gets the
warmed instance instead of delaying to the first chunk; update references around
postproc_worker_main, PostprocWorker.default_record_creator, and
set_configured_post_processor_hook to use the resolved/validated hook instance
rather than the raw import path.
---
Nitpick comments:
In `@docs/source/features/post-processor-hook.md`:
- Line 29: Replace the CLI flag usage in the docs: change the example that says
"set it in a YAML config passed via `--extra_llm_api_options`" to use the
preferred shorter alias `--config` instead (for the `trtllm-serve` CLI), i.e.
update any occurrence of `--extra_llm_api_options` to `--config` in the sentence
referencing trtllm-serve so the documentation uses the correct preferred flag.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3eac517f-99e0-427c-9520-f548b414057c
📒 Files selected for processing (17)
docs/source/features/post-processor-hook.mddocs/source/index.rsttensorrt_llm/commands/serve.pytensorrt_llm/executor/postproc_worker.pytensorrt_llm/executor/postprocessor_hook.pytensorrt_llm/executor/proxy.pytensorrt_llm/executor/result.pytensorrt_llm/executor/worker.pytensorrt_llm/llmapi/llm.pytensorrt_llm/llmapi/llm_args.pytensorrt_llm/serve/openai_server.pytests/integration/defs/test_e2e.pytests/integration/test_lists/test-db/l0_a10.ymltests/unittest/api_stability/references/llm.yamltests/unittest/executor/test_postprocessor_hook.pytests/unittest/llmapi/apps/_postproc_hook_samples.pytests/unittest/llmapi/apps/_test_openai_post_processor.py
|
PR_Github #53644 [ run ] completed with state
|
Address PR review feedback on the trtllm-serve post-processing hook: - Fail fast on conflicting in-process hook registration: a second LLM in the same process with a different --post_processor now raises instead of silently clobbering the process-global and applying the wrong hook to the already-running instance. Re-registering the same path is a no-op; clearing to None (e.g. on shutdown) is always allowed. - Validate the import path BEFORE recording the process-global in BaseLLM.__init__, so a bad path leaves no stale registration; clear the global in BaseLLM.shutdown() so sequential LLMs still work. - Add a focused unit test for the proxy late-response drop (the .get() at dispatch_result_task): a response for an already-popped client_id after a terminate is dropped without KeyError. - Doc: clarify that engine batching is transparent (one call per request, keyed by request_id). - Drop an internal study-plan section reference from a code comment. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
Replace the process-global post-processing hook ownership with per-instance ownership, so the hook is a property of the LLM (and of each post-processing worker process), like the tokenizer it sits next to in BaseLlmArgs. - postprocessor_hook.py: drop the process globals and their accessors (_CONFIGURED_HOOK_PATH, set_/get_configured_post_processor_hook, _HOOK_INSTANCE_CACHE, get_post_processor_hook) and the fail-fast conflict guard. Keep load_post_processor_hook / apply_post_processor_hook. - result.py: DetokenizedGenerationResultBase carries its own _post_processor_hook; the detok chokepoint applies that instance attribute instead of reading a process global. - llm.py: BaseLLM builds its own hook instance (eager import validation retained) and threads it onto the RequestOutput via _from_generation_result; remove the global registration and shutdown-clear. - postproc_worker.py: each PostprocWorker builds and owns one hook instance (mirroring the tokenizer) and injects it onto every record it creates. - Tests: replace the global/conflict/cache tests with instance-ownership tests; update the docs to describe per-instance ownership. Independent LLM instances in one process now stay isolated without any fail-fast guard, and the hook instance never crosses an IPC boundary (each owner builds it from the import path). Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
…cessor-hook Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
Tighten the post-processing hook changes after the per-instance refactor: - postprocessor_hook.py: fix the now-stale "built once per process" wording in the PostProcessorHook docstring (it is per-owner), drop a duplicated ownership paragraph in the module docstring and a redundant token-id NOTE, and shorten the is_final comment. - postproc_worker.py: shorten the hook-injection comment. - post-processor-hook.md: condense the batching note. - test_postprocessor_hook.py: drop test_independent_instances_keep_separate_state — two distinct hook instances having separate state is tautological and the read-site wiring is already covered by the apply-method test. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
… flag Address review of the post-processing hook branch (two-agent review). Main fix: a `/v1/completions` request with `detokenize=false` previously bypassed the hook entirely, because the hook ran inside the detokenize guard. The hook is a server-side guardrail and must not be bypassable by a client flag (matching Triton, where the post-processor is a mandatory ensemble stage). The detok+hook block now runs when detokenize is set OR a post_processor is configured; the returned channel is unchanged (the response formatter honors the request's detokenize flag separately), and suppress/terminate still withhold the token-id channel. Also: - Extract the harmony fail-fast into OpenAIServer._ensure_post_processor_supported and unit-test it directly. - Add an e2e test that detokenize=false does not bypass the terminate hook. - proxy.py / postprocessor_hook.py comment accuracy; postproc_worker.py uses pop(client_id, None) like its hardened siblings. - Docs: not-bypassable guarantee, disagg note, soften the untested responses endpoint claim. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
Close two more guardrail-bypass gaps found in review, by making the hook's withholding contract complete and explicit rather than patching channels one at a time. - All-channel withholding: a channel×endpoint audit showed suppress/terminate only blanked the streaming diff views; the non-streaming formatters emit the FULL token_ids (completions) and FULL logprobs (chat + completions), which still leaked withheld content. _withhold_token_channel now truncates the full token_ids/logprobs to the presented prefix for non-streaming (mirroring how .text is blanked), keeping the proven streaming watermark path unchanged. - Require a tokenizer: post_processor + skip_tokenizer_init is now rejected in BaseLlmArgs (the text-based hook has no text without a tokenizer), mirroring the harmony fail-fast. - Docs: verdict table + limitations clarified (suppress/terminate withhold all channels; emit is text-only; tokenizer required). - Tests: non-streaming full-channel suppress unit test; skip_tokenizer_init rejection test; e2e asserts token_ids are withheld under detokenize=false. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
Third review pass confirmed no bypass: the non-streaming response delivers the content the hook emitted before the first suppress/terminate — identical to what a streaming client receives, so a client cannot obtain more output by choosing non-streaming. Only the wording was imprecise. - _withhold_token_channel docstring: "already-presented" -> "already-emitted" prefix, and note the streaming-consistent semantics. - Doc: state that verdicts are per-chunk (suppress withholds this chunk, terminate keeps prior chunks), consistent across streaming/non-streaming, and that all-or-nothing withholding means suppressing from the first chunk. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
Make the post-processing hook fail closed: a hook exception (or an invalid verdict) now re-raises so the request errors instead of serving the un-vetted chunk, matching Triton's per-request model. The in-proxy path surfaces it to the serving handler; the worker path converts it to an ErrorResponse. Server and sibling requests stay alive. Replace the free-form verdict action string with a PostProcAction enum validated in PostProcVerdict.__post_init__, so an unknown action can no longer be smuggled past the dispatch. Document the n>1 / beam behavior (emit/suppress are per output sequence; terminate cancels the whole request) instead of assuming a single output, and trim the over-verbose comments and JIRA references across the feature. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #54400 [ run ] triggered by Bot. Commit: |
|
PR_Github #54400 [ run ] completed with state
|
The dedicated unit test files were not enumerated in any test-db list, so CI never collected them. Add them to the executor unittest block in l0_a10.yml so they run in the A10-PyTorch stage. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
|
/bot run --stage-list "A10-PyTorch-1, A10-PyTorch-2" --disable-fail-fast |
|
PR_Github #54635 [ run ] triggered by Bot. Commit: |
|
PR_Github #54635 [ run ] completed with state |
JunyiXu-nv
left a comment
There was a problem hiding this comment.
Overall LGTM. Only one possible improvement.
…cessor-hook Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
load_post_processor_hook now checks callable() on the instantiated hook and raises a descriptive ValueError, so a non-callable import path fails at load/startup instead of per-chunk at runtime (benefits CI). Addresses the CodeRabbit suggestion acknowledged on PR NVIDIA#15239 (thread r3425180772). Updates the loader unit tests to use a callable stand-in (MagicMock) and adds a non-callable negative case (OrderedDict). Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
|
/bot run --stage-list "A10-PyTorch-1, A10-PyTorch-2" --disable-fail-fast |
|
PR_Github #55090 [ run ] triggered by Bot. Commit: |
|
PR_Github #55090 [ run ] completed with state |
kaiyux
left a comment
There was a problem hiding this comment.
Approving on behalf of trt-llm-doc-owners.
…licate docs Address PR review feedback: - Rename PostProcChunk/PostProcVerdict/PostProcAction to PostProcessorHookChunk/PostProcessorHookVerdict/PostProcessorHookAction so they read distinctly from the existing postproc-worker types (PostprocParams, PostprocWorker, PostprocWorkerConfig). - Replace the duplicated PostProcessorHookChunk field table in the docs with a pointer to the authoritative dataclass to avoid drift. - Use repo-relative paths for the source-file links instead of absolute github.com blob URLs. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
…cessor-hook Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
…ocessor_hook Address PR review feedback (one level up from the type rename): the user-facing CLI flag --post_processor and the BaseLlmArgs field post_processor collided head-on with the pre-existing PostprocParams.post_processor (the internal per-endpoint response formatter), in the same postproc-worker subsystem. Rename the user-facing surface to post_processor_hook, matching the PostProcessorHook* types, the PostProcessorHook Protocol, the internal PostprocWorkerConfig.post_processor_hook, and the doc file name. The existing formatter PostprocParams.post_processor is left untouched. Touches: CLI flag + plumbing (serve.py), BaseLlmArgs field + validation (llm_args.py), eager-load (llm.py), harmony guard rename (openai_server._ensure_post_processor_hook_supported), api_stability reference, unit/e2e tests, and docs. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
|
/bot run --stage-list "A10-PyTorch-1, A10-PyTorch-2" --disable-fail-fast |
1 similar comment
|
/bot run --stage-list "A10-PyTorch-1, A10-PyTorch-2" --disable-fail-fast |
|
PR_Github #55342 [ run ] triggered by Bot. Commit: |
|
PR_Github #55342 [ run ] completed with state |
Summary
Adds a native, per-request post-processing hook to
trtllm-serve(TRTLLM-12622). A user supplies an importable, picklable callable class via--post_processor(or the equivalentpost_processorfield in--extra_llm_api_optionsYAML). The hook runs at a single chokepoint after detokenization and before the per-endpoint response formatter, and may rewrite, suppress, or terminate the output. Each hook instance owns its own per-request state, keyed byrequest_id.The hook works across the chat and completions endpoints (streaming and non-streaming) and with the post-processing worker pool both enabled and disabled.
Not client-bypassable (the hook is a server-side guardrail):
detokenize=false— the server detokenizes for the hook regardless.suppress/terminatewithhold every client-visible channel — text, token ids, and logprobs — on both streaming and non-streaming paths.skip_tokenizer_init(no detokenized text) at args validation.Design
tensorrt_llm/executor/postprocessor_hook.py:PostProcChunk,PostProcVerdict, thePostProcActionenum (EMIT/SUPPRESS/TERMINATE, validated at construction so an unknown action cannot be smuggled in), theemit/suppress/terminatehelpers, thePostProcessorHookprotocol, the import-path loader (mirroring--custom_tokenizer), andapply_post_processor_hook.DetokenizedGenerationResultBase._handle_response, which (via MRO) covers both the in-proxy and post-processing-worker detok paths.LLMand each post-processing worker builds its own hook instance from the import path — no process globals.BaseLLMbuildsself._post_processor_hook, threads it ontoRequestOutput;PostprocWorkerbuilds its own and attaches it per record. True no-op when unset; multipleLLMs in one process are independent.emit(text)rewrites the text channel only;suppress/terminatewithhold all channels.terminatecancels the engine request through the existing abort path (output-side only; no engine back-edge).emit/suppressact per output sequence;terminatecancels the whole request (the documented behavior undern>1/beam — no extra guard).ErrorResponse; the in-proxy path surfaces it to the serving handler). The server and sibling requests stay alive. For a safety guardrail, failing open would emit exactly the un-vetted content the hook exists to block.proxy.py: late engine responses arriving after aterminate-driven record pop are dropped instead of raisingKeyError.--post_processorCLI flag +BaseLlmArgs.post_processor(prototype) + anapi_stabilityreference update.docs/source/features/post-processor-hook.md.Test Coverage
tests/unittest/executor/test_postprocessor_hook.py: verdict semantics (rewrite/suppress/terminate), all-channel withholding (streaming + non-streaming), per-request state isolation, enum action rejected at construction, fail-closed re-raise, import-path loader, independent per-instance state, harmony-rejection guard.tests/unittest/executor/test_proxy_postproc_terminate.py: a late response after a terminate-driven pop is dropped withoutKeyError; a live final response is delivered and popped.tests/unittest/llmapi/test_llm_args.py:post_processor+skip_tokenizer_initis rejected.tests/unittest/llmapi/apps/_test_openai_post_processor.py: chat & completions × streaming/non-streaming ×num_postprocess_workers∈ {0, 2} × rewrite/suppress/terminate, plus adetokenize=false-does-not-bypass-the-hook case.api_stabilityreference updated for the newpost_processorfield.post_processor) 1, integration 30 — 87 passed, 0 failed.Notes
This is a prototype feature; the interface may change. Documented limitations:
emit(rewrite) controls only the text channel and does not scrub the underlying token ids / logprobs, so a client reading those channels can still recover the original text — hard guardrails must usesuppress/terminate, which withhold all channels. The reasoning/tool-call parsers run after the hook. Only the generation endpoints are covered (encode/embeddings/VisualGen are separate paths).Summary by CodeRabbit
New Features
--post_processoroption to customize output processing using user-supplied Python codeDocumentation
Tests