(post-processor-hook)=
trtllm-serve supports a user-supplied post-processing hook: a native, per-request seam that runs
on each generated output after detokenization and before the per-endpoint response formatter. It
lets a deployment rewrite, redact, suppress, or terminate model output — including stateful logic that
spans the chunks of a streamed response — without modifying TensorRT LLM source.
The hook is a plain Python callable class supplied by import path, mirroring --custom_tokenizer. It
is owned by the LLM instance (and built once in each post-processing worker process when those are
enabled) and invoked once per output, per streaming chunk (plus a final call), so it can hold its own
per-request state. Independent LLM instances in one process each own a separate hook instance.
This feature is a prototype and its interface may change in a future release.
For the interface definitions referenced below, see
tensorrt_llm/executor/postprocessor_hook.py.
Pass the dotted import path of your hook class to --post_processor_hook:
trtllm-serve <model> --post_processor_hook my_pkg.guardrail.MyPostProcessorHookEquivalently, set it in a YAML config passed via --extra_llm_api_options:
post_processor_hook: my_pkg.guardrail.MyPostProcessorHookThe class must be:
- Importable — installed (
pip install) or onPYTHONPATHwhen the server (and its post-processing worker processes, if any) start. - Picklable and no-argument-constructible — the hook is reconstructed by reference inside each
process;
__init__takes no required arguments and is the one-time setup point.
The hook works the same with or without the out-of-process post-processing worker pool
(--num_postprocess_workers).
A hook implements a single method, __call__(self, chunk) -> verdict:
from tensorrt_llm.executor.postprocessor_hook import (
PostProcessorHookChunk,
PostProcessorHookVerdict,
emit,
suppress,
terminate,
)
class MyPostProcessorHook:
def __call__(self, chunk: PostProcessorHookChunk) -> PostProcessorHookVerdict:
return emit(chunk.text_diff) # pass through unchangedThe payload handed to the hook for one output chunk — request_id, output_index, text_diff,
text, token_ids_diff, is_final, aborted, and streaming. See the PostProcessorHookChunk
dataclass in postprocessor_hook.py
for the authoritative field-by-field descriptions.
Return one of the following helpers from __call__:
| Helper | Effect |
|---|---|
emit(text) |
Emit text for this chunk. Use it to pass output through unchanged (emit(chunk.text_diff)) or to rewrite/redact it. Affects the text channel only — it does not synthesize matching token_ids (see Text vs. token ids below). |
suppress() |
Withhold this chunk entirely across all client-visible channels — text, token_ids, and logprobs (so detokenize=false token output is withheld too). |
terminate(reason) |
Stop the stream for this request, withholding the terminating chunk on all channels. reason is surfaced as the response stop_reason, and the engine request is cancelled. |
Verdicts are per chunk: suppress() withholds the current chunk, and terminate() stops generation while keeping the chunks already emitted before it. This is consistent across streaming and non-streaming — a non-streaming response contains exactly the content the hook emitted before the first suppress/terminate, i.e. the same content a streaming client would have received. A hook that must withhold the entire output (all-or-nothing) should suppress() from the first chunk (it sees every chunk) rather than emitting and then terminating.
A stateless hook that upper-cases every chunk:
from tensorrt_llm.executor.postprocessor_hook import PostProcessorHookChunk, PostProcessorHookVerdict, emit
class UpperCaseHook:
def __call__(self, chunk: PostProcessorHookChunk) -> PostProcessorHookVerdict:
return emit(chunk.text_diff.upper())This hook accumulates text per request (keyed by request_id), stops the stream as soon as a banned
phrase appears, and releases its state when the request finishes:
from tensorrt_llm.executor.postprocessor_hook import (
PostProcessorHookChunk, PostProcessorHookVerdict, emit, terminate,
)
class BannedPhraseGuard:
_BANNED = ("forbidden phrase",)
def __init__(self):
# Per-request accumulators owned entirely by the hook.
self._buffers: dict[int, str] = {}
def __call__(self, chunk: PostProcessorHookChunk) -> PostProcessorHookVerdict:
buffer = self._buffers.get(chunk.request_id, "") + chunk.text_diff.lower()
self._buffers[chunk.request_id] = buffer
if any(phrase in buffer for phrase in self._BANNED):
self._buffers.pop(chunk.request_id, None)
return terminate("banned_phrase")
if chunk.is_final:
self._buffers.pop(chunk.request_id, None)
return emit(chunk.text_diff)A hook that withholds all client-visible text:
from tensorrt_llm.executor.postprocessor_hook import PostProcessorHookChunk, PostProcessorHookVerdict, suppress
class SuppressHook:
def __call__(self, chunk: PostProcessorHookChunk) -> PostProcessorHookVerdict:
return suppress()The hook instance is owned by the LLM (built once in each post-processing worker process when the
pool is enabled) and shared across all requests it handles, so any per-request state must be keyed by
chunk.request_id and released when chunk.is_final is seen (or after a terminate). State is not
shared across processes or across separate LLM instances; when the post-processing worker pool is
enabled, all chunks of a single request are still routed to the same worker, so per-request state
remains consistent for that request.
Engine-level batching is transparent to the hook: even when requests are batched together in the
engine, the hook is invoked once per request (per output, per chunk) — there is no batched-call
form — so keying state on chunk.request_id is sufficient to keep concurrent requests isolated.
- Endpoints:
chat/completionsandcompletions, both streaming and non-streaming. The hook is expected to apply to theresponsesendpoint for non-harmony models as well (shared detokenization path), though that endpoint is not covered by the current end-to-end tests. - Not client-bypassable: the hook is a server-side guardrail, so it runs on every response even
when a
completionsrequest setsdetokenize=false. The server detokenizes for the hook regardless; thedetokenizeflag still controls only the returned channel (text vs. token ids). Asuppress/terminateverdict withholds all client-visible channels — text,token_ids, andlogprobs— on both the streaming and non-streaming paths, so a client cannot recover withheld content through any channel. - Requires a tokenizer: the hook needs detokenized text to inspect, so
--post_processor_hookcombined withskip_tokenizer_initis rejected at startup rather than silently disabled. - harmony / gpt-oss models: not supported. Because the harmony output path is reconstructed from
raw token ids, it would bypass the text-based hook, so the server fails fast at startup when
--post_processor_hookis combined with a harmony model. - Disaggregated serving: the context and generation servers are separate processes, each running
the hook on its own phase under a different
request_id; per-request state cannot be correlated across the two. Aterminateon one phase does not propagate to the other. - Text vs. token ids:
emitrewrites the text channel only — it does not rewrite the underlyingtoken_ids/logprobs, so a client reading both should expect them to diverge. (suppress/terminatewithhold all channels, so they do not diverge.) - Reasoning / tool parsing: the hook runs before the reasoning and tool-call parsers. A hook that
rewrites or suppresses text may desync those parsers; prefer
terminate, or apply such hooks to plain-text requests. - Hook errors fail closed: if the hook raises, the request fails with an error rather than serving the un-vetted chunk (the server and other requests stay alive). A returned verdict with an unknown action is rejected the same way.
n> 1 / beam search:emitandsuppressact per output sequence, butterminatecancels the whole request (all sequences), because the engine request is the unit of cancellation — aterminateon one candidate ends the others too. Hooks needing per-sequence state should key on(request_id, output_index).
The hook's unit and end-to-end tests double as runnable usage examples:
# Unit tests for the hook contract (rewrite / suppress / terminate, per-request state, loader)
pytest tests/unittest/executor/test_postprocessor_hook.py -v- End-to-end serving tests across endpoints, streaming modes, and worker-pool settings:
tests/unittest/llmapi/apps/_test_openai_post_processor.py. - The deterministic sample hooks used by those tests:
tests/unittest/llmapi/apps/_postproc_hook_samples.py.