Commit 08e8113
committed
extension/llm/server: token-ID prompt segments for tool-use resume (V2b.1.5)
Warm resume (V2b.1) misses on agent loops: an assistant turn re-rendered from
its parsed tool call almost never re-tokenizes to the tokens the model actually
generated, so the resident state isn't an exact prefix and the worker resets.
On BFCL multi_turn the warm-resume hit rate was 0%.
Fix: carry the exact tokens instead of re-deriving them from text. The worker
returns generated_token_ids on `done` and accepts a `prompt_segments` form of
the prompt -- an ordered list of {"text"} chunks to tokenize and {"ids"} runs of
literal token ids (mutually exclusive with the plain `prompt` string); the
WorkerClient/SessionRuntime transport for that form was introduced with the
SessionRuntime boundary, and this commit makes the worker assemble and emit it.
The adapter-specific transcript glue lives in a new module, openai_transcript.py
(OpenAITranscriptState): it stores one record per assistant turn ({fingerprint,
ids}) and, on the next request, rebuilds the prompt as segments -- each prior
assistant turn is replaced with a unique sentinel, the conversation is rendered
once, and the rendered text is split on the sentinels with the stored ids spliced
back in. Tool results stay text (they re-tokenize deterministically). This logic
is the OpenAI adapter's concern, not the runtime's: SessionRuntime only sees a
PromptInput (text or segments).
Splicing is guarded so stale ids are never injected: a turn is substituted only
when the incoming assistant message fingerprint-matches the response we returned
(an edited or branched history, or a session reused for another conversation ->
text fallback; splicing stops at the first divergence and the now-stale tail is
pruned, and a regenerated turn is recorded at its position so it replaces the
stale record instead of shadowing later hits), and only when its ids faithfully
decode to what the client saw -- a stop-string trim kept post-stop tokens
resident but dropped them from the output, so the worker omits the ids and the
turn is re-rendered as text. Sentinel collisions / dropped sentinels also fall
back to text, and the worker's exact-token prefix check backstops the rest.
The context-window preflight counts what the worker actually assembles: for a
segment prompt it sums the literal {ids} run lengths and the tokenized {text}
chunks (not the rendered string), so a near-limit request agrees with the worker
rather than false-rejecting or failing mid-decode.
On BFCL multi_turn (per-conversation sessions) this moves the reuse fraction
from 0% to ~50% (exact_prefix hits where there were none); the single-turn AST
suite is unchanged (no prior assistant turn -> plain text prompt).
Review order: worker_loop.h (segment assembly + faithful generated_token_ids);
then the control plane (the new openai_transcript.py store + fingerprint-guarded
sentinel rendering, and the serving_chat wiring that builds the segments and
counts them for the context preflight); then tests and docs.
Part of #20001
ghstack-source-id: 8f15d6a
ghstack-comment-id: 4661784137
Pull-Request: #201611 parent c360dff commit 08e8113
8 files changed
Lines changed: 1049 additions & 22 deletions
File tree
- examples/models/qwen3_5_moe
- extension/llm/server
- cpp
- python
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
243 | 243 | | |
244 | 244 | | |
245 | 245 | | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
246 | 256 | | |
247 | 257 | | |
248 | 258 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
95 | 95 | | |
96 | 96 | | |
97 | 97 | | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
0 commit comments