You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
extension/llm/server: serving docs and comment cleanup
Documentation/comment-only hardening (no control-flow change), in two parts.
First, document the limitations that matter for local-agent / subagent use, which
were previously only implied: cancellation is best-effort and head-of-line
blocking (WorkerClient.stop() is a no-op and the worker holds the single
in-flight slot to completion, so a disconnected client doesn't interrupt it and a
long generation blocks other sessions until it finishes; real interruption needs
a protocol change), and warm resume requires true turn terminators surfaced as
terminal/EOS token ids -- a string-only terminator marks every turn dirty and
never resumes. Stated in worker_loop.h, python/README.md, spec/README.md, and the
WorkerClient.stop / SessionRuntime cancel-path comments.
Second, make the stack read as a stable architecture rather than a migration
diary: removes the intermediate phase labels (V1 / V2 / V2a / V2b.1 / V2b.1.5 and
work-item tags) from all serving and Qwen-worker-example comments, docstrings,
READMEs, and tests; shortens the worker_loop.h top narrative and the
openai_transcript.py module and helper docstrings to their durable contracts;
tightens the hot-path splice and stop-handling comments; de-duplicates the JSONL
protocol (cpp/worker_loop.h is the canonical reference, with worker_client.py and
the READMEs pointing to it) and replaces the stale protocol snippet in
python/README.md; clarifies prefix/KV reuse in spec/README.md (no global
cross-session prefix cache, but per-session append-only warm resume is
implemented worker-side); and trims the Qwen README session section to
user-facing facts.
Kept: the JSONL/wire protocol contract, the exact-token warm-resume invariant
(mismatch resets), stop-string-trim non-resumability, generated_token_ids
excluding the terminal EOS, the resident_token_ids == session.position()
invariant, the CUDA mutable-state rationale, and the user-visible cancellation /
head-of-line and terminator-vs-stop limitations.
Behavior-preserving: the full Python serving suite passes; the only non-comment
edits are two diagnostic strings (an error message and a CLI help description).
Part of #20001
ghstack-source-id: f98f407
ghstack-comment-id: 4672992038
Pull-Request: #20193
0 commit comments