Skip to content

Commit 509e9b7

Browse files
committed
Codec: token-native binary transport for OpenAI completions + chat streaming
Adds a binary streaming wire format (MessagePack and length-prefixed Protocol Buffers) for /v1/completions and /v1/chat/completions. Opt-in via the request body field `stream_format`. Default behaviour is byte-identical to current vllm. ## Why Real cross-stack measurements (run 2026-05-15T20-00-00Z, 2K-token completion on Qwen2.5-0.5B-Instruct; full matrix at https://github.com/wdunn001/Codec/blob/main/packages/bench/results/2026-05-15T20-00-00Z/MATRIX.md): | Engine | JSON-SSE | Best Codec | Reduction | | sglang | 485.2 KB | 291 B (msgpack + dict-zstd) | 1,707× | | **vllm** | 517.8 KB | 3.9 KB (msgpack + gzip) | **137×** | | llama.cpp | 528.8 KB | 140 B (msgpack + dict-zstd, fp16)| 3,868× | vllm's 137× headline is gzip-only because the model's output at temp=0 is content-bound, not protocol-bound — when a dict-zstd path is wired in (Codec v0.5 adds discoverable .well-known/codec/dicts/), vllm moves to the same multi-thousand-× range as sglang + llama.cpp. The savings come from: 1. Not detokenizing at the serving server (no UTF-8 / JSON envelope per chunk). 2. Letting the client opt into HTTP-level compression (gzip / br / zstd-with-dict) on the binary stream. 3. Skipping the re-tokenize round-trip in agent-to-agent and tool-dispatch hops — the consumer reads uint32 IDs directly. ## What this PR adds (additive only) 13 files, +2,032 / -16 lines. All under `vllm/entrypoints/`. New modules: - `codec_frame.py` — MessagePack + Protocol Buffers encoders for CodecFrame{ids[], done, finish_reason?, tool_calls?}. Hand-rolled protobuf (no codegen step). Accepts Union[Sequence[int], np.ndarray [uint32], array.array('I'), bytes] on ids so future CODEC_OPENAI_BYPASS work (skip the PyLong-list allocation per token in tokenizer_manager) can be wired in without further encoder churn. - `codec_compression.py` — Accept-Encoding negotiation for zstd (with pre-trained dictionary), br, gzip, identity. Emits Codec-Zstd-Dict: sha256:<hex> on every zstd response. - `codec_agent.py` — ToolWatcher: uint32-compare state machine detecting delimited regions (tool calls, reasoning blocks, multimodal spans) in the raw token stream without detokenizing. ~100× faster than detokenize+regex. - `codec_dispatcher.py` — bolt-on tool dispatch (CODEC_BOLT_ON_DISPATCH=1, default off). Reads tool manifests from CODEC_TOOL_MANIFEST_URLS at boot, hash-validates each manifest's tokenizerHash against the active model's tokenizer, POSTs CodecToolCall (msgpack-framed), reinjects response_ids into the generation stream. - `codec_version.py` — protocol-version negotiation (Codec-Client- Version, Codec-Min-Version headers + 426 Upgrade Required + VERSION_INCOMPATIBLE frame). Modified existing files: - `openai/completion/serving.py` + `openai/chat_completion/serving.py` — dispatch to binary generator when stream_format != "json"; preserve JSON-SSE path byte-for-byte when unset. - `openai/completion/protocol.py` + `openai/chat_completion/protocol.py` — stream_format, tool_watcher, tool_watcher_start, tool_watcher_end fields on request types. - `openai/completion/api_router.py` + `openai/chat_completion/api_router.py` — route registration. - `openai/server_utils.py` — codec-aware response helpers. Plus 1 new test file: `test_codec_compression.py`. ## Trust posture / opt-in - Default behaviour byte-identical to current vllm. Client that doesn't set stream_format gets JSON-SSE exactly as today. - No new mandatory dependencies. Wire emit uses msgspec (already in vllm's reqs); compression uses brotli + zstandard (graceful fallthrough to identity if missing). - Engine boot unchanged unless CODEC_BOLT_ON_DISPATCH=1 set; dispatcher loads manifests lazily. ## Cross-stack reference + spec - Wire format spec: https://github.com/wdunn001/Codec/blob/main/spec/versions/v0.5.md - .well-known/codec/ discovery surface: https://github.com/wdunn001/Codec/blob/main/spec/WELL_KNOWN_DISCOVERY.md - 6 client-language reference implementations (TS / Python / Rust / .NET / Java / C) consume this wire byte-equally: https://github.com/wdunn001/Codec/tree/main/packages - Cross-stack bench: https://github.com/wdunn001/Codec/blob/main/packages/bench/RESULTS.md Companion PR against sgl-project/sglang filed in parallel (sgl-project/sglang#25544). The wdunn001/vllm fork has carried this code in production-grade wdunn001/codec-vllm Docker images for two releases; cross-client byte-equality is 24/24 cells unanimous as of v0.4.1. Signed-off-by: William Dunn <wdunn001@gmail.com>
1 parent 966903e commit 509e9b7

13 files changed

Lines changed: 2032 additions & 16 deletions

vllm/entrypoints/codec_agent.py

Lines changed: 266 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,266 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
3+
"""
4+
Server-side agentic primitives for Codec streaming responses.
5+
6+
Two pieces, layered cleanly on top of the wire format defined in
7+
codec_frame.py:
8+
9+
- ToolWatcher: a uint32-compare state machine that detects delimited
10+
regions (tool calls, reasoning blocks, vision spans, sandbox runs)
11+
in the output token stream without ever decoding. Mirrors the
12+
libcodec / @codecai/web / codecai / Codec.Net implementations
13+
bit-identically — same edge cases, same buffering, same nested-
14+
start handling.
15+
16+
- parse_tool_call: when a region completes, render its body through
17+
the tokenizer, parse as JSON (the convention every chat-tuned
18+
model in current use follows), and surface name + arguments_json
19+
on the next frame.
20+
21+
Why server-side: orchestrators don't have to detokenize on every
22+
frame just to scan for marker text. The server already has the
23+
tokenizer. The server already has the IDs. This PR exposes the
24+
detection result directly in the Codec wire format so clients get
25+
structured tool_call data alongside the raw token stream.
26+
27+
Disabled by default. Activated per-request via the `tool_watcher`
28+
field on ChatCompletionRequest / CompletionRequest.
29+
30+
No new external dependencies — only stdlib + the codec_frame module
31+
in this same package.
32+
"""
33+
from __future__ import annotations
34+
35+
import json
36+
from dataclasses import dataclass, field
37+
from typing import Any, List, Optional, Tuple
38+
39+
40+
# ---------------------------------------------------------------------------
41+
# Tool-call data model (mirrors openai-style { id, name, arguments } shape)
42+
# ---------------------------------------------------------------------------
43+
44+
45+
@dataclass
46+
class ToolCallEvent:
47+
"""One tool call detected in the model's output stream.
48+
49+
`arguments_json` is the raw JSON string between the start/end markers.
50+
`name` is parsed from that JSON when the model uses the standard
51+
`{"name": "...", "arguments": {...}}` shape; otherwise None.
52+
"""
53+
54+
name: Optional[str]
55+
arguments_json: str
56+
id: Optional[str] = None # server-generated, e.g. "tc_<uuid>"
57+
58+
def to_wire_dict(self) -> dict:
59+
"""Serialise to the dict shape encoded into msgpack frames and
60+
the protobuf ToolCall message."""
61+
out: dict = {"arguments_json": self.arguments_json}
62+
if self.name is not None:
63+
out["name"] = self.name
64+
if self.id is not None:
65+
out["id"] = self.id
66+
return out
67+
68+
69+
# ---------------------------------------------------------------------------
70+
# Watcher state machine
71+
# ---------------------------------------------------------------------------
72+
73+
74+
@dataclass
75+
class _WatcherState:
76+
"""Minimal per-request state. Cheap to instantiate; one per stream."""
77+
78+
start_id: int
79+
end_id: int
80+
inside: bool = False
81+
region_ids: List[int] = field(default_factory=list)
82+
83+
84+
class ToolWatcher:
85+
"""Stateful detector for delimited regions in a token-ID stream.
86+
87+
The hot path is `feed(ids)` — a single linear pass that:
88+
- emits passthrough IDs (everything outside a region) untouched
89+
- on region close, returns the buffered body IDs for downstream
90+
parsing
91+
- never invokes the tokenizer
92+
93+
This mirrors codec_tool_watcher (libcodec) and ToolWatcher
94+
(@codecai/web, codecai, Codec.Net). Same state-machine semantics
95+
so client-side and server-side detection produce identical results.
96+
97+
Edge cases (matched bit-for-bit with the other implementations):
98+
- stray end marker: passes through as a regular ID
99+
- nested start marker: inner ignored, outer end closes the region
100+
- region split across feeds: body buffered, emitted on close
101+
"""
102+
103+
def __init__(self, start_id: int, end_id: int) -> None:
104+
self._st = _WatcherState(start_id=start_id, end_id=end_id)
105+
106+
@property
107+
def inside(self) -> bool:
108+
return self._st.inside
109+
110+
def reset(self) -> None:
111+
self._st.inside = False
112+
self._st.region_ids = []
113+
114+
def feed(self, ids: List[int]) -> Tuple[List[int], List[List[int]]]:
115+
"""Process a batch of newly-emitted token IDs.
116+
117+
Returns:
118+
passthrough_ids: IDs that should be forwarded as the frame's
119+
`ids` field (markers consumed; region body IDs withheld
120+
until the region closes).
121+
completed_regions: list of region bodies (each a list of
122+
uint32s, markers excluded) that closed during this feed.
123+
124+
Both are returned per-feed so the caller can attach completed
125+
regions to the same frame whose passthrough IDs come from this
126+
feed — keeps tool-call surfaces aligned with their stream
127+
position.
128+
"""
129+
out_ids: List[int] = []
130+
completed: List[List[int]] = []
131+
st = self._st
132+
for tok in ids:
133+
if not st.inside:
134+
if tok == st.start_id:
135+
st.inside = True
136+
st.region_ids = []
137+
# Marker itself is NOT forwarded — orchestrators
138+
# don't want the "begin tool call" token in the
139+
# outbound stream.
140+
else:
141+
out_ids.append(tok)
142+
else:
143+
if tok == st.end_id:
144+
completed.append(list(st.region_ids))
145+
st.region_ids = []
146+
st.inside = False
147+
# End marker also withheld.
148+
elif tok == st.start_id:
149+
# Nested start — ignore (same as the other ports).
150+
pass
151+
else:
152+
st.region_ids.append(tok)
153+
return out_ids, completed
154+
155+
156+
# ---------------------------------------------------------------------------
157+
# Body → ToolCallEvent
158+
# ---------------------------------------------------------------------------
159+
160+
161+
def parse_tool_call(
162+
region_body_text: str, *, call_id: Optional[str] = None
163+
) -> ToolCallEvent:
164+
"""Parse the body of a tool-call region (already detokenized) into
165+
a structured event.
166+
167+
The convention every chat-tuned model in current use follows:
168+
{ "name": "<function>", "arguments": { ... } }
169+
170+
We accept both pretty-printed and compact JSON. If parsing fails
171+
(malformed body, partial JSON, etc.) we still return an event with
172+
name=None and arguments_json set to the raw body — the caller can
173+
surface that to the client so it can return a "invalid_arguments"
174+
error to the model.
175+
176+
Empty / whitespace-only bodies produce an event with name=None
177+
and arguments_json="" — same shape, distinguishable downstream.
178+
"""
179+
body = region_body_text.strip()
180+
if not body:
181+
return ToolCallEvent(name=None, arguments_json="", id=call_id)
182+
183+
name: Optional[str] = None
184+
try:
185+
parsed: Any = json.loads(body)
186+
if isinstance(parsed, dict):
187+
n = parsed.get("name")
188+
if isinstance(n, str):
189+
name = n
190+
except json.JSONDecodeError:
191+
# Keep the raw body so the caller can decide how to handle it.
192+
pass
193+
194+
return ToolCallEvent(name=name, arguments_json=body, id=call_id)
195+
196+
197+
# ---------------------------------------------------------------------------
198+
# Helpers for the serving layer
199+
# ---------------------------------------------------------------------------
200+
201+
202+
def detokenize_region(tokenizer, region_ids: List[int]) -> str:
203+
"""Convenience wrapper around the tokenizer's batch decode that
204+
skips special tokens — tool-call body text is pure JSON, no chat
205+
template chrome.
206+
207+
Tokenizer compatibility: works with any tokenizer exposing a
208+
.decode(ids, skip_special_tokens=bool) method (HF AutoTokenizer,
209+
vLLM's AnyTokenizer, the MistralTokenizer wrapper). We don't import
210+
transformers directly to keep this module dependency-free —
211+
duck-typing on .decode() is enough.
212+
"""
213+
return tokenizer.decode(region_ids, skip_special_tokens=True)
214+
215+
216+
def make_call_id(seq_no: int) -> str:
217+
"""Server-generated tool call id. Stable shape; sequence-numbered
218+
rather than UUID so test fixtures stay deterministic."""
219+
return f"tc_{seq_no:08x}"
220+
221+
222+
# ---------------------------------------------------------------------------
223+
# Marker resolution helpers (vLLM-specific, not present in sglang)
224+
# ---------------------------------------------------------------------------
225+
226+
227+
def resolve_marker_id(tokenizer, marker: str) -> Optional[int]:
228+
"""Resolve a special-token string like ``<tool_call>`` to its single
229+
integer ID in the loaded tokenizer's vocab.
230+
231+
Returns None if the marker doesn't exist as a single token — the
232+
caller should disable the watcher in that case (the model can't
233+
emit a single-token boundary, so ID-level detection is impossible).
234+
235+
Tries three lookup paths in order, since vLLM ships several
236+
tokenizer flavours and they don't all expose the same surface:
237+
1. ``added_tokens_encoder`` (HF fast / slow)
238+
2. ``get_vocab()`` returning a dict-like mapping
239+
3. ``encode(marker, add_special_tokens=False)`` returning a list
240+
whose length must be 1
241+
"""
242+
enc = getattr(tokenizer, "added_tokens_encoder", None)
243+
if isinstance(enc, dict) and marker in enc:
244+
return int(enc[marker])
245+
246+
get_vocab = getattr(tokenizer, "get_vocab", None)
247+
if callable(get_vocab):
248+
try:
249+
vocab = get_vocab()
250+
if marker in vocab:
251+
return int(vocab[marker])
252+
except Exception:
253+
pass
254+
255+
encode = getattr(tokenizer, "encode", None)
256+
if callable(encode):
257+
try:
258+
ids = encode(marker, add_special_tokens=False)
259+
# Some tokenizers return tensors; coerce to list.
260+
ids = list(ids) if not isinstance(ids, list) else ids
261+
if len(ids) == 1:
262+
return int(ids[0])
263+
except Exception:
264+
pass
265+
266+
return None

0 commit comments

Comments
 (0)