Skip to content

Codec: token-native binary transport for completions streaming#25544

Open
wdunn001 wants to merge 5 commits into
sgl-project:mainfrom
wdunn001:pr/codec-binary-transport
Open

Codec: token-native binary transport for completions streaming#25544
wdunn001 wants to merge 5 commits into
sgl-project:mainfrom
wdunn001:pr/codec-binary-transport

Conversation

@wdunn001
Copy link
Copy Markdown

@wdunn001 wdunn001 commented May 17, 2026

Summary

Adds a binary streaming wire format (MessagePack and length-prefixed Protocol Buffers) that ships raw uint32 token IDs end-to-end instead of detokenizing every chunk to UTF-8 and wrapping it in JSON Server-Sent Events. Opt-in via the request-body field stream_format. Default behaviour is byte-identical to current sglang.

Why

Real cross-stack measurements (run id 2026-05-15T20-00-00Z, 2,000-token completion on Qwen2.5-0.5B-Instruct; full matrix at https://github.com/wdunn001/Codec/blob/main/packages/bench/results/2026-05-15T20-00-00Z/MATRIX.md):

Engine JSON-SSE Best Codec Reduction
sglang 485.2 KB 291 B (msgpack + dict-zstd) 1,707×
vllm 517.8 KB 3.9 KB (msgpack + gzip) 137×
llama.cpp 528.8 KB 140 B (msgpack + dict-zstd, fp16) 3,868×

The savings come from:

  1. Not detokenizing at the serving server (no UTF-8 / JSON envelope per chunk).
  2. Letting the client opt into HTTP-level compression (gzip, br, zstd with pre-trained dictionaries) on the binary stream.
  3. Skipping the re-tokenize round-trip in agent-to-agent and tool-dispatch hops — the consumer reads uint32 IDs directly and feeds them to the next model.

What this PR adds (additive only)

13 files, +2,690 / -2 lines. All under python/sglang/srt/entrypoints/.

New modules:

  • codec_frame.py — MessagePack + Protocol Buffers encoders for CodecFrame {ids[], done, finish_reason?, tool_calls?}. Hand-rolled protobuf (no codegen step). Accepts Union[List[int], numpy.ndarray[uint32], array.array('I'), bytes] on the ids parameter so a future upstream change to surface output_ids as a buffer can be wired in without further encoder churn.
  • codec_compression.py — Accept-Encoding negotiation for zstd (with pre-trained dictionary), br, gzip, identity. Emits Codec-Zstd-Dict: sha256:<hex> on every zstd response so clients can verify they have the matching dict loaded before decompressing.
  • codec_agent.pyToolWatcher: a uint32-compare state machine that detects delimited regions (tool calls, reasoning blocks, multimodal spans) in the raw token stream without detokenizing. ~100× faster than detokenize+regex in the steady state. Per-request opt-in via the tool_watcher request field.
  • codec_dispatcher.py — bolt-on tool dispatch (default off, CODEC_BOLT_ON_DISPATCH=1). Reads tool manifests from CODEC_TOOL_MANIFEST_URLS at boot, hash-validates each manifest's tokenizerHash against the active model's tokenizer, POSTs CodecToolCall (msgpack-framed) to registered tools when ToolWatcher fires, reinjects response_ids into the generation stream — full <tool_call>...</tool_call> loop without ever decoding the model's stream to text.
  • codec_version.py — protocol-version negotiation (Codec-Client-Version, Codec-Min-Version headers + 426 Upgrade Required + VERSION_INCOMPATIBLE frame).

5 modified existing files:

  • openai/serving_completions.py + openai/serving_chat.py — dispatch to the binary generator when stream_format != "json"; preserve the existing JSON-SSE path byte-for-byte when unset or "json".
  • openai/protocol.pystream_format and tool_watcher fields on the request types.
  • engine.py + http_server.py/codec/schema endpoint registration (returns the protobuf schema text for client code generation).

Plus 3 new test files: test_codec_agent.py, test_codec_compression.py, test_codec_version.py.

Trust posture

  • Default behaviour byte-identical to current sglang. A client that does NOT set stream_format gets JSON-SSE exactly as today.
  • No new mandatory dependencies. The wire emit uses msgspec (already in sglang's reqs); compression uses brotli + zstandard (missing → graceful fallthrough to identity per the Accept-Encoding negotiation rules).
  • Engine boot unchanged unless opt-in. CODEC_BOLT_ON_DISPATCH=1 loads manifests lazily; without it nothing in the codec path runs at boot.

Test plan

  • pytest python/sglang/srt/entrypoints/test_codec_*.py green (3 new test files in this PR).
  • stream_format=msgpack round-trip via curl against a Qwen2.5-0.5B-Instruct engine; response Content-Type application/x-msgpack, decode reproduces token IDs end-to-end.
  • Existing stream_format unset / stream_format=json requests produce byte-identical JSON-SSE to current main (responses regression-tested in the bench harness referenced above).
  • Accept-Encoding: zstd, br, gzip negotiates correctly per RFC 7231 §5.3.4 preference order; Codec-Zstd-Dict header present on zstd responses.

Cross-stack reference + spec

Companion PR against vllm-project/vllm filed in parallel. The wdunn001/sglang fork has been carrying this code in production-grade wdunn001/codec-sglang Docker images for two releases (v0.4 and v0.5); cross-client byte-equality is 24/24 cells unanimous as of v0.4.1.

🤖 Generated with Claude Code


CI States

Latest PR Test (Base): ❌ Run #26120396284
Latest PR Test (Extra): ❌ Blocked -- run-ci is required first.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the 'Codec' binary transport protocol to SGLang, enabling efficient token-ID streaming via MessagePack and Protobuf to eliminate text detokenization overhead. Key features include server-side tool-call detection, negotiated transport compression (zstd with dictionaries, Brotli, Gzip), and a version negotiation system for graceful downgrades. Feedback focuses on critical performance and concurrency issues, specifically the synchronous fetching of tool manifests and blocking HTTP calls within the asynchronous request loop. Additionally, improvements were suggested for the efficiency of binary unpacking and the robustness of varint decoding to prevent potential overflows or infinite loops.

Comment thread python/sglang/srt/entrypoints/openai/serving_completions.py Outdated
Comment thread python/sglang/srt/entrypoints/codec_dispatcher.py
Comment thread python/sglang/srt/entrypoints/codec_frame.py Outdated
Comment thread python/sglang/srt/entrypoints/codec_frame.py
wdunn001 added a commit to wdunn001/sglang that referenced this pull request May 17, 2026
…varint

Four fixes from the gemini-code-assist review on sgl-project#25544:

1. codec_frame.py: replace the numpy-free LE-uint32 byte-buffer
   unpack with `struct.unpack('<NI', b)` (~10× faster than the
   per-element list comprehension) and reject buffers whose length
   is not a multiple of 4 instead of silently corrupting.

2. codec_frame.py: inline-replace the manual varint loop in the
   packed `prompt_ids` branch of decode_protobuf_request with one
   that has the same overflow + truncation protection as the
   outer `read_varint` helper. Malformed input now raises
   ValueError instead of running off the buffer end or producing
   a silently-wrong value.

3. codec_dispatcher.py: add `dispatch_call_async`, a
   `asyncio.to_thread`-wrapping variant of `dispatch_call`. The
   sync form does a blocking `urllib.request.urlopen` POST that
   would freeze the event loop if called from an `async def`
   request handler. The sync form stays for non-async callers
   (CLI tools, batch eval drivers).

4. openai/serving_completions.py: cache the `ToolRegistry` on
   `OpenAIServingCompletion` rather than calling
   `ToolRegistry.from_env` per request. `from_env` performs
   blocking HTTP fetches against manifest URLs; the first request
   after process start pays the cost (off-loop via
   `asyncio.to_thread`) and every subsequent request reuses the
   cached registry. Same site now uses `dispatch_call_async` so
   tool dispatches don't block the worker either.

Wire format unchanged. No new dependencies. Existing tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wdunn001 added a commit to wdunn001/vllm that referenced this pull request May 17, 2026
…varint

Mirrors the sglang fork PR fixes (sgl-project/sglang#25544):

1. codec_frame.py: numpy-free LE-uint32 unpack now uses
   `struct.unpack('<NI', b)` (~10× faster than the per-element
   list comprehension) and rejects buffers whose length is not a
   multiple of 4 instead of silently corrupting.

2. codec_frame.py: `_decode_varint` gains bounds-check +
   shift-cap (35 bits = 5 bytes, the max uint32 varint width).
   Used by every length-delimited field decode including the
   packed `prompt_ids` loop. Malformed or malicious input fails
   fast with a clear ValueError instead of looping unbounded.

3. codec_dispatcher.py: add `dispatch_call_async`, a
   `asyncio.to_thread`-wrapping variant of `dispatch_call`. The
   sync form does a blocking `urllib.request.urlopen` POST that
   would freeze the event loop if called from an `async def`
   request handler.

4. openai/chat_completion/serving.py: cache the `ToolRegistry`
   on `OpenAIServingChat` rather than calling
   `ToolRegistry.from_env` per request. `from_env` performs
   blocking HTTP fetches against manifest URLs; the first request
   after process start pays the cost (off-loop via
   `asyncio.to_thread`) and every subsequent request reuses the
   cached registry. Same site now uses `dispatch_call_async` so
   tool dispatches don't block the worker either.

Wire format unchanged. No new dependencies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wdunn001 added a commit to wdunn001/sglang that referenced this pull request May 17, 2026
…varint

Four fixes from the gemini-code-assist review on sgl-project#25544:

1. codec_frame.py: replace the numpy-free LE-uint32 byte-buffer
   unpack with `struct.unpack('<NI', b)` (~10× faster than the
   per-element list comprehension) and reject buffers whose length
   is not a multiple of 4 instead of silently corrupting.

2. codec_frame.py: inline-replace the manual varint loop in the
   packed `prompt_ids` branch of decode_protobuf_request with one
   that has the same overflow + truncation protection as the
   outer `read_varint` helper. Malformed input now raises
   ValueError instead of running off the buffer end or producing
   a silently-wrong value.

3. codec_dispatcher.py: add `dispatch_call_async`, a
   `asyncio.to_thread`-wrapping variant of `dispatch_call`. The
   sync form does a blocking `urllib.request.urlopen` POST that
   would freeze the event loop if called from an `async def`
   request handler. The sync form stays for non-async callers
   (CLI tools, batch eval drivers).

4. openai/serving_completions.py: cache the `ToolRegistry` on
   `OpenAIServingCompletion` rather than calling
   `ToolRegistry.from_env` per request. `from_env` performs
   blocking HTTP fetches against manifest URLs; the first request
   after process start pays the cost (off-loop via
   `asyncio.to_thread`) and every subsequent request reuses the
   cached registry. Same site now uses `dispatch_call_async` so
   tool dispatches don't block the worker either.

Wire format unchanged. No new dependencies. Existing tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wdunn001 added a commit to wdunn001/vllm that referenced this pull request May 17, 2026
…varint

Mirrors the sglang fork PR fixes (sgl-project/sglang#25544):

1. codec_frame.py: numpy-free LE-uint32 unpack now uses
   `struct.unpack('<NI', b)` (~10× faster than the per-element
   list comprehension) and rejects buffers whose length is not a
   multiple of 4 instead of silently corrupting.

2. codec_frame.py: `_decode_varint` gains bounds-check +
   shift-cap (35 bits = 5 bytes, the max uint32 varint width).
   Used by every length-delimited field decode including the
   packed `prompt_ids` loop. Malformed or malicious input fails
   fast with a clear ValueError instead of looping unbounded.

3. codec_dispatcher.py: add `dispatch_call_async`, a
   `asyncio.to_thread`-wrapping variant of `dispatch_call`. The
   sync form does a blocking `urllib.request.urlopen` POST that
   would freeze the event loop if called from an `async def`
   request handler.

4. openai/chat_completion/serving.py: cache the `ToolRegistry`
   on `OpenAIServingChat` rather than calling
   `ToolRegistry.from_env` per request. `from_env` performs
   blocking HTTP fetches against manifest URLs; the first request
   after process start pays the cost (off-loop via
   `asyncio.to_thread`) and every subsequent request reuses the
   cached registry. Same site now uses `dispatch_call_async` so
   tool dispatches don't block the worker either.

Wire format unchanged. No new dependencies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wdunn001 added a commit to wdunn001/vllm that referenced this pull request May 17, 2026
…reaming

Adds a binary streaming wire format (MessagePack and length-prefixed
Protocol Buffers) for /v1/completions and /v1/chat/completions. Opt-in
via the request body field `stream_format`. Default behaviour is
byte-identical to current vllm.

## Why

Real cross-stack measurements (run 2026-05-15T20-00-00Z, 2K-token
completion on Qwen2.5-0.5B-Instruct; full matrix at
https://github.com/wdunn001/Codec/blob/main/packages/bench/results/2026-05-15T20-00-00Z/MATRIX.md):

  | Engine     | JSON-SSE   | Best Codec                       | Reduction  |
  | sglang     | 485.2 KB   | 291 B  (msgpack + dict-zstd)     | 1,707×     |
  | **vllm**   | 517.8 KB   | 3.9 KB (msgpack + gzip)          | **137×**   |
  | llama.cpp  | 528.8 KB   | 140 B  (msgpack + dict-zstd, fp16)| 3,868×    |

vllm's 137× headline is gzip-only because the model's output at temp=0
is content-bound, not protocol-bound — when a dict-zstd path is wired
in (Codec v0.5 adds discoverable .well-known/codec/dicts/), vllm
moves to the same multi-thousand-× range as sglang + llama.cpp.

The savings come from:

1. Not detokenizing at the serving server (no UTF-8 / JSON envelope
   per chunk).
2. Letting the client opt into HTTP-level compression (gzip / br /
   zstd-with-dict) on the binary stream.
3. Skipping the re-tokenize round-trip in agent-to-agent and
   tool-dispatch hops — the consumer reads uint32 IDs directly.

## What this PR adds (additive only)

13 files, +2,032 / -16 lines. All under `vllm/entrypoints/`.

New modules:

- `codec_frame.py` — MessagePack + Protocol Buffers encoders for
  CodecFrame{ids[], done, finish_reason?, tool_calls?}. Hand-rolled
  protobuf (no codegen step). Accepts Union[Sequence[int], np.ndarray
  [uint32], array.array('I'), bytes] on ids so future CODEC_OPENAI_BYPASS
  work (skip the PyLong-list allocation per token in tokenizer_manager)
  can be wired in without further encoder churn.
- `codec_compression.py` — Accept-Encoding negotiation for zstd
  (with pre-trained dictionary), br, gzip, identity. Emits
  Codec-Zstd-Dict: sha256:<hex> on every zstd response.
- `codec_agent.py` — ToolWatcher: uint32-compare state machine
  detecting delimited regions (tool calls, reasoning blocks,
  multimodal spans) in the raw token stream without detokenizing.
  ~100× faster than detokenize+regex.
- `codec_dispatcher.py` — bolt-on tool dispatch
  (CODEC_BOLT_ON_DISPATCH=1, default off). Reads tool manifests
  from CODEC_TOOL_MANIFEST_URLS at boot, hash-validates each
  manifest's tokenizerHash against the active model's tokenizer,
  POSTs CodecToolCall (msgpack-framed), reinjects response_ids into
  the generation stream.
- `codec_version.py` — protocol-version negotiation (Codec-Client-
  Version, Codec-Min-Version headers + 426 Upgrade Required +
  VERSION_INCOMPATIBLE frame).

Modified existing files:

- `openai/completion/serving.py` + `openai/chat_completion/serving.py`
  — dispatch to binary generator when stream_format != "json";
  preserve JSON-SSE path byte-for-byte when unset.
- `openai/completion/protocol.py` + `openai/chat_completion/protocol.py`
  — stream_format, tool_watcher, tool_watcher_start, tool_watcher_end
  fields on request types.
- `openai/completion/api_router.py` +
  `openai/chat_completion/api_router.py` — route registration.
- `openai/server_utils.py` — codec-aware response helpers.

Plus 1 new test file: `test_codec_compression.py`.

## Trust posture / opt-in

- Default behaviour byte-identical to current vllm. Client that
  doesn't set stream_format gets JSON-SSE exactly as today.
- No new mandatory dependencies. Wire emit uses msgspec (already in
  vllm's reqs); compression uses brotli + zstandard (graceful
  fallthrough to identity if missing).
- Engine boot unchanged unless CODEC_BOLT_ON_DISPATCH=1 set;
  dispatcher loads manifests lazily.

## Cross-stack reference + spec

- Wire format spec: https://github.com/wdunn001/Codec/blob/main/spec/versions/v0.5.md
- .well-known/codec/ discovery surface: https://github.com/wdunn001/Codec/blob/main/spec/WELL_KNOWN_DISCOVERY.md
- 6 client-language reference implementations (TS / Python / Rust /
  .NET / Java / C) consume this wire byte-equally:
  https://github.com/wdunn001/Codec/tree/main/packages
- Cross-stack bench: https://github.com/wdunn001/Codec/blob/main/packages/bench/RESULTS.md

Companion PR against sgl-project/sglang filed in parallel
(sgl-project/sglang#25544). The wdunn001/vllm fork has carried this
code in production-grade wdunn001/codec-vllm Docker images for two
releases; cross-client byte-equality is 24/24 cells unanimous as of
v0.4.1.

Signed-off-by: William Dunn <wdunn001@gmail.com>
wdunn001 added a commit to wdunn001/vllm that referenced this pull request May 17, 2026
…varint

Mirrors the sglang fork PR fixes (sgl-project/sglang#25544):

1. codec_frame.py: numpy-free LE-uint32 unpack now uses
   `struct.unpack('<NI', b)` (~10× faster than the per-element
   list comprehension) and rejects buffers whose length is not a
   multiple of 4 instead of silently corrupting.

2. codec_frame.py: `_decode_varint` gains bounds-check +
   shift-cap (35 bits = 5 bytes, the max uint32 varint width).
   Used by every length-delimited field decode including the
   packed `prompt_ids` loop. Malformed or malicious input fails
   fast with a clear ValueError instead of looping unbounded.

3. codec_dispatcher.py: add `dispatch_call_async`, a
   `asyncio.to_thread`-wrapping variant of `dispatch_call`. The
   sync form does a blocking `urllib.request.urlopen` POST that
   would freeze the event loop if called from an `async def`
   request handler.

4. openai/chat_completion/serving.py: cache the `ToolRegistry`
   on `OpenAIServingChat` rather than calling
   `ToolRegistry.from_env` per request. `from_env` performs
   blocking HTTP fetches against manifest URLs; the first request
   after process start pays the cost (off-loop via
   `asyncio.to_thread`) and every subsequent request reuses the
   cached registry. Same site now uses `dispatch_call_async` so
   tool dispatches don't block the worker either.

Wire format unchanged. No new dependencies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: William Dunn <wdunn001@gmail.com>
wdunn001 and others added 3 commits May 17, 2026 17:34
Adds a binary streaming wire format (MessagePack and length-prefixed
Protocol Buffers) that ships raw uint32 token IDs end-to-end instead of
detokenizing every chunk to UTF-8 and wrapping it in JSON Server-Sent
Events. Opt-in via the request body field `stream_format`.

## Why

Real cross-stack measurements against sgl-project/sglang (this fork's
upstream), vllm-project/vllm, and ggml-org/llama.cpp on Qwen2.5-0.5B-
Instruct, 2,000-token completion (run id 2026-05-15T20-00-00Z, full
matrix at https://github.com/wdunn001/Codec/blob/main/packages/bench/results/2026-05-15T20-00-00Z/MATRIX.md):

  | Engine     | JSON-SSE   | Best Codec                       | Reduction  |
  |------------|-----------:|---------------------------------:|-----------:|
  | sglang     | 485.2 KB   | 291 B  (msgpack + dict-zstd)     | **1,707×** |
  | vllm       | 517.8 KB   | 3.9 KB (msgpack + gzip)          |   137×     |
  | llama.cpp  | 528.8 KB   | 140 B  (msgpack + dict-zstd, fp16)| 3,868×    |

The savings come from:

1. Not detokenizing at the serving server (no UTF-8 / JSON envelope
   per chunk).
2. Letting the client opt into HTTP-level compression (`gzip`, `br`,
   `zstd` with pre-trained dictionaries) on the binary stream.
3. Skipping the re-tokenize round-trip in agent-to-agent and
   tool-dispatch hops — the consumer can read uint32 IDs directly
   and feed them to the next model.

## What this PR adds (additive only; default behaviour unchanged)

New modules under `python/sglang/srt/entrypoints/`:

- `codec_frame.py` — MessagePack + Protocol Buffers encoders for
  `CodecFrame {ids[], done, finish_reason?, tool_calls?}`. Hand-rolled
  protobuf (no codegen step). Accepts `Union[List[int],
  numpy.ndarray[uint32], array.array('I'), bytes]` on the ids
  parameter so a future upstream change to surface output_ids as a
  buffer (avoiding the PyLong-list allocation per token) can be wired
  in without further encoder churn (`CODEC_OPENAI_BYPASS=1`).
- `codec_compression.py` — Accept-Encoding negotiation for `zstd`
  (with pre-trained dictionary), `br`, `gzip`, `identity`. Honours
  the standard preference order; emits `Codec-Zstd-Dict: sha256:<hex>`
  on every zstd response so clients can verify they have the matching
  dict loaded before decompressing.
- `codec_agent.py` — `ToolWatcher`: a uint32-compare state machine
  that detects delimited regions (tool calls, reasoning blocks,
  multimodal spans) in the raw token stream without detokenizing.
  ~100× faster than detokenize+regex in the steady state. Per-request
  opt-in via the `tool_watcher` field.
- `codec_dispatcher.py` — bolt-on tool dispatch (default off,
  `CODEC_BOLT_ON_DISPATCH=1`). Reads tool manifests from
  `CODEC_TOOL_MANIFEST_URLS` at boot, hash-validates each manifest's
  `tokenizerHash` against the active model's tokenizer, POSTs
  `CodecToolCall` (msgpack-framed) to registered tools when
  ToolWatcher fires, reinjects `response_ids` into the generation
  stream — full <tool_call>...</tool_call> loop without ever
  decoding the model's stream to text.
- `codec_version.py` — protocol-version negotiation surface
  (`Codec-Client-Version`, `Codec-Min-Version` headers + 426 Upgrade
  Required + `VERSION_INCOMPATIBLE` frame).

5 modified existing files:

- `openai/serving_completions.py` — dispatches to the binary
  generator when `stream_format != "json"`; preserves the existing
  JSON-SSE path byte-for-byte when `stream_format` is unset or
  `"json"`.
- `openai/serving_chat.py` — same dispatch on chat completions.
- `openai/protocol.py` — `stream_format` and `tool_watcher` fields
  on the request types.
- `engine.py` + `http_server.py` — protocol-discovery endpoint
  registration (`GET /codec/schema` returns the protobuf schema text
  for client code generation).

Plus unit tests for the new modules (`test_codec_agent.py`,
`test_codec_compression.py`, `test_codec_version.py`).

## Trust posture / opt-in

- Default behaviour is byte-identical to current sglang. A client
  that does NOT set `stream_format` gets JSON-SSE exactly as today.
- No new mandatory dependencies. The wire emit uses `msgspec`
  (already in sglang's reqs); compression uses `brotli` + `zstandard`
  (stdlib-adjacent; missing → falls through to identity per the
  Accept-Encoding negotiation rules).
- Engine boot is unchanged unless `CODEC_BOLT_ON_DISPATCH=1` is set;
  the dispatcher loads manifests lazily.

## Cross-stack reference + spec

- Wire format spec: https://github.com/wdunn001/Codec/blob/main/spec/versions/v0.5.md
- Compression negotiation: https://github.com/wdunn001/Codec/blob/main/spec/versions/v0.5.md#transport-compression-optional
- `.well-known/codec/` discovery surface: https://github.com/wdunn001/Codec/blob/main/spec/WELL_KNOWN_DISCOVERY.md
- 6 client-language reference implementations (TypeScript, Python,
  Rust, .NET, Java, C) consume this wire format: https://github.com/wdunn001/Codec/tree/main/packages
- Cross-stack bench (the table above): https://github.com/wdunn001/Codec/blob/main/packages/bench/RESULTS.md

This PR ships the sglang half of the protocol. Companion PRs against
vllm-project/vllm and the llama.cpp server already exist (see the
Codec repo's bench results — 24/24 cells per engine pass cross-client
byte-equality at v0.4.1; v0.5 adds an opt-in delta-varint axis +
discoverable zstd dicts).

Signed-off-by: William Dunn <wdunn001@gmail.com>
…varint

Four fixes from the gemini-code-assist review on sgl-project#25544:

1. codec_frame.py: replace the numpy-free LE-uint32 byte-buffer
   unpack with `struct.unpack('<NI', b)` (~10× faster than the
   per-element list comprehension) and reject buffers whose length
   is not a multiple of 4 instead of silently corrupting.

2. codec_frame.py: inline-replace the manual varint loop in the
   packed `prompt_ids` branch of decode_protobuf_request with one
   that has the same overflow + truncation protection as the
   outer `read_varint` helper. Malformed input now raises
   ValueError instead of running off the buffer end or producing
   a silently-wrong value.

3. codec_dispatcher.py: add `dispatch_call_async`, a
   `asyncio.to_thread`-wrapping variant of `dispatch_call`. The
   sync form does a blocking `urllib.request.urlopen` POST that
   would freeze the event loop if called from an `async def`
   request handler. The sync form stays for non-async callers
   (CLI tools, batch eval drivers).

4. openai/serving_completions.py: cache the `ToolRegistry` on
   `OpenAIServingCompletion` rather than calling
   `ToolRegistry.from_env` per request. `from_env` performs
   blocking HTTP fetches against manifest URLs; the first request
   after process start pays the cost (off-loop via
   `asyncio.to_thread`) and every subsequent request reuses the
   cached registry. Same site now uses `dispatch_call_async` so
   tool dispatches don't block the worker either.

Wire format unchanged. No new dependencies. Existing tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: William Dunn <wdunn001@gmail.com>
Per vllm PR (vllm-project/vllm#42896) bot review — same fix here for
parity. If a manifest URL returns a JSON list, scalar, or null, the
existing `if required not in parsed` check raises TypeError with an
unhelpful message. Reject non-dict shapes up front with a clear
ValueError that names the URL and actual type.

Also document why `_fetch_manifest` stays synchronous: it's only
called from `ToolRegistry.from_env`, which the engine wraps in
`asyncio.to_thread`, so the blocking urlopen never runs on the
request event loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: William Dunn <wdunn001@gmail.com>
@wdunn001 wdunn001 force-pushed the pr/codec-binary-transport branch from c400209 to 8b2d2f8 Compare May 17, 2026 21:34
wdunn001 added a commit to wdunn001/codec-website that referenced this pull request May 18, 2026
- Hero eyebrow: v0.4.1 shipping -> v0.5.0 shipping
- Benchmarks card image refs: codec-sglang:v0.4.1 -> :v0.5.0,
  (all v0.4.1) -> (all v0.5.0)
- /changelog/ gains 2026-05-18-v0-5-efficiency-observability.md
  covering the 4 new opt-in wire surfaces (delta-varint,
  discoverable zstd dicts, GPU latent quantize, bolt-on tool
  dispatcher), the 11-artifact cohort, the engine cohort change
  (TGI dropped), bench unchanged at byte level (wire-additive
  invariant), upstream PRs at sgl-project/sglang#25544 +
  vllm-project/vllm#42896, IETF I-D status.

Historical v0.4.1 references in bench card subtitles / page-
section comments / protocol-map descriptions left in place; they
document when features landed and remain accurate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wdunn001 added 2 commits May 19, 2026 15:16
Pure-style fixes flagged by sglang's lint job on PR sgl-project#25544:

  isort: codec_version, http_server, test_codec_version
  ruff (F401): drop unused hashlib (codec_dispatcher),
              dispatch_call (serving_completions),
              FastAPI/TestClient (test_codec_version)
  black:  codec_dispatcher, codec_frame, http_server,
          openai/serving_completions,
          test_codec_compression, test_codec_version

No behavioural changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant