feat(router): openai-harmony bypass for gpt-oss tool calls (closes #444 #455 #468 #480, #513)#515
Merged
Merged
Conversation
#455 #468 #480 Issue #513 — the custom Gemma 4–style OutputRouter state machine cannot model harmony's tool-call protocol reliably. Production ``commentary`` is two tokens (``comment`` + ``ary``) on gpt-oss-20b, so a single-token channel-type match never fires; the recipient (``functions.<name>``) and constrain directive (``<|constrain|>json``) are multi-token literals the naive state machine swallows or leaks. PR #514 confirmed this empirically and deferred the full e2e fix to a follow-up. This PR lands the SOTA fix: delegate harmony state tracking to ``openai-harmony.StreamableParser`` — the same library vLLM and SGLang delegate to for gpt-oss tool calling. The new ``HarmonyStreamingRouter`` shim mirrors ``OutputRouter``'s public surface (``feed`` / ``finalize`` / ``reset`` / ``feed_sequence`` / ``map``), so the engine streaming and non-stream sequence paths pick it up without changes. The legacy custom state machine remains in ``OutputRouter`` as the documented fallback when the optional ``openai-harmony`` dep is unavailable. Compatibility gate: ``HarmonyStreamingRouter`` is only selected when the model's harmony special-token IDs match the openai-harmony encoding's IDs (verified for upstream gpt-oss family at PR-time; encoded ``<|channel|>``, ``<|message|>``, ``<|call|>``, ``<|end|>``, ``<|return|>``, ``<|start|>``, ``<|constrain|>`` through both the model HF tokenizer and the harmony encoding and asserted set equality). For synthetic test vocabs that map structural markers to different IDs, the legacy router still runs — keeping the legacy test suite green without coercion. Routing rules (delegated to StreamableParser): - analysis channel → Channel.REASONING with last_content_delta - final channel → Channel.CONTENT with last_content_delta - commentary channel + recipient: body deltas buffered; emit a single Channel.TOOL_CALL event on message close, carrying the reconstructed wire-format text so the downstream HarmonyToolParser consumes it unchanged. - finalize() runs process_eos so truncated commentary messages still surface as tool calls instead of being dropped silently. New entry point: ``OutputRouter.from_tokenizer_for_streaming(tokenizer)`` — production factory used by ``BatchedEngine._create_output_router``. Returns ``HarmonyStreamingRouter`` for matched-vocab harmony models, legacy ``OutputRouter`` for everything else. ``from_tokenizer`` remains unchanged so legacy tests instantiate the legacy router directly. Regression file ``tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py`` pins closure of #444/#480 (commentary tool call no marker leak), #455 (analysis + commentary channel separation), #468 (compound analysis + commentary + final), plus general invariants (per-token streaming, truncated-message finalize drain). Skips gracefully when the optional ``openai-harmony`` dep is missing. Dependency: ``openai-harmony >= 0.0.6`` added to project deps. Pure Python + a Rust extension, ~600KB on disk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hitespace + sanitize recipient Three BLOCKING findings from pr_validate round-1: 1. `is_openai_harmony_compatible` only verified marker IDs; a tokenizer with matching markers but a different body-token vocabulary would feed `StreamableParser` IDs that decode through the wrong vocab, corrupting streamed content and tool-call arguments. Tighten the gate to also probe a representative set of body strings through both encoders and require ID equality. This auto-resolves the 7 unit regressions in `test_engine_router_non_stream.py` and `test_batched_engine_output_router.py` whose synthetic vocabs had the smoking-gun mismatch (`Reason`=1 / `ing`=2 decoded to `"#` under harmony encoding). 2. `feed_sequence().strip()` silently mutated content/reasoning whose harmony body legitimately begins or ends with whitespace (markdown blocks, code fences, formatted output). Preserve verbatim; map only the exact empty string to `None`. 3. `_reconstruct_tool_call_text()` interpolated `recipient` verbatim into the `to=<recipient>` wire text consumed by `HarmonyToolParser`, so a recipient containing whitespace, marker-like characters, or an unexpected shape could break the downstream parser. Validate against `^functions\.<safe-name>$` and raise on malformed input. Updated `from_tokenizer_for_streaming` to pass the tokenizer to the gate. Added 4 invariant tests (whitespace preservation, recipient shape rejection, gate rejects no-encode tokenizer, gate rejects mismatched body vocab) under `tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…515 round-1 follow-up) Live verify on gpt-oss-20b uncovered a non-stream regression masked by unit tests: `_route_tokens_for_channels`'s content override was designed for the legacy router, which could not classify multi-token ``commentary`` and therefore left ``routed.content`` non-None on tool calls. The new ``HarmonyStreamingRouter`` correctly classifies commentary as a TOOL_CALL event, so ``routed.content`` becomes None even though the model DID emit a structural sequence the downstream ``HarmonyToolParser`` needs to parse. The override then clobbered ``fallback_text`` (which still carries the commentary wire text — ``_clean_gpt_oss_output`` bails on commentary), and the route's ``_parse_tool_calls_with_parser`` received an empty string. Fix: skip the override when ``routed.tool_calls`` is non-empty. The override remains in force for the original #442 case (analysis-only, no final, no tool call → return empty content so the analysis body does not leak into the user-visible message). Verified post-fix: - Non-stream tool call on gpt-oss-20b: surfaces correctly - Streaming tool call: still correct - Non-stream analysis + final (no tools): still correct, no marker leak - pydantic_ai_full: 6/6 pass - anthropic_sdk: 5/5 pass - langchain: 6/6 pass - 28-test unit panel (tests/test_engine_router_non_stream.py + tests/test_batched_engine_output_router.py + the harmony bypass regressions) all green. Pinning invariant added: ``test_tool_call_routing_preserves_fallback_text`` — fakes a router returning ``tool_calls=[wire_text]`` and asserts the fallback_text survives intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n + perf/log nits
Two BLOCKING and two NIT findings from pr_validate round-2:
1. BLOCKING — recipient regex too strict
`^functions\.[A-Za-z_]...` rejected OpenAI-spec valid tool names
whose first character is a digit (e.g. `functions.2fa_lookup`).
OpenAI documents `[a-zA-Z0-9_-]{1,64}` for function names; match
that exactly so any tool the upstream API accepts also round-trips
through the router.
2. BLOCKING — body-vocab probe set alone insufficient
Three probe strings cannot prove full-vocab parity. A tokenizer
with matching markers AND matching probes but a remapped uncommon
token could still corrupt content later. Add tokenizer-identity
allowlist as the first gate layer: require `tokenizer.name_or_path`
to contain `gpt-oss` (case-insensitive substring) so quant-suffix
renames (`-MXFP4-Q8`) and org prefixes (`openai/` vs
`mlx-community/`) all resolve correctly while strangers are
rejected up-front. The marker + probe checks remain as
defense-in-depth.
3. NIT — `feed_sequence` O(n²) string concat
Replace `content += event.text` / `reasoning += event.text` in the
per-token loop with list-append + `"".join(...)` at return time.
On long non-stream generations the old form copied the
accumulator on every token.
4. NIT — `logger.info` per-request spam
The streaming factory runs on every `/v1/chat/completions` call.
Downgrade the "upgraded harmony router" message to DEBUG so it
doesn't flood production logs.
Plus ruff format on the three affected files (round-1 commit left
them at non-canonical line shapes).
Pinning invariants added:
- `test_compat_gate_rejects_unknown_tokenizer_identity` — Mistral-named
tokenizer with matching encoder is still rejected.
- `test_compat_gate_accepts_gpt_oss_quant_suffix_variants` — all
`mlx-community/gpt-oss-*` and `openai/gpt-oss-*` variants accepted;
arbitrary names rejected.
- `test_recipient_shape_accepts_digit_start_names` — `functions.2fa_lookup`,
`functions.0_index`, `functions.123`, `functions.tool-with-dash`.
Live verify on gpt-oss-20b post-fix:
- non-stream tool call: still surfaces with correct args
- analysis + final: still routes correctly, zero marker leak
- log spam: NONE during normal request (was 1 INFO line per request)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uncated bodies + cache compat probe
Round-3 findings from pr_validate:
1. BLOCKING — non-stream path discarded `routed["tool_calls"]`
When `HarmonyStreamingRouter.finalize()` synthesized a TOOL_CALL
event for truncated commentary (no `<|call|>` in the raw output),
`_clean_gpt_oss_output`'s bail-out condition might not match —
the strip regex then removed the commentary header from
`fallback_text`, and the route's text-based `HarmonyToolParser`
saw no tool call at all. Append the reconstructed wire text from
`routed["tool_calls"]` to `fallback_text` so the downstream parser
can extract the call. Suppressed when `fallback_text` already
carries a commentary marker (avoids double-parsing the same call).
2. BLOCKING — finalize() surfaced mid-JSON truncations as completed
`StreamableParser.process_eos` appends the in-progress commentary
message to `messages` even when the body was cut off mid-stream
(max_tokens limit hit in the middle of `{"city":"NY`). The router
then synthesized a TOOL_CALL event with corrupt arguments. Add
`_tool_call_body_is_valid` — when the message's `content_type`
indicates JSON, require `json.loads(body)` to succeed; empty
bodies and JSON-decode failures cause `finalize()` to drop the
event. Sibling test `test_finalize_drains_truncated_commentary_message`
already pins that a COMPLETE-but-unterminated body still surfaces;
the new tests pin that mid-JSON / empty bodies do NOT.
3. NIT — per-request compat probe + harmony encoding load
`is_openai_harmony_compatible` ran the three-layer gate (marker
IDs + body-vocab probes) on every `/v1/chat/completions` call, and
each call re-loaded the harmony encoding. Cache both: a
module-level `_COMPAT_RESULT_CACHE` keyed by lowercased
`name_or_path` short-circuits subsequent calls, and a
`_HARMONY_ENCODING_CACHE` reuses the loaded encoding. Both fall
back to the slow path if the cache is cleared.
New invariants pinned:
- `test_finalize_drops_mid_json_truncation` — `{"city":"NY` → no event
- `test_finalize_drops_truncated_commentary_with_empty_body` —
empty body → no event
- `test_compat_gate_caches_per_tokenizer_identity` — repeat calls
hit the cache (no encode invocations)
- `test_finalize_drained_tool_call_supplements_fallback_text` —
engine appends wire text when fallback_text lacks markers;
idempotent when commentary is already present.
Existing tests with reused fixture identities re-keyed to unique
`name_or_path` strings so the new result cache doesn't cross-
contaminate them.
Live verify on gpt-oss-20b post-fix:
- non-stream tool call still surfaces
- analysis + final still routes correctly
- log spam: still silenced
35/35 router-related unit tests green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y marker IDs Round-4 findings from pr_validate: 1. BLOCKING — gate skipped missing marker IDs via ``continue`` ``if model_id is None: continue`` meant a tokenizer with only a subset of harmony markers in its vocab could pass the gate. StreamableParser needs ALL seven (``<|channel|>`` / ``<|message|>`` / ``<|call|>`` / ``<|end|>`` / ``<|return|>`` / ``<|start|>`` / ``<|constrain|>``) to function — a partial tokenizer would crash inside the parser when the model emitted a marker the gate didn't check. Replace ``continue`` with ``return False``. Any production gpt-oss tokenizer has all seven; tokenizers that don't are not harmony-compatible. 2. BLOCKING — finalize() didn't drain post-EOS final/analysis delta ``process_eos()`` may surface a ``last_content_delta`` from a truncated final or analysis message that was buffered inside StreamableParser. The previous finalize() only checked for freshly-closed commentary messages and discarded any delta. Add a post-EOS drain that routes the delta to CONTENT (final) or REASONING (analysis) using the ``pre_eos_channel`` snapshot (the post-EOS channel becomes ``None``). Empirically the current openai-harmony build emits everything during ``feed()``, but the contract is pinned so future builds can't silently regress. 3. NIT — compat cache key was lowercased ``name_or_path`` only Two tokenizer instances with the same identity string but distinct marker IDs (e.g. test mocks, or a model loaded with a vocab override) would share a stale cached True/False. Include the marker-ID tuple in the cache key. New invariants pinned: - ``test_finalize_does_not_double_emit_completed_final`` — a normal final-channel response with ``<|return|>`` emits via per-token feed only; finalize() must return None. - ``test_finalize_drains_truncated_final_buffered_delta`` — when EOS surfaces a buffered final/analysis delta, finalize() routes it to the correct channel. - ``test_compat_gate_rejects_missing_marker_ids`` — TokenMap with ``harmony_call=None`` is rejected. - ``test_compat_gate_cache_segregates_by_marker_ids`` — same identity + distinct marker tuples → distinct cache entries. Live verify post-fix: - non-stream tool call still surfaces correctly - 39/39 router-related unit tests green Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… test Round-5 codex findings (2 BLOCKING, 0 NIT — converging fast): 1. BLOCKING — engine bulk guard suppressed multi-tool finalize drain ``"<|channel|>commentary" not in fallback_text`` rejected the entire ``routed["tool_calls"]`` append when ANY commentary marker was already present. In a multi-tool turn (model emits call #1 cleanly with ``<|call|>``, gets truncated mid-call-#2), the fallback already carries call #1's marker so call #2 was silently dropped. Replace with per-wire-text substring check: append each reconstructed tool call only when its exact text is missing from fallback_text. Idempotent on call #1 + appends call #2. 2. BLOCKING — compat-cache test was non-discriminating ``test_compat_gate_caches_per_tokenizer_identity`` used an identity outside the allowlist, so the gate short-circuited BEFORE ``encode`` was ever invoked. The test would pass even if the cache were removed entirely. Redesign with a real gpt-oss identity + matching marker IDs so the gate reaches the body-vocab probe step; only the mismatching probe response causes False, and the second call MUST be short-circuited by the cache (asserted via encode-call counter). Updated test_finalize_drained_tool_call_supplements_fallback_text to match the new per-wire-text contract: Case B now pins the "same call must not be doubled" invariant, while the new ``test_multi_tool_fallback_with_existing_marker_still_appends_new`` pins the multi-tool drain. 40/40 router-related unit tests green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ynthesis
Adopt vLLM / SGLang same default: HarmonyStreamingRouter.finalize() now
only flushes parser state via process_eos and returns None. It no longer:
* synthesizes a TOOL_CALL from a truncated commentary message (no
<|call|>) — a max_tokens / stop cutoff must not be executed as if
the model had emitted the call terminator, even if the partial
body parses as JSON. Rescuing the rare recoverable case is dwarfed
by the cost of one wrongly-invoked side-effecting tool.
* drains a post-EOS last_content_delta for final / analysis — per-
token feed() already produced every body byte; emitting another
delta risks duplicating the last token if a future openai-harmony
build does not clear last_content_delta between flushes.
Both behaviors were the last two codex round-5/6 BLOCKING findings on
PR #515 (closes #444 #455 #468 #480, lands #513). They were originally
added to recover edge cases but the analysis above shows the safer
default is to drop them; the helper _tool_call_body_is_valid + the
json import are removed alongside since they are no longer reachable.
Tests:
* test_finalize_drains_truncated_commentary_message → rename to
test_finalize_never_synthesizes_truncated_commentary, assert None.
* test_finalize_drains_truncated_final_buffered_delta → rename to
test_finalize_does_not_drain_post_eos_buffered_delta, assert None
on both final and analysis channels.
* test_finalize_drops_mid_json_truncation /
test_finalize_drops_truncated_commentary_with_empty_body /
test_finalize_does_not_double_emit_completed_final already pinned
drained == None — docstrings refreshed to attribute the contract
to the safer-default rule.
29 targeted tests still pass; engine-side multi-tool dedup tests are
unaffected because they exercise the routed["tool_calls"] append loop,
which still fires from feed()-time TOOL_CALL events (completed tool
calls with <|call|> closing the message).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ural dedup Two fixes flagged by codex on PR #515 round-7: BLOCKING (output_router_harmony.py:175): _reconstruct_tool_call_text reinserted the tool-call body into a delimiter-based harmony wire text consumed by HarmonyToolParser. A body containing literal harmony sentinel strings (``<|call|>``, ``<|message|>``, etc.) would let the downstream parser anchor on the embedded sentinel and truncate the JSON arguments. On gpt-oss those sentinels are normally single special tokens never emitted as body text, but an adversarial prompt could echo the raw characters back via body vocab. Reject (raise ValueError) rather than escape — the engine's outer try/except then falls back to the legacy text-based parser instead of feeding corrupt args downstream. Consistent with the recipient-validation BLOCKING from round-1. ``content_type`` also gets a sentinel scan, but the legitimate ``<|constrain|>`` prefix is stripped first so the framing carrier doesn't trip its own gate. NIT (engine/batched.py:1373): De-duping reconstructed tool calls with verbatim substring match doubled calls when the model emit and router reconstruction differ only by whitespace runs (e.g. ``to=foo <|constrain|>`` vs ``to=foo <|constrain|>``). Replace with a structural compare on the ``(recipient, normalized-body)`` tuple extracted via the new ``_HARMONY_TC_SIGNATURE_RE``. Robust to any spacing the model picks while still distinguishing different arguments for the same recipient. Falls back to verbatim match on non-canonical reconstructions (better to risk one duplicate than silently drop). Tests: * test_reconstruct_tool_call_rejects_body_carrying_harmony_sentinel pins each of the seven forbidden sentinels and a smuggled content_type. Sanity: clean JSON + ``<|constrain|>json`` ctype still reconstruct. * test_fallback_dedup_normalizes_whitespace_variance pins the spacing-variant case (same call) and the same-recipient-different- body case (must still append). 31 targeted tests pass (parsers regression + engine non-stream). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…encode validation Three BLOCKINGs + one NIT from codex on PR #515 round-8: BLOCKING engine/batched.py:1313 — Body whitespace collapse broke distinct legitimate calls. The signature was returning ``" ".join(body.split())`` so ``{"q":"a b"}`` and ``{"q":"a b"}`` deduped as the same call and the later one was silently dropped. Use the exact body bytes captured between ``<|message|>`` and ``<|call|>``; the regex's ``\s*`` already absorbs the only place spacing can legitimately differ (between ``to=`` and ``<|constrain|>``). BLOCKING engine/batched.py:1397 — When ``_harmony_tool_call_signature`` returned ``None``, the previous "best-effort append" fed non-canonical wire text to HarmonyToolParser, which either fails to parse (best case) or anchors on the wrong markers and surfaces a corrupt tool call (worst case). Skip with a debug log instead — the router has already lost the tool-call signal, nothing safe to recover. BLOCKING output_router_harmony.py:334 — ``list(model_ids)`` assumed ``tokenizer.encode()`` returns a flat ``list[int]``, but HF ``BatchEncoding`` and tensor-shaped returns either raise on ``list()`` or yield opaque non-int entries. Validate that the encode result is a flat int sequence; on anything else fall back to ``False`` so the engine uses the legacy router instead of crashing. NIT output_router_harmony.py:274 — ``is_openai_harmony_compatible`` called both ``is_openai_harmony_available()`` and ``_get_harmony_encoding()`` to check the dep; the latter already returns ``None`` on import failure. Drop the redundant probe. Tests: * test_compat_gate_rejects_non_int_encode_result — covers both BatchEncoding-style iter (dict keys) and tuple-of-strings. * test_fallback_dedup_preserves_internal_body_whitespace — pins that ``{"q":"a b"}`` and ``{"q":"a b"}`` are NOT collapsed. * test_non_canonical_router_wire_text_is_dropped — pins that a missing ``<|call|>`` terminator causes the wire text to be dropped, not appended. 34 targeted tests pass. Round-7 ``test_fallback_dedup_normalizes_whitespace_variance`` still passes because the spacing variance lives in the wrapper, not the body — that regex ``\s*`` still absorbs it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gex + sentinel abstain Three BLOCKINGs from codex on PR #515 round-9: BLOCKING engine/batched.py:1446 — Set dedup silently dropped a legitimate second identical tool call in the same turn (e.g. user asks for the same weather lookup twice). Switch to ``Counter`` so each existing match in fallback_text consumes ONE routed call; multiplicity is preserved for both sides. BLOCKING engine/batched.py:1311 — Signature regex matched bare ``to=functions.X<|message|>...<|call|>`` without the ``<|channel|>commentary`` prefix. A fallback_text where ``clean_output_text`` stripped the commentary header but left the tail (or any ordinary content carrying those substrings) would false-suppress the router's real wire text. Anchor the regex on the full commentary frame so only canonical tool-call shapes count as existing signatures. BLOCKING output_router_harmony.py:217 — Round-7's hard ``raise ValueError`` on body-sentinel substrings dropped legitimate calls whose OpenAI tool-call JSON values legally contain ``<|...|>`` characters. Flip the contract: ``_reconstruct_tool_call_text`` now returns ``str | None`` and ABSTAINS (returns None) on sentinel-laden body or content_type. The caller (``feed()``) emits no router event; the engine's outer plumbing leaves the model's raw fallback_text untouched and the legacy text-based HarmonyToolParser handles the call (with the same parse limitation every text-based harmony consumer has — no worse than baseline). Recipient still ``raise``s because recipient corruption is structural, not body-content- dependent. Tests: * test_reconstruct_tool_call_abstains_on_body_carrying_harmony_sentinel (renamed from rejects_..., now asserts None instead of raises). * test_identical_calls_twice_in_turn_both_survive_dedup pins the multiplicity contract on both balanced (2+2 → 2) and unbalanced (1+2 → 2) cases. * test_signature_regex_requires_commentary_frame pins that a stripped fallback_text bare tail does NOT false-suppress. 36 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…aming visibility Codex round-10 BLOCKING (output_router_harmony.py:292): when ``_reconstruct_tool_call_text`` abstains for a body containing a sentinel substring, the per-token feed loop had already suppressed commentary body deltas (we buffer until close), so emitting nothing on close — the round-9 behavior — made the entire tool call invisible to the streaming user. That regressed against baseline, where the PR-#515-replaced legacy router leaked commentary body into CONTENT verbatim (one of the original bugs being fixed, but at least visible). Fix: surface the accumulated body bytes as a CONTENT event when abstaining. The streaming user sees the tool call's intent (``{"text":"use <|call|>"}``) — semantically wrong channel, but matches baseline behavior and breaks no other contract. The engine non-stream path is unaffected: it relies on ``fallback_text``, which still carries the model's raw emit and reaches HarmonyToolParser the same way it would for any harmony tool call. Honors all three preceding rounds: * Round-7 (don't emit corrupt structured event) — no TOOL_CALL event is emitted; the structured-path contract is preserved. * Round-9 (don't drop legitimate calls with sentinel chars in JSON values) — the call is still observable via the streaming CONTENT delta and the non-stream fallback_text. * Round-10 (don't drop streaming visibility) — body now flushed. Tests: * test_reconstruct_tool_call_abstains_on_body_carrying_harmony_sentinel docstring updated to attribute the contract to round-10. * test_sentinel_body_emits_content_event_for_streaming_visibility pins the new ``feed()`` contract: TOOL_CALL stays None on abstain, but CONTENT carries the body bytes verbatim. 37 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… + ctype shape
Codex round-11 surfaced two BLOCKINGs and two NITs in fresh territory
(no oscillation with rounds 7/9/10):
BLOCKING output_router_harmony.py:541 — ``known in name_lc`` substring
matching let any identity carrying ``gpt-oss`` anywhere in its name
pass the allowlist (``my-not-gpt-oss``, ``foo/bar-gpt-oss-malicious``).
Switch to anchored regex ``(?:^|/)gpt-oss(?:-|$)`` so the basename
must START with ``gpt-oss-`` (after a path separator or string start)
and either continue with ``-`` or end. Canonical names
(``openai/gpt-oss-20b``, ``mlx-community/gpt-oss-20b-MXFP4-Q8``,
bare ``gpt-oss-20b``) still pass; tail-substring fakes don't.
BLOCKING output_router_harmony.py:507 — Cache key
``(name_or_path, marker_ids)`` allowed two distinct tokenizer
instances with the same name + markers but a remapped body vocab to
share a single cached True. Fold ``id(tokenizer)`` into the key —
distinct instances re-probe authoritatively, same-instance reuses
cleanly with zero-encode cache hits (cheaper than a body-probe
fingerprint and answers codex's "tokenizer instance" suggestion
directly).
NIT output_router_harmony.py:226 — ``content_type`` was emitted
verbatim, so a bare enum value like ``"json"`` would reconstruct
non-canonical wire text. Validate against
``^<\|constrain\|>[A-Za-z0-9_\-]{1,32}$`` and abstain (return None)
on anything outside that shape. The regex implicitly rejects embedded
sentinels too, subsuming the round-7 smuggled-sentinel content_type
scan.
NIT tests/test_engine_router_non_stream.py:225 — Round-6 made
``HarmonyStreamingRouter.finalize()`` always return None, so the
"finalize-synthesized truncated tool call" framing in this test's
docstring was stale (the fake-router stub actually stands in for the
``feed_sequence`` path that surfaces a completed ``<|call|>``-closed
tool call). Renamed to
``test_routed_tool_call_supplements_fallback_text`` and rewrote the
docstring to attribute the contract to ``feed_sequence`` not
``finalize``.
Tests:
* test_compat_gate_anchored_allowlist_rejects_tail_substring_fake
pins 4 rejected tail-substring names + 4 accepted canonical names.
* test_compat_cache_key_segregates_by_tokenizer_instance pins that
two distinct instances with the same name+markers but mismatched
body vocab segregate correctly.
* test_reconstruct_tool_call_abstains_on_body_carrying_harmony_sentinel
adds bare-``"json"`` ctype + bad-char ctype abstain coverage.
39 targeted tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-12 surfaced 2 BLOCKINGs and 1 NIT that compound into a
real production bug (full_unit caught it via
``test_compat_cache_key_segregates_by_tokenizer_instance``).
BLOCKING / NIT output_router_harmony.py:551 — Round-11 keyed the
cache on ``(name_lc, marker_ids, id(tokenizer))``. ``id()`` is the
Python object's memory address; CPython is free to reuse it after the
object is garbage-collected, so a tokenizer freed between requests
could have its address reused by a NEW (potentially incompatible)
tokenizer that then hits the stale ``True`` and is wrongly upgraded
to the harmony router. The cache also grew unbounded across model
reloads. Replace with a ``WeakKeyDictionary`` keyed on the tokenizer
object itself, with an inner ``dict[(name_lc, marker_ids), bool]``
for per-TokenMap segregation. Entries auto-clear on GC — no id-reuse
hazard, bounded growth. Non-weak-referenceable tokenizers (some test
mocks, ``__slots__`` classes without ``__weakref__``) fall through
to recompute on every call — correctness preserved at a small cost.
BLOCKING output_router_harmony.py:106 — Round-11's anchored basename
allowlist ``(?:^|/)gpt-oss(?:-|$)`` still accepted arbitrary owners
(``some-user/gpt-oss-remapped``, ``evil-org/gpt-oss-20b``). Restrict
to known owner prefixes: ``openai/``, ``mlx-community/``,
``unsloth/``, plus the bare basename form. Any other owner is
rejected even if its basename starts with ``gpt-oss-`` — owner
provenance is now part of the identity gate, not just basename
shape. Combined with the body-vocab probe set this triples the
defense.
Tests:
* test_compat_gate_anchored_allowlist_rejects_tail_substring_fake
grows to include ``some-user/gpt-oss-remapped``,
``evil-org/gpt-oss-20b``, ``anonymous/gpt-oss`` in the reject
list; ``unsloth/gpt-oss-20b-MLX-8bit`` joins the accept list.
* test_compat_cache_key_segregates_by_tokenizer_instance no longer
leaks across full_unit runs (WeakKeyDict + instances kept in
scope per test segregate cleanly).
* test_compat_gate_cache_segregates_by_marker_ids rewritten for
the new WeakKeyDict layout (single tokenizer instance probed
with two distinct TokenMaps must produce two inner entries).
* test_compat_gate_caches_per_tokenizer_identity no longer needs
manual cache cleanup — fresh tokenizer instance is naturally
isolated by the WeakKeyDict.
39 targeted tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…xact-count test
Codex round-13 found 1 BLOCKING + 2 NITs after the round-12 allowlist
tightening:
BLOCKING output_router_harmony.py:593 — the round-12 owner-prefix
allowlist was anchored at ``^owner/...``, which silently rejected
LOCAL filesystem paths (``/models/gpt-oss-20b``,
``~/.cache/.../gpt-oss-20b``, ``./gpt-oss-20b-quantized``). Production
deployments that load models by absolute path then fell back to the
legacy leaking router. Replace the flat regex list with a two-tier
``_is_known_harmony_identity`` helper:
* Bare basename (no ``/``) — accept iff basename matches gpt-oss-*.
* Local filesystem path (``/``, ``~``, ``./``, ``../`` prefix) —
trust the basename match (user-controlled artifact, body-probe
set is the authoritative vocab check).
* Remote HF ID (``owner/name`` shape) — owner must be in
{openai, mlx-community, unsloth}.
Identical defense surface as round-12 for remote IDs, plus the local-
path support the round-12 tightening accidentally broke.
NIT output_router_harmony.py:213 — ``_reconstruct_tool_call_text``
built the body with quadratic ``+=`` accumulation. Switch to list +
``"".join`` once; same pattern as the round-2 NIT fix in
``feed_sequence``. Avoidable copy cost on large JSON tool arguments.
NIT tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py:213
— ``test_per_token_streaming_routes_one_event_per_body_token`` claimed
"one event per body token" but only asserted ``>= 2``. Coalescing or
dropped intermediate deltas would have passed silently. Compute the
exact body-token count from the encoded sequence (positions between
``<|message|>`` and ``<|return|>``) and assert event count EQUALS
body-token count.
Tests:
* test_compat_gate_anchored_allowlist_rejects_tail_substring_fake
extends ``accepted_names`` with 5 local-path shapes (absolute,
HF-cache, tilde, dot-relative, parent-relative).
* test_per_token_streaming_routes_one_event_per_body_token
strengthened to exact-count assertion.
39 targeted tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round-15 refactor closing PR #515 codex round-12 / round-14 BLOCKING: eliminate the sentinel-delimited wire-text round-trip for harmony tool calls; surface structured {"name", "arguments"} all the way through the engine and routes. Matches vLLM / SGLang exactly. Why this round: The previous design reconstructed harmony commentary messages back into wire text (<|channel|>commentary to=functions.X<|message|> {body}<|call|>) so the downstream text-based HarmonyToolParser could regex-parse it again. That round-trip lost tool calls whose JSON arguments contained literal harmony marker substrings (e.g. {"text":"<|call|>"}) — the regex anchored on the embedded sentinel and silently truncated the call. Codex flagged this for the third time in round 14; no amount of body-scrubbing on the reconstruction side could fix it without dropping legitimate calls. The fix is the one codex proposed (and the one vLLM and SGLang already ship): pass structured (name, arguments) data through the pipeline. The text-based parser stays in place for legacy / non- router paths; the router-fed path bypasses it entirely. Changes: - GenerationOutput: new tool_calls: list[dict] | None field. Appended after raw_text / reasoning_text to preserve positional-arg compatibility. - RouterEvent: new tool_call: dict | None field. Populated on Channel.TOOL_CALL events by routers that natively parse tool calls. - HarmonyStreamingRouter: feed() emits RouterEvent with the structured payload on commentary message close; feed_sequence() returns list[dict] entries on routed["tool_calls"]. Removed the wire-text reconstruction + sentinel-body abstain logic (no more reconstruction, no more sentinels to worry about). - HarmonyStreamingRouter.identity gate: tolerate HuggingFace cache snapshot paths like ~/.cache/huggingface/hub/models--openai-- gpt-oss-20b/snapshots/<sha> via a new models--<owner>--<name> pattern. Round-14 BLOCKING #1: previous basename-only check rejected the SHA snapshot dir and fell back to the leaking legacy router for every HF-cache-loaded gpt-oss model. - BatchedEngine._route_tokens_for_channels: returns (reasoning, content, structured_tool_calls). Removed the Counter-based wire-text dedup machinery (no more wire text to dedup). When structured tool calls are present, force content to the router's CONTENT-channel result so un-cleaned commentary headers in fallback_text don't bleed into user-facing content. - BatchedEngine._make_routed_output: propagate structured tool_call payload from RouterEvent → GenerationOutput.tool_calls for the streaming path. - _parse_tool_calls_with_parser: new structured_tool_calls= kwarg. When non-None the text-based parser is bypassed entirely; the structured entries are converted to ToolCall objects directly. - routes/chat.py + routes/anthropic.py (non-stream): pass output.tool_calls into the helper. - routes/anthropic.py (stream): accumulate engine-surfaced tool_calls on a per-chunk basis and pass the accumulated list to the end-of-stream parser call. - service/postprocessor.py (stream): on Channel.TOOL_CALL events with output.tool_calls populated, emit StreamEvent(type="tool_call", ...) directly instead of regex-parsing the delta text. Tests: - tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py: rewritten for the structured contract. New test_structured_payload_preserves_body_with_harmony_sentinels replays the BLOCKING #2 scenario (7 sentinel-bearing JSON bodies) and asserts each surfaces intact via the structured payload. New test_compat_gate_accepts_hf_cache_snapshot_path pins BLOCKING #1 fix. 27 tests, all green. - tests/test_engine_router_non_stream.py: rewritten for the (reasoning, content, structured_tool_calls) triple return. New test_structured_tool_call_passthrough_preserves_sentinel_bearing_arguments pins the same BLOCKING #2 scenario at the engine layer. 13 tests, all green. - tests/test_parallel_tool_calls.py: fake parser signature update. - tests/test_server_utils.py: field-order test extended to include tool_calls in the appended-fields list. Full unit suite: 4475 passed, 19 skipped, 16 deselected, 7 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-15 BLOCKING ×2 on the new structured-tool-call streaming fast-path introduced for HarmonyStreamingRouter passthrough: 1. ``tool_calls[*].index`` restarted at 0 on every router event, so clients merging deltas on ``index`` would overwrite earlier calls. Added per-response ``_structured_tool_call_count`` counter (initialized in __init__ + reset()) and assign monotonic indices across both intra-chunk and inter-chunk emissions. 2. The early structured-tool-call return bypassed the ``parallel_tool_calls=false`` cap that the non-streaming path enforces (routes/chat.py post-parse trim). Added ``_parallel_tool_calls_allowed()`` helper (dict + pydantic shapes) and trim ``allowed_calls`` to one before structuring; suppressed chunks still emit a finish event when ``output.finished`` to keep the route's buffered_finish gate firing. Tests (TestStructuredToolCallStreaming, 9 cases): - index monotonic across chunks - index monotonic within single chunk - parallel_tool_calls=false cap across chunks - parallel_tool_calls=false cap within single chunk - parallel_tool_calls=true (explicit) does not cap - suppressed-by-cap chunk preserves finish semantics - structured event auto-assigns id when missing - structured event preserves caller-provided id - reset() clears the structured counter Also updated ``_make_output`` helper to default ``tool_calls=None`` so the new fast-path stays inert in tests that don't opt in (MagicMock auto-attrs are truthy and would otherwise trip the channel-routed structured branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
raullenchai
added a commit
that referenced
this pull request
Jun 5, 2026
PR #515 — openai-harmony StreamableParser passthrough for gpt-oss tool calls. Closes #444 #455 #468 #480 #513 (harmony streaming tool calls leak channel markers / commentary into content). Architecture: HarmonyStreamingRouter delegates to the official harmony parser (same library vLLM and SGLang use), surfaces structured (name, arguments) tool calls through the engine without round-tripping through wire-text — eliminating the entire class of sentinel-anchored regex parsing bugs that the cluster #444-#480 stemmed from. Streaming fast-path enforces tool_calls[*].index monotonicity across router chunks and the parallel_tool_calls=false external contract cap. Release-SOP gates walked: G1 release-smoke OK, G2 codex×16 rounds OK, G3 CLI-fidelity OK, G4 4484 unit tests OK, G5-G9 pr_validate 3×3 matrix (qwen3.5-35B-A3B-8bit / qwen3.6-27B-MLX-8bit / gpt-oss-20b -MXFP4-Q8) OK. G10 N/A (no mlx-* dep bump). G11 deferred to #516 (target v0.6.76) — auto-routing escape hatch flag pair for HarmonyStreamingRouter selection; waiver justified by safe-fallback to legacy in-production harmony router. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace the custom harmony state machine in OutputRouter with a thin shim over openai-harmony's
StreamableParser— the same library vLLM and SGLang delegate to for gpt-oss tool calling — and surface structured tool-call data through the engine and routes (no more sentinel-delimited wire-text round-trip).Closes #444 / #480 (harmony streaming tool call leak).
Closes #455 (commentary channel routing).
Closes #468 (tool_choice=required compound).
Lands #513.
Why
PR #514 partially fixed #444/#480 at the parser layer (prefix-hold + count-based
<|call|>detection) but explicitly deferred the END-TO-END production fix because the router can't model harmony's tool-call protocol with a single-token state machine.Live verification on
mlx-community/gpt-oss-20b-MXFP4-Q8confirmed thatcommentarytokenizes ascomment(12606) +ary(815) — TWO tokens. The custom router's AWAITING_CHANNEL_TYPE branch only recognizesanalysisandfinalas single-token channel words.commentaryfalls through to CONTENT and leaks the whole markered tool-call sequence into the content stream.The proper fix is what vLLM and SGLang already do: delegate to the official harmony streaming parser.
StreamableParserhandles multi-token channel/recipient/constrain parsing, message boundaries, and state transitions natively.Architecture (round-15 — bytes-faithful structured passthrough)
Earlier revisions of this PR reconstructed parsed harmony messages back into sentinel-delimited wire text so the downstream text-based
HarmonyToolParsercould regex-parse them again. That round-trip lost tool calls whose JSON arguments contained literal harmony marker substrings (e.g.{"text":"<|call|>"}) — the regex anchored on the embedded sentinel and silently truncated the call. Round-15 eliminates the round-trip;HarmonyStreamingRoutersurfaces structured{"name", "arguments"}payloads that flow through the engine via a newGenerationOutput.tool_callsfield. Routes consume the structured data directly, bypassing the regex pass for the router-fed path. This matches vLLM and SGLang exactly.HarmonyStreamingRouter(vllm_mlx/output_router_harmony.py) — duck-typed replacement forOutputRouteron harmony format. Internally wrapsStreamableParser. Samefeed/finalize/reset/feed_sequencesurface so engine streaming + non-stream sequence paths pick it up without changes.RouterEventinstances carry the parsed{"name", "arguments"}dict on a newRouterEvent.tool_callfield;feed_sequencereturns the same shape viarouted["tool_calls"].GenerationOutput.tool_calls: list[dict] | Nonecarries the structured payload through both streaming and non-streaming paths._route_tokens_for_channelsreturns a(reasoning, content, structured_tool_calls)triple; the wire-text dedup machinery is gone._parse_tool_calls_with_parser(..., structured_tool_calls=...)short-circuits the regex extraction when the engine surfaced structured calls. The text-based parser stays for legacy paths (gemma4 + non-harmony) which still produce wire-text TOOL_CALL events.models--<owner>--<name>/snapshots/<sha>). Mismatched body vocabs / unknown owners fall back to the legacy router.process_eos; never synthesizes a tool call from truncated commentary (vLLM / SGLang behavior).feed_sequencereturns accumulated text verbatim; only the exact empty string maps toNone.Test coverage
tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py(27 tests):test_structured_payload_preserves_body_with_harmony_sentinels— pins the round-15 architectural fix: 7 sentinel-bearing JSON bodies ({"text":"<|call|>"}etc.) each surface intact via the structured payload.test_compat_gate_accepts_hf_cache_snapshot_path— pins HF cache snapshot dir acceptance.finalizesafer-default, recipient-shape, per-token streaming contract all covered.tests/test_engine_router_non_stream.py(13 tests) — pins the new(reasoning, content, structured_tool_calls)triple.test_structured_tool_call_passthrough_preserves_sentinel_bearing_argumentsmirrors the BLOCKING feat: Tier 1 optimizations — streaming tool fix, frequency-aware cache, block reuse #2 scenario at the engine layer.tests/test_anthropic_*(122 tests) — anthropic streaming + non-streaming flows green.Skips gracefully when
openai-harmonyis missing.What was verified live (gpt-oss-20b)
finish_reason=tool_calls, content empty, tool surfaced w/ correct argsDependency
Adds
openai-harmony>=0.0.6to project deps. Pure Python + small Rust extension, ~600 KB. Soft-imported at runtime — missing dep / mismatched vocab gracefully falls back to legacy state machine.Test plan
🤖 Generated with Claude Code