feat(router): openai-harmony bypass for gpt-oss tool calls (closes #444 #455 #468 #480, #513) by raullenchai · Pull Request #515 · raullenchai/Rapid-MLX

raullenchai · 2026-06-04T20:30:05Z

Summary

Replace the custom harmony state machine in OutputRouter with a thin shim over openai-harmony's StreamableParser — the same library vLLM and SGLang delegate to for gpt-oss tool calling — and surface structured tool-call data through the engine and routes (no more sentinel-delimited wire-text round-trip).

Closes #444 / #480 (harmony streaming tool call leak).
Closes #455 (commentary channel routing).
Closes #468 (tool_choice=required compound).
Lands #513.

Why

PR #514 partially fixed #444/#480 at the parser layer (prefix-hold + count-based <|call|> detection) but explicitly deferred the END-TO-END production fix because the router can't model harmony's tool-call protocol with a single-token state machine.

Live verification on mlx-community/gpt-oss-20b-MXFP4-Q8 confirmed that commentary tokenizes as comment(12606) + ary(815) — TWO tokens. The custom router's AWAITING_CHANNEL_TYPE branch only recognizes analysis and final as single-token channel words. commentary falls through to CONTENT and leaks the whole markered tool-call sequence into the content stream.

The proper fix is what vLLM and SGLang already do: delegate to the official harmony streaming parser. StreamableParser handles multi-token channel/recipient/constrain parsing, message boundaries, and state transitions natively.

Architecture (round-15 — bytes-faithful structured passthrough)

Earlier revisions of this PR reconstructed parsed harmony messages back into sentinel-delimited wire text so the downstream text-based HarmonyToolParser could regex-parse them again. That round-trip lost tool calls whose JSON arguments contained literal harmony marker substrings (e.g. {"text":"<|call|>"}) — the regex anchored on the embedded sentinel and silently truncated the call. Round-15 eliminates the round-trip; HarmonyStreamingRouter surfaces structured {"name", "arguments"} payloads that flow through the engine via a new GenerationOutput.tool_calls field. Routes consume the structured data directly, bypassing the regex pass for the router-fed path. This matches vLLM and SGLang exactly.

New HarmonyStreamingRouter (vllm_mlx/output_router_harmony.py) — duck-typed replacement for OutputRouter on harmony format. Internally wraps StreamableParser. Same feed/finalize/reset/feed_sequence surface so engine streaming + non-stream sequence paths pick it up without changes.
Structured event payload — TOOL_CALL RouterEvent instances carry the parsed {"name", "arguments"} dict on a new RouterEvent.tool_call field; feed_sequence returns the same shape via routed["tool_calls"].
Engine plumbing — GenerationOutput.tool_calls: list[dict] | None carries the structured payload through both streaming and non-streaming paths. _route_tokens_for_channels returns a (reasoning, content, structured_tool_calls) triple; the wire-text dedup machinery is gone.
Route bypass — _parse_tool_calls_with_parser(..., structured_tool_calls=...) short-circuits the regex extraction when the engine surfaced structured calls. The text-based parser stays for legacy paths (gemma4 + non-harmony) which still produce wire-text TOOL_CALL events.
Compatibility gate — selected only when the model's harmony special-token IDs AND a representative set of body-vocab probe strings round-trip to the same IDs through both the model tokenizer and the harmony encoding. Tokenizer-identity allowlist accepts known-owner remote IDs, local filesystem paths, AND HuggingFace cache snapshot dirs (models--<owner>--<name>/snapshots/<sha>). Mismatched body vocabs / unknown owners fall back to the legacy router.
finalize() safer-default — only flushes process_eos; never synthesizes a tool call from truncated commentary (vLLM / SGLang behavior).
Whitespace preservation — feed_sequence returns accumulated text verbatim; only the exact empty string maps to None.

Test coverage

tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py (27 tests):
- Closes harmony streaming: tool calls leak as raw content deltas instead of tool_calls events #444 / Anthropic /v1/messages streaming + tools: harmony commentary leaks as text block instead of tool_use (worse than #444) #455 / tool_choice="required" not enforced + harmony channel markers leak as content #468 / Harmony streaming tool calls leak channel syntax into content (gpt-oss-20b) #480 by replaying production-shape token sequences through the real harmony encoding and asserting structured tool-call surfacing + zero marker leak.
- test_structured_payload_preserves_body_with_harmony_sentinels — pins the round-15 architectural fix: 7 sentinel-bearing JSON bodies ({"text":"<|call|>"} etc.) each surface intact via the structured payload.
- test_compat_gate_accepts_hf_cache_snapshot_path — pins HF cache snapshot dir acceptance.
- Identity-allowlist, marker-parity, body-vocab probe, weak-key cache, finalize safer-default, recipient-shape, per-token streaming contract all covered.
tests/test_engine_router_non_stream.py (13 tests) — pins the new (reasoning, content, structured_tool_calls) triple. test_structured_tool_call_passthrough_preserves_sentinel_bearing_arguments mirrors the BLOCKING feat: Tier 1 optimizations — streaming tool fix, frequency-aware cache, block reuse #2 scenario at the engine layer.
tests/test_anthropic_* (122 tests) — anthropic streaming + non-streaming flows green.
Full unit suite: 4475 passed, 19 skipped, 16 deselected, 7 xfailed.

Skips gracefully when openai-harmony is missing.

What was verified live (gpt-oss-20b)

✅ Streaming tool call: finish_reason=tool_calls, content empty, tool surfaced w/ correct args
✅ Non-stream tool call: same shape
✅ Analysis + final (no tools): reasoning and content split, zero marker leak
✅ pydantic_ai_full: 6/6 pass
✅ anthropic_sdk: 5/5 pass
✅ langchain: 6/6 pass

Dependency

Adds openai-harmony>=0.0.6 to project deps. Pure Python + small Rust extension, ~600 KB. Soft-imported at runtime — missing dep / mismatched vocab gracefully falls back to legacy state machine.

Test plan

CI green
Live verify gpt-oss-20b commentary tool call surfaces with no content leak
Live verify gpt-oss-20b analysis + final still route correctly post-bypass
Regression: gpt-oss-20b agent tests pass (pydantic_ai / anthropic_sdk / langchain)
Round-15: sentinel-in-JSON-args bodies preserved bytes-faithfully (codex round-12 / round-14 BLOCKING closure)

🤖 Generated with Claude Code

#455 #468 #480 Issue #513 — the custom Gemma 4–style OutputRouter state machine cannot model harmony's tool-call protocol reliably. Production ``commentary`` is two tokens (``comment`` + ``ary``) on gpt-oss-20b, so a single-token channel-type match never fires; the recipient (``functions.<name>``) and constrain directive (``<|constrain|>json``) are multi-token literals the naive state machine swallows or leaks. PR #514 confirmed this empirically and deferred the full e2e fix to a follow-up. This PR lands the SOTA fix: delegate harmony state tracking to ``openai-harmony.StreamableParser`` — the same library vLLM and SGLang delegate to for gpt-oss tool calling. The new ``HarmonyStreamingRouter`` shim mirrors ``OutputRouter``'s public surface (``feed`` / ``finalize`` / ``reset`` / ``feed_sequence`` / ``map``), so the engine streaming and non-stream sequence paths pick it up without changes. The legacy custom state machine remains in ``OutputRouter`` as the documented fallback when the optional ``openai-harmony`` dep is unavailable. Compatibility gate: ``HarmonyStreamingRouter`` is only selected when the model's harmony special-token IDs match the openai-harmony encoding's IDs (verified for upstream gpt-oss family at PR-time; encoded ``<|channel|>``, ``<|message|>``, ``<|call|>``, ``<|end|>``, ``<|return|>``, ``<|start|>``, ``<|constrain|>`` through both the model HF tokenizer and the harmony encoding and asserted set equality). For synthetic test vocabs that map structural markers to different IDs, the legacy router still runs — keeping the legacy test suite green without coercion. Routing rules (delegated to StreamableParser): - analysis channel → Channel.REASONING with last_content_delta - final channel → Channel.CONTENT with last_content_delta - commentary channel + recipient: body deltas buffered; emit a single Channel.TOOL_CALL event on message close, carrying the reconstructed wire-format text so the downstream HarmonyToolParser consumes it unchanged. - finalize() runs process_eos so truncated commentary messages still surface as tool calls instead of being dropped silently. New entry point: ``OutputRouter.from_tokenizer_for_streaming(tokenizer)`` — production factory used by ``BatchedEngine._create_output_router``. Returns ``HarmonyStreamingRouter`` for matched-vocab harmony models, legacy ``OutputRouter`` for everything else. ``from_tokenizer`` remains unchanged so legacy tests instantiate the legacy router directly. Regression file ``tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py`` pins closure of #444/#480 (commentary tool call no marker leak), #455 (analysis + commentary channel separation), #468 (compound analysis + commentary + final), plus general invariants (per-token streaming, truncated-message finalize drain). Skips gracefully when the optional ``openai-harmony`` dep is missing. Dependency: ``openai-harmony >= 0.0.6`` added to project deps. Pure Python + a Rust extension, ~600KB on disk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…hitespace + sanitize recipient Three BLOCKING findings from pr_validate round-1: 1. `is_openai_harmony_compatible` only verified marker IDs; a tokenizer with matching markers but a different body-token vocabulary would feed `StreamableParser` IDs that decode through the wrong vocab, corrupting streamed content and tool-call arguments. Tighten the gate to also probe a representative set of body strings through both encoders and require ID equality. This auto-resolves the 7 unit regressions in `test_engine_router_non_stream.py` and `test_batched_engine_output_router.py` whose synthetic vocabs had the smoking-gun mismatch (`Reason`=1 / `ing`=2 decoded to `"#` under harmony encoding). 2. `feed_sequence().strip()` silently mutated content/reasoning whose harmony body legitimately begins or ends with whitespace (markdown blocks, code fences, formatted output). Preserve verbatim; map only the exact empty string to `None`. 3. `_reconstruct_tool_call_text()` interpolated `recipient` verbatim into the `to=<recipient>` wire text consumed by `HarmonyToolParser`, so a recipient containing whitespace, marker-like characters, or an unexpected shape could break the downstream parser. Validate against `^functions\.<safe-name>$` and raise on malformed input. Updated `from_tokenizer_for_streaming` to pass the tokenizer to the gate. Added 4 invariant tests (whitespace preservation, recipient shape rejection, gate rejects no-encode tokenizer, gate rejects mismatched body vocab) under `tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…515 round-1 follow-up) Live verify on gpt-oss-20b uncovered a non-stream regression masked by unit tests: `_route_tokens_for_channels`'s content override was designed for the legacy router, which could not classify multi-token ``commentary`` and therefore left ``routed.content`` non-None on tool calls. The new ``HarmonyStreamingRouter`` correctly classifies commentary as a TOOL_CALL event, so ``routed.content`` becomes None even though the model DID emit a structural sequence the downstream ``HarmonyToolParser`` needs to parse. The override then clobbered ``fallback_text`` (which still carries the commentary wire text — ``_clean_gpt_oss_output`` bails on commentary), and the route's ``_parse_tool_calls_with_parser`` received an empty string. Fix: skip the override when ``routed.tool_calls`` is non-empty. The override remains in force for the original #442 case (analysis-only, no final, no tool call → return empty content so the analysis body does not leak into the user-visible message). Verified post-fix: - Non-stream tool call on gpt-oss-20b: surfaces correctly - Streaming tool call: still correct - Non-stream analysis + final (no tools): still correct, no marker leak - pydantic_ai_full: 6/6 pass - anthropic_sdk: 5/5 pass - langchain: 6/6 pass - 28-test unit panel (tests/test_engine_router_non_stream.py + tests/test_batched_engine_output_router.py + the harmony bypass regressions) all green. Pinning invariant added: ``test_tool_call_routing_preserves_fallback_text`` — fakes a router returning ``tool_calls=[wire_text]`` and asserts the fallback_text survives intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n + perf/log nits Two BLOCKING and two NIT findings from pr_validate round-2: 1. BLOCKING — recipient regex too strict `^functions\.[A-Za-z_]...` rejected OpenAI-spec valid tool names whose first character is a digit (e.g. `functions.2fa_lookup`). OpenAI documents `[a-zA-Z0-9_-]{1,64}` for function names; match that exactly so any tool the upstream API accepts also round-trips through the router. 2. BLOCKING — body-vocab probe set alone insufficient Three probe strings cannot prove full-vocab parity. A tokenizer with matching markers AND matching probes but a remapped uncommon token could still corrupt content later. Add tokenizer-identity allowlist as the first gate layer: require `tokenizer.name_or_path` to contain `gpt-oss` (case-insensitive substring) so quant-suffix renames (`-MXFP4-Q8`) and org prefixes (`openai/` vs `mlx-community/`) all resolve correctly while strangers are rejected up-front. The marker + probe checks remain as defense-in-depth. 3. NIT — `feed_sequence` O(n²) string concat Replace `content += event.text` / `reasoning += event.text` in the per-token loop with list-append + `"".join(...)` at return time. On long non-stream generations the old form copied the accumulator on every token. 4. NIT — `logger.info` per-request spam The streaming factory runs on every `/v1/chat/completions` call. Downgrade the "upgraded harmony router" message to DEBUG so it doesn't flood production logs. Plus ruff format on the three affected files (round-1 commit left them at non-canonical line shapes). Pinning invariants added: - `test_compat_gate_rejects_unknown_tokenizer_identity` — Mistral-named tokenizer with matching encoder is still rejected. - `test_compat_gate_accepts_gpt_oss_quant_suffix_variants` — all `mlx-community/gpt-oss-*` and `openai/gpt-oss-*` variants accepted; arbitrary names rejected. - `test_recipient_shape_accepts_digit_start_names` — `functions.2fa_lookup`, `functions.0_index`, `functions.123`, `functions.tool-with-dash`. Live verify on gpt-oss-20b post-fix: - non-stream tool call: still surfaces with correct args - analysis + final: still routes correctly, zero marker leak - log spam: NONE during normal request (was 1 INFO line per request) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…uncated bodies + cache compat probe Round-3 findings from pr_validate: 1. BLOCKING — non-stream path discarded `routed["tool_calls"]` When `HarmonyStreamingRouter.finalize()` synthesized a TOOL_CALL event for truncated commentary (no `<|call|>` in the raw output), `_clean_gpt_oss_output`'s bail-out condition might not match — the strip regex then removed the commentary header from `fallback_text`, and the route's text-based `HarmonyToolParser` saw no tool call at all. Append the reconstructed wire text from `routed["tool_calls"]` to `fallback_text` so the downstream parser can extract the call. Suppressed when `fallback_text` already carries a commentary marker (avoids double-parsing the same call). 2. BLOCKING — finalize() surfaced mid-JSON truncations as completed `StreamableParser.process_eos` appends the in-progress commentary message to `messages` even when the body was cut off mid-stream (max_tokens limit hit in the middle of `{"city":"NY`). The router then synthesized a TOOL_CALL event with corrupt arguments. Add `_tool_call_body_is_valid` — when the message's `content_type` indicates JSON, require `json.loads(body)` to succeed; empty bodies and JSON-decode failures cause `finalize()` to drop the event. Sibling test `test_finalize_drains_truncated_commentary_message` already pins that a COMPLETE-but-unterminated body still surfaces; the new tests pin that mid-JSON / empty bodies do NOT. 3. NIT — per-request compat probe + harmony encoding load `is_openai_harmony_compatible` ran the three-layer gate (marker IDs + body-vocab probes) on every `/v1/chat/completions` call, and each call re-loaded the harmony encoding. Cache both: a module-level `_COMPAT_RESULT_CACHE` keyed by lowercased `name_or_path` short-circuits subsequent calls, and a `_HARMONY_ENCODING_CACHE` reuses the loaded encoding. Both fall back to the slow path if the cache is cleared. New invariants pinned: - `test_finalize_drops_mid_json_truncation` — `{"city":"NY` → no event - `test_finalize_drops_truncated_commentary_with_empty_body` — empty body → no event - `test_compat_gate_caches_per_tokenizer_identity` — repeat calls hit the cache (no encode invocations) - `test_finalize_drained_tool_call_supplements_fallback_text` — engine appends wire text when fallback_text lacks markers; idempotent when commentary is already present. Existing tests with reused fixture identities re-keyed to unique `name_or_path` strings so the new result cache doesn't cross- contaminate them. Live verify on gpt-oss-20b post-fix: - non-stream tool call still surfaces - analysis + final still routes correctly - log spam: still silenced 35/35 router-related unit tests green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…y marker IDs Round-4 findings from pr_validate: 1. BLOCKING — gate skipped missing marker IDs via ``continue`` ``if model_id is None: continue`` meant a tokenizer with only a subset of harmony markers in its vocab could pass the gate. StreamableParser needs ALL seven (``<|channel|>`` / ``<|message|>`` / ``<|call|>`` / ``<|end|>`` / ``<|return|>`` / ``<|start|>`` / ``<|constrain|>``) to function — a partial tokenizer would crash inside the parser when the model emitted a marker the gate didn't check. Replace ``continue`` with ``return False``. Any production gpt-oss tokenizer has all seven; tokenizers that don't are not harmony-compatible. 2. BLOCKING — finalize() didn't drain post-EOS final/analysis delta ``process_eos()`` may surface a ``last_content_delta`` from a truncated final or analysis message that was buffered inside StreamableParser. The previous finalize() only checked for freshly-closed commentary messages and discarded any delta. Add a post-EOS drain that routes the delta to CONTENT (final) or REASONING (analysis) using the ``pre_eos_channel`` snapshot (the post-EOS channel becomes ``None``). Empirically the current openai-harmony build emits everything during ``feed()``, but the contract is pinned so future builds can't silently regress. 3. NIT — compat cache key was lowercased ``name_or_path`` only Two tokenizer instances with the same identity string but distinct marker IDs (e.g. test mocks, or a model loaded with a vocab override) would share a stale cached True/False. Include the marker-ID tuple in the cache key. New invariants pinned: - ``test_finalize_does_not_double_emit_completed_final`` — a normal final-channel response with ``<|return|>`` emits via per-token feed only; finalize() must return None. - ``test_finalize_drains_truncated_final_buffered_delta`` — when EOS surfaces a buffered final/analysis delta, finalize() routes it to the correct channel. - ``test_compat_gate_rejects_missing_marker_ids`` — TokenMap with ``harmony_call=None`` is rejected. - ``test_compat_gate_cache_segregates_by_marker_ids`` — same identity + distinct marker tuples → distinct cache entries. Live verify post-fix: - non-stream tool call still surfaces correctly - 39/39 router-related unit tests green Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… test Round-5 codex findings (2 BLOCKING, 0 NIT — converging fast): 1. BLOCKING — engine bulk guard suppressed multi-tool finalize drain ``"<|channel|>commentary" not in fallback_text`` rejected the entire ``routed["tool_calls"]`` append when ANY commentary marker was already present. In a multi-tool turn (model emits call #1 cleanly with ``<|call|>``, gets truncated mid-call-#2), the fallback already carries call #1's marker so call #2 was silently dropped. Replace with per-wire-text substring check: append each reconstructed tool call only when its exact text is missing from fallback_text. Idempotent on call #1 + appends call #2. 2. BLOCKING — compat-cache test was non-discriminating ``test_compat_gate_caches_per_tokenizer_identity`` used an identity outside the allowlist, so the gate short-circuited BEFORE ``encode`` was ever invoked. The test would pass even if the cache were removed entirely. Redesign with a real gpt-oss identity + matching marker IDs so the gate reaches the body-vocab probe step; only the mismatching probe response causes False, and the second call MUST be short-circuited by the cache (asserted via encode-call counter). Updated test_finalize_drained_tool_call_supplements_fallback_text to match the new per-wire-text contract: Case B now pins the "same call must not be doubled" invariant, while the new ``test_multi_tool_fallback_with_existing_marker_still_appends_new`` pins the multi-tool drain. 40/40 router-related unit tests green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ynthesis Adopt vLLM / SGLang same default: HarmonyStreamingRouter.finalize() now only flushes parser state via process_eos and returns None. It no longer: * synthesizes a TOOL_CALL from a truncated commentary message (no <|call|>) — a max_tokens / stop cutoff must not be executed as if the model had emitted the call terminator, even if the partial body parses as JSON. Rescuing the rare recoverable case is dwarfed by the cost of one wrongly-invoked side-effecting tool. * drains a post-EOS last_content_delta for final / analysis — per- token feed() already produced every body byte; emitting another delta risks duplicating the last token if a future openai-harmony build does not clear last_content_delta between flushes. Both behaviors were the last two codex round-5/6 BLOCKING findings on PR #515 (closes #444 #455 #468 #480, lands #513). They were originally added to recover edge cases but the analysis above shows the safer default is to drop them; the helper _tool_call_body_is_valid + the json import are removed alongside since they are no longer reachable. Tests: * test_finalize_drains_truncated_commentary_message → rename to test_finalize_never_synthesizes_truncated_commentary, assert None. * test_finalize_drains_truncated_final_buffered_delta → rename to test_finalize_does_not_drain_post_eos_buffered_delta, assert None on both final and analysis channels. * test_finalize_drops_mid_json_truncation / test_finalize_drops_truncated_commentary_with_empty_body / test_finalize_does_not_double_emit_completed_final already pinned drained == None — docstrings refreshed to attribute the contract to the safer-default rule. 29 targeted tests still pass; engine-side multi-tool dedup tests are unaffected because they exercise the routed["tool_calls"] append loop, which still fires from feed()-time TOOL_CALL events (completed tool calls with <|call|> closing the message). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ural dedup Two fixes flagged by codex on PR #515 round-7: BLOCKING (output_router_harmony.py:175): _reconstruct_tool_call_text reinserted the tool-call body into a delimiter-based harmony wire text consumed by HarmonyToolParser. A body containing literal harmony sentinel strings (``<|call|>``, ``<|message|>``, etc.) would let the downstream parser anchor on the embedded sentinel and truncate the JSON arguments. On gpt-oss those sentinels are normally single special tokens never emitted as body text, but an adversarial prompt could echo the raw characters back via body vocab. Reject (raise ValueError) rather than escape — the engine's outer try/except then falls back to the legacy text-based parser instead of feeding corrupt args downstream. Consistent with the recipient-validation BLOCKING from round-1. ``content_type`` also gets a sentinel scan, but the legitimate ``<|constrain|>`` prefix is stripped first so the framing carrier doesn't trip its own gate. NIT (engine/batched.py:1373): De-duping reconstructed tool calls with verbatim substring match doubled calls when the model emit and router reconstruction differ only by whitespace runs (e.g. ``to=foo <|constrain|>`` vs ``to=foo <|constrain|>``). Replace with a structural compare on the ``(recipient, normalized-body)`` tuple extracted via the new ``_HARMONY_TC_SIGNATURE_RE``. Robust to any spacing the model picks while still distinguishing different arguments for the same recipient. Falls back to verbatim match on non-canonical reconstructions (better to risk one duplicate than silently drop). Tests: * test_reconstruct_tool_call_rejects_body_carrying_harmony_sentinel pins each of the seven forbidden sentinels and a smuggled content_type. Sanity: clean JSON + ``<|constrain|>json`` ctype still reconstruct. * test_fallback_dedup_normalizes_whitespace_variance pins the spacing-variant case (same call) and the same-recipient-different- body case (must still append). 31 targeted tests pass (parsers regression + engine non-stream). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…encode validation Three BLOCKINGs + one NIT from codex on PR #515 round-8: BLOCKING engine/batched.py:1313 — Body whitespace collapse broke distinct legitimate calls. The signature was returning ``" ".join(body.split())`` so ``{"q":"a b"}`` and ``{"q":"a b"}`` deduped as the same call and the later one was silently dropped. Use the exact body bytes captured between ``<|message|>`` and ``<|call|>``; the regex's ``\s*`` already absorbs the only place spacing can legitimately differ (between ``to=`` and ``<|constrain|>``). BLOCKING engine/batched.py:1397 — When ``_harmony_tool_call_signature`` returned ``None``, the previous "best-effort append" fed non-canonical wire text to HarmonyToolParser, which either fails to parse (best case) or anchors on the wrong markers and surfaces a corrupt tool call (worst case). Skip with a debug log instead — the router has already lost the tool-call signal, nothing safe to recover. BLOCKING output_router_harmony.py:334 — ``list(model_ids)`` assumed ``tokenizer.encode()`` returns a flat ``list[int]``, but HF ``BatchEncoding`` and tensor-shaped returns either raise on ``list()`` or yield opaque non-int entries. Validate that the encode result is a flat int sequence; on anything else fall back to ``False`` so the engine uses the legacy router instead of crashing. NIT output_router_harmony.py:274 — ``is_openai_harmony_compatible`` called both ``is_openai_harmony_available()`` and ``_get_harmony_encoding()`` to check the dep; the latter already returns ``None`` on import failure. Drop the redundant probe. Tests: * test_compat_gate_rejects_non_int_encode_result — covers both BatchEncoding-style iter (dict keys) and tuple-of-strings. * test_fallback_dedup_preserves_internal_body_whitespace — pins that ``{"q":"a b"}`` and ``{"q":"a b"}`` are NOT collapsed. * test_non_canonical_router_wire_text_is_dropped — pins that a missing ``<|call|>`` terminator causes the wire text to be dropped, not appended. 34 targeted tests pass. Round-7 ``test_fallback_dedup_normalizes_whitespace_variance`` still passes because the spacing variance lives in the wrapper, not the body — that regex ``\s*`` still absorbs it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gex + sentinel abstain Three BLOCKINGs from codex on PR #515 round-9: BLOCKING engine/batched.py:1446 — Set dedup silently dropped a legitimate second identical tool call in the same turn (e.g. user asks for the same weather lookup twice). Switch to ``Counter`` so each existing match in fallback_text consumes ONE routed call; multiplicity is preserved for both sides. BLOCKING engine/batched.py:1311 — Signature regex matched bare ``to=functions.X<|message|>...<|call|>`` without the ``<|channel|>commentary`` prefix. A fallback_text where ``clean_output_text`` stripped the commentary header but left the tail (or any ordinary content carrying those substrings) would false-suppress the router's real wire text. Anchor the regex on the full commentary frame so only canonical tool-call shapes count as existing signatures. BLOCKING output_router_harmony.py:217 — Round-7's hard ``raise ValueError`` on body-sentinel substrings dropped legitimate calls whose OpenAI tool-call JSON values legally contain ``<|...|>`` characters. Flip the contract: ``_reconstruct_tool_call_text`` now returns ``str | None`` and ABSTAINS (returns None) on sentinel-laden body or content_type. The caller (``feed()``) emits no router event; the engine's outer plumbing leaves the model's raw fallback_text untouched and the legacy text-based HarmonyToolParser handles the call (with the same parse limitation every text-based harmony consumer has — no worse than baseline). Recipient still ``raise``s because recipient corruption is structural, not body-content- dependent. Tests: * test_reconstruct_tool_call_abstains_on_body_carrying_harmony_sentinel (renamed from rejects_..., now asserts None instead of raises). * test_identical_calls_twice_in_turn_both_survive_dedup pins the multiplicity contract on both balanced (2+2 → 2) and unbalanced (1+2 → 2) cases. * test_signature_regex_requires_commentary_frame pins that a stripped fallback_text bare tail does NOT false-suppress. 36 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…aming visibility Codex round-10 BLOCKING (output_router_harmony.py:292): when ``_reconstruct_tool_call_text`` abstains for a body containing a sentinel substring, the per-token feed loop had already suppressed commentary body deltas (we buffer until close), so emitting nothing on close — the round-9 behavior — made the entire tool call invisible to the streaming user. That regressed against baseline, where the PR-#515-replaced legacy router leaked commentary body into CONTENT verbatim (one of the original bugs being fixed, but at least visible). Fix: surface the accumulated body bytes as a CONTENT event when abstaining. The streaming user sees the tool call's intent (``{"text":"use <|call|>"}``) — semantically wrong channel, but matches baseline behavior and breaks no other contract. The engine non-stream path is unaffected: it relies on ``fallback_text``, which still carries the model's raw emit and reaches HarmonyToolParser the same way it would for any harmony tool call. Honors all three preceding rounds: * Round-7 (don't emit corrupt structured event) — no TOOL_CALL event is emitted; the structured-path contract is preserved. * Round-9 (don't drop legitimate calls with sentinel chars in JSON values) — the call is still observable via the streaming CONTENT delta and the non-stream fallback_text. * Round-10 (don't drop streaming visibility) — body now flushed. Tests: * test_reconstruct_tool_call_abstains_on_body_carrying_harmony_sentinel docstring updated to attribute the contract to round-10. * test_sentinel_body_emits_content_event_for_streaming_visibility pins the new ``feed()`` contract: TOOL_CALL stays None on abstain, but CONTENT carries the body bytes verbatim. 37 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… + ctype shape Codex round-11 surfaced two BLOCKINGs and two NITs in fresh territory (no oscillation with rounds 7/9/10): BLOCKING output_router_harmony.py:541 — ``known in name_lc`` substring matching let any identity carrying ``gpt-oss`` anywhere in its name pass the allowlist (``my-not-gpt-oss``, ``foo/bar-gpt-oss-malicious``). Switch to anchored regex ``(?:^|/)gpt-oss(?:-|$)`` so the basename must START with ``gpt-oss-`` (after a path separator or string start) and either continue with ``-`` or end. Canonical names (``openai/gpt-oss-20b``, ``mlx-community/gpt-oss-20b-MXFP4-Q8``, bare ``gpt-oss-20b``) still pass; tail-substring fakes don't. BLOCKING output_router_harmony.py:507 — Cache key ``(name_or_path, marker_ids)`` allowed two distinct tokenizer instances with the same name + markers but a remapped body vocab to share a single cached True. Fold ``id(tokenizer)`` into the key — distinct instances re-probe authoritatively, same-instance reuses cleanly with zero-encode cache hits (cheaper than a body-probe fingerprint and answers codex's "tokenizer instance" suggestion directly). NIT output_router_harmony.py:226 — ``content_type`` was emitted verbatim, so a bare enum value like ``"json"`` would reconstruct non-canonical wire text. Validate against ``^<\|constrain\|>[A-Za-z0-9_\-]{1,32}$`` and abstain (return None) on anything outside that shape. The regex implicitly rejects embedded sentinels too, subsuming the round-7 smuggled-sentinel content_type scan. NIT tests/test_engine_router_non_stream.py:225 — Round-6 made ``HarmonyStreamingRouter.finalize()`` always return None, so the "finalize-synthesized truncated tool call" framing in this test's docstring was stale (the fake-router stub actually stands in for the ``feed_sequence`` path that surfaces a completed ``<|call|>``-closed tool call). Renamed to ``test_routed_tool_call_supplements_fallback_text`` and rewrote the docstring to attribute the contract to ``feed_sequence`` not ``finalize``. Tests: * test_compat_gate_anchored_allowlist_rejects_tail_substring_fake pins 4 rejected tail-substring names + 4 accepted canonical names. * test_compat_cache_key_segregates_by_tokenizer_instance pins that two distinct instances with the same name+markers but mismatched body vocab segregate correctly. * test_reconstruct_tool_call_abstains_on_body_carrying_harmony_sentinel adds bare-``"json"`` ctype + bad-char ctype abstain coverage. 39 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Codex round-12 surfaced 2 BLOCKINGs and 1 NIT that compound into a real production bug (full_unit caught it via ``test_compat_cache_key_segregates_by_tokenizer_instance``). BLOCKING / NIT output_router_harmony.py:551 — Round-11 keyed the cache on ``(name_lc, marker_ids, id(tokenizer))``. ``id()`` is the Python object's memory address; CPython is free to reuse it after the object is garbage-collected, so a tokenizer freed between requests could have its address reused by a NEW (potentially incompatible) tokenizer that then hits the stale ``True`` and is wrongly upgraded to the harmony router. The cache also grew unbounded across model reloads. Replace with a ``WeakKeyDictionary`` keyed on the tokenizer object itself, with an inner ``dict[(name_lc, marker_ids), bool]`` for per-TokenMap segregation. Entries auto-clear on GC — no id-reuse hazard, bounded growth. Non-weak-referenceable tokenizers (some test mocks, ``__slots__`` classes without ``__weakref__``) fall through to recompute on every call — correctness preserved at a small cost. BLOCKING output_router_harmony.py:106 — Round-11's anchored basename allowlist ``(?:^|/)gpt-oss(?:-|$)`` still accepted arbitrary owners (``some-user/gpt-oss-remapped``, ``evil-org/gpt-oss-20b``). Restrict to known owner prefixes: ``openai/``, ``mlx-community/``, ``unsloth/``, plus the bare basename form. Any other owner is rejected even if its basename starts with ``gpt-oss-`` — owner provenance is now part of the identity gate, not just basename shape. Combined with the body-vocab probe set this triples the defense. Tests: * test_compat_gate_anchored_allowlist_rejects_tail_substring_fake grows to include ``some-user/gpt-oss-remapped``, ``evil-org/gpt-oss-20b``, ``anonymous/gpt-oss`` in the reject list; ``unsloth/gpt-oss-20b-MLX-8bit`` joins the accept list. * test_compat_cache_key_segregates_by_tokenizer_instance no longer leaks across full_unit runs (WeakKeyDict + instances kept in scope per test segregate cleanly). * test_compat_gate_cache_segregates_by_marker_ids rewritten for the new WeakKeyDict layout (single tokenizer instance probed with two distinct TokenMaps must produce two inner entries). * test_compat_gate_caches_per_tokenizer_identity no longer needs manual cache cleanup — fresh tokenizer instance is naturally isolated by the WeakKeyDict. 39 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…xact-count test Codex round-13 found 1 BLOCKING + 2 NITs after the round-12 allowlist tightening: BLOCKING output_router_harmony.py:593 — the round-12 owner-prefix allowlist was anchored at ``^owner/...``, which silently rejected LOCAL filesystem paths (``/models/gpt-oss-20b``, ``~/.cache/.../gpt-oss-20b``, ``./gpt-oss-20b-quantized``). Production deployments that load models by absolute path then fell back to the legacy leaking router. Replace the flat regex list with a two-tier ``_is_known_harmony_identity`` helper: * Bare basename (no ``/``) — accept iff basename matches gpt-oss-*. * Local filesystem path (``/``, ``~``, ``./``, ``../`` prefix) — trust the basename match (user-controlled artifact, body-probe set is the authoritative vocab check). * Remote HF ID (``owner/name`` shape) — owner must be in {openai, mlx-community, unsloth}. Identical defense surface as round-12 for remote IDs, plus the local- path support the round-12 tightening accidentally broke. NIT output_router_harmony.py:213 — ``_reconstruct_tool_call_text`` built the body with quadratic ``+=`` accumulation. Switch to list + ``"".join`` once; same pattern as the round-2 NIT fix in ``feed_sequence``. Avoidable copy cost on large JSON tool arguments. NIT tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py:213 — ``test_per_token_streaming_routes_one_event_per_body_token`` claimed "one event per body token" but only asserted ``>= 2``. Coalescing or dropped intermediate deltas would have passed silently. Compute the exact body-token count from the encoded sequence (positions between ``<|message|>`` and ``<|return|>``) and assert event count EQUALS body-token count. Tests: * test_compat_gate_anchored_allowlist_rejects_tail_substring_fake extends ``accepted_names`` with 5 local-path shapes (absolute, HF-cache, tilde, dot-relative, parent-relative). * test_per_token_streaming_routes_one_event_per_body_token strengthened to exact-count assertion. 39 targeted tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Round-15 refactor closing PR #515 codex round-12 / round-14 BLOCKING: eliminate the sentinel-delimited wire-text round-trip for harmony tool calls; surface structured {"name", "arguments"} all the way through the engine and routes. Matches vLLM / SGLang exactly. Why this round: The previous design reconstructed harmony commentary messages back into wire text (<|channel|>commentary to=functions.X<|message|> {body}<|call|>) so the downstream text-based HarmonyToolParser could regex-parse it again. That round-trip lost tool calls whose JSON arguments contained literal harmony marker substrings (e.g. {"text":"<|call|>"}) — the regex anchored on the embedded sentinel and silently truncated the call. Codex flagged this for the third time in round 14; no amount of body-scrubbing on the reconstruction side could fix it without dropping legitimate calls. The fix is the one codex proposed (and the one vLLM and SGLang already ship): pass structured (name, arguments) data through the pipeline. The text-based parser stays in place for legacy / non- router paths; the router-fed path bypasses it entirely. Changes: - GenerationOutput: new tool_calls: list[dict] | None field. Appended after raw_text / reasoning_text to preserve positional-arg compatibility. - RouterEvent: new tool_call: dict | None field. Populated on Channel.TOOL_CALL events by routers that natively parse tool calls. - HarmonyStreamingRouter: feed() emits RouterEvent with the structured payload on commentary message close; feed_sequence() returns list[dict] entries on routed["tool_calls"]. Removed the wire-text reconstruction + sentinel-body abstain logic (no more reconstruction, no more sentinels to worry about). - HarmonyStreamingRouter.identity gate: tolerate HuggingFace cache snapshot paths like ~/.cache/huggingface/hub/models--openai-- gpt-oss-20b/snapshots/<sha> via a new models--<owner>--<name> pattern. Round-14 BLOCKING #1: previous basename-only check rejected the SHA snapshot dir and fell back to the leaking legacy router for every HF-cache-loaded gpt-oss model. - BatchedEngine._route_tokens_for_channels: returns (reasoning, content, structured_tool_calls). Removed the Counter-based wire-text dedup machinery (no more wire text to dedup). When structured tool calls are present, force content to the router's CONTENT-channel result so un-cleaned commentary headers in fallback_text don't bleed into user-facing content. - BatchedEngine._make_routed_output: propagate structured tool_call payload from RouterEvent → GenerationOutput.tool_calls for the streaming path. - _parse_tool_calls_with_parser: new structured_tool_calls= kwarg. When non-None the text-based parser is bypassed entirely; the structured entries are converted to ToolCall objects directly. - routes/chat.py + routes/anthropic.py (non-stream): pass output.tool_calls into the helper. - routes/anthropic.py (stream): accumulate engine-surfaced tool_calls on a per-chunk basis and pass the accumulated list to the end-of-stream parser call. - service/postprocessor.py (stream): on Channel.TOOL_CALL events with output.tool_calls populated, emit StreamEvent(type="tool_call", ...) directly instead of regex-parsing the delta text. Tests: - tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py: rewritten for the structured contract. New test_structured_payload_preserves_body_with_harmony_sentinels replays the BLOCKING #2 scenario (7 sentinel-bearing JSON bodies) and asserts each surfaces intact via the structured payload. New test_compat_gate_accepts_hf_cache_snapshot_path pins BLOCKING #1 fix. 27 tests, all green. - tests/test_engine_router_non_stream.py: rewritten for the (reasoning, content, structured_tool_calls) triple return. New test_structured_tool_call_passthrough_preserves_sentinel_bearing_arguments pins the same BLOCKING #2 scenario at the engine layer. 13 tests, all green. - tests/test_parallel_tool_calls.py: fake parser signature update. - tests/test_server_utils.py: field-order test extended to include tool_calls in the appended-fields list. Full unit suite: 4475 passed, 19 skipped, 16 deselected, 7 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Codex round-15 BLOCKING ×2 on the new structured-tool-call streaming fast-path introduced for HarmonyStreamingRouter passthrough: 1. ``tool_calls[*].index`` restarted at 0 on every router event, so clients merging deltas on ``index`` would overwrite earlier calls. Added per-response ``_structured_tool_call_count`` counter (initialized in __init__ + reset()) and assign monotonic indices across both intra-chunk and inter-chunk emissions. 2. The early structured-tool-call return bypassed the ``parallel_tool_calls=false`` cap that the non-streaming path enforces (routes/chat.py post-parse trim). Added ``_parallel_tool_calls_allowed()`` helper (dict + pydantic shapes) and trim ``allowed_calls`` to one before structuring; suppressed chunks still emit a finish event when ``output.finished`` to keep the route's buffered_finish gate firing. Tests (TestStructuredToolCallStreaming, 9 cases): - index monotonic across chunks - index monotonic within single chunk - parallel_tool_calls=false cap across chunks - parallel_tool_calls=false cap within single chunk - parallel_tool_calls=true (explicit) does not cap - suppressed-by-cap chunk preserves finish semantics - structured event auto-assigns id when missing - structured event preserves caller-provided id - reset() clears the structured counter Also updated ``_make_output`` helper to default ``tool_calls=None`` so the new fast-path stays inert in tests that don't opt in (MagicMock auto-attrs are truthy and would otherwise trip the channel-routed structured branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #515 — openai-harmony StreamableParser passthrough for gpt-oss tool calls. Closes #444 #455 #468 #480 #513 (harmony streaming tool calls leak channel markers / commentary into content). Architecture: HarmonyStreamingRouter delegates to the official harmony parser (same library vLLM and SGLang use), surfaces structured (name, arguments) tool calls through the engine without round-tripping through wire-text — eliminating the entire class of sentinel-anchored regex parsing bugs that the cluster #444-#480 stemmed from. Streaming fast-path enforces tool_calls[*].index monotonicity across router chunks and the parallel_tool_calls=false external contract cap. Release-SOP gates walked: G1 release-smoke OK, G2 codex×16 rounds OK, G3 CLI-fidelity OK, G4 4484 unit tests OK, G5-G9 pr_validate 3×3 matrix (qwen3.5-35B-A3B-8bit / qwen3.6-27B-MLX-8bit / gpt-oss-20b -MXFP4-Q8) OK. G10 N/A (no mlx-* dep bump). G11 deferred to #516 (target v0.6.76) — auto-routing escape hatch flag pair for HarmonyStreamingRouter selection; waiver justified by safe-fallback to legacy in-production harmony router. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

raullenchai and others added 17 commits June 4, 2026 13:29

raullenchai merged commit 529fdd3 into main Jun 5, 2026
8 checks passed

raullenchai deleted the feat/openai-harmony-bypass-513 branch June 5, 2026 04:52

raullenchai mentioned this pull request Jun 5, 2026

Auto-routing flag gap: add --no-openai-harmony-streaming escape hatch (release G11) #516

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(router): openai-harmony bypass for gpt-oss tool calls (closes #444 #455 #468 #480, #513)#515

feat(router): openai-harmony bypass for gpt-oss tool calls (closes #444 #455 #468 #480, #513)#515
raullenchai merged 17 commits into
mainfrom
feat/openai-harmony-bypass-513

raullenchai commented Jun 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raullenchai commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Architecture (round-15 — bytes-faithful structured passthrough)

Test coverage

What was verified live (gpt-oss-20b)

Dependency

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

raullenchai commented Jun 4, 2026 •

edited

Loading