Skip to content

feat(router): openai-harmony bypass for gpt-oss tool calls (closes #444 #455 #468 #480, #513)#515

Merged
raullenchai merged 17 commits into
mainfrom
feat/openai-harmony-bypass-513
Jun 5, 2026
Merged

feat(router): openai-harmony bypass for gpt-oss tool calls (closes #444 #455 #468 #480, #513)#515
raullenchai merged 17 commits into
mainfrom
feat/openai-harmony-bypass-513

Conversation

@raullenchai
Copy link
Copy Markdown
Owner

@raullenchai raullenchai commented Jun 4, 2026

Summary

Replace the custom harmony state machine in OutputRouter with a thin shim over openai-harmony's StreamableParser — the same library vLLM and SGLang delegate to for gpt-oss tool calling — and surface structured tool-call data through the engine and routes (no more sentinel-delimited wire-text round-trip).

Closes #444 / #480 (harmony streaming tool call leak).
Closes #455 (commentary channel routing).
Closes #468 (tool_choice=required compound).
Lands #513.

Why

PR #514 partially fixed #444/#480 at the parser layer (prefix-hold + count-based <|call|> detection) but explicitly deferred the END-TO-END production fix because the router can't model harmony's tool-call protocol with a single-token state machine.

Live verification on mlx-community/gpt-oss-20b-MXFP4-Q8 confirmed that commentary tokenizes as comment(12606) + ary(815) — TWO tokens. The custom router's AWAITING_CHANNEL_TYPE branch only recognizes analysis and final as single-token channel words. commentary falls through to CONTENT and leaks the whole markered tool-call sequence into the content stream.

The proper fix is what vLLM and SGLang already do: delegate to the official harmony streaming parser. StreamableParser handles multi-token channel/recipient/constrain parsing, message boundaries, and state transitions natively.

Architecture (round-15 — bytes-faithful structured passthrough)

Earlier revisions of this PR reconstructed parsed harmony messages back into sentinel-delimited wire text so the downstream text-based HarmonyToolParser could regex-parse them again. That round-trip lost tool calls whose JSON arguments contained literal harmony marker substrings (e.g. {"text":"<|call|>"}) — the regex anchored on the embedded sentinel and silently truncated the call. Round-15 eliminates the round-trip; HarmonyStreamingRouter surfaces structured {"name", "arguments"} payloads that flow through the engine via a new GenerationOutput.tool_calls field. Routes consume the structured data directly, bypassing the regex pass for the router-fed path. This matches vLLM and SGLang exactly.

  • New HarmonyStreamingRouter (vllm_mlx/output_router_harmony.py) — duck-typed replacement for OutputRouter on harmony format. Internally wraps StreamableParser. Same feed/finalize/reset/feed_sequence surface so engine streaming + non-stream sequence paths pick it up without changes.
  • Structured event payload — TOOL_CALL RouterEvent instances carry the parsed {"name", "arguments"} dict on a new RouterEvent.tool_call field; feed_sequence returns the same shape via routed["tool_calls"].
  • Engine plumbingGenerationOutput.tool_calls: list[dict] | None carries the structured payload through both streaming and non-streaming paths. _route_tokens_for_channels returns a (reasoning, content, structured_tool_calls) triple; the wire-text dedup machinery is gone.
  • Route bypass_parse_tool_calls_with_parser(..., structured_tool_calls=...) short-circuits the regex extraction when the engine surfaced structured calls. The text-based parser stays for legacy paths (gemma4 + non-harmony) which still produce wire-text TOOL_CALL events.
  • Compatibility gate — selected only when the model's harmony special-token IDs AND a representative set of body-vocab probe strings round-trip to the same IDs through both the model tokenizer and the harmony encoding. Tokenizer-identity allowlist accepts known-owner remote IDs, local filesystem paths, AND HuggingFace cache snapshot dirs (models--<owner>--<name>/snapshots/<sha>). Mismatched body vocabs / unknown owners fall back to the legacy router.
  • finalize() safer-default — only flushes process_eos; never synthesizes a tool call from truncated commentary (vLLM / SGLang behavior).
  • Whitespace preservationfeed_sequence returns accumulated text verbatim; only the exact empty string maps to None.

Test coverage

Skips gracefully when openai-harmony is missing.

What was verified live (gpt-oss-20b)

  • ✅ Streaming tool call: finish_reason=tool_calls, content empty, tool surfaced w/ correct args
  • ✅ Non-stream tool call: same shape
  • ✅ Analysis + final (no tools): reasoning and content split, zero marker leak
  • ✅ pydantic_ai_full: 6/6 pass
  • ✅ anthropic_sdk: 5/5 pass
  • ✅ langchain: 6/6 pass

Dependency

Adds openai-harmony>=0.0.6 to project deps. Pure Python + small Rust extension, ~600 KB. Soft-imported at runtime — missing dep / mismatched vocab gracefully falls back to legacy state machine.

Test plan

  • CI green
  • Live verify gpt-oss-20b commentary tool call surfaces with no content leak
  • Live verify gpt-oss-20b analysis + final still route correctly post-bypass
  • Regression: gpt-oss-20b agent tests pass (pydantic_ai / anthropic_sdk / langchain)
  • Round-15: sentinel-in-JSON-args bodies preserved bytes-faithfully (codex round-12 / round-14 BLOCKING closure)

🤖 Generated with Claude Code

raullenchai and others added 17 commits June 4, 2026 13:29
 #455 #468 #480

Issue #513 — the custom Gemma 4–style OutputRouter state machine cannot
model harmony's tool-call protocol reliably. Production ``commentary``
is two tokens (``comment`` + ``ary``) on gpt-oss-20b, so a single-token
channel-type match never fires; the recipient (``functions.<name>``)
and constrain directive (``<|constrain|>json``) are multi-token
literals the naive state machine swallows or leaks. PR #514 confirmed
this empirically and deferred the full e2e fix to a follow-up.

This PR lands the SOTA fix: delegate harmony state tracking to
``openai-harmony.StreamableParser`` — the same library vLLM and
SGLang delegate to for gpt-oss tool calling. The new
``HarmonyStreamingRouter`` shim mirrors ``OutputRouter``'s public
surface (``feed`` / ``finalize`` / ``reset`` / ``feed_sequence`` /
``map``), so the engine streaming and non-stream sequence paths pick
it up without changes. The legacy custom state machine remains in
``OutputRouter`` as the documented fallback when the optional
``openai-harmony`` dep is unavailable.

Compatibility gate: ``HarmonyStreamingRouter`` is only selected when
the model's harmony special-token IDs match the openai-harmony
encoding's IDs (verified for upstream gpt-oss family at PR-time;
encoded ``<|channel|>``, ``<|message|>``, ``<|call|>``, ``<|end|>``,
``<|return|>``, ``<|start|>``, ``<|constrain|>`` through both
the model HF tokenizer and the harmony encoding and asserted set
equality). For synthetic test vocabs that map structural markers to
different IDs, the legacy router still runs — keeping the legacy
test suite green without coercion.

Routing rules (delegated to StreamableParser):
- analysis channel → Channel.REASONING with last_content_delta
- final channel → Channel.CONTENT with last_content_delta
- commentary channel + recipient: body deltas buffered; emit a
  single Channel.TOOL_CALL event on message close, carrying the
  reconstructed wire-format text so the downstream HarmonyToolParser
  consumes it unchanged.
- finalize() runs process_eos so truncated commentary messages still
  surface as tool calls instead of being dropped silently.

New entry point:
``OutputRouter.from_tokenizer_for_streaming(tokenizer)`` — production
factory used by ``BatchedEngine._create_output_router``. Returns
``HarmonyStreamingRouter`` for matched-vocab harmony models, legacy
``OutputRouter`` for everything else. ``from_tokenizer`` remains
unchanged so legacy tests instantiate the legacy router directly.

Regression file
``tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py``
pins closure of #444/#480 (commentary tool call no marker leak),
#455 (analysis + commentary channel separation), #468 (compound
analysis + commentary + final), plus general invariants (per-token
streaming, truncated-message finalize drain). Skips gracefully when
the optional ``openai-harmony`` dep is missing.

Dependency: ``openai-harmony >= 0.0.6`` added to project deps. Pure
Python + a Rust extension, ~600KB on disk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hitespace + sanitize recipient

Three BLOCKING findings from pr_validate round-1:

1. `is_openai_harmony_compatible` only verified marker IDs; a tokenizer
   with matching markers but a different body-token vocabulary would
   feed `StreamableParser` IDs that decode through the wrong vocab,
   corrupting streamed content and tool-call arguments. Tighten the
   gate to also probe a representative set of body strings through
   both encoders and require ID equality. This auto-resolves the 7
   unit regressions in `test_engine_router_non_stream.py` and
   `test_batched_engine_output_router.py` whose synthetic vocabs had
   the smoking-gun mismatch (`Reason`=1 / `ing`=2 decoded to `"#`
   under harmony encoding).

2. `feed_sequence().strip()` silently mutated content/reasoning whose
   harmony body legitimately begins or ends with whitespace (markdown
   blocks, code fences, formatted output). Preserve verbatim; map
   only the exact empty string to `None`.

3. `_reconstruct_tool_call_text()` interpolated `recipient` verbatim
   into the `to=<recipient>` wire text consumed by `HarmonyToolParser`,
   so a recipient containing whitespace, marker-like characters, or
   an unexpected shape could break the downstream parser. Validate
   against `^functions\.<safe-name>$` and raise on malformed input.

Updated `from_tokenizer_for_streaming` to pass the tokenizer to the
gate. Added 4 invariant tests (whitespace preservation, recipient
shape rejection, gate rejects no-encode tokenizer, gate rejects
mismatched body vocab) under
`tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…515 round-1 follow-up)

Live verify on gpt-oss-20b uncovered a non-stream regression masked by
unit tests: `_route_tokens_for_channels`'s content override was
designed for the legacy router, which could not classify multi-token
``commentary`` and therefore left ``routed.content`` non-None on tool
calls. The new ``HarmonyStreamingRouter`` correctly classifies
commentary as a TOOL_CALL event, so ``routed.content`` becomes None
even though the model DID emit a structural sequence the downstream
``HarmonyToolParser`` needs to parse. The override then clobbered
``fallback_text`` (which still carries the commentary wire text —
``_clean_gpt_oss_output`` bails on commentary), and the route's
``_parse_tool_calls_with_parser`` received an empty string.

Fix: skip the override when ``routed.tool_calls`` is non-empty. The
override remains in force for the original #442 case (analysis-only,
no final, no tool call → return empty content so the analysis body
does not leak into the user-visible message).

Verified post-fix:
- Non-stream tool call on gpt-oss-20b: surfaces correctly
- Streaming tool call: still correct
- Non-stream analysis + final (no tools): still correct, no marker leak
- pydantic_ai_full: 6/6 pass
- anthropic_sdk: 5/5 pass
- langchain: 6/6 pass
- 28-test unit panel (tests/test_engine_router_non_stream.py +
  tests/test_batched_engine_output_router.py + the harmony bypass
  regressions) all green.

Pinning invariant added:
``test_tool_call_routing_preserves_fallback_text`` —
fakes a router returning ``tool_calls=[wire_text]`` and asserts the
fallback_text survives intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n + perf/log nits

Two BLOCKING and two NIT findings from pr_validate round-2:

1. BLOCKING — recipient regex too strict
   `^functions\.[A-Za-z_]...` rejected OpenAI-spec valid tool names
   whose first character is a digit (e.g. `functions.2fa_lookup`).
   OpenAI documents `[a-zA-Z0-9_-]{1,64}` for function names; match
   that exactly so any tool the upstream API accepts also round-trips
   through the router.

2. BLOCKING — body-vocab probe set alone insufficient
   Three probe strings cannot prove full-vocab parity. A tokenizer
   with matching markers AND matching probes but a remapped uncommon
   token could still corrupt content later. Add tokenizer-identity
   allowlist as the first gate layer: require `tokenizer.name_or_path`
   to contain `gpt-oss` (case-insensitive substring) so quant-suffix
   renames (`-MXFP4-Q8`) and org prefixes (`openai/` vs
   `mlx-community/`) all resolve correctly while strangers are
   rejected up-front. The marker + probe checks remain as
   defense-in-depth.

3. NIT — `feed_sequence` O(n²) string concat
   Replace `content += event.text` / `reasoning += event.text` in the
   per-token loop with list-append + `"".join(...)` at return time.
   On long non-stream generations the old form copied the
   accumulator on every token.

4. NIT — `logger.info` per-request spam
   The streaming factory runs on every `/v1/chat/completions` call.
   Downgrade the "upgraded harmony router" message to DEBUG so it
   doesn't flood production logs.

Plus ruff format on the three affected files (round-1 commit left
them at non-canonical line shapes).

Pinning invariants added:
- `test_compat_gate_rejects_unknown_tokenizer_identity` — Mistral-named
  tokenizer with matching encoder is still rejected.
- `test_compat_gate_accepts_gpt_oss_quant_suffix_variants` — all
  `mlx-community/gpt-oss-*` and `openai/gpt-oss-*` variants accepted;
  arbitrary names rejected.
- `test_recipient_shape_accepts_digit_start_names` — `functions.2fa_lookup`,
  `functions.0_index`, `functions.123`, `functions.tool-with-dash`.

Live verify on gpt-oss-20b post-fix:
- non-stream tool call: still surfaces with correct args
- analysis + final: still routes correctly, zero marker leak
- log spam: NONE during normal request (was 1 INFO line per request)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uncated bodies + cache compat probe

Round-3 findings from pr_validate:

1. BLOCKING — non-stream path discarded `routed["tool_calls"]`
   When `HarmonyStreamingRouter.finalize()` synthesized a TOOL_CALL
   event for truncated commentary (no `<|call|>` in the raw output),
   `_clean_gpt_oss_output`'s bail-out condition might not match —
   the strip regex then removed the commentary header from
   `fallback_text`, and the route's text-based `HarmonyToolParser`
   saw no tool call at all. Append the reconstructed wire text from
   `routed["tool_calls"]` to `fallback_text` so the downstream parser
   can extract the call. Suppressed when `fallback_text` already
   carries a commentary marker (avoids double-parsing the same call).

2. BLOCKING — finalize() surfaced mid-JSON truncations as completed
   `StreamableParser.process_eos` appends the in-progress commentary
   message to `messages` even when the body was cut off mid-stream
   (max_tokens limit hit in the middle of `{"city":"NY`). The router
   then synthesized a TOOL_CALL event with corrupt arguments. Add
   `_tool_call_body_is_valid` — when the message's `content_type`
   indicates JSON, require `json.loads(body)` to succeed; empty
   bodies and JSON-decode failures cause `finalize()` to drop the
   event. Sibling test `test_finalize_drains_truncated_commentary_message`
   already pins that a COMPLETE-but-unterminated body still surfaces;
   the new tests pin that mid-JSON / empty bodies do NOT.

3. NIT — per-request compat probe + harmony encoding load
   `is_openai_harmony_compatible` ran the three-layer gate (marker
   IDs + body-vocab probes) on every `/v1/chat/completions` call, and
   each call re-loaded the harmony encoding. Cache both: a
   module-level `_COMPAT_RESULT_CACHE` keyed by lowercased
   `name_or_path` short-circuits subsequent calls, and a
   `_HARMONY_ENCODING_CACHE` reuses the loaded encoding. Both fall
   back to the slow path if the cache is cleared.

New invariants pinned:
- `test_finalize_drops_mid_json_truncation` — `{"city":"NY` → no event
- `test_finalize_drops_truncated_commentary_with_empty_body` —
  empty body → no event
- `test_compat_gate_caches_per_tokenizer_identity` — repeat calls
  hit the cache (no encode invocations)
- `test_finalize_drained_tool_call_supplements_fallback_text` —
  engine appends wire text when fallback_text lacks markers;
  idempotent when commentary is already present.

Existing tests with reused fixture identities re-keyed to unique
`name_or_path` strings so the new result cache doesn't cross-
contaminate them.

Live verify on gpt-oss-20b post-fix:
- non-stream tool call still surfaces
- analysis + final still routes correctly
- log spam: still silenced

35/35 router-related unit tests green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y marker IDs

Round-4 findings from pr_validate:

1. BLOCKING — gate skipped missing marker IDs via ``continue``
   ``if model_id is None: continue`` meant a tokenizer with only a
   subset of harmony markers in its vocab could pass the gate.
   StreamableParser needs ALL seven (``<|channel|>`` / ``<|message|>``
   / ``<|call|>`` / ``<|end|>`` / ``<|return|>`` / ``<|start|>`` /
   ``<|constrain|>``) to function — a partial tokenizer would crash
   inside the parser when the model emitted a marker the gate didn't
   check. Replace ``continue`` with ``return False``. Any production
   gpt-oss tokenizer has all seven; tokenizers that don't are not
   harmony-compatible.

2. BLOCKING — finalize() didn't drain post-EOS final/analysis delta
   ``process_eos()`` may surface a ``last_content_delta`` from a
   truncated final or analysis message that was buffered inside
   StreamableParser. The previous finalize() only checked for
   freshly-closed commentary messages and discarded any delta. Add
   a post-EOS drain that routes the delta to CONTENT (final) or
   REASONING (analysis) using the ``pre_eos_channel`` snapshot (the
   post-EOS channel becomes ``None``). Empirically the current
   openai-harmony build emits everything during ``feed()``, but the
   contract is pinned so future builds can't silently regress.

3. NIT — compat cache key was lowercased ``name_or_path`` only
   Two tokenizer instances with the same identity string but
   distinct marker IDs (e.g. test mocks, or a model loaded with a
   vocab override) would share a stale cached True/False. Include
   the marker-ID tuple in the cache key.

New invariants pinned:
- ``test_finalize_does_not_double_emit_completed_final`` — a normal
  final-channel response with ``<|return|>`` emits via per-token
  feed only; finalize() must return None.
- ``test_finalize_drains_truncated_final_buffered_delta`` — when EOS
  surfaces a buffered final/analysis delta, finalize() routes it to
  the correct channel.
- ``test_compat_gate_rejects_missing_marker_ids`` — TokenMap with
  ``harmony_call=None`` is rejected.
- ``test_compat_gate_cache_segregates_by_marker_ids`` — same
  identity + distinct marker tuples → distinct cache entries.

Live verify post-fix:
- non-stream tool call still surfaces correctly
- 39/39 router-related unit tests green

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… test

Round-5 codex findings (2 BLOCKING, 0 NIT — converging fast):

1. BLOCKING — engine bulk guard suppressed multi-tool finalize drain
   ``"<|channel|>commentary" not in fallback_text`` rejected the
   entire ``routed["tool_calls"]`` append when ANY commentary marker
   was already present. In a multi-tool turn (model emits call #1
   cleanly with ``<|call|>``, gets truncated mid-call-#2), the
   fallback already carries call #1's marker so call #2 was silently
   dropped. Replace with per-wire-text substring check: append each
   reconstructed tool call only when its exact text is missing from
   fallback_text. Idempotent on call #1 + appends call #2.

2. BLOCKING — compat-cache test was non-discriminating
   ``test_compat_gate_caches_per_tokenizer_identity`` used an
   identity outside the allowlist, so the gate short-circuited
   BEFORE ``encode`` was ever invoked. The test would pass even if
   the cache were removed entirely. Redesign with a real gpt-oss
   identity + matching marker IDs so the gate reaches the body-vocab
   probe step; only the mismatching probe response causes False,
   and the second call MUST be short-circuited by the cache (asserted
   via encode-call counter).

Updated test_finalize_drained_tool_call_supplements_fallback_text
to match the new per-wire-text contract: Case B now pins the
"same call must not be doubled" invariant, while the new
``test_multi_tool_fallback_with_existing_marker_still_appends_new``
pins the multi-tool drain.

40/40 router-related unit tests green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ynthesis

Adopt vLLM / SGLang same default: HarmonyStreamingRouter.finalize() now
only flushes parser state via process_eos and returns None. It no longer:

  * synthesizes a TOOL_CALL from a truncated commentary message (no
    <|call|>) — a max_tokens / stop cutoff must not be executed as if
    the model had emitted the call terminator, even if the partial
    body parses as JSON. Rescuing the rare recoverable case is dwarfed
    by the cost of one wrongly-invoked side-effecting tool.
  * drains a post-EOS last_content_delta for final / analysis — per-
    token feed() already produced every body byte; emitting another
    delta risks duplicating the last token if a future openai-harmony
    build does not clear last_content_delta between flushes.

Both behaviors were the last two codex round-5/6 BLOCKING findings on
PR #515 (closes #444 #455 #468 #480, lands #513). They were originally
added to recover edge cases but the analysis above shows the safer
default is to drop them; the helper _tool_call_body_is_valid + the
json import are removed alongside since they are no longer reachable.

Tests:
  * test_finalize_drains_truncated_commentary_message → rename to
    test_finalize_never_synthesizes_truncated_commentary, assert None.
  * test_finalize_drains_truncated_final_buffered_delta → rename to
    test_finalize_does_not_drain_post_eos_buffered_delta, assert None
    on both final and analysis channels.
  * test_finalize_drops_mid_json_truncation /
    test_finalize_drops_truncated_commentary_with_empty_body /
    test_finalize_does_not_double_emit_completed_final already pinned
    drained == None — docstrings refreshed to attribute the contract
    to the safer-default rule.

29 targeted tests still pass; engine-side multi-tool dedup tests are
unaffected because they exercise the routed["tool_calls"] append loop,
which still fires from feed()-time TOOL_CALL events (completed tool
calls with <|call|> closing the message).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ural dedup

Two fixes flagged by codex on PR #515 round-7:

BLOCKING (output_router_harmony.py:175):
  _reconstruct_tool_call_text reinserted the tool-call body into a
  delimiter-based harmony wire text consumed by HarmonyToolParser. A
  body containing literal harmony sentinel strings (``<|call|>``,
  ``<|message|>``, etc.) would let the downstream parser anchor on the
  embedded sentinel and truncate the JSON arguments. On gpt-oss those
  sentinels are normally single special tokens never emitted as body
  text, but an adversarial prompt could echo the raw characters back
  via body vocab. Reject (raise ValueError) rather than escape — the
  engine's outer try/except then falls back to the legacy text-based
  parser instead of feeding corrupt args downstream. Consistent with
  the recipient-validation BLOCKING from round-1.

  ``content_type`` also gets a sentinel scan, but the legitimate
  ``<|constrain|>`` prefix is stripped first so the framing carrier
  doesn't trip its own gate.

NIT (engine/batched.py:1373):
  De-duping reconstructed tool calls with verbatim substring match
  doubled calls when the model emit and router reconstruction differ
  only by whitespace runs (e.g. ``to=foo  <|constrain|>`` vs
  ``to=foo <|constrain|>``). Replace with a structural compare on the
  ``(recipient, normalized-body)`` tuple extracted via the new
  ``_HARMONY_TC_SIGNATURE_RE``. Robust to any spacing the model picks
  while still distinguishing different arguments for the same
  recipient. Falls back to verbatim match on non-canonical
  reconstructions (better to risk one duplicate than silently drop).

Tests:
  * test_reconstruct_tool_call_rejects_body_carrying_harmony_sentinel
    pins each of the seven forbidden sentinels and a smuggled
    content_type. Sanity: clean JSON + ``<|constrain|>json`` ctype
    still reconstruct.
  * test_fallback_dedup_normalizes_whitespace_variance pins the
    spacing-variant case (same call) and the same-recipient-different-
    body case (must still append).

31 targeted tests pass (parsers regression + engine non-stream).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…encode validation

Three BLOCKINGs + one NIT from codex on PR #515 round-8:

BLOCKING engine/batched.py:1313 — Body whitespace collapse broke
distinct legitimate calls. The signature was returning
``" ".join(body.split())`` so ``{"q":"a b"}`` and ``{"q":"a  b"}``
deduped as the same call and the later one was silently dropped.
Use the exact body bytes captured between ``<|message|>`` and
``<|call|>``; the regex's ``\s*`` already absorbs the only place
spacing can legitimately differ (between ``to=`` and
``<|constrain|>``).

BLOCKING engine/batched.py:1397 — When ``_harmony_tool_call_signature``
returned ``None``, the previous "best-effort append" fed non-canonical
wire text to HarmonyToolParser, which either fails to parse (best
case) or anchors on the wrong markers and surfaces a corrupt tool
call (worst case). Skip with a debug log instead — the router has
already lost the tool-call signal, nothing safe to recover.

BLOCKING output_router_harmony.py:334 — ``list(model_ids)`` assumed
``tokenizer.encode()`` returns a flat ``list[int]``, but HF
``BatchEncoding`` and tensor-shaped returns either raise on ``list()``
or yield opaque non-int entries. Validate that the encode result is a
flat int sequence; on anything else fall back to ``False`` so the
engine uses the legacy router instead of crashing.

NIT output_router_harmony.py:274 — ``is_openai_harmony_compatible``
called both ``is_openai_harmony_available()`` and ``_get_harmony_encoding()``
to check the dep; the latter already returns ``None`` on import
failure. Drop the redundant probe.

Tests:
  * test_compat_gate_rejects_non_int_encode_result — covers both
    BatchEncoding-style iter (dict keys) and tuple-of-strings.
  * test_fallback_dedup_preserves_internal_body_whitespace — pins
    that ``{"q":"a b"}`` and ``{"q":"a  b"}`` are NOT collapsed.
  * test_non_canonical_router_wire_text_is_dropped — pins that a
    missing ``<|call|>`` terminator causes the wire text to be
    dropped, not appended.

34 targeted tests pass. Round-7 ``test_fallback_dedup_normalizes_whitespace_variance``
still passes because the spacing variance lives in the wrapper, not
the body — that regex ``\s*`` still absorbs it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gex + sentinel abstain

Three BLOCKINGs from codex on PR #515 round-9:

BLOCKING engine/batched.py:1446 — Set dedup silently dropped a
legitimate second identical tool call in the same turn (e.g. user
asks for the same weather lookup twice). Switch to ``Counter`` so
each existing match in fallback_text consumes ONE routed call;
multiplicity is preserved for both sides.

BLOCKING engine/batched.py:1311 — Signature regex matched bare
``to=functions.X<|message|>...<|call|>`` without the
``<|channel|>commentary`` prefix. A fallback_text where
``clean_output_text`` stripped the commentary header but left the
tail (or any ordinary content carrying those substrings) would
false-suppress the router's real wire text. Anchor the regex on the
full commentary frame so only canonical tool-call shapes count as
existing signatures.

BLOCKING output_router_harmony.py:217 — Round-7's hard ``raise
ValueError`` on body-sentinel substrings dropped legitimate calls
whose OpenAI tool-call JSON values legally contain ``<|...|>``
characters. Flip the contract: ``_reconstruct_tool_call_text`` now
returns ``str | None`` and ABSTAINS (returns None) on sentinel-laden
body or content_type. The caller (``feed()``) emits no router event;
the engine's outer plumbing leaves the model's raw fallback_text
untouched and the legacy text-based HarmonyToolParser handles the
call (with the same parse limitation every text-based harmony
consumer has — no worse than baseline). Recipient still ``raise``s
because recipient corruption is structural, not body-content-
dependent.

Tests:
  * test_reconstruct_tool_call_abstains_on_body_carrying_harmony_sentinel
    (renamed from rejects_..., now asserts None instead of raises).
  * test_identical_calls_twice_in_turn_both_survive_dedup pins the
    multiplicity contract on both balanced (2+2 → 2) and unbalanced
    (1+2 → 2) cases.
  * test_signature_regex_requires_commentary_frame pins that a
    stripped fallback_text bare tail does NOT false-suppress.

36 targeted tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…aming visibility

Codex round-10 BLOCKING (output_router_harmony.py:292): when
``_reconstruct_tool_call_text`` abstains for a body containing a
sentinel substring, the per-token feed loop had already suppressed
commentary body deltas (we buffer until close), so emitting nothing on
close — the round-9 behavior — made the entire tool call invisible to
the streaming user. That regressed against baseline, where the
PR-#515-replaced legacy router leaked commentary body into CONTENT
verbatim (one of the original bugs being fixed, but at least visible).

Fix: surface the accumulated body bytes as a CONTENT event when
abstaining. The streaming user sees the tool call's intent
(``{"text":"use <|call|>"}``) — semantically wrong channel, but
matches baseline behavior and breaks no other contract. The engine
non-stream path is unaffected: it relies on ``fallback_text``, which
still carries the model's raw emit and reaches HarmonyToolParser the
same way it would for any harmony tool call.

Honors all three preceding rounds:
  * Round-7 (don't emit corrupt structured event) — no TOOL_CALL
    event is emitted; the structured-path contract is preserved.
  * Round-9 (don't drop legitimate calls with sentinel chars in JSON
    values) — the call is still observable via the streaming CONTENT
    delta and the non-stream fallback_text.
  * Round-10 (don't drop streaming visibility) — body now flushed.

Tests:
  * test_reconstruct_tool_call_abstains_on_body_carrying_harmony_sentinel
    docstring updated to attribute the contract to round-10.
  * test_sentinel_body_emits_content_event_for_streaming_visibility
    pins the new ``feed()`` contract: TOOL_CALL stays None on abstain,
    but CONTENT carries the body bytes verbatim.

37 targeted tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… + ctype shape

Codex round-11 surfaced two BLOCKINGs and two NITs in fresh territory
(no oscillation with rounds 7/9/10):

BLOCKING output_router_harmony.py:541 — ``known in name_lc`` substring
matching let any identity carrying ``gpt-oss`` anywhere in its name
pass the allowlist (``my-not-gpt-oss``, ``foo/bar-gpt-oss-malicious``).
Switch to anchored regex ``(?:^|/)gpt-oss(?:-|$)`` so the basename
must START with ``gpt-oss-`` (after a path separator or string start)
and either continue with ``-`` or end. Canonical names
(``openai/gpt-oss-20b``, ``mlx-community/gpt-oss-20b-MXFP4-Q8``,
bare ``gpt-oss-20b``) still pass; tail-substring fakes don't.

BLOCKING output_router_harmony.py:507 — Cache key
``(name_or_path, marker_ids)`` allowed two distinct tokenizer
instances with the same name + markers but a remapped body vocab to
share a single cached True. Fold ``id(tokenizer)`` into the key —
distinct instances re-probe authoritatively, same-instance reuses
cleanly with zero-encode cache hits (cheaper than a body-probe
fingerprint and answers codex's "tokenizer instance" suggestion
directly).

NIT output_router_harmony.py:226 — ``content_type`` was emitted
verbatim, so a bare enum value like ``"json"`` would reconstruct
non-canonical wire text. Validate against
``^<\|constrain\|>[A-Za-z0-9_\-]{1,32}$`` and abstain (return None)
on anything outside that shape. The regex implicitly rejects embedded
sentinels too, subsuming the round-7 smuggled-sentinel content_type
scan.

NIT tests/test_engine_router_non_stream.py:225 — Round-6 made
``HarmonyStreamingRouter.finalize()`` always return None, so the
"finalize-synthesized truncated tool call" framing in this test's
docstring was stale (the fake-router stub actually stands in for the
``feed_sequence`` path that surfaces a completed ``<|call|>``-closed
tool call). Renamed to
``test_routed_tool_call_supplements_fallback_text`` and rewrote the
docstring to attribute the contract to ``feed_sequence`` not
``finalize``.

Tests:
  * test_compat_gate_anchored_allowlist_rejects_tail_substring_fake
    pins 4 rejected tail-substring names + 4 accepted canonical names.
  * test_compat_cache_key_segregates_by_tokenizer_instance pins that
    two distinct instances with the same name+markers but mismatched
    body vocab segregate correctly.
  * test_reconstruct_tool_call_abstains_on_body_carrying_harmony_sentinel
    adds bare-``"json"`` ctype + bad-char ctype abstain coverage.

39 targeted tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-12 surfaced 2 BLOCKINGs and 1 NIT that compound into a
real production bug (full_unit caught it via
``test_compat_cache_key_segregates_by_tokenizer_instance``).

BLOCKING / NIT output_router_harmony.py:551 — Round-11 keyed the
cache on ``(name_lc, marker_ids, id(tokenizer))``. ``id()`` is the
Python object's memory address; CPython is free to reuse it after the
object is garbage-collected, so a tokenizer freed between requests
could have its address reused by a NEW (potentially incompatible)
tokenizer that then hits the stale ``True`` and is wrongly upgraded
to the harmony router. The cache also grew unbounded across model
reloads. Replace with a ``WeakKeyDictionary`` keyed on the tokenizer
object itself, with an inner ``dict[(name_lc, marker_ids), bool]``
for per-TokenMap segregation. Entries auto-clear on GC — no id-reuse
hazard, bounded growth. Non-weak-referenceable tokenizers (some test
mocks, ``__slots__`` classes without ``__weakref__``) fall through
to recompute on every call — correctness preserved at a small cost.

BLOCKING output_router_harmony.py:106 — Round-11's anchored basename
allowlist ``(?:^|/)gpt-oss(?:-|$)`` still accepted arbitrary owners
(``some-user/gpt-oss-remapped``, ``evil-org/gpt-oss-20b``). Restrict
to known owner prefixes: ``openai/``, ``mlx-community/``,
``unsloth/``, plus the bare basename form. Any other owner is
rejected even if its basename starts with ``gpt-oss-`` — owner
provenance is now part of the identity gate, not just basename
shape. Combined with the body-vocab probe set this triples the
defense.

Tests:
  * test_compat_gate_anchored_allowlist_rejects_tail_substring_fake
    grows to include ``some-user/gpt-oss-remapped``,
    ``evil-org/gpt-oss-20b``, ``anonymous/gpt-oss`` in the reject
    list; ``unsloth/gpt-oss-20b-MLX-8bit`` joins the accept list.
  * test_compat_cache_key_segregates_by_tokenizer_instance no longer
    leaks across full_unit runs (WeakKeyDict + instances kept in
    scope per test segregate cleanly).
  * test_compat_gate_cache_segregates_by_marker_ids rewritten for
    the new WeakKeyDict layout (single tokenizer instance probed
    with two distinct TokenMaps must produce two inner entries).
  * test_compat_gate_caches_per_tokenizer_identity no longer needs
    manual cache cleanup — fresh tokenizer instance is naturally
    isolated by the WeakKeyDict.

39 targeted tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…xact-count test

Codex round-13 found 1 BLOCKING + 2 NITs after the round-12 allowlist
tightening:

BLOCKING output_router_harmony.py:593 — the round-12 owner-prefix
allowlist was anchored at ``^owner/...``, which silently rejected
LOCAL filesystem paths (``/models/gpt-oss-20b``,
``~/.cache/.../gpt-oss-20b``, ``./gpt-oss-20b-quantized``). Production
deployments that load models by absolute path then fell back to the
legacy leaking router. Replace the flat regex list with a two-tier
``_is_known_harmony_identity`` helper:
  * Bare basename (no ``/``) — accept iff basename matches gpt-oss-*.
  * Local filesystem path (``/``, ``~``, ``./``, ``../`` prefix) —
    trust the basename match (user-controlled artifact, body-probe
    set is the authoritative vocab check).
  * Remote HF ID (``owner/name`` shape) — owner must be in
    {openai, mlx-community, unsloth}.
Identical defense surface as round-12 for remote IDs, plus the local-
path support the round-12 tightening accidentally broke.

NIT output_router_harmony.py:213 — ``_reconstruct_tool_call_text``
built the body with quadratic ``+=`` accumulation. Switch to list +
``"".join`` once; same pattern as the round-2 NIT fix in
``feed_sequence``. Avoidable copy cost on large JSON tool arguments.

NIT tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py:213
— ``test_per_token_streaming_routes_one_event_per_body_token`` claimed
"one event per body token" but only asserted ``>= 2``. Coalescing or
dropped intermediate deltas would have passed silently. Compute the
exact body-token count from the encoded sequence (positions between
``<|message|>`` and ``<|return|>``) and assert event count EQUALS
body-token count.

Tests:
  * test_compat_gate_anchored_allowlist_rejects_tail_substring_fake
    extends ``accepted_names`` with 5 local-path shapes (absolute,
    HF-cache, tilde, dot-relative, parent-relative).
  * test_per_token_streaming_routes_one_event_per_body_token
    strengthened to exact-count assertion.

39 targeted tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round-15 refactor closing PR #515 codex round-12 / round-14 BLOCKING:
eliminate the sentinel-delimited wire-text round-trip for harmony
tool calls; surface structured {"name", "arguments"} all the way
through the engine and routes. Matches vLLM / SGLang exactly.

Why this round:

The previous design reconstructed harmony commentary messages back
into wire text (<|channel|>commentary to=functions.X<|message|>
{body}<|call|>) so the downstream text-based HarmonyToolParser could
regex-parse it again. That round-trip lost tool calls whose JSON
arguments contained literal harmony marker substrings (e.g.
{"text":"<|call|>"}) — the regex anchored on the embedded sentinel
and silently truncated the call. Codex flagged this for the third
time in round 14; no amount of body-scrubbing on the reconstruction
side could fix it without dropping legitimate calls.

The fix is the one codex proposed (and the one vLLM and SGLang
already ship): pass structured (name, arguments) data through the
pipeline. The text-based parser stays in place for legacy / non-
router paths; the router-fed path bypasses it entirely.

Changes:

- GenerationOutput: new tool_calls: list[dict] | None field. Appended
  after raw_text / reasoning_text to preserve positional-arg
  compatibility.
- RouterEvent: new tool_call: dict | None field. Populated on
  Channel.TOOL_CALL events by routers that natively parse tool calls.
- HarmonyStreamingRouter: feed() emits RouterEvent with the
  structured payload on commentary message close; feed_sequence()
  returns list[dict] entries on routed["tool_calls"]. Removed the
  wire-text reconstruction + sentinel-body abstain logic (no more
  reconstruction, no more sentinels to worry about).
- HarmonyStreamingRouter.identity gate: tolerate HuggingFace cache
  snapshot paths like ~/.cache/huggingface/hub/models--openai--
  gpt-oss-20b/snapshots/<sha> via a new models--<owner>--<name>
  pattern. Round-14 BLOCKING #1: previous basename-only check
  rejected the SHA snapshot dir and fell back to the leaking legacy
  router for every HF-cache-loaded gpt-oss model.
- BatchedEngine._route_tokens_for_channels: returns
  (reasoning, content, structured_tool_calls). Removed the
  Counter-based wire-text dedup machinery (no more wire text to
  dedup). When structured tool calls are present, force content to
  the router's CONTENT-channel result so un-cleaned commentary
  headers in fallback_text don't bleed into user-facing content.
- BatchedEngine._make_routed_output: propagate structured tool_call
  payload from RouterEvent → GenerationOutput.tool_calls for the
  streaming path.
- _parse_tool_calls_with_parser: new structured_tool_calls= kwarg.
  When non-None the text-based parser is bypassed entirely; the
  structured entries are converted to ToolCall objects directly.
- routes/chat.py + routes/anthropic.py (non-stream): pass
  output.tool_calls into the helper.
- routes/anthropic.py (stream): accumulate engine-surfaced tool_calls
  on a per-chunk basis and pass the accumulated list to the
  end-of-stream parser call.
- service/postprocessor.py (stream): on Channel.TOOL_CALL events
  with output.tool_calls populated, emit StreamEvent(type="tool_call",
  ...) directly instead of regex-parsing the delta text.

Tests:

- tests/parsers/regressions/test_issue_513_harmony_streamable_parser.py:
  rewritten for the structured contract. New
  test_structured_payload_preserves_body_with_harmony_sentinels
  replays the BLOCKING #2 scenario (7 sentinel-bearing JSON bodies)
  and asserts each surfaces intact via the structured payload. New
  test_compat_gate_accepts_hf_cache_snapshot_path pins BLOCKING #1
  fix. 27 tests, all green.
- tests/test_engine_router_non_stream.py: rewritten for the
  (reasoning, content, structured_tool_calls) triple return. New
  test_structured_tool_call_passthrough_preserves_sentinel_bearing_arguments
  pins the same BLOCKING #2 scenario at the engine layer. 13 tests,
  all green.
- tests/test_parallel_tool_calls.py: fake parser signature update.
- tests/test_server_utils.py: field-order test extended to include
  tool_calls in the appended-fields list.

Full unit suite: 4475 passed, 19 skipped, 16 deselected, 7 xfailed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-15 BLOCKING ×2 on the new structured-tool-call streaming
fast-path introduced for HarmonyStreamingRouter passthrough:

1. ``tool_calls[*].index`` restarted at 0 on every router event, so
   clients merging deltas on ``index`` would overwrite earlier calls.
   Added per-response ``_structured_tool_call_count`` counter
   (initialized in __init__ + reset()) and assign monotonic indices
   across both intra-chunk and inter-chunk emissions.
2. The early structured-tool-call return bypassed the
   ``parallel_tool_calls=false`` cap that the non-streaming path
   enforces (routes/chat.py post-parse trim). Added
   ``_parallel_tool_calls_allowed()`` helper (dict + pydantic shapes)
   and trim ``allowed_calls`` to one before structuring; suppressed
   chunks still emit a finish event when ``output.finished`` to keep
   the route's buffered_finish gate firing.

Tests (TestStructuredToolCallStreaming, 9 cases):
- index monotonic across chunks
- index monotonic within single chunk
- parallel_tool_calls=false cap across chunks
- parallel_tool_calls=false cap within single chunk
- parallel_tool_calls=true (explicit) does not cap
- suppressed-by-cap chunk preserves finish semantics
- structured event auto-assigns id when missing
- structured event preserves caller-provided id
- reset() clears the structured counter

Also updated ``_make_output`` helper to default
``tool_calls=None`` so the new fast-path stays inert in tests that
don't opt in (MagicMock auto-attrs are truthy and would otherwise
trip the channel-routed structured branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@raullenchai raullenchai merged commit 529fdd3 into main Jun 5, 2026
8 checks passed
@raullenchai raullenchai deleted the feat/openai-harmony-bypass-513 branch June 5, 2026 04:52
raullenchai added a commit that referenced this pull request Jun 5, 2026
PR #515 — openai-harmony StreamableParser passthrough for gpt-oss tool
calls. Closes #444 #455 #468 #480 #513 (harmony streaming tool calls
leak channel markers / commentary into content).

Architecture: HarmonyStreamingRouter delegates to the official harmony
parser (same library vLLM and SGLang use), surfaces structured
(name, arguments) tool calls through the engine without round-tripping
through wire-text — eliminating the entire class of sentinel-anchored
regex parsing bugs that the cluster #444-#480 stemmed from.

Streaming fast-path enforces tool_calls[*].index monotonicity across
router chunks and the parallel_tool_calls=false external contract cap.

Release-SOP gates walked: G1 release-smoke OK, G2 codex×16 rounds OK,
G3 CLI-fidelity OK, G4 4484 unit tests OK, G5-G9 pr_validate 3×3
matrix (qwen3.5-35B-A3B-8bit / qwen3.6-27B-MLX-8bit / gpt-oss-20b
-MXFP4-Q8) OK. G10 N/A (no mlx-* dep bump). G11 deferred to #516
(target v0.6.76) — auto-routing escape hatch flag pair for
HarmonyStreamingRouter selection; waiver justified by safe-fallback
to legacy in-production harmony router.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant