fix(routes): default empty content to "" so message.content key always present#467
Open
raullenchai wants to merge 5 commits into
Open
fix(routes): default empty content to "" so message.content key always present#467raullenchai wants to merge 5 commits into
raullenchai wants to merge 5 commits into
Conversation
…s present
When the engine returned an empty completion (e.g. prefix-cache hit that
sampled EOS first, or a reasoning-only response), routes/chat.py left
``final_content = None``. Combined with ``model_dump_json(exclude_none=True)``,
the ``content`` field was silently DROPPED from the response JSON, breaking
OpenAI clients (LangChain, Vercel AI SDK) that key off
``choices[0].message.content`` with KeyError / AttributeError.
OpenAI's spec requires ``message.content`` to always be present in
non-tool-call responses — empty string ``""`` when there is no text.
Tool-call responses are exempt (content may legitimately be null when
tool_calls is populated).
Fix: default ``final_content = ""`` when None and there are no tool_calls
so the key always survives serialization. Surfaced by iter15 onboarding
sweep on unsloth/Qwen3.6-27B-MLX-8bit — same prompt+seed replay produced
``{"role":"assistant"}`` with no content key on the second call.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on test Three DeepSeek V4 Pro adversarial findings on PR #467: 1. **Add test for tool_call exemption** — OpenAI permits content:null when tool_calls is populated. Pre-fix behavior already satisfied this (content stayed None → dropped) but no test pinned the invariant. Added ``test_tool_call_response_content_may_be_null`` that monkeypatches the tool parser to return a tool_call and asserts content remains null (not empty string). 2. **Move guard closer to AssistantMessage construction** — DeepSeek noted the fix relies on ``tool_calls`` being final at the guard point. Verified no mutation after line 860 today, but moved guard from line 918 to immediately before ``AssistantMessage`` construction so a future refactor that adds tool_call mutation in the intervening logprobs/response_format block cannot silently regress. 3. **Streaming path** — DeepSeek asked whether the stream chat completion has the same bug. Investigated: streaming uses SSE deltas per OpenAI's spec; clients accumulate ``content`` across chunks. The role chunk itself has no content key (intentional, matches OpenAI). Well-behaved streaming clients use ``.get("content", "")`` per delta. Adding the default on streaming would emit empty ``"content":""`` keys on every reasoning/tool_call delta, which is wrong per spec. Documented this decision inline in the comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ntinel + reasoning decision) DeepSeek round-2 raised 5 findings on the iter1 review changes. Addressed: **#2 (accepted)** — assertion ambiguity: ``msg.get("content") in (None,)`` returned True for both omitted-key and explicit-null. Changed to ``"content" not in msg`` so a future flip from omitted → explicit null (or worse, ``""``) fails loudly. **#1 (accepted via sentinel)** — verify monkeypatched parser actually fires. Added ``parse_call_count`` counter + assertion so if a future refactor adds an ``if cfg.tool_parser is not None`` guard to routes/chat.py:850, the test fails loudly instead of silently passing on the wrong code path. **#4 (declined with clarified comment)** — DeepSeek suggested adding ``not reasoning_content`` to the guard. Declined: OpenAI's o1/o3 reasoning models always emit ``content`` (often empty string) alongside reasoning, and the user-facing bug we're fixing was specifically reasoning-only responses dropping the key. Adding the exemption would re-introduce the crash for reasoning models. Clarified the comment to explain why ``reasoning_content`` is intentionally NOT an exemption. **#3, #5 (declined)** — double-reset and minimal tokenizer mock are defensive style only. Current behavior is correct; defensive padding adds test complexity without catching a real regression class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DeepSeek round-3 caught a genuine spec misread: my round-2 test asserted ``"content" not in msg`` for tool-call responses, claiming this matched "OpenAI's own response shape." That is wrong — api.openai.com emits ``"content": null`` (key present, value null), not key-absent. Clients that distinguish ``content is None`` from ``"content" not in msg`` would see a divergence between rapid-mlx and OpenAI. Fix: - Route now does a post-dump fixup: after ``chat_response.model_dump(exclude_none=True)``, walk ``choices[*].message`` and re-inject ``content: None`` when the ``exclude_none`` pass dropped it. Tool-call responses now emit ``"content": null`` matching OpenAI exactly. - Test renamed ``test_tool_call_response_content_is_explicit_null`` and flipped to assert ``"content" in msg AND msg["content"] is None``. - Comments and module docstring updated to reflect the actual OpenAI behavior (key present + null, not key-omitted). Non-tool-call empty responses continue to emit ``"content": ""`` (set by the pre-existing default at line 939) — also matches OpenAI for non-tool-call empty completions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fixup loop) Four findings on the round-3 post-dump fixup: **#1 (accepted, functional)** — ``json.dumps(response_dict, ...)`` bypasses Pydantic's JSON encoder chain (datetimes, UUIDs, Decimals, ``@field_serializer``). Today's response schema is all primitives so it worked, but a future field with a non-primitive type would TypeError at runtime with no test coverage. Switched to ``model_dump(mode="json", exclude_none=True)`` — same JSON-safe encoding that ``model_dump_json`` uses internally, just yielding a dict so we can post-process before serializing. **#2 (accepted, defensive)** — replaced silent ``isinstance(_msg, dict)`` guard with ``assert isinstance(_msg, dict)``. If a future Pydantic version ever returns a non-dict for nested model dumps, fail loudly instead of silently reverting to the dropped-content-key bug. **#3 (accepted, defensive)** — ``response_dict.get("choices", ())`` returns ``None`` if ``choices`` is explicitly None (not absent); for-loop would crash with TypeError. Changed to ``get("choices") or ()`` so explicit None collapses to empty iteration like absent does. **#4 (declined)** — ``_client()`` calls ``reset_config()`` before the caller's try/finally. Theoretical leak if engine constructor raises; no observable bug since ``reset_config()`` is idempotent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
raullenchai
added a commit
that referenced
this pull request
May 24, 2026
Round 4 of self-validation surfaced one trivial NIT: `^update\.?$` and `^updates?\.?$` are redundant — the `s?` form covers both. Consolidate into the single inclusive pattern with an inline comment so the intent is grep-able. Round 4 verdict: PASS (1 nit, 0 blocking). Tiering converged in 4 rounds: 1→1→0 blocking findings across rounds 1/2/3/4. Without tiering the same 4 rounds would have been 5/5/2/1 "failures" by the old binary verdict — exactly the spiral PR #467 hit. Mission accomplished. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
raullenchai
added a commit
that referenced
this pull request
May 24, 2026
…#474) * feat(pr_validate): integrate Google eng-practices code-review tiering Adds two new gates inspired by Google's eng-practices and refactors the DeepSeek-review verdict to converge instead of spiral. ## What changed 1. **DeepSeek prompt now requires `[BLOCKING]` or `[NIT]` on every finding** (prompts/deepseek_review.md). Only `[BLOCKING]` fails the gate; `[NIT]`s surface in the scorecard so the author can decide. Caps: 5 BLOCKING + 5 NIT per review forces triage over padding. Untagged findings default to `[BLOCKING]` so a forgotten prefix can't silently downgrade a real bug. 2. **New `cl_description_quality` step** (steps/cl_description_quality.py). Title + body hygiene gate from Google's CL-descriptions guidance: - Title not empty, >=3 words after CC-prefix strip, not in the known-bad blacklist (`fix bug`, `wip`, `update`, `tweaks`, ...). - Body not empty. - Body contains a rationale signal: `## Why`/`## Summary`/etc heading, `Closes #NNN`, inline `Why:`, or `because`-clause. - Override: `PR_VALIDATE_SKIP_DESC=1` for trivial dep-bumps. ## Why PR #467 (chat-route empty-content fix) spent 5 DeepSeek rounds in a spiral because every reply produced fresh "could be more defensive" findings and the previous prompt had no way to mark them as nits. Google's standard explicitly addresses this: "Reviewers should favor approving a CL once it is in a state where it definitely improves the overall code health, even if the CL isn't perfect." Tiered findings encode that principle — substantive bugs still block, but style preferences and future-proofing suggestions don't hold up merges. The CL-description gate addresses a parallel issue: PRs with bad titles or empty bodies are hard to bisect later, and the cost of fixing it (30 seconds) is much smaller than the cost of code archaeology forever. ## Test plan - [x] `pytest tests/test_pr_validate_google_review.py` — 42 new tests covering tier-split (untagged default, mixed cases, case-insensitive prefix) and CLDescriptionQualityStep (bad-pattern blacklist, CC-prefix stripping, rationale-signal detection, override env). - [x] `pytest tests/test_pr_validate_*` — all 79 pr_validate tests pass (no regressions in existing test_pr_validate_deepseek.py, test_pr_validate_runner.py, test_pr_validate_test_plan_check.py). - [x] `ruff check` + `ruff format --check` clean across all modified files. Closes the convergence-cost feedback loop that #467 surfaced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pr_validate): tolerate indented rationale + harden override env check Two fixes from DeepSeek round 1 of PR #474's own self-validation: 1. Rationale-signal regexes were anchored at column 0 only — an indented `Why:` or `**Why:**` inside a bullet list / block quote (e.g. ` - Why: foo`) was not detected. Add an optional leading `[\s>*+\-]*` prefix so common markdown nesting still satisfies the gate. 2. The override check used `os.environ.get(_OVERRIDE_ENV)` which treats `PR_VALIDATE_SKIP_DESC=0` as truthy (the string "0" is truthy in Python). Switch to the repo-wide `env_truthy` helper so only `1`/`true`/`yes`/`on` actually skip the gate. 6 new tests pin both fixes (4 indented-rationale shapes + 2 =0 / =false override cases). Skipped DeepSeek's third finding (strip-then-re-prefix in `_split_findings_by_tier`) — intentional design choice so the helper returns clean strings usable in non-scorecard contexts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(pr_validate): handle conventional-commit breaking marker + pin 3-word threshold Round 2 of PR #474's self-validation surfaced 1 mis-classified [BLOCKING] + 4 NITs. Triage: * **Took NIT-2** — `_CC_PREFIX` regex didn't strip the `!` in conventional-commits breaking-change form (`feat!:`, `feat(api)!:`). Such titles passed through unstripped, so the bad-pattern blacklist silently never applied to breaking-change PRs. One-char regex fix + 3-case test. * **Overrode the [BLOCKING] threshold finding** — DeepSeek flagged the 3-word minimum as overly strict (e.g. blocks "fix: memory leak" = 2 words after strip). This is a threshold/policy choice, not a correctness bug, so [NIT]-class at most. Repo's actual title corpus is 5-10 words; Google's "informative for future grep" principle wants specificity ("memory leak" doesn't tell future searchers where the leak was — "memory leak on websocket close" does). Added a parametrized test pinning the policy so a future relaxation has to be intentional, not accidental. * **Skipped NIT-3** (strip-then-re-prefix) — intentional design already addressed in round 1; defensible. * **Skipped NIT-5** (`fix tests` blacklist) — authors can use `fix(tests): explain what specifically broke` form which strips to a descriptive title. This is the convergence pattern the new tiering system was designed for: round 1 produced 1 real bug + 2 NITs (fixed 1), round 2 produced 0 real bugs + 1 mis-classified NIT + 4 actual NITs (took 1, pinned 1, skipped 3). Without tiering both rounds would have "failed" on subjective findings; with tiering we converge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(pr_validate): pin _extract_findings → _split_findings_by_tier pipeline DeepSeek round 3 surfaced a real integration gap: the tier splitter and the numbered-list extractor are tested in isolation, but the contract between them (extractor strips leading "1. ", splitter sees "[BLOCKING]" at start) was never asserted end-to-end. If someone later refactors `_extract_findings` to preserve the leading list number, every finding would default to BLOCKING and tiering would silently break. Added a single test that feeds a realistic 4-finding model review (mixed 2 BLOCKING + 2 NIT) through both helpers and asserts the split is 2/2, not 4/0. Test fails loudly with a diagnostic if the contract regresses. This was DeepSeek's only legitimate concern across both round-3 findings (the other was the same point about test coverage — addressed by this same test). Convergence achieved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(pr_validate): dedupe redundant bad-title patterns Round 4 of self-validation surfaced one trivial NIT: `^update\.?$` and `^updates?\.?$` are redundant — the `s?` form covers both. Consolidate into the single inclusive pattern with an inline comment so the intent is grep-able. Round 4 verdict: PASS (1 nit, 0 blocking). Tiering converged in 4 rounds: 1→1→0 blocking findings across rounds 1/2/3/4. Without tiering the same 4 rounds would have been 5/5/2/1 "failures" by the old binary verdict — exactly the spiral PR #467 hit. Mission accomplished. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
routes/chat.pyleftfinal_content = None. Combined withmodel_dump_json(exclude_none=True), thecontentfield was silently DROPPED from the response JSON.choices[0].message.contentcrash withKeyError/AttributeErroron the missing field.final_content = ""when None and there are notool_calls(tool-call responses are exempt per OpenAI spec — content may legitimately be null when tool_calls is populated).Repro (pre-fix)
Why this matters
Confirmed via Pydantic v2 trace:
The empty-completion path is rare in normal usage but reproducible 5/5 on cache replay with seed sampling. Any prefix-cache hit that samples EOS first triggers it.
Test plan
python3.12 -m pytest tests/test_empty_content_key_present.py -v— 3/3 passpython3.12 -m pytest tests/test_routes.py tests/test_parallel_tool_calls.py tests/test_legacy_functions_field.py tests/test_text_only_media_gate.py tests/test_chat_logprobs_channel_routing.py tests/test_anthropic_stop_sequences.py— 79/79 pass (no regressions in adjacent route suites)ruff check+ruff format --checkon both files — clean🤖 Generated with Claude Code