Skip to content

fix(routes): default empty content to "" so message.content key always present#467

Open
raullenchai wants to merge 5 commits into
mainfrom
fix/empty-content-key-present
Open

fix(routes): default empty content to "" so message.content key always present#467
raullenchai wants to merge 5 commits into
mainfrom
fix/empty-content-key-present

Conversation

@raullenchai
Copy link
Copy Markdown
Owner

Summary

  • When the engine returned an empty completion, routes/chat.py left final_content = None. Combined with model_dump_json(exclude_none=True), the content field was silently DROPPED from the response JSON.
  • OpenAI clients (LangChain, Vercel AI SDK) that key off choices[0].message.content crash with KeyError / AttributeError on the missing field.
  • Fix: default final_content = "" when None and there are no tool_calls (tool-call responses are exempt per OpenAI spec — content may legitimately be null when tool_calls is populated).
  • 3 regression tests pin the invariant.

Repro (pre-fix)

# Surfaced by iter15 onboarding on unsloth/Qwen3.6-27B-MLX-8bit.
# Same prompt+seed sent twice — second call hits prefix cache, samples
# EOS first, finish_reason="stop" with no content tokens.
curl -X POST http://127.0.0.1:8501/v1/chat/completions \
  -H "Authorization: Bearer SEKRET" \
  -H "Content-Type: application/json" \
  -d '{"model":"unsloth/Qwen3.6-27B-MLX-8bit","messages":[{"role":"user","content":"Give me a creative 20-word story."}],"seed":42,"temperature":0.7,"max_tokens":50}'

# pre-fix run #2: {"choices":[{"message":{"role":"assistant"},...}]}
#                                                    ^^^ "content" key absent
# post-fix run #2: {"choices":[{"message":{"role":"assistant","content":"",...}}]}

Why this matters

Confirmed via Pydantic v2 trace:

AssistantMessage(content=None).model_dump_json(exclude_none=True)
# → '{"role":"assistant"}'   ← content key dropped
AssistantMessage(content="").model_dump_json(exclude_none=True)
# → '{"role":"assistant","content":""}'   ← OK

The empty-completion path is rare in normal usage but reproducible 5/5 on cache replay with seed sampling. Any prefix-cache hit that samples EOS first triggers it.

Test plan

  • python3.12 -m pytest tests/test_empty_content_key_present.py -v — 3/3 pass
  • python3.12 -m pytest tests/test_routes.py tests/test_parallel_tool_calls.py tests/test_legacy_functions_field.py tests/test_text_only_media_gate.py tests/test_chat_logprobs_channel_routing.py tests/test_anthropic_stop_sequences.py — 79/79 pass (no regressions in adjacent route suites)
  • ruff check + ruff format --check on both files — clean

🤖 Generated with Claude Code

raullenchai and others added 5 commits May 23, 2026 04:19
…s present

When the engine returned an empty completion (e.g. prefix-cache hit that
sampled EOS first, or a reasoning-only response), routes/chat.py left
``final_content = None``. Combined with ``model_dump_json(exclude_none=True)``,
the ``content`` field was silently DROPPED from the response JSON, breaking
OpenAI clients (LangChain, Vercel AI SDK) that key off
``choices[0].message.content`` with KeyError / AttributeError.

OpenAI's spec requires ``message.content`` to always be present in
non-tool-call responses — empty string ``""`` when there is no text.
Tool-call responses are exempt (content may legitimately be null when
tool_calls is populated).

Fix: default ``final_content = ""`` when None and there are no tool_calls
so the key always survives serialization. Surfaced by iter15 onboarding
sweep on unsloth/Qwen3.6-27B-MLX-8bit — same prompt+seed replay produced
``{"role":"assistant"}`` with no content key on the second call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on test

Three DeepSeek V4 Pro adversarial findings on PR #467:

1. **Add test for tool_call exemption** — OpenAI permits content:null when
   tool_calls is populated. Pre-fix behavior already satisfied this (content
   stayed None → dropped) but no test pinned the invariant. Added
   ``test_tool_call_response_content_may_be_null`` that monkeypatches the
   tool parser to return a tool_call and asserts content remains null
   (not empty string).

2. **Move guard closer to AssistantMessage construction** — DeepSeek noted
   the fix relies on ``tool_calls`` being final at the guard point. Verified
   no mutation after line 860 today, but moved guard from line 918 to
   immediately before ``AssistantMessage`` construction so a future refactor
   that adds tool_call mutation in the intervening logprobs/response_format
   block cannot silently regress.

3. **Streaming path** — DeepSeek asked whether the stream chat completion
   has the same bug. Investigated: streaming uses SSE deltas per OpenAI's
   spec; clients accumulate ``content`` across chunks. The role chunk
   itself has no content key (intentional, matches OpenAI). Well-behaved
   streaming clients use ``.get("content", "")`` per delta. Adding the
   default on streaming would emit empty ``"content":""`` keys on every
   reasoning/tool_call delta, which is wrong per spec. Documented this
   decision inline in the comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ntinel + reasoning decision)

DeepSeek round-2 raised 5 findings on the iter1 review changes. Addressed:

**#2 (accepted)** — assertion ambiguity: ``msg.get("content") in (None,)``
returned True for both omitted-key and explicit-null. Changed to
``"content" not in msg`` so a future flip from omitted → explicit null
(or worse, ``""``) fails loudly.

**#1 (accepted via sentinel)** — verify monkeypatched parser actually
fires. Added ``parse_call_count`` counter + assertion so if a future
refactor adds an ``if cfg.tool_parser is not None`` guard to
routes/chat.py:850, the test fails loudly instead of silently passing
on the wrong code path.

**#4 (declined with clarified comment)** — DeepSeek suggested adding
``not reasoning_content`` to the guard. Declined: OpenAI's o1/o3 reasoning
models always emit ``content`` (often empty string) alongside reasoning,
and the user-facing bug we're fixing was specifically reasoning-only
responses dropping the key. Adding the exemption would re-introduce the
crash for reasoning models. Clarified the comment to explain why
``reasoning_content`` is intentionally NOT an exemption.

**#3, #5 (declined)** — double-reset and minimal tokenizer mock are
defensive style only. Current behavior is correct; defensive padding adds
test complexity without catching a real regression class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DeepSeek round-3 caught a genuine spec misread: my round-2 test asserted
``"content" not in msg`` for tool-call responses, claiming this matched
"OpenAI's own response shape." That is wrong — api.openai.com emits
``"content": null`` (key present, value null), not key-absent. Clients
that distinguish ``content is None`` from ``"content" not in msg`` would
see a divergence between rapid-mlx and OpenAI.

Fix:
- Route now does a post-dump fixup: after ``chat_response.model_dump(exclude_none=True)``,
  walk ``choices[*].message`` and re-inject ``content: None`` when the
  ``exclude_none`` pass dropped it. Tool-call responses now emit
  ``"content": null`` matching OpenAI exactly.
- Test renamed ``test_tool_call_response_content_is_explicit_null`` and
  flipped to assert ``"content" in msg AND msg["content"] is None``.
- Comments and module docstring updated to reflect the actual OpenAI
  behavior (key present + null, not key-omitted).

Non-tool-call empty responses continue to emit ``"content": ""`` (set
by the pre-existing default at line 939) — also matches OpenAI for
non-tool-call empty completions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fixup loop)

Four findings on the round-3 post-dump fixup:

**#1 (accepted, functional)** — ``json.dumps(response_dict, ...)`` bypasses
Pydantic's JSON encoder chain (datetimes, UUIDs, Decimals,
``@field_serializer``). Today's response schema is all primitives so it
worked, but a future field with a non-primitive type would TypeError at
runtime with no test coverage. Switched to
``model_dump(mode="json", exclude_none=True)`` — same JSON-safe encoding
that ``model_dump_json`` uses internally, just yielding a dict so we can
post-process before serializing.

**#2 (accepted, defensive)** — replaced silent ``isinstance(_msg, dict)``
guard with ``assert isinstance(_msg, dict)``. If a future Pydantic version
ever returns a non-dict for nested model dumps, fail loudly instead of
silently reverting to the dropped-content-key bug.

**#3 (accepted, defensive)** — ``response_dict.get("choices", ())`` returns
``None`` if ``choices`` is explicitly None (not absent); for-loop would
crash with TypeError. Changed to ``get("choices") or ()`` so explicit None
collapses to empty iteration like absent does.

**#4 (declined)** — ``_client()`` calls ``reset_config()`` before the
caller's try/finally. Theoretical leak if engine constructor raises; no
observable bug since ``reset_config()`` is idempotent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
raullenchai added a commit that referenced this pull request May 24, 2026
Round 4 of self-validation surfaced one trivial NIT: `^update\.?$`
and `^updates?\.?$` are redundant — the `s?` form covers both.
Consolidate into the single inclusive pattern with an inline
comment so the intent is grep-able.

Round 4 verdict: PASS (1 nit, 0 blocking). Tiering converged in
4 rounds: 1→1→0 blocking findings across rounds 1/2/3/4. Without
tiering the same 4 rounds would have been 5/5/2/1 "failures" by
the old binary verdict — exactly the spiral PR #467 hit. Mission
accomplished.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
raullenchai added a commit that referenced this pull request May 24, 2026
…#474)

* feat(pr_validate): integrate Google eng-practices code-review tiering

Adds two new gates inspired by Google's eng-practices and refactors the
DeepSeek-review verdict to converge instead of spiral.

## What changed

1. **DeepSeek prompt now requires `[BLOCKING]` or `[NIT]` on every
   finding** (prompts/deepseek_review.md). Only `[BLOCKING]` fails the
   gate; `[NIT]`s surface in the scorecard so the author can decide.
   Caps: 5 BLOCKING + 5 NIT per review forces triage over padding.
   Untagged findings default to `[BLOCKING]` so a forgotten prefix
   can't silently downgrade a real bug.

2. **New `cl_description_quality` step** (steps/cl_description_quality.py).
   Title + body hygiene gate from Google's CL-descriptions guidance:
   - Title not empty, >=3 words after CC-prefix strip, not in the
     known-bad blacklist (`fix bug`, `wip`, `update`, `tweaks`, ...).
   - Body not empty.
   - Body contains a rationale signal: `## Why`/`## Summary`/etc
     heading, `Closes #NNN`, inline `Why:`, or `because`-clause.
   - Override: `PR_VALIDATE_SKIP_DESC=1` for trivial dep-bumps.

## Why

PR #467 (chat-route empty-content fix) spent 5 DeepSeek rounds in a
spiral because every reply produced fresh "could be more defensive"
findings and the previous prompt had no way to mark them as nits.
Google's standard explicitly addresses this: "Reviewers should favor
approving a CL once it is in a state where it definitely improves the
overall code health, even if the CL isn't perfect." Tiered findings
encode that principle — substantive bugs still block, but style
preferences and future-proofing suggestions don't hold up merges.

The CL-description gate addresses a parallel issue: PRs with bad
titles or empty bodies are hard to bisect later, and the cost of
fixing it (30 seconds) is much smaller than the cost of code
archaeology forever.

## Test plan

- [x] `pytest tests/test_pr_validate_google_review.py` — 42 new
  tests covering tier-split (untagged default, mixed cases,
  case-insensitive prefix) and CLDescriptionQualityStep (bad-pattern
  blacklist, CC-prefix stripping, rationale-signal detection,
  override env).
- [x] `pytest tests/test_pr_validate_*` — all 79 pr_validate tests
  pass (no regressions in existing test_pr_validate_deepseek.py,
  test_pr_validate_runner.py, test_pr_validate_test_plan_check.py).
- [x] `ruff check` + `ruff format --check` clean across all
  modified files.

Closes the convergence-cost feedback loop that #467 surfaced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pr_validate): tolerate indented rationale + harden override env check

Two fixes from DeepSeek round 1 of PR #474's own self-validation:

1. Rationale-signal regexes were anchored at column 0 only — an
   indented `Why:` or `**Why:**` inside a bullet list / block quote
   (e.g. `  - Why: foo`) was not detected. Add an optional leading
   `[\s>*+\-]*` prefix so common markdown nesting still satisfies
   the gate.

2. The override check used `os.environ.get(_OVERRIDE_ENV)` which
   treats `PR_VALIDATE_SKIP_DESC=0` as truthy (the string "0" is
   truthy in Python). Switch to the repo-wide `env_truthy` helper so
   only `1`/`true`/`yes`/`on` actually skip the gate.

6 new tests pin both fixes (4 indented-rationale shapes + 2
=0 / =false override cases). Skipped DeepSeek's third finding
(strip-then-re-prefix in `_split_findings_by_tier`) — intentional
design choice so the helper returns clean strings usable in
non-scorecard contexts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pr_validate): handle conventional-commit breaking marker + pin 3-word threshold

Round 2 of PR #474's self-validation surfaced 1 mis-classified
[BLOCKING] + 4 NITs. Triage:

* **Took NIT-2** — `_CC_PREFIX` regex didn't strip the `!` in
  conventional-commits breaking-change form (`feat!:`, `feat(api)!:`).
  Such titles passed through unstripped, so the bad-pattern blacklist
  silently never applied to breaking-change PRs. One-char regex fix
  + 3-case test.

* **Overrode the [BLOCKING] threshold finding** — DeepSeek flagged
  the 3-word minimum as overly strict (e.g. blocks "fix: memory
  leak" = 2 words after strip). This is a threshold/policy choice,
  not a correctness bug, so [NIT]-class at most. Repo's actual title
  corpus is 5-10 words; Google's "informative for future grep"
  principle wants specificity ("memory leak" doesn't tell future
  searchers where the leak was — "memory leak on websocket close"
  does). Added a parametrized test pinning the policy so a future
  relaxation has to be intentional, not accidental.

* **Skipped NIT-3** (strip-then-re-prefix) — intentional design
  already addressed in round 1; defensible.

* **Skipped NIT-5** (`fix tests` blacklist) — authors can use
  `fix(tests): explain what specifically broke` form which strips
  to a descriptive title.

This is the convergence pattern the new tiering system was designed
for: round 1 produced 1 real bug + 2 NITs (fixed 1), round 2
produced 0 real bugs + 1 mis-classified NIT + 4 actual NITs (took
1, pinned 1, skipped 3). Without tiering both rounds would have
"failed" on subjective findings; with tiering we converge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(pr_validate): pin _extract_findings → _split_findings_by_tier pipeline

DeepSeek round 3 surfaced a real integration gap: the tier splitter
and the numbered-list extractor are tested in isolation, but the
contract between them (extractor strips leading "1. ", splitter sees
"[BLOCKING]" at start) was never asserted end-to-end. If someone
later refactors `_extract_findings` to preserve the leading list
number, every finding would default to BLOCKING and tiering would
silently break.

Added a single test that feeds a realistic 4-finding model review
(mixed 2 BLOCKING + 2 NIT) through both helpers and asserts the
split is 2/2, not 4/0. Test fails loudly with a diagnostic if the
contract regresses.

This was DeepSeek's only legitimate concern across both round-3
findings (the other was the same point about test coverage —
addressed by this same test). Convergence achieved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(pr_validate): dedupe redundant bad-title patterns

Round 4 of self-validation surfaced one trivial NIT: `^update\.?$`
and `^updates?\.?$` are redundant — the `s?` form covers both.
Consolidate into the single inclusive pattern with an inline
comment so the intent is grep-able.

Round 4 verdict: PASS (1 nit, 0 blocking). Tiering converged in
4 rounds: 1→1→0 blocking findings across rounds 1/2/3/4. Without
tiering the same 4 rounds would have been 5/5/2/1 "failures" by
the old binary verdict — exactly the spiral PR #467 hit. Mission
accomplished.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant