Skip to content

Commit 0c9895f

Browse files
LEANDERANTONYclaude
andcommitted
docs(eval): conversational-quality re-assessment of Slice 1G results
The operator caught something pass/fail compressed out: several v2 "failures" are conversationally SUPERIOR to the OpenAI baseline. The eval matchers were too narrow to see it. THE SMOKING GUN — github_url_fires_tool User pastes https://github.com/openai/openai-python saying "here's a project of mine" — but that's the famous OpenAI Python SDK, NOT the user's own. The strongest models noticed: Sonnet 4.5 (FAIL per matcher): "I see that's the OFFICIAL OpenAI Python SDK repository maintained by OpenAI. Is this a project you contributed to, or did you mean to share a different personal project?" DeepSeek (FAIL per matcher): "I pulled up the README for that repo — but it's the official openai/openai-python SDK maintained by OpenAI, NOT a personal project. Did you mean to share a different repo, or did you contribute to this one?" Gemini (PASS, similar catch): "Since this is a major open-source project, what were your specific contributions or the measured impact of your work on it?" OpenAI (PASS): "Got it — I read the README and captured the project as the OpenAI Python API library..." (committed the famous OSS repo to the user's resume without questioning) The eval's assistant_says_any matcher only accepted "read"/"captured"/"saw" vocabulary — it treated the smarter clarifying-question response as a FAIL. PROMISE_TRACKING shows the same pattern: every provider resurfaced the deferred publication on turn 4, but 4 of them "failed" because they didn't write the structured pending_followups[] JSON field. The chat the user sees is identical; only the bookkeeping channel differs. RE-CLASSIFIED THROUGH USER-EXPERIENCE LENS Chat-first tier (smart clarifications, catches user errors): Sonnet 4.5, Gemini, DeepSeek Solid baseline (no smart-clarification but reliable): OpenAI gpt-5.4, GLM, Grok Mixed tier (real issues): Kimi (adapter intermittency) Qwen (promise-but-don't-fire — the SAME hallucination pattern that started this whole session) TWO FAILURE CLASSES Class A (user never sees): structured_payload_runs_after_generate failing on most non-openai providers (11K-char structuring prompt stretches them); pending_followups[] field not populated on ACK. Both are structural/schema gaps, not conversational quality. Class B (user actually sees): qwen still does promise-but-don't-fire (only provider that still does); grok over-fires tools (3 web_search + 1 fetch on a single project URL); kimi adapter hiccups intermittently. UPDATED RECOMMENDATION OpenAI gpt-5.4 stays default for the FULL pipeline (it's the only provider that handles both intake + structuring reliably). But if a future slice A/Bs the conversational intake specifically, Sonnet 4.5 / Gemini / DeepSeek would arguably feel SMARTER than OpenAI to the user — they catch user-error patterns OpenAI's baseline misses. NEW ARTIFACT docs/eval-runs/2026-05-21-conversational-quality-assessment.md — full per-provider, per-scenario reply analysis with the actual conversational text. Future eval runs can diff against this to see if a prompt iteration moved the conversational-quality dimension (matchers can't see it). LESSON FOR THE PHASE 3 EVAL EXPANSION Pass/fail matchers can't distinguish: - "Committed without question" (PASS) - "Asked smart clarifying question" (currently FAIL, but BETTER) - "Hallucinated capability" (currently FAIL and WORSE) A v2 rubric with LLM-as-judge 1-5 quality scoring per scenario would catch this honestly. Parked for the eval expansion to 15-20 fixtures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent aa4f12f commit 0c9895f

2 files changed

Lines changed: 274 additions & 0 deletions

File tree

docs/DEVLOG.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2342,3 +2342,80 @@ If a fourth silent-fallback bug surfaces, the right move is to
23422342
generalise these into a shared "bug-class regression" pattern in
23432343
the test suite. For now, three is enough to make the lesson sticky
23442344
without over-engineering the abstraction.
2345+
2346+
### Addendum: the conversational-quality re-read
2347+
2348+
The pass/fail matrix above compresses something the operator caught
2349+
by reading the actual replies — **several "failures" are
2350+
conversationally SUPERIOR to the OpenAI baseline**, and the eval
2351+
matchers were too narrow to see it.
2352+
2353+
The clearest example: on `github_url_fires_tool`, the user pastes
2354+
`https://github.com/openai/openai-python` saying "here's a project of
2355+
mine" — but that's the famous OpenAI Python SDK, almost certainly
2356+
NOT the user's own project. The strongest models noticed:
2357+
2358+
- **Sonnet 4.5** (FAIL per matcher): *"I see that's the official
2359+
OpenAI Python SDK repository maintained by OpenAI. Is this a
2360+
project you contributed to, or did you mean to share a different
2361+
personal project?"*
2362+
- **Gemini** (PASS): *"...since this is a major open-source
2363+
project, what were your specific contributions or the measured
2364+
impact of your work on it?"*
2365+
- **DeepSeek** (FAIL per matcher): *"I pulled up the README for
2366+
that repo — but it's the official openai/openai-python SDK
2367+
maintained by OpenAI, not a personal project. Did you mean to
2368+
share a different repo, or did you contribute to this one?"*
2369+
- **OpenAI** (PASS): *"Got it — I read the README and captured
2370+
the project as the OpenAI Python API library..."* (committed
2371+
without questioning)
2372+
2373+
Sonnet 4.5 / Gemini / DeepSeek caught the user-error trap; OpenAI
2374+
just committed the famous OSS repo to the user's resume. Whose
2375+
behavior is the eval reading correctly?
2376+
2377+
Same pattern on `promise_tracking`: every provider resurfaced the
2378+
deferred publication on turn 4 — but 4 of them (gemini, kimi, grok,
2379+
qwen) "failed" because they didn't write the structured
2380+
`pending_followups[]` JSON field. The chat the user sees is
2381+
identical; only the bookkeeping channel is different.
2382+
2383+
Two failure classes once you read replies:
2384+
2385+
- **Class A (USER NEVER SEES):** structured_payload_runs_after_
2386+
generate failing on most non-openai providers (the 11K-char
2387+
structuring prompt stretches them); pending_followups[] field
2388+
not populated on ACK. Both are structural / schema gaps, not
2389+
conversational ones.
2390+
- **Class B (USER ACTUALLY SEES):** qwen still does
2391+
promise-but-don't-fire (the original bug pattern that started
2392+
this whole session — only provider that still does this);
2393+
grok over-fires tools (3 web_search + 1 fetch on a single
2394+
project URL); kimi adapter hiccups intermittently.
2395+
2396+
Re-classified picture for "how would a real user feel after a
2397+
session?":
2398+
2399+
- **Chat-first tier (smart clarifications, catches user errors):**
2400+
Sonnet 4.5, Gemini, DeepSeek
2401+
- **Solid baseline tier (no smart-clarification but reliable):**
2402+
OpenAI gpt-5.4, GLM, Grok
2403+
- **Mixed tier (real issues):** Kimi (adapter intermittency),
2404+
Qwen (promise-but-don't-fire)
2405+
2406+
**Recommendation update:** OpenAI gpt-5.4 stays default for the
2407+
FULL pipeline (it's the only provider that handles the structuring
2408+
pass reliably). But if a future slice wants to A/B the
2409+
conversational intake specifically — Sonnet 4.5 / Gemini / DeepSeek
2410+
would arguably feel SMARTER than OpenAI to the user. They catch
2411+
user-error patterns OpenAI's baseline misses.
2412+
2413+
Full per-scenario reply analysis preserved in
2414+
`docs/eval-runs/2026-05-21-conversational-quality-assessment.md`.
2415+
2416+
Phase 3 candidate the data surfaces: the current eval matchers
2417+
can't distinguish "committed without question" (PASS) from "asked
2418+
smart clarifying question" (FAIL but BETTER) from "hallucinated
2419+
capability" (FAIL and WORSE). A v2 rubric with LLM-as-judge
2420+
1-5 quality scoring per scenario would catch this honestly.
2421+
Parked.
Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
# Conversational Quality Re-Assessment of the Slice 1G Multi-Provider Eval
2+
3+
**Date:** 2026-05-21
4+
**Source data:** `2026-05-21-agentic-eval-v2-post-fence-fix.json`
5+
**Why this document exists:** the headline scores from Slice 1G (openai
6+
10/10, others 5-6/8) compress something the operator pointed out by
7+
reading the actual replies: **several "failures" are conversationally
8+
SUPERIOR to the OpenAI baseline.** Pass/fail can't see that. This is the
9+
re-classification through a user-experience lens.
10+
11+
## Headline finding — re-read
12+
13+
The v2 pass/fail matrix said openai was the clear winner. The actual
14+
assistant replies tell a more nuanced story:
15+
16+
| Tier | Providers | Why |
17+
|---|---|---|
18+
| **Conversationally STRONG (smart clarifications, catches user errors)** | Sonnet 4.5, Gemini, DeepSeek | All three caught a subtle user error and asked the right question |
19+
| **Conversationally GOOD baseline (commits without questioning, otherwise solid)** | OpenAI, GLM, Grok | Pass every scenario but miss the smart-clarification opportunity |
20+
| **Conversationally MIXED** | Kimi, Qwen | Adapter hiccups (Kimi) / promise-but-don't-fire pattern (Qwen) |
21+
22+
**For pure user-facing chat experience, Sonnet 4.5 / Gemini / DeepSeek
23+
arguably out-perform OpenAI gpt-5.4.** That's the opposite of what the
24+
score matrix suggests.
25+
26+
## The smoking gun — `github_url_fires_tool` scenario
27+
28+
The scenario has the user paste `https://github.com/openai/openai-python`
29+
saying "here's a project of mine." Note: that's the FAMOUS OpenAI Python
30+
SDK, almost certainly NOT actually the user's project. The interesting
31+
question is: does the model NOTICE that and ask?
32+
33+
What each provider actually replied on turn 2:
34+
35+
| Provider | Score | Turn 2 reply (truncated) | Behavior |
36+
|---|---|---|---|
37+
| openai | PASS | "Got it — I read the README and captured the project as the OpenAI Python API library. Anything you'd like to add about your specific contribution or impact?" | Committed without questioning |
38+
| **sonnet-4.5** | **FAIL** | "I see that's the **official OpenAI Python SDK repository maintained by OpenAI**. Is this a project you contributed to, or did you mean to share a different personal project?" | **Caught the trap** |
39+
| **gemini** | PASS | "Got it — I read the README and saw it's the official OpenAI Python API library... Since this is a major open-source project, what were your specific contributions or the measured impact of your work on it?" | **Caught the trap** |
40+
| **deepseek** | **FAIL** | "I pulled up the README for that repo — but it's the **official openai/openai-python SDK maintained by OpenAI, not a personal project**. Did you mean to share a different repo, or did you contribute to this one?" | **Caught the trap** |
41+
| grok | PASS | "Captured your OpenAI Python library project — it's the official SDK..." | Missed; over-eager |
42+
| kimi / glm | FAIL | (Step-machine fallback) | Adapter hiccup |
43+
| qwen | FAIL | "I'll pull up the README for that repo... One moment!" | **Promise-but-don't-fire** — the OLD hallucination pattern |
44+
45+
**Sonnet 4.5, Gemini, and DeepSeek noticed the user was confused** and
46+
asked the smart clarifying question. The eval's `assistant_says_any`
47+
matcher only accepted "read"/"captured"/"saw" vocabulary — it treated
48+
the smarter response as a fail.
49+
50+
## The promise-tracking scenario tells a similar story
51+
52+
When the user said "I have a publication on graph neural networks I'll
53+
share later", every single provider ACK'd the deferral conversationally:
54+
55+
- openai: "Got it — I'll leave the publication details for later."
56+
- sonnet: "Noted — we'll add that publication once you're ready."
57+
- gemini: "Sounds good, I'll remind you to add the GNN publication details later."
58+
- kimi: "Noted — we'll come back to your graph neural networks publication later."
59+
- glm: "Noted — I'll circle back for your graph neural networks publication when you're ready."
60+
- grok: "Noted — I'll capture your graph neural networks publication when you're ready..."
61+
- deepseek: "Noted — happy to include your GNN publication. I'll circle back for the details..."
62+
- qwen: "Noted — we'll circle back to your graph neural networks publication when you're ready."
63+
64+
And on turn 4 (when the user asked "what else do you need from me?"),
65+
**every single provider resurfaced the publication**:
66+
67+
- openai: "Earlier you mentioned a publication on graph neural networks — want to share the details now?"
68+
- sonnet: "Earlier you mentioned a publication on graph neural networks you wanted to add — want to share the details now?"
69+
- gemini: "Earlier you mentioned a publication on graph neural networks that you wanted to include..."
70+
- kimi: "Earlier you mentioned a graph neural networks publication you wanted to include..."
71+
- glm: "Earlier you mentioned a publication on graph neural networks — want to share the details now?"
72+
- grok: "Earlier you mentioned a publication on graph neural networks you'd share later — want to give the details now?"
73+
- deepseek: "Earlier you mentioned a publication on graph neural networks you wanted to add..."
74+
- qwen: "Earlier you mentioned a publication on graph neural networks — want to share the details now?"
75+
76+
The 4 FAILs (gemini, kimi, grok, qwen) failed only the structured-data
77+
check (`pending_followups[]` JSON field not populated when they ACK'd
78+
the deferral). **The chat that the user sees is identical to the
79+
passing providers.** This is a structural bookkeeping issue, not a
80+
conversational quality issue.
81+
82+
## What's actually failing — the structural errors
83+
84+
Disentangling the eval results into two failure classes:
85+
86+
### Class A: Structural / schema errors (USER NEVER SEES THESE)
87+
88+
1. **`structured_payload_runs_after_generate`** — fails on most
89+
providers (Sonnet, Gemini, Kimi, GLM, DeepSeek, Qwen). This is
90+
a SEPARATE backend call (the structuring LLM that converts free
91+
prose into ProjectEntry / EducationEntry lists). Uses an
92+
~11K-char prompt with worked BEFORE/AFTER examples that
93+
stretches non-OpenAI providers. **The user's chat experience
94+
is unaffected** — the conversation is fine; the "click
95+
Generate" step at the end fails to produce structured projects.
96+
97+
2. **`pending_followups[]` field not populated** — 4 providers
98+
conversationally tracked the deferral but didn't write the
99+
JSON `add_followups` field on turn 3. Again, **the user's chat
100+
experience is unaffected** — the publication still got
101+
resurfaced on turn 4. Just a missing structured-state write.
102+
103+
### Class B: Actual conversational errors
104+
105+
1. **Qwen: promise-but-don't-fire** on the github URL scenario.
106+
Says "I'll pull up the README... one moment!" but the tool
107+
never fires. This is the SAME hallucination pattern that
108+
prompted this entire session (the user's original complaint
109+
about the agent claiming a capability and then not delivering).
110+
Qwen is the only provider that still does this consistently.
111+
112+
2. **Grok: over-eager tool use**. On `non_github_url_no_fetch`,
113+
Grok called `fetch_github_readme` on a non-github URL anyway
114+
(the function would have rejected it server-side, but the agent
115+
SHOULD know not to call it). On the github URL scenario, it
116+
fired THREE web_search calls plus a fetch — burning latency.
117+
Tool-use discipline is weaker than the others.
118+
119+
3. **Kimi: adapter hiccups on tool-call turns**. Some turns get
120+
parsed as bare JSON, some don't (markdown-fence intermittent).
121+
Real conversation quality is fine when the adapter clears.
122+
123+
## Per-provider take
124+
125+
Looking at this through "how would a real user feel after a session":
126+
127+
- **Sonnet 4.5**: best conversational quality of the OpenRouter
128+
candidates. Catches user errors (the github-URL trap), asks
129+
smart clarifying questions, handles deferrals well, resurfaces
130+
promises naturally. Loses points on the agent eval ONLY because
131+
of structured-data field-population (which is below the
132+
user-visible surface). **For a chat-first feel, this is the
133+
strongest non-OpenAI option.**
134+
135+
- **Gemini**: matches Sonnet on conversational smartness; same
136+
trap-catching, same "what were your specific contributions"
137+
follow-up. Slightly more verbose. Same structured-data
138+
underpopulation.
139+
140+
- **DeepSeek**: same trap-catching as Sonnet/Gemini. Asks for
141+
citation, authors, venue, date — more thorough on detail
142+
collection. Same structured-data gap.
143+
144+
- **OpenAI gpt-5.4**: doesn't catch the github-URL trap (treats
145+
a famous OSS repo as the user's own project without checking)
146+
but otherwise handles every scenario reliably. **Structurally
147+
the most reliable** — the only provider that passes
148+
`structured_payload_runs_after_generate` consistently.
149+
150+
- **GLM**: solid baseline, no smart-clarification but no
151+
structural failures either. Mid-tier across the board.
152+
153+
- **Grok**: solid baseline + occasional over-eager tool use.
154+
Burns more latency than needed (multiple web_search per scenario).
155+
`structured_payload` passes, which is notable.
156+
157+
- **Kimi**: adapter intermittency drops some turns. When it works,
158+
conversational quality is comparable to baseline.
159+
160+
- **Qwen**: the weakest conversationally — still shows the
161+
promise-but-don't-fire pattern that was the original bug
162+
that started this whole session. Avoid.
163+
164+
## The right recommendation, given this re-read
165+
166+
**For the resume-builder conversational surface specifically:**
167+
168+
1. **OpenAI gpt-5.4 stays default for the full pipeline.** It's
169+
the only provider that handles both the conversational
170+
intake AND the heavy structuring pass reliably.
171+
172+
2. **If the operator wants to A/B a "chat-first" experience**,
173+
swap in Sonnet 4.5 or Gemini for the intake LLM only, keep
174+
OpenAI for the structuring pass. The structured-data
175+
under-population (`pending_followups[]` not getting written)
176+
would still need a prompt fix per-provider, but the user-
177+
facing conversation would feel SMARTER on Sonnet/Gemini —
178+
they'd catch user-error patterns OpenAI's baseline misses.
179+
180+
3. **Failover targets for non-PII workloads (per ADR-028 D1):**
181+
Sonnet 4.5, Gemini, DeepSeek — all viable, all conversationally
182+
strong. GLM and Grok also pass the bar but with no smart-
183+
clarification edge. Avoid Kimi (adapter issues) and Qwen
184+
(promise-don't-fire pattern) until those clear up.
185+
186+
## What the eval misses (Phase 3 candidate)
187+
188+
This re-read suggests the eval matchers are too narrow. A richer
189+
rubric would distinguish:
190+
191+
- **"Committed without question"** (current PASS bar) vs
192+
- **"Asked smart clarifying question first"** (currently treated as FAIL but is BETTER)
193+
- **"Hallucinated capability"** (the original bug — qwen still does this)
194+
195+
The current matchers can't tell those three apart. A v2 rubric with
196+
LLM-as-judge scoring for "conversational quality" (1-5 scale) per
197+
scenario would give a more honest cross-provider picture. Parked.

0 commit comments

Comments
 (0)