Conversation
Four-case periodic-tier eval that captures the verbatim AskUserQuestion text /plan-ceo-review and /plan-eng-review produce, then asserts the format rule is honored: RECOMMENDATION always, Completeness: N/10 only on coverage-differentiated options, and an explicit "options differ in kind" note on kind-differentiated options. Cases: - plan-ceo-review mode selection (kind-differentiated) - plan-ceo-review approach menu (coverage-differentiated) - plan-eng-review per-issue coverage decision - plan-eng-review per-issue architectural choice (kind-differentiated) Classified periodic because behavior depends on Opus non-determinism — gate-tier would flake and block merges. Test harness instructs the agent to write its would-be AskUserQuestion text to $OUT_FILE rather than invoke a real tool (MCP AskUserQuestion isn't wired in the test subprocess). Regex predicates then validate the captured content. Cost: ~$2 per full run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…stion type
Opus 4.7 users reported /plan-ceo-review and /plan-eng-review stopped
emitting the RECOMMENDATION line and per-option Completeness: X/10
scores. E2E capture showed the real failure mode: on kind-differentiated
questions (mode selection, architectural A-vs-B, cherry-pick), Opus 4.7
either fabricated filler scores (10/10 on every option — conveys nothing)
or dropped the format entirely when the metric didn't fit.
Fix is at two layers:
1. scripts/resolvers/preamble/generate-ask-user-format.ts splits the old
run-on step 3 into:
- Step 3 "Recommend (ALWAYS)": RECOMMENDATION is required on every
question, coverage- or kind-differentiated.
- Step 4 "Score completeness (when meaningful)": emit Completeness: N/10
only when options differ in coverage. When options differ in kind,
skip the score and include a one-line explanatory note. Do not
fabricate scores.
2. scripts/resolvers/preamble/generate-completeness-section.ts updates
the Completeness Principle tail to match. Without this, the preamble
contained two rules (one conditional, one unconditional) and the
model hedged.
Template anchors reinforce the distinction where agent judgment is most
likely to drift:
- plan-ceo-review Section 0C-bis (approach menu) gets the
coverage-differentiated anchor.
- plan-ceo-review Section 0F (mode selection) gets the kind-differentiated
anchor.
- plan-eng-review CRITICAL RULE section gets the coverage-vs-kind rule
for every per-issue AskUserQuestion raised during the review.
Regenerated SKILL.md for all T2 skills + golden fixtures refreshed. Every
skill using the T2 preamble now has the same conditional scoring rule.
Verified via new periodic-tier eval (test/skill-e2e-plan-format.test.ts):
all 4 cases fail on prior behavior, all 4 pass with this fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
E2E Evals: ✅ PASS68/68 tests passed | $8.74 total cost | 12 parallel runners
12x ubicloud-standard-2 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite |
Four-case periodic-tier eval mirrors test/skill-e2e-plan-format.test.ts but drives the plan review skills via codex exec instead of claude -p. Context: Codex under the gpt.md "No preamble / Prefer doing over listing" overlay tends to skip the Simplify/ELI10 paragraph and the RECOMMENDATION line on AskUserQuestion calls. Users have to manually re-prompt "ELI10 and don't forget to recommend" almost every time. This test pins the behavior so regressions surface. Cases: - plan-ceo-review mode selection (kind-differentiated) - plan-ceo-review approach menu (coverage-differentiated) - plan-eng-review per-issue coverage decision - plan-eng-review per-issue architectural choice (kind-differentiated) Assertions on captured AskUserQuestion text: - RECOMMENDATION: Choose present (all cases) - Completeness: N/10 present on coverage, absent on kind - "options differ in kind" note present on kind - ELI10 length floor (>400 chars) — catches bare options-only output Cost: ~\$2-4 per full run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Follow-up to v1.6.2.0. Codex (GPT-5.4) under the gpt.md overlay treated "No preamble / Prefer doing over listing" as license to skip the Simplify paragraph and the RECOMMENDATION line on AskUserQuestion calls. Users had to manually re-prompt "ELI10 and don't forget to recommend" almost every time. Two layers: 1. model-overlays/gpt.md — adds an explicit "AskUserQuestion is NOT preamble" carve-out. The "No preamble" rule applies to direct answers; AskUserQuestion content must emit the full format (Re-ground, Simplify/ELI10, Recommend, Options). Tells the model: if you find yourself about to skip any of these, back up and emit them — the user will ask anyway, so do it the first time. 2. scripts/resolvers/preamble/generate-ask-user-format.ts — step 2 renamed to "Simplify (ELI10, ALWAYS)" with explicit "not optional verbosity, not preamble" framing. Step 3 "Recommend (ALWAYS)" hardened: "Never omit, never collapse into the options list." All T2 skills regenerated across all hosts. Golden fixtures refreshed (claude-ship, codex-ship, factory-ship). Updated the ELI10 assertion in test/gen-skill-docs.test.ts to match the new wording. Codex compliance to be verified empirically via test/codex-e2e-plan-format.test.ts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two test infrastructure bugs in the initial Codex eval landed in the prior commit: 1. sandbox: 'read-only' (the default) blocked Codex from writing $OUT_FILE. Test reported "STATUS: BLOCKED" and exited 0 without a capture file. Fixed: sandbox: 'workspace-write' for all 4 cases, allowing writes inside the tempdir. 2. recordCodexResult called a non-existent evalCollector.record() API (I invented it). The real surface is addTest() with a different field schema. Aligned with test/codex-e2e.test.ts pattern. With both fixed, the eval now actually measures Codex AskUserQuestion format compliance. All 4 cases pass on v1.6.2.0 with the gpt.md carve-out: RECOMMENDATION always, Completeness: N/10 only on coverage, "options differ in kind" note on kind, ELI10 explanation present. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the Codex ELI10 + RECOMMENDATION carve-out scope landed after v1.6.2.0's Claude-verified fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
garrytan
added a commit
that referenced
this pull request
Apr 23, 2026
….6.4.0 Main shipped v1.6.3.0 (Codex ELI10 + RECOMMENDATION fix, #1149) and also took the v1.6.2.0 version slot (plan-reviews RECOMMENDATION + Completeness split) while this branch was at 1.6.2.0 without a CHANGELOG entry. Version-number collision resolved per CLAUDE.md: branch bumps above main's latest, accepts main's two new CHANGELOG entries. VERSION: 1.6.4.0 (above main's 1.6.3.0). package.json: synced to 1.6.4.0. CHANGELOG: main's v1.6.3.0 + v1.6.2.0 entries accepted, placed above our v1.5.2.0 entry in reverse-chronological order. Auto-merged: many SKILL.md regenerations from main's preamble changes. No real conflicts in security source files. Security test suite: 87 pass, 0 fail post-merge (security.test.ts + content-security.test.ts).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two-part fix for AskUserQuestion format regressions in
/plan-ceo-reviewand/plan-eng-review, measured on both Claude Opus 4.7 and Codex (GPT-5.4).v1.6.2.0 — Claude regression. A user on Opus 4.7 reported
/plan-ceo-reviewand/plan-eng-reviewstopped showing theRECOMMENDATION: Choose Xline and theCompleteness: N/10per-option score. Investigation showed the real failure mode: on kind-differentiated questions (mode selection, architectural A-vs-B, cherry-pick Add/Defer/Skip), Opus 4.7 was fabricating filler scores (10/10 on every option, conveys nothing) or dropping the format when the metric didn't fit. Fix splitsCompleteness: N/10application by question type: coverage-differentiated options get scores, kind-differentiated options getNote: options differ in kind, not coverage — no completeness score.instead.v1.6.3.0 — Codex follow-up. User reported Codex (GPT-5.4) was failing the same pattern 10/10 times — skipping the ELI10 explanation and the RECOMMENDATION line on AskUserQuestion calls, forcing manual "ELI10 and don't forget to recommend" re-prompts every time. Root cause: the
gpt.mdmodel overlay's "No preamble / Prefer doing over listing" rule was training Codex to skip the exact prose the user needs for decision-making. Fix adds a "AskUserQuestion is NOT preamble" carve-out togpt.mdand hardens step 2 of the AskUserQuestion Format rule ("Simplify (ELI10, ALWAYS)" with explicit "not optional verbosity" framing).Test Coverage
Two new periodic-tier eval files, 4 cases each, pinned to the model family under test:
Claude —
test/skill-e2e-plan-format.test.ts(claude-opus-4-7):10/10on all 4 modes**bolded**Completeness: 5/7/109/9/5on kind questionCodex —
test/codex-e2e-plan-format.test.ts(codex-cli viacodex exec):Completeness: 5/7/10Eval pass record
skill-e2e-plan.test.tsPre-Landing Review
Three plan-phase reviews completed:
Completeness: X/10on kind-differentiated questions.Plan Completion
All phases shipped:
gpt.mdcarve-out, 4 new Codex eval cases, 4/4 pass.Phase 4 (literal in-template scaffolding fallback) not needed.
Verification Results
bun test— 448+ passing, 0 failing after golden fixture refresh.gen-skill-docs --host all— clean across all hosts (claude, codex, factory, gbrain, gpt-5.4, hermes, kiro, opencode, openclaw, slate, cursor).skill-e2e-plan.test.ts.codex exec.Test plan
🤖 Generated with Claude Code