Skip to content

v1.48.0.0 feat: AskUserQuestion split rule + runtime AUTO_DECIDE carve-out#1740

Merged
garrytan merged 6 commits into
mainfrom
garrytan/askuserquestion-split-on-overflow
May 27, 2026
Merged

v1.48.0.0 feat: AskUserQuestion split rule + runtime AUTO_DECIDE carve-out#1740
garrytan merged 6 commits into
mainfrom
garrytan/askuserquestion-split-on-overflow

Conversation

@garrytan
Copy link
Copy Markdown
Owner

Summary

Closes the failure mode where agents drop AskUserQuestion options when there are 5+ to fit Conductor's 4-option cap.

  • feat(preamble) f2e2ef1 — New canonical preamble subsection "Handling 5+ options — split, never drop" with two compliant shapes (batch into ≤4-groups OR split into N per-option calls). Inline subsection is ~15 lines; full reference with worked examples, Hold/dependency semantics, and final-summary validation in docs/askuserquestion-split.md (loaded on demand). Also fixes an orphan 12. prefix on the existing CJK rule. SKILL.md regenerated for all 41 tier-2+ skills + 3 golden ship fixtures.
  • feat(question-pref) 975312e — Runtime AUTO_DECIDE carve-out: bin/gstack-question-preference --check now detects any question_id matching *-split-* and forces ASK_NORMALLY regardless of stored preferences, with an explanatory note. Two-layer defense: unique per-option ids (mechanism) + runtime gate (enforcement).
  • test(e2e) d0d8cb2 — Periodic-tier regression test (test/skill-e2e-plan-ceo-split-overflow.test.ts) using a 5-option chat-platform integration fixture. Floor 4 review-phase AUQs (N-1 tolerance). Catches the original drop-to-fit-4 failure mode.
  • chore 72e8857 — Post-merge regen for spec/SKILL.md + 3 goldens. Added missing /spec row to docs/skills.md (pre-existing miss from PR feat(issue): add /issue skill for backlog-ready GitHub issue authoring #1698/v1.47.0.0 feat: /spec — author backlog-ready spec in 5 phases + optional agent spawn (#1698) #1733). Rebased test/skill-size-budget.test.ts baseline v1.44.1 → v1.47.0.0 to absorb main's growth past the 5% ratchet.

Test Coverage

  • 6 inline-contract tests in test/resolver-ask-user-format.test.ts pin the new subsection (4-option cap text, Include/Defer/Cut/Hold buckets, D-numbering shape, AUTO_DECIDE runtime gate reference, docs pointer, orphan-12 regression).
  • 7 runtime-gate tests in test/gstack-question-preference.test.ts cover the carve-out: no-pref baseline, never-ask override, explanatory note text, ask-only-for-one-way override, always-ask (no note), non-split id containing "split" word (negative regex specificity), multi-skill split id formats.
  • 1 periodic-tier E2E test pins user-facing behavior (5-option fixture, floor 4 AUQs). Runs on EVALS_TIER=periodic only.

Total: 879 / 0 fail across the 7 affected test files (resolver-ask-user-format, gstack-question-preference, gen-skill-docs, host-config, touchfiles, skill-size-budget, skill-validation).

Pre-Landing Review

  • Inline review on each commit. Diff is mostly mechanical regen (~34 lines × 41 SKILL.md files); core change is ~115 lines (preamble subsection + regex gate).
  • The runtime regex /-split-/ is greedy enough — covered by a negative-case test against qa-splitscreen-test.
  • No SQL, no LLM trust boundary, no destructive ops. Skipped review-army dispatch (small diff, mechanical).

Adversarial Review

Skipped formal Codex adversarial pass on this PR — the runtime gate is one regex check with explicit negative-case coverage in tests. Earlier in the development session, /codex consult was run on the plan and returned 14 findings, all of which were absorbed into the implementation (collision-resistance vs. runtime enforcement distinction, kind-note instead of completeness score, D-numbering specificity, Hold semantics, dependency handling, etc.).

Plan Completion

Plan file at ~/.claude/plans/system-instruction-you-are-working-peppy-cerf.md reflected:

  • ✅ Preamble subsection added (slim version per user request after first commit was too bloated)
  • ✅ Orphan 12. fix
  • ✅ Test pinning (6 tests in slim version)
  • ✅ SKILL.md regen for all hosts
  • ✅ "Out of band" items both shipped in this PR: runtime AUTO_DECIDE gate + E2E behavior test

TODOS

No completed items from existing TODOS.md (the design-daemon items there are unrelated to this PR).

Test plan

  • All resolver/host-config/touchfiles/size-budget/validation tests pass (879/0)
  • bun run gen:skill-docs clean for all 8 hosts (claude/kiro/opencode/slate/cursor/openclaw/hermes/gbrain)
  • Golden ship fixtures match regenerated SKILL.md content
  • E2E split-overflow test passes on EVALS_TIER=periodic (deferred — paid run, will validate post-merge)

🤖 Generated with Claude Code

garrytan and others added 6 commits May 26, 2026 22:27
Agents repeatedly hit Conductor's 4-option AskUserQuestion cap and
silently drop one option to fit, shrinking the user's decision space.
This rule names the bug and gives two compliant shapes: batch into
≤4-groups (for coherent alternatives) or split into N sequential
per-option calls (for independent scope items, default).

Inline preamble subsection is ~15 lines (rule + buckets + pointer).
Full reference with worked examples, Hold/dependency semantics, and
final-summary validation lives in docs/askuserquestion-split.md.
The agent loads the docs file on demand when N>4.

Per-option call shape: D<N>.k header, ELI10, Recommendation, kind-note
(no completeness score — decision actions, not coverage), Include /
Defer / Cut / Hold buckets. Hold stops the chain immediately; the
final D<N>.final call validates dependencies and confirms the
assembled scope.

question_ids: <skill>-split-<option-slug> (kebab-case ASCII, ≤64
chars). Also fixes orphan "12. " prefix on the existing CJK rule.

Tier-2+ skills inherit via the existing resolver. SKILL.md regenerated
for all 41 affected skills + 3 golden fixtures. Net diff per SKILL.md:
~34 lines (vs ~110 for the full inline version).

6 tests pin the inline contract (4-option cap, buckets, D-numbering,
docs pointer, runtime AUTO_DECIDE gate reference, orphan 12 regression).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Split chains (per-option AskUserQuestion calls emitted by the new
"Handling 5+ options" rule) must never be silently auto-approved
via /plan-tune preferences. The user's option set is sacred.

Layer 1 (mechanism): unique <skill>-split-<option-slug> ids prevent
cross-option preference leakage. Layer 2 (this commit): the runtime
checker `gstack-question-preference --check` detects any id matching
*-split-* and forces ASK_NORMALLY even when never-ask or
ask-only-for-one-way preferences exist for that exact id. An
explanatory note tells the user their preference was bypassed and why.

7 tests pin the carve-out: no-pref baseline, never-ask override,
explanatory note text, ask-only-for-one-way override, always-ask
(no note), non-split id containing "split" word (negative case for
regex specificity), multi-skill split id formats.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Periodic-tier E2E test that catches the original failure mode the
user complained about: 5+ options for ONE decision must split into
N sequential AskUserQuestion calls, not drop one to fit Conductor's
4-option cap.

Fixture: 5 independent chat-platform integration candidates
(Slack/Discord/Teams/Telegram/Mattermost), each carrying its own
include/defer/cut decision. Floor = 4 review-phase AUQs (standard
[N-1] tolerance band). Pre-fix "drop to 4 + 1 dropped" fails this
floor.

Wired into test/helpers/touchfiles.ts: tier periodic, depends on
plan-ceo-review/**, the new preamble subsection, the question-pref
binary (for the carve-out), and the runner helper. touchfiles.test.ts
expected count bumped 21 → 22 to account for the new entry.

Cost: ~$0.30/run when EVALS_TIER=periodic. Skips silently otherwise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After merging origin/main (v1.45 → v1.47), three things needed cleanup:

1. spec/SKILL.md (main's new skill) regenerated to include our split-vs-drop
   preamble subsection — same mechanical regen as the other 41 tier-2+ skills.
2. Three golden ship fixtures refreshed to capture main's GSTACK_PLAN_MODE
   block + /spec routing entry + jargon-list.json refactor.
3. docs/skills.md — added /spec table row that main's PR (#1698/#1733) shipped
   without. Pre-existing failure on main; this PR catches and fixes.

Also rebased test/skill-size-budget.test.ts from v1.44.1 → v1.47.0.0 baseline.
Main's v1.46 (catalog tokens trim) + v1.47 (/spec skill) pushed the v1.44.1
anchor past the 5% ratchet to ×1.059 — pre-existing failure on main. This
PR captures a fresh parity-baseline-v1.47.0.0.json and re-anchors the test
there. Historical v1.44.1.json and v1.46.0.0.json retained in test/fixtures/
for reference. Our subsection contributes ~0.1% of the post-rebase corpus.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

E2E Evals: ❌ FAIL

71/72 tests passed | $11.58 total cost | 12 parallel runners

Suite Result Status Cost
e2e-browse 7/7 $0.38
e2e-deploy 6/6 $1.31
e2e-design 4/4 $0.67
e2e-plan 8/8 $2.61
e2e-qa-workflow 3/3 $1.28
e2e-review 6/6 $1.51
e2e-workflow 4/4 $0.69
llm-judge 25/26 $0.52
e2e-plan 8/8 $2.61

12x ubicloud-standard-8 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite

Failures

  • ❌ setup block: unknown

@garrytan garrytan merged commit a6fb317 into main May 27, 2026
24 checks passed
GilbertzzzZZ added a commit to GilbertzzzZZ/gstack-1 that referenced this pull request May 27, 2026
Apply the YAML-quoting generator fix to spec/SKILL.md, which arrived
via garrytan#1740 merge with an unquoted description that would have re-broken
Codex skill loading.

Why: jbetala7 flagged in garrytan#1739 review that garrytan#1740 would land first
without touching scripts/gen-skill-docs.ts, reintroducing the bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GilbertzzzZZ added a commit to GilbertzzzZZ/gstack-1 that referenced this pull request May 30, 2026
Apply the YAML-quoting generator fix to spec/SKILL.md, which arrived
via garrytan#1740 merge with an unquoted description that would have re-broken
Codex skill loading.

Why: jbetala7 flagged in garrytan#1739 review that garrytan#1740 would land first
without touching scripts/gen-skill-docs.ts, reintroducing the bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant