Skip to content

Commit cab774c

Browse files
garrytanclaude
andauthored
v1.56.0.0 Token-reduction Phase B + AUQ paranoid safety net (#1849)
* refactor(plan-ceo-review): carve review body into on-demand section Carve the largest skill (138,838 B) into a skeleton + one on-demand section, the documented next Phase B target after /ship (v2_PLAN.md:216). - sections/review-sections.md(.tmpl): the 11-section deep review, codex/ outside-voice rules, how-to-ask, Required Outputs, registries, Completion Summary, Review Log, REVIEW_DASHBOARD, PLAN_FILE_REVIEW_REPORT, Next Steps, docs/designs promotion, Formatting Rules, and the Mode Quick Reference. - sections/manifest.json: passive registry (CM2), one entry. - SKILL.md.tmpl: {{SECTION_INDEX}} after the system audit, a single {{SECTION:review-sections}} STOP-Read after Step 0 mode selection, and a Section self-check. All of Step 0 (the scope/mode conversation) stays in the always-loaded skeleton; only EXIT_PLAN_MODE_GATE follows the section. Measured: always-loaded skeleton 138,838 -> 80,731 B (-42%, ~14.4K tokens off every invocation). Union (skeleton + section) 139,110 B, behavior held. Boundary honors Codex P1: nothing review-governing (formatting rules, mode reference, how-to-ask, required outputs) sits in the skeleton below the STOP. Housekeeping resolvers ride in the section, matching the ship precedent (adversarial.md carries LEARNINGS_LOG + GBRAIN_SAVE_RESULTS). Tests (atomic with the carve — skill-docs.yml gates gen:skill-docs freshness on every push, so source + regen + tests must land together): - parity-harness: plan-ceo flipped to sectioned, maxSkeletonBytes 90_000 (measured 80,731 + headroom); content/minBytes run against the union. - skill-size-budget: plan-ceo-review added to SECTIONS_EXTRACTED. - section-manifest-consistency: generalized to discover every carved skill, vars computed per-skill-case (Codex P2). - skill-ceo-section-ordering (new, gate): per-PR static guard — STOP after Step 0, review body absent from skeleton, report writer in the section, nothing review-governing below the STOP. - skill-e2e-plan-ceo-review-section-loading (new, periodic): refreshes the installed skill first (Codex P1), drives full Step 0, asserts the section is Read before the report. - gen-skill-docs + skill-validation: read the skeleton+sections union for carved skills so relocated prose still counts. - touchfiles: plan-ceo-section-loading registered (periodic). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump VERSION + CHANGELOG for plan-ceo-review carve (v1.56.0.0) MINOR: carves the largest skill into skeleton + on-demand section, dropping plan-ceo-review's always-loaded cost 42% (138,838 -> 80,731 B, ~14.4K tokens off every invocation). User-facing release notes lead with the measured token win. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(todos): file P3 follow-up — carve the shared {{PREAMBLE}} reference blocks Surfaced by /plan-eng-review on the plan-ceo-review carve: per-skill section carves stay modest because the ~40-50KB shared preamble dominates the always-loaded surface. A single preamble-reference carve would help every tier->=2 skill at once. Records the why, the cold-vs-hot split to measure, and the guards it needs. Not implemented this PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(auq): Layer 0 — guarantee AUQ format spec is always-loaded Deterministic, free, per-PR keystone for the token-reduction era. For every interactive (tier>=2) skill, asserts the full AskUserQuestion decision-brief format (ELI10/Recommendation/Pros-cons/checks/Net/(recommended)/Stakes/ self-check) lives in the always-loaded SKILL.md skeleton, NOT only in an on-demand section. Plus a roster guard (a carve can't silently drop the block) and per-skill rule survival in the skeleton+sections union. 51 cases + a negative control. Fails the instant a future carve strands AUQ-governing text where it won't be loaded when a question fires. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(auq): SDK capture engine + verbose-vs-carved no-degradation A/B Adds the reusable SDK $OUT_FILE capture engine (auq-sdk-capture.ts): drives a skill to its AUQ and captures the verbatim text the model GENERATES, cleanly (real-PTY mangles plan-mode AUQs via cursor escapes). Pins the skill to an absolute path with Read/Write-only tools so the agent can't wander to the global install. gradeAuqRecommendation normalizes a non-"because" connective before grading so substantive reasons aren't false-flagged (without touching the pinned shared judge). The A/B drives the same prompt through the carved 80KB skeleton and the pre-carve 137KB monolith and fails if carved scores worse. Result: both 7/7 format, substance 5 — proven no degradation, transcript-verified each side read its own planted SKILL.md. Periodic tier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(auq): consistency — same trigger N runs, stable format + substance Drives the carved /plan-ceo-review AUQ N=3 times and fails if any format element appears in one run but not another, or substance craters. Targets the "fine one run, broken the next" failure class a single snapshot can't see. Result: 3/3 stable, 7/7 + substance 5 every run. Periodic tier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(auq): behavioral matrix across AUQ-heavy skills Data-driven test that drives each AUQ-heavy skill (plan-eng/design/devex, office-hours, cso, spec, design-consultation) to its first AskUserQuestion and grades it to the plan-ceo bar: 7/7 decision-brief format + recommendation substance >=4. One case per skill (isolated failures), env-subsettable via AUQ_MATRIX_ONLY. Browser/design-binary skills are intentionally excluded (comparison boards, not format-AUQs; Layer 0 covers their spec). All targeted skills pass 7/7 with substance 4-5. Periodic tier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(codex): live recommendation-substance grade for /codex Closes the gap where /codex's synthesis recommendation was only checked statically (template grep) and via fixtures. Drives the real /codex skill over a flawed diff and grades the emitted "Recommendation: ... because ..." line with judgeRecommendation (present/commits/has_because/substance>=4). The named weak spot holds up: substance 5. Periodic tier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(auq): deterministic trigger for format-compliance gate A bare /plan-ceo-review against a repo whose work is already implemented makes the model improvise an off-script "what should I review?" scope question that skips the decision-brief format, which the gate test then times out waiting for. Hand it a concrete plan to review (FORCING_FLOOR_CEO) so it reaches the real Step 0 mode-selection AUQ that is the intended format check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(office-hours): carve Phase 5+6 into on-demand section Third Phase B carve (v2_PLAN.md:216, after ship and plan-ceo-review). Moves Phase 5 (Design Doc templates) + Phase 6 (tiered relationship handoff) — the session's output + closing tail, only reached after the conversation and alternatives are done — into sections/design-and-handoff.md, behind a single STOP-Read after Phase 4.5. The live conversation (Phases 1-4.5) and the always-run Important Rules stay in the always-loaded skeleton. Measured: always-loaded skeleton 118,280 -> 88,975 B (-24.8%). Union preserved. The carved AUQ is identical to pre-carve (matrix: 7/7 format, substance 5), and Layer 0 confirms the AUQ format spec stays in the skeleton — the AUQ paranoid suite de-risked this carve end to end. Atomic with tests + regen (skill-docs.yml gates gen:skill-docs freshness on every push, so source + regen + tests land together; --host all regenerates the inlined non-Claude variants): - sections/manifest.json: passive registry, one entry. - parity-harness: office-hours flipped to sectioned, maxSkeletonBytes 96_000 (measured 88,975 + headroom); content/minBytes run against the union. - skill-size-budget: office-hours added to SECTIONS_EXTRACTED. - gen-skill-docs + skill-validation: read the skeleton+sections union for office-hours so relocated Phase 5/6 prose still counts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump VERSION + CHANGELOG for office-hours carve + AUQ suite (v1.57.0.0) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(preamble): carve CJK-escaping manual to on-demand doc The AskUserQuestion format block is inlined into every interactive skill (~33). It carried the full multi-paragraph non-ASCII/CJK escaping manual inline, but that rationale only matters when a question contains CJK text and the operative rule already lives in the always-loaded self-check. Moved the justification to docs/askuserquestion-cjk.md (read on demand); kept the rule + a pointer. Corpus: Claude-host SKILL.md total 3,087,499 -> 3,057,975 B (-29,524 B, ~900 B x ~33 skills). Layer 0 still passes — the core decision-brief format stays always-loaded; only the rare CJK rationale moved. Atomic with the all-host regen (skill-docs.yml freshness gate). VERSION + package.json -> 1.58.0.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(plan-eng-review): carve review body into on-demand section Fourth Phase B carve (v2_PLAN.md:220). Moves the 4-section review (Architecture, Code Quality, Tests, Performance), outside voice, required outputs, and review report — everything after Step 0 scope — into sections/review-sections.md behind a single STOP-Read. Step 0 (scope challenge) and EXIT_PLAN_MODE_GATE stay in the always-loaded skeleton. Measured: skeleton 106,984 -> 54,892 B (-48.7%). Union preserved. Atomic with tests + all-host regen (freshness gate): parity flipped to sectioned (maxSkeletonBytes 62K), plan-eng-review added to SECTIONS_EXTRACTED, gen-skill-docs reads the union for relocated review/TEST_COVERAGE/dashboard prose. Layer 0 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(plan-design-review): carve review body into on-demand section Fifth Phase B carve (v2_PLAN.md:220, bundled with plan-eng). Moves the 7 design passes, required outputs, and review report — everything after Step 0 scope and the mockup/rating phase — into sections/review-sections.md behind a STOP-Read. Step 0, Step 0.5 mockups, the rating method, and EXIT_PLAN_MODE_GATE stay in the always-loaded skeleton. Measured: skeleton 112,057 -> 76,024 B (-32.2%). Union preserved. Atomic with tests + all-host regen: parity sectioned (maxSkeletonBytes 82K), added to SECTIONS_EXTRACTED, gen-skill-docs reads the union. Layer 0 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(plan-devex-review): carve review body into on-demand section Sixth Phase B carve. Moves the 8 DX passes, required outputs, and review report — everything after the Step 0 DX investigation — into sections/review-sections.md behind a STOP-Read. All of Step 0 (persona, empathy, benchmark, journey trace, roleplay) + the rating method + EXIT_PLAN_MODE_GATE stay always-loaded. Measured: skeleton 110,621 -> 69,658 B (-37%). Union preserved. Atomic with tests + all-host regen: added to SECTIONS_EXTRACTED, gen-skill-docs reads the union. Layer 0 green. (No parity invariant entry for plan-devex-review.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump VERSION + CHANGELOG for plan-* family carves (v1.59.0.0) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: refresh ship golden baselines + gbrain-detection union after carves Two follow-ups the carve commits should have carried (caught by the full suite, missed by targeted subsets): - ship golden baselines (claude/codex/factory) regenerated: the preamble CJK trim (v1.58) changed ship's always-loaded AskUserQuestion block. - gbrain-detection-override probes the office-hours skeleton+section union: GBRAIN_SAVE_RESULTS moved into sections/design-and-handoff.md when office-hours was carved, so the detection assertions now check both files. Full `bun test` green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(auq): grade format-compliance gate from SDK capture, not the TUI The real-PTY version grepped the stripAnsi'd interactive AUQ picker. Verified directly that this cannot work: plan-mode AUQs render as a cursor picker whose cursor-positioning escapes stripAnsi can't flatten — the picker renders fine for a human (cursorSeen=45) but the flattened text drops ELI10:/(recommended) and parseNumberedOptions returns 0. The test was grading a lossy projection and failed by construction. Rewritten to drive /plan-ceo-review via the SDK $OUT_FILE capture (the agent writes the verbatim question it would have shown — clean text, no rendering loss) and grade 7/7 format + kind-note + recommendation substance >=4. Same property, reliable, environment-independent; shares the engine with the periodic A/B and matrix evals. Result: 7/7 format, substance 5. Touchfiles key renamed ask-user-question-format-pty -> auq-format-gate (no longer a PTY test). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: fix carve-broken CI evals (union reads + section fixtures) Two CI eval jobs failed on the carved plan-* skills because they read content that moved into sections/: - llm-judge (skill-llm-eval): runWorkflowJudge sliced SKILL.md between markers like "## Review Sections" / "## CRITICAL RULE" that now live in sections/review-sections.md. The markers vanished from the skeleton, so the judge scored empty/wrong content. Fix: read the skeleton+sections union. Verified: plan-ceo modes / plan-eng sections / plan-design passes all PASS (25/25). - e2e-plan (skill-e2e-plan): setupPlanDir copied only <skill>/SKILL.md into the fixture, not sections/. The carved skill's STOP pointed at a section file that was absent, so the model improvised a compressed report table instead of the canonical "| Review | Trigger | Why | Runs | Status | Findings |". Fix: copy sections/ alongside SKILL.md in all 6 setup sites. Verified: report test PASS, canonical table emitted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: copy carved sections into all e2e fixtures (prevent more carve-blind CI fails) Proactive sweep beyond the two CI logs: every e2e test that copies a carved skill's SKILL.md into a temp fixture must also copy its sections/, or the model hits a STOP pointing at a missing section file and improvises/degrades. - skill-e2e.test.ts: plan-ceo/plan-eng/plan-design/office-hours copies across planDir/reviewDir/ohDir/benefitsDir dests now copy sections/. - skill-e2e-plan.test.ts: the office-hours copy + the 4-skill codex-offering loop now copy sections/. - skill-e2e-design.test.ts: plan-design-review copy now copies sections/. - skill-e2e-office-hours.test.ts: both office-hours copies now copy sections/. - skill-e2e-office-hours-brain-writeback.test.ts: GBRAIN_SAVE_RESULTS moved into the section, so check the regenerated skeleton+section UNION for the gbrain put block, ship both into the workdir, and restore both (the section regen was also leaking into the working tree — finally now restores it). ship copies (single-file Step-0 slices) and review/retro (not carved) untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: migrate section-loading E2E to lossless SDK tool-stream detection The /ship and /plan-ceo-review section-loading tests drove a real PTY and scraped the ANSI screen buffer for sections/<file>.md paths. That silently saw nothing in a Conductor PTY (cursor-positioned tool renders and an unanswered Step 0 question loop both defeat the regex), so both reported read: [] even when the agent did the work. They now run the skill through claude -p (the same SDK path the AUQ matrix uses) and detect section reads from the tool-use stream — Read calls whose file_path contains sections/<file>.md — with no rendering layer to mangle. The run is also hermetic: the freshly-generated worktree skeleton + sections are copied into a throwaway fixture with the absolute path pinned, so the test validates this branch's carve without mutating the user's ~/.claude install. Validated EVALS_TIER=periodic: both pass (plan-ceo Reads review-sections.md; ship Reads review-army.md + changelog.md), ~6.5 min for both vs ~23 min combined on the old PTY path where both were failing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: consolidate branch to v1.56.0.0 (single MINOR above main) The branch bumped VERSION several times during development (1.56 → 1.57 → 1.58 → 1.59), but none of those landed on main (main is at 1.55.1.0). Per the "never orphan branch-internal versions" discipline, collapse all four into a single 1.56.0.0 entry — one MINOR release covering the whole branch: five skills carved (plan-ceo, office-hours, plan-eng, plan-design, plan-devex), the shared AskUserQuestion preamble CJK trim, and the paranoid AUQ no-degradation test suite + lossless section-loading tests. VERSION and package.json set to 1.56.0.0; main's 1.55.1.0 entry preserved below the consolidated entry. No SKILL.md drift (VERSION is not embedded in generated bodies). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent c43c850 commit cab774c

95 files changed

Lines changed: 7701 additions & 6781 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CHANGELOG.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,65 @@
11
# Changelog
22

3+
## [1.56.0.0] - 2026-06-03
4+
5+
## **Five heavy skills now load their bulk on demand, the shared question preamble slimmed corpus-wide, and a paranoid test suite proves the questions never got worse.**
6+
7+
The token-reduction program lands its biggest wave. Five of the heaviest skills — `/plan-ceo-review`, `/office-hours`, `/plan-eng-review`, `/plan-design-review`, and `/plan-devex-review` — are now a small always-loaded skeleton plus an on-demand `sections/` file the agent opens only when it reaches the work. The conversational front half (Step 0 scope, the live interview) stays always-loaded; the deep review bodies, design-doc templates, outside-voice rules, and required-output writers move behind a single STOP-Read. The shared AskUserQuestion preamble also shed its rarely-needed CJK escaping manual, so every interactive skill is a little lighter at once. And because the most user-facing surface in gstack is the question it asks you, a new paranoid test suite proves that slimming a skill never strands or degrades that question.
8+
9+
### The numbers that matter
10+
11+
Measured from the generated skeletons (`wc -c <skill>/SKILL.md`), regenerated for all hosts:
12+
13+
| Skill | Before | After | Δ |
14+
|-------|--------|-------|---|
15+
| plan-ceo-review | 138,838 B | 80,731 B | -42.0% |
16+
| office-hours | 118,280 B | 88,975 B | -24.8% |
17+
| plan-eng-review | 106,984 B | 54,892 B | -48.7% |
18+
| plan-design-review | 112,057 B | 76,024 B | -32.2% |
19+
| plan-devex-review | 110,621 B | 69,658 B | -37.0% |
20+
21+
On top of the per-skill carves, the shared AskUserQuestion preamble dropped its inline CJK manual to a one-line rule + a doc pointer, trimming ~29,524 B across the Claude-host corpus (every interactive skill, ~900 B each).
22+
23+
The AskUserQuestion proof, measured by SDK capture (`test/skill-e2e-auq-matrix.test.ts`):
24+
25+
| Guarantee | Result |
26+
|-----------|--------|
27+
| Carved vs verbose, same trigger | 7/7 format, substance 5 == 7/7 format, substance 5 (no degradation) |
28+
| Matrix across 7 AUQ-heavy skills | all 7/7 format, substance 4-5 |
29+
| Same trigger 3× (consistency) | stable, every format element every run |
30+
| AUQ format spec always-loaded | guaranteed in every skeleton (Layer 0) |
31+
32+
Every review runs identical pass for pass; the only thing that changed is what sits in context before the work begins.
33+
34+
### What this means for you
35+
36+
The skills you run most start markedly lighter: a `/plan-ceo-review` opens ~42% smaller, the others 25-49%, and they pull in their review bodies only when they reach them. You will not notice any behavior change in the reviews themselves; they run section for section as before. What you get is more of the context window left for your actual work, paid back on every invocation. And every skill that asks you questions now carries a guarantee, enforced by tests: the decision brief (plain-English ELI10, an explicit recommendation with a real reason, pros and cons, the stakes) is provably in context the instant any question fires. External hosts (codex, factory, kiro, opencode) still receive the full inline skill, so nothing regresses off Claude.
37+
38+
### Itemized changes
39+
40+
#### Added
41+
- `plan-ceo-review/sections/review-sections.md` — the 11-section deep review, outside-voice rules, required-output registries, completion summary, review report writer, next-step chaining, and mode quick reference, behind a STOP-Read pointer with a passive `manifest.json`.
42+
- `office-hours/sections/design-and-handoff.md` — Phase 5 design-doc templates + Phase 6 tiered handoff, behind a STOP-Read pointer with a passive `manifest.json`.
43+
- `/plan-eng-review`, `/plan-design-review`, `/plan-devex-review` each gain a `sections/review-sections.md` carved behind a post-Step-0 STOP-Read.
44+
- `docs/askuserquestion-cjk.md` — full non-ASCII / CJK escaping rationale + worked example, read on demand.
45+
- `test/auq-format-always-loaded.test.ts` — free per-PR keystone: every interactive skill must carry the full AskUserQuestion format spec in its always-loaded skeleton, never stranded in a section. 51 cases plus a negative control.
46+
- `test/skill-e2e-auq-matrix.test.ts` — drives each AUQ-heavy skill to its first question and grades it (7/7 format, substance >=4).
47+
- `test/skill-e2e-auq-verbose-vs-carved-ab.test.ts` — proves a carved skill's question is not worse than the pre-carve monolith's, on the same trigger.
48+
- `test/skill-e2e-auq-consistency.test.ts` — same trigger N times, fails on any format element that flickers between runs.
49+
- `test/codex-e2e-recommendation-substance.test.ts` — grades `/codex`'s live recommendation substance.
50+
- `test/skill-ceo-section-ordering.test.ts` — gate-tier static guard: the STOP fires after Step 0, the review body is absent from the skeleton, the report writer lives in the section, and nothing review-governing sits below the STOP.
51+
- `test/skill-e2e-plan-ceo-review-section-loading.test.ts` — periodic backstop that asserts the carved section is Read before the report.
52+
53+
#### Changed
54+
- `/plan-ceo-review`, `/office-hours`, `/plan-eng-review`, `/plan-design-review`, `/plan-devex-review` are each a skeleton + one on-demand section on Claude; the conversational front (Step 0 / Phases 1-4.5) stays always-loaded; external hosts still receive the full inline skill (no behavior change off Claude).
55+
- The AskUserQuestion preamble trims the inline CJK manual to the operative rule + a doc pointer; the always-loaded self-check is unchanged.
56+
- Parity, size-budget, section-manifest, gen-skill-docs, and skill-validation treat every carved skill consistently (content + size floors run against the skeleton + section union; skeleton-shrink assertions guard the always-loaded win).
57+
58+
#### For contributors
59+
- `test/helpers/auq-sdk-capture.ts` — reusable SDK capture engine: drives a skill to its AUQ and captures the verbatim generated text cleanly (real-PTY mangles plan-mode questions), grades format + recommendation substance robust to the connective, and detects section reads losslessly from the tool-use stream.
60+
- `section-manifest-consistency` discovers every carved skill automatically, so the next carve is covered the moment its manifest lands.
61+
- The `/ship` and `/plan-ceo-review` section-loading E2E tests detect section reads from the `claude -p` tool-use stream instead of scraping the real-PTY screen buffer, so they are reliable (the PTY path silently saw nothing in some terminals) and run hermetically against the worktree carve without mutating the installed skill.
62+
363
## [1.55.1.0] - 2026-06-02
464

565
## **Telemetry now tells you exactly what it records and where it stays. The project-slug helper hands the shell a safe identifier on every path.**

TODOS.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,38 @@ v1.47.0.0 baselines retained in `test/fixtures/` for the v1→v2 audit trail. Th
1919
captured skill bytes match `origin/main` exactly (the rebasing branch left every
2020
SKILL.md untouched). `bun test` is green again.
2121

22+
## Token-reduction follow-ups (Phase B, filed via /plan-eng-review on the plan-ceo-review carve)
23+
24+
### P3: Carve the always-loaded `{{PREAMBLE}}` reference blocks into an on-demand doc
25+
26+
**What:** The per-skill section carves (`/ship` v1.54, `/plan-ceo-review` v1.56) yield
27+
real but bounded wins (-42% to -59% on the carved skill) because the shared
28+
`{{PREAMBLE}}` (~40-50KB on every tier-3/4 skill) is the dominant always-loaded cost
29+
and stays inline. Move the rarely-needed preamble REFERENCE blocks (the AskUserQuestion
30+
split-rules and the CJK / lone-surrogate escaping reference) into an on-demand
31+
section-style doc the agent reads only when it hits those edge cases, leaving the hot
32+
path (voice, completeness principle, recommendation format) inline.
33+
34+
**Why:** Highest-ROI remaining token target. One preamble carve helps EVERY tier-≥2
35+
skill at once, not one skill per PR. The eng-review on the plan-ceo carve flagged that
36+
per-skill carves stay modest precisely because the preamble dominates the always-loaded
37+
surface.
38+
39+
**Pros:** A single change reduces always-loaded cost across the whole skill pack.
40+
**Cons:** The preamble is load-bearing and shared; a botched carve regresses every skill.
41+
Needs the same union-parity + per-push freshness guards the section carves use, applied
42+
corpus-wide.
43+
44+
**Context:** Builds on the v2 section pipeline (`scripts/resolvers/sections.ts`,
45+
`{{SECTION:id}}` / `{{SECTION_INDEX}}`). The preamble source is
46+
`scripts/resolvers/preamble.ts`. Measure which sub-blocks are cold (escaping reference,
47+
split-rules) vs hot (voice, recommendation format) before cutting. Validate on one skill,
48+
then roll corpus-wide.
49+
50+
**Effort estimate:** L (human team) → M (CC+gstack)
51+
**Priority:** P3
52+
**Depends on / blocked by:** The section pipeline (shipped v1.54). No hard blocker.
53+
2254
## gbrowser memory follow-ups (filed via /plan-eng-review + /codex on the v1.49 leak-fix PR)
2355

2456
These four items came out of the memory-leak investigation that shipped

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
1.55.1.0
1+
1.56.0.0

autoplan/SKILL.md

Lines changed: 6 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -371,25 +371,12 @@ so split chains are never AUTO_DECIDE-eligible — the user's option set is sacr
371371
**Full rule + worked examples + Hold/dependency semantics:** see
372372
`docs/askuserquestion-split.md` in the gstack repo. Read on demand when N>4.
373373

374-
**Non-ASCII characters — write directly, never \u-escape.** When any
375-
string field (question, option label, option description) contains
376-
Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text, emit
377-
the literal UTF-8 characters in the JSON string. **Never escape them
378-
as `\uXXXX`.** Claude Code's tool parameter pipe is UTF-8 native
379-
and passes characters through unchanged. Manually escaping requires
380-
recalling each codepoint from training, which is unreliable for long
381-
CJK strings — the model regularly emits the wrong codepoint (e.g.
382-
writes `\u3103` thinking it is 管 U+7BA1, but `\u3103` is
383-
actually ㄃, so the user sees `管理工具` rendered as `㄃3用箱`).
384-
The trigger is long, multi-line questions with hundreds of CJK
385-
characters: that is exactly when reflexive escaping kicks in and
386-
exactly when miscoding is most damaging. Long ≠ escape. Keep
387-
characters literal.
388-
389-
Wrong: `"question": "請選擇\uXXXX\uXXXX\uXXXX\uXXXX"`
390-
Right: `"question": "請選擇管理工具"`
391-
392-
Only JSON-mandatory escapes remain allowed: `\n`, `\t`, `\"`, `\\`.
374+
**Non-ASCII characters — write directly, never \u-escape.** When any string
375+
field contains Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text,
376+
emit the literal UTF-8 characters; never escape them as `\uXXXX` (the pipe is
377+
UTF-8 native, and manual escaping miscodes long CJK strings). Only `\n`,
378+
`\t`, `\"`, `\\` remain allowed. Full rationale + worked example: see
379+
`docs/askuserquestion-cjk.md`. Read on demand when a question contains CJK.
393380

394381
### Self-check before emitting
395382

canary/SKILL.md

Lines changed: 6 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -363,25 +363,12 @@ so split chains are never AUTO_DECIDE-eligible — the user's option set is sacr
363363
**Full rule + worked examples + Hold/dependency semantics:** see
364364
`docs/askuserquestion-split.md` in the gstack repo. Read on demand when N>4.
365365

366-
**Non-ASCII characters — write directly, never \u-escape.** When any
367-
string field (question, option label, option description) contains
368-
Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text, emit
369-
the literal UTF-8 characters in the JSON string. **Never escape them
370-
as `\uXXXX`.** Claude Code's tool parameter pipe is UTF-8 native
371-
and passes characters through unchanged. Manually escaping requires
372-
recalling each codepoint from training, which is unreliable for long
373-
CJK strings — the model regularly emits the wrong codepoint (e.g.
374-
writes `\u3103` thinking it is 管 U+7BA1, but `\u3103` is
375-
actually ㄃, so the user sees `管理工具` rendered as `㄃3用箱`).
376-
The trigger is long, multi-line questions with hundreds of CJK
377-
characters: that is exactly when reflexive escaping kicks in and
378-
exactly when miscoding is most damaging. Long ≠ escape. Keep
379-
characters literal.
380-
381-
Wrong: `"question": "請選擇\uXXXX\uXXXX\uXXXX\uXXXX"`
382-
Right: `"question": "請選擇管理工具"`
383-
384-
Only JSON-mandatory escapes remain allowed: `\n`, `\t`, `\"`, `\\`.
366+
**Non-ASCII characters — write directly, never \u-escape.** When any string
367+
field contains Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text,
368+
emit the literal UTF-8 characters; never escape them as `\uXXXX` (the pipe is
369+
UTF-8 native, and manual escaping miscodes long CJK strings). Only `\n`,
370+
`\t`, `\"`, `\\` remain allowed. Full rationale + worked example: see
371+
`docs/askuserquestion-cjk.md`. Read on demand when a question contains CJK.
385372

386373
### Self-check before emitting
387374

codex/SKILL.md

Lines changed: 6 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -366,25 +366,12 @@ so split chains are never AUTO_DECIDE-eligible — the user's option set is sacr
366366
**Full rule + worked examples + Hold/dependency semantics:** see
367367
`docs/askuserquestion-split.md` in the gstack repo. Read on demand when N>4.
368368

369-
**Non-ASCII characters — write directly, never \u-escape.** When any
370-
string field (question, option label, option description) contains
371-
Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text, emit
372-
the literal UTF-8 characters in the JSON string. **Never escape them
373-
as `\uXXXX`.** Claude Code's tool parameter pipe is UTF-8 native
374-
and passes characters through unchanged. Manually escaping requires
375-
recalling each codepoint from training, which is unreliable for long
376-
CJK strings — the model regularly emits the wrong codepoint (e.g.
377-
writes `\u3103` thinking it is 管 U+7BA1, but `\u3103` is
378-
actually ㄃, so the user sees `管理工具` rendered as `㄃3用箱`).
379-
The trigger is long, multi-line questions with hundreds of CJK
380-
characters: that is exactly when reflexive escaping kicks in and
381-
exactly when miscoding is most damaging. Long ≠ escape. Keep
382-
characters literal.
383-
384-
Wrong: `"question": "請選擇\uXXXX\uXXXX\uXXXX\uXXXX"`
385-
Right: `"question": "請選擇管理工具"`
386-
387-
Only JSON-mandatory escapes remain allowed: `\n`, `\t`, `\"`, `\\`.
369+
**Non-ASCII characters — write directly, never \u-escape.** When any string
370+
field contains Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text,
371+
emit the literal UTF-8 characters; never escape them as `\uXXXX` (the pipe is
372+
UTF-8 native, and manual escaping miscodes long CJK strings). Only `\n`,
373+
`\t`, `\"`, `\\` remain allowed. Full rationale + worked example: see
374+
`docs/askuserquestion-cjk.md`. Read on demand when a question contains CJK.
388375

389376
### Self-check before emitting
390377

context-restore/SKILL.md

Lines changed: 6 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -367,25 +367,12 @@ so split chains are never AUTO_DECIDE-eligible — the user's option set is sacr
367367
**Full rule + worked examples + Hold/dependency semantics:** see
368368
`docs/askuserquestion-split.md` in the gstack repo. Read on demand when N>4.
369369

370-
**Non-ASCII characters — write directly, never \u-escape.** When any
371-
string field (question, option label, option description) contains
372-
Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text, emit
373-
the literal UTF-8 characters in the JSON string. **Never escape them
374-
as `\uXXXX`.** Claude Code's tool parameter pipe is UTF-8 native
375-
and passes characters through unchanged. Manually escaping requires
376-
recalling each codepoint from training, which is unreliable for long
377-
CJK strings — the model regularly emits the wrong codepoint (e.g.
378-
writes `\u3103` thinking it is 管 U+7BA1, but `\u3103` is
379-
actually ㄃, so the user sees `管理工具` rendered as `㄃3用箱`).
380-
The trigger is long, multi-line questions with hundreds of CJK
381-
characters: that is exactly when reflexive escaping kicks in and
382-
exactly when miscoding is most damaging. Long ≠ escape. Keep
383-
characters literal.
384-
385-
Wrong: `"question": "請選擇\uXXXX\uXXXX\uXXXX\uXXXX"`
386-
Right: `"question": "請選擇管理工具"`
387-
388-
Only JSON-mandatory escapes remain allowed: `\n`, `\t`, `\"`, `\\`.
370+
**Non-ASCII characters — write directly, never \u-escape.** When any string
371+
field contains Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text,
372+
emit the literal UTF-8 characters; never escape them as `\uXXXX` (the pipe is
373+
UTF-8 native, and manual escaping miscodes long CJK strings). Only `\n`,
374+
`\t`, `\"`, `\\` remain allowed. Full rationale + worked example: see
375+
`docs/askuserquestion-cjk.md`. Read on demand when a question contains CJK.
389376

390377
### Self-check before emitting
391378

context-save/SKILL.md

Lines changed: 6 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -366,25 +366,12 @@ so split chains are never AUTO_DECIDE-eligible — the user's option set is sacr
366366
**Full rule + worked examples + Hold/dependency semantics:** see
367367
`docs/askuserquestion-split.md` in the gstack repo. Read on demand when N>4.
368368

369-
**Non-ASCII characters — write directly, never \u-escape.** When any
370-
string field (question, option label, option description) contains
371-
Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text, emit
372-
the literal UTF-8 characters in the JSON string. **Never escape them
373-
as `\uXXXX`.** Claude Code's tool parameter pipe is UTF-8 native
374-
and passes characters through unchanged. Manually escaping requires
375-
recalling each codepoint from training, which is unreliable for long
376-
CJK strings — the model regularly emits the wrong codepoint (e.g.
377-
writes `\u3103` thinking it is 管 U+7BA1, but `\u3103` is
378-
actually ㄃, so the user sees `管理工具` rendered as `㄃3用箱`).
379-
The trigger is long, multi-line questions with hundreds of CJK
380-
characters: that is exactly when reflexive escaping kicks in and
381-
exactly when miscoding is most damaging. Long ≠ escape. Keep
382-
characters literal.
383-
384-
Wrong: `"question": "請選擇\uXXXX\uXXXX\uXXXX\uXXXX"`
385-
Right: `"question": "請選擇管理工具"`
386-
387-
Only JSON-mandatory escapes remain allowed: `\n`, `\t`, `\"`, `\\`.
369+
**Non-ASCII characters — write directly, never \u-escape.** When any string
370+
field contains Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text,
371+
emit the literal UTF-8 characters; never escape them as `\uXXXX` (the pipe is
372+
UTF-8 native, and manual escaping miscodes long CJK strings). Only `\n`,
373+
`\t`, `\"`, `\\` remain allowed. Full rationale + worked example: see
374+
`docs/askuserquestion-cjk.md`. Read on demand when a question contains CJK.
388375

389376
### Self-check before emitting
390377

cso/SKILL.md

Lines changed: 6 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -369,25 +369,12 @@ so split chains are never AUTO_DECIDE-eligible — the user's option set is sacr
369369
**Full rule + worked examples + Hold/dependency semantics:** see
370370
`docs/askuserquestion-split.md` in the gstack repo. Read on demand when N>4.
371371

372-
**Non-ASCII characters — write directly, never \u-escape.** When any
373-
string field (question, option label, option description) contains
374-
Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text, emit
375-
the literal UTF-8 characters in the JSON string. **Never escape them
376-
as `\uXXXX`.** Claude Code's tool parameter pipe is UTF-8 native
377-
and passes characters through unchanged. Manually escaping requires
378-
recalling each codepoint from training, which is unreliable for long
379-
CJK strings — the model regularly emits the wrong codepoint (e.g.
380-
writes `\u3103` thinking it is 管 U+7BA1, but `\u3103` is
381-
actually ㄃, so the user sees `管理工具` rendered as `㄃3用箱`).
382-
The trigger is long, multi-line questions with hundreds of CJK
383-
characters: that is exactly when reflexive escaping kicks in and
384-
exactly when miscoding is most damaging. Long ≠ escape. Keep
385-
characters literal.
386-
387-
Wrong: `"question": "請選擇\uXXXX\uXXXX\uXXXX\uXXXX"`
388-
Right: `"question": "請選擇管理工具"`
389-
390-
Only JSON-mandatory escapes remain allowed: `\n`, `\t`, `\"`, `\\`.
372+
**Non-ASCII characters — write directly, never \u-escape.** When any string
373+
field contains Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text,
374+
emit the literal UTF-8 characters; never escape them as `\uXXXX` (the pipe is
375+
UTF-8 native, and manual escaping miscodes long CJK strings). Only `\n`,
376+
`\t`, `\"`, `\\` remain allowed. Full rationale + worked example: see
377+
`docs/askuserquestion-cjk.md`. Read on demand when a question contains CJK.
391378

392379
### Self-check before emitting
393380

0 commit comments

Comments
 (0)