diff --git a/README.md b/README.md index 6f1b697a5..bcec9a16b 100644 --- a/README.md +++ b/README.md @@ -105,7 +105,7 @@ The first pass tightens recent branch changes before review. The targeted pass i After installing, run `/ce-setup` in any project. It checks repo-local config, reports optional tool capabilities, and helps keep machine-local CE settings safely gitignored. -The `compound-engineering` plugin currently ships 27 skills and 0 standalone agents. Specialist review, research, and workflow behavior lives inside the owning skills as skill-local prompt assets. +The `compound-engineering` plugin currently ships 26 skills and 0 standalone agents. Specialist review, research, and workflow behavior lives inside the owning skills as skill-local prompt assets. ### Full Skill Inventory @@ -136,7 +136,6 @@ The `compound-engineering` plugin currently ships 27 skills and 0 standalone age | `/ce-polish` | Start a dev server and iterate on UX polish | | `/ce-proof` | Create, edit, and share Proof documents | | `/ce-dogfood-beta` | Diff-scoped browser QA of the active branch | -| `/ce-work-beta` | Experimental execution workflow with Codex delegation mode | | `/lfg` | Full autonomous engineering workflow | --- diff --git a/docs/plans/2026-06-26-001-chore-remove-ce-work-beta-plan.html b/docs/plans/2026-06-26-001-chore-remove-ce-work-beta-plan.html new file mode 100644 index 000000000..c859cc09d --- /dev/null +++ b/docs/plans/2026-06-26-001-chore-remove-ce-work-beta-plan.html @@ -0,0 +1,449 @@ + + + + + +Remove ce-work-beta Skill - Plan + + + +
+ + Compound Engineering · Implementation Plan +

Remove ce-work-beta Skill - Plan

+

+ Retire the experimental Codex external-delegation skill and complete the repo's + required removal cleanup, keeping the reusable delegation learnings. +

+ +
+
Type
chore
+
Date
+
Topic
remove-ce-work-beta
+
Artifact
ce-unified-plan/v1
+
Readiness
implementation-ready
+
Source
ce-brainstorm
+
Execution
code
+
+ + + +
+

Goal Capsule

+
+
+
Objective
+
Remove the unmaintained ce-work-beta skill (the Codex shell-out delegation experiment) and complete every cleanup the repo's conventions require, while preserving the reusable external-delegation learnings.
+
Authority
+
Trevin Chow (plugin maintainer). The Product Contract is the source of truth; planning corrections to it are noted below.
+
Execution profile
+
Single small PR off a feature branch. Mechanical removal plus one documentation edit; no architectural risk.
+
Stop conditions
+
Stop and surface if removing the skill breaks a test whose intent is to verify live delegation behavior in a way the plan didn't anticipate, or if release:validate fails for a reason other than the skill count. Do not expand scope into the historical plan docs or CHANGELOG.
+
Tail ownership
+
Implementer runs bun test and bun run release:validate to green, then opens a PR with a chore(ce-work-beta):-scoped title. All changes to main go through the PR; no direct push.
+
Open blockers
+
None.
+
+
+
+ +
+ +
+

Product Contract

+ +
+ Product Contract preservation + Requirements unchanged in intent. R5 clarified: there is no skill count stored in a release manifest — the count lives in README prose (R4) and one test assertion (tests/release-metadata.test.ts), and release:validate auto-derives counts from disk. R5 is reframed from "sync the release manifest count" to "fix the README + test count and run release:validate to confirm sync." No product-scope change. +
+ +

Summary

+

+ Delete the ce-work-beta skill from the plugin and perform the full set of + cleanups its removal triggers: stale-install registries, the README inventory and + skill count, the delegation-parity contract tests, and the release-metadata test + count. The skill is beta and disable-model-invocation: true with no + downstream handoffs, so removal is a hard delete rather than a deprecation. The Codex + token-economics learnings are kept as a standalone reference; the now-moot promotion + checklist is deleted. +

+ +

Problem Frame

+

+ ce-work-beta was a beta surface for routing implementation work to the + Codex CLI via codex exec. The delegation machinery — config parsing, + per-batch effort selection, sandbox routing, and the ~23 KB delegation-workflow + reference — has ongoing carrying cost. Keeping the shell-out path correct as + ce-work evolves means maintaining a parallel orchestration surface that + stayed beta, manual-only, and isolated from planning/workflow handoffs by design. The + maintenance burden now outweighs the value the experiment returns, so the experiment + ends. +

+ +

Requirements

+ +
Skill removal
+ + + + + +
IDRequirement
R1Delete the entire skills/ce-work-beta/ directory tree, including SKILL.md and all files under references/.
+ +
Stale-install cleanup
+ + + + + + +
IDRequirement
R2Add ce-work-beta (dash variant) to STALE_SKILL_DIRS in src/utils/legacy-cleanup.ts so upgrades sweep the flat-install skill directory. The colon variant ce:work-beta is already present and stays.
R3Add ce-work-beta to EXTRA_LEGACY_ARTIFACTS_BY_PLUGIN["compound-engineering"].skills in src/data/plugin-legacy-artifacts.ts for the same reason.
+ +
Inventory and release sync
+ + + + + + +
IDRequirement
R4Update README.md: remove the /ce-work-beta inventory row and change the stated skill count from 27 to 26.
R5Fix the hardcoded skill-count assertion in tests/release-metadata.test.ts (27 to 26) and run bun run release:validate to confirm metadata is in sync. The count auto-derives from skill directories; do not hand-bump release-owned versions.
+ +
Knowledge preservation
+ + + + + + +
IDRequirement
R6Edit docs/solutions/best-practices/codex-delegation-best-practices.md to read as a standalone external-delegation reference: keep the token-economics and batching learnings, remove the parts that depend on ce-work-beta existing as a live skill.
R7Delete docs/solutions/skill-design/ce-work-beta-promotion-checklist.md — no longer applicable once the skill is removed rather than promoted.
+ +
Tests
+ + + + + + +
IDRequirement
R8Remove the ce-work-beta delegation-parity assertions in tests/pipeline-review-contract.test.ts (the tests that read the deleted skill to verify it mirrors ce-work).
R9Repair the remaining test references so bun test passes: fix the stale-install cleanup tests to stop reading the deleted SKILL.md while keeping their sweep assertions, drop ce-work-beta from the user-invoked-skills set, and repoint the converter example off the dead name.
+ +

Scope Boundaries

+
+ Not in scope +
    +
  • The historical plan docs in docs/plans/ that mention ce-work-beta as an active work surface stay untouched — they record past state, and version control holds history.
  • +
  • Existing CHANGELOG.md entries referencing ce-work-beta are left as-is — release-owned history, not hand-edited here.
  • +
+
+
+ Explicitly rejected + Promoting Codex delegation into stable ce-work. The decision is to end the + experiment, not graduate it; the promotion path is being deleted, not taken. +
+
+ +
+ +
+

Planning Contract

+ +

Key Technical Decisions

+
+ KTD1Hard delete, not a deprecation stub + The skill is beta, disable-model-invocation: true, and no other skill (planning, brainstorm, lfg, compound) hands off to it, so there is no user path to soften. The directory is removed outright and the name is added to the cleanup registries so existing flat installs get swept on upgrade. +
+
+ KTD2Add the dash variant only to the two registries that lack it + legacy-cleanup.ts already carries ce-work-beta in STALE_PROMPT_FILES, LEGACY_SKILL_DESCRIPTION_ALIASES, LEGACY_PROMPT_CURRENT_SKILL_FOR_FILE, and LEGACY_PROMPT_DESCRIPTION_ALIASES from the earlier ce:ce- rename. Only STALE_SKILL_DIRS (R2) and the artifacts skills array (R3) are missing the dash variant. Add only those two; do not duplicate the existing entries. +
+
+ KTD3Keep the stale-install cleanup tests; delete only the delegation-parity tests + The registries continue to target ce-work-beta, so "a leftover ce-work-beta install gets swept" stays a tested behavior. The legacy-cleanup tests keep that assertion but must stop reading the deleted skills/ce-work-beta/SKILL.md (replace the runtime pluginDescription(...) read with the literal historical description already encoded in LEGACY_SKILL_DESCRIPTION_ALIASES). The pipeline-review-contract tests verify a capability that no longer exists and are deleted whole. +
+
+ KTD4The skill count is README prose plus one test assertion, not a release manifest value + getCompoundEngineeringCounts in src/release/metadata.ts derives the count from on-disk skill directories. release:validate reports it but does not fail on a README mismatch and does not write the count back. The only enforced copy is tests/release-metadata.test.ts:150 (caught by bun test). So the count fix is two hand-edits (README prose + that assertion), not a manifest bump. +
+ +

Assumptions

+ + +

Sources & Research

+ +
+ +
+ +
+

Implementation Units

+ + +
+

U1. Delete the ce-work-beta skill tree

+
+
Goal
Remove the skill itself.
+
Requirements
R1
+
Dependencies
None
+
Files
skills/ce-work-beta/ (entire tree: SKILL.md + references/)
+
+
Execution noteLand U1 together with U3 in the same PR — deleting the tree red-lines the tests that read it until U3 lands. The PR tip must be green.
+
Approach & verification +

Pure deletion of the directory. After this, countSkillDirectories returns 26 automatically. No code references the skill at import time — all remaining references are string literals in tests and registries handled by other units.

+

Test expectation: none — pure file deletion; the absence is proven by U3's suite repair and the Verification Contract grep.

+

Verification: skills/ce-work-beta/ no longer exists.

+
+
+ +
+

U2. Add dash variant to the two cleanup registries

+
+
Goal
Ensure existing installs sweep the now-stale ce-work-beta skill directory on upgrade.
+
Requirements
R2, R3
+
Dependencies
None
+
Files
src/utils/legacy-cleanup.ts, src/data/plugin-legacy-artifacts.ts
+
+
Approach, patterns & test scenarios +

Approach: Add the string "ce-work-beta" to STALE_SKILL_DIRS (next to the existing "ce:work-beta") and to EXTRA_LEGACY_ARTIFACTS_BY_PLUGIN["compound-engineering"].skills (next to its "ce:work-beta"). Per KTD2, do not touch the prompt-file / description-alias registries — they already carry the dash variant.

+

Patterns to follow: mirror the adjacent ce:work-beta entries; keep alphabetical/grouped ordering consistent with neighbors.

+

Test scenarios:

+
    +
  • Cleanup sweep: a fixture containing a stale ce-work-beta skill directory is removed by the cleanup routine (extends the existing prompt-file sweep coverage to the skill-dir registry entry).
  • +
  • Registry membership: STALE_SKILL_DIRS and the artifacts skills array each contain both ce:work-beta and ce-work-beta.
  • +
+

Verification: the cleanup tests in U3 pass with the new entries.

+
+
+ +
+

U3. Repair the test suite

+
+
Goal
Make bun test green after the skill and its parity surface are gone.
+
Requirements
R5, R8, R9
+
Dependencies
U1, U2
+
Files
tests/pipeline-review-contract.test.ts, tests/legacy-cleanup.test.ts, tests/skill-conventions.test.ts, tests/copilot-converter.test.ts, tests/release-metadata.test.ts
+
+
Approach & test scenarios +

Approach (per file):

+
    +
  • pipeline-review-contract.test.ts — delete the four named ce-work-beta tests and the ce:work-beta codex delegation contract describe block (all read the deleted skill to assert delegation parity). (R8)
  • +
  • legacy-cleanup.test.ts — keep both tests that assert a stale ce-work-beta.md wrapper is removed, but replace the pluginDescription("skills/ce-work-beta/SKILL.md") runtime read with the literal historical description string (already encoded in LEGACY_SKILL_DESCRIPTION_ALIASES), so the test no longer depends on the deleted file. (R9, KTD3)
  • +
  • skill-conventions.test.ts — remove "ce-work-beta" from EXPECTED_USER_INVOKED_SKILLS. (R9)
  • +
  • copilot-converter.test.ts — repoint the multi-colon transform example off the dead skill name (use a live or neutral example) so no test references a removed skill. (R9)
  • +
  • release-metadata.test.ts — change the skills: 27 assertion to 26. (R5)
  • +
+

Test scenarios:

+
    +
  • Full suite green: bun test passes with no ce-work-beta SKILL.md reads.
  • +
  • Count assertion: getCompoundEngineeringCounts reports skills: 26.
  • +
  • Stale-install sweep still asserted: the ce-work-beta.md removal tests pass without touching the filesystem skill dir.
  • +
  • Converter transform still correct after the example swap (colon→dash output unchanged in behavior).
  • +
+

Verification: bun test green.

+
+
+ +
+

U4. Update README inventory and count

+
+
Goal
Keep the public inventory accurate.
+
Requirements
R4
+
Dependencies
None
+
Files
README.md
+
+
Approach & verification +

Approach: Remove the | /ce-work-beta | ... | inventory row and change "currently ships 27 skills" to "26 skills".

+

Test expectation: none — documentation prose; no test asserts README contents (verified).

+

Verification: no ce-work-beta row remains; count reads 26.

+
+
+ +
+

U5. Decouple the best-practices doc

+
+
Goal
Preserve the delegation learnings as a standalone reference.
+
Requirements
R6
+
Dependencies
None
+
Files
docs/solutions/best-practices/codex-delegation-best-practices.md
+
+
Approach & verification +

Approach: Keep the token-economics crossover (~5–7 unit threshold) and batching guidance. Remove or rewrite passages that frame the doc as ce-work-beta usage (references to the live skill, its config keys, the promotion path) so it reads as general external-delegation knowledge. Update frontmatter tags if they name the skill.

+

Test expectation: none — documentation.

+

Verification: the doc stands alone; no dangling references to ce-work-beta as a live skill.

+
+
+ +
+

U6. Delete the promotion checklist

+
+
Goal
Remove the now-moot promotion doc.
+
Requirements
R7
+
Dependencies
None
+
Files
docs/solutions/skill-design/ce-work-beta-promotion-checklist.md
+
+
Approach & verification +

Approach: Delete the file. Grep docs/solutions/ for inbound links to it and fix any (likely none).

+

Test expectation: none — documentation.

+

Verification: file gone; no broken links to it.

+
+
+
+ +
+ +
+

Verification Contract

+
+
    +
  • bun test — full suite green, including the repaired cleanup, conventions, converter, and release-metadata tests.
  • +
  • bun run release:validate — reports compound-engineering currently has 0 agents, 26 skills, ... and exits in sync (no metadata errors).
  • +
  • rg -n "ce-work-beta" . — the only surviving matches are historical files in docs/plans/, CHANGELOG.md history, and this plan document. No live skill, registry-absent reference, README row, or test reading the deleted SKILL.md remains.
  • +
+
+
+ +
+ +
+

Definition of Done

+
+
    +
  • R1–R9 satisfied.
  • +
  • bun test and bun run release:validate both green; release:validate reports 26 skills.
  • +
  • No reference to ce-work-beta as a live skill anywhere except historical plan docs, CHANGELOG history, and this plan.
  • +
  • The best-practices doc reads standalone; the promotion checklist is deleted.
  • +
  • No abandoned or experimental code left in the diff; changes land in one PR with a chore(ce-work-beta): title via the standard PR flow (no direct push to main).
  • +
+
+
+ + + +
+ + diff --git a/docs/solutions/best-practices/codex-delegation-best-practices.md b/docs/solutions/best-practices/codex-delegation-best-practices.md index ad5fefd75..154bb848e 100644 --- a/docs/solutions/best-practices/codex-delegation-best-practices.md +++ b/docs/solutions/best-practices/codex-delegation-best-practices.md @@ -18,16 +18,17 @@ tags: - batching - orchestration-cost - prompt-engineering - - ce-work-beta --- # Codex Delegation Best Practices ## Context -Over six iterations of evaluation building Codex delegation into `ce-work-beta`, we collected quantitative data on the token economics of orchestrating work between Claude Code (the orchestrator) and Codex (the delegated executor). The core question: when does delegating plan units to Codex actually save Claude tokens, and what architectural patterns control the cost? +> **Note:** This is a retrospective. The experimental delegation skill it studied (`ce-work-beta`) has since been removed from the plugin. The findings below are preserved as general guidance for designing external-model delegation in any orchestrator skill, not as documentation for a live feature. -The delegation model: `ce-work-beta` receives a plan with N implementation units, then decides whether to execute them directly (standard mode) or delegate them to Codex via `codex exec`. Delegation has a fixed orchestration overhead per batch (prompt file write, codex exec invocation, result classification, commit) of approximately 4-5k Claude tokens. Each unit of code Claude does not write saves roughly 3-5k tokens. The crossover depends on how many units are batched per delegation call. +Over six iterations of evaluation building Codex delegation into an experimental `ce-work` delegation mode, we collected quantitative data on the token economics of orchestrating work between Claude Code (the orchestrator) and Codex (the delegated executor). The core question: when does delegating plan units to Codex actually save Claude tokens, and what architectural patterns control the cost? + +The delegation model: the delegating skill receives a plan with N implementation units, then decides whether to execute them directly (standard mode) or delegate them to Codex via `codex exec`. Delegation has a fixed orchestration overhead per batch (prompt file write, codex exec invocation, result classification, commit) of approximately 4-5k Claude tokens. Each unit of code Claude does not write saves roughly 3-5k tokens. The crossover depends on how many units are batched per delegation call. The evaluation spanned iterations 1-6, testing small (1-2 units), medium (4 units), large (7 units), and extra-large (10 units) plans in both delegation and standard modes, with real code implementation and test verification in isolated worktrees. diff --git a/docs/solutions/skill-design/ce-work-beta-promotion-checklist.md b/docs/solutions/skill-design/ce-work-beta-promotion-checklist.md deleted file mode 100644 index 8b09bf4ca..000000000 --- a/docs/solutions/skill-design/ce-work-beta-promotion-checklist.md +++ /dev/null @@ -1,106 +0,0 @@ ---- -title: "ce-work-beta promotion needs manual-handoff cleanup and contract migration" -category: skill-design -date: 2026-03-31 -module: skills -component: SKILL.md -tags: - - skill-design - - beta-testing - - workflow - - rollout-safety -severity: medium -description: "Promoting ce-work-beta requires more than copying SKILL.md content: stable handoffs, contract tests, beta-only wording, and planning neutrality must all flip together." -related: - - docs/solutions/skill-design/beta-skills-framework.md - - docs/solutions/skill-design/beta-promotion-orchestration-contract.md ---- - -## Problem - -`ce-work-beta` is intentionally a manual-invocation beta skill. During beta, `ce-plan`, `ce-brainstorm`, `lfg`, and other workflow handoffs remain pointed at stable `ce-work` so the repo does not need to support two execution paths at once. - -That means promoting `ce-work-beta` to stable is not just a content copy. The rollout flips multiple contracts at once: - -- the active implementation surface moves from `ce-work-beta` to `ce-work` -- beta-only manual invocation caveats become wrong -- planner and workflow handoffs can start acknowledging the promoted path -- tests need to assert the stable surface, not the beta surface - -If those changes do not happen together, the repo ends up teaching the wrong skill, keeping stale beta caveats, or preserving duplicate active paths that drift apart. - -## Current Beta Limitation - -During beta, the intended behavior is: - -- `ce-work-beta` contains the experimental implementation -- users invoke `ce-work-beta` manually when they want the new behavior -- `ce-plan` stays neutral and continues to offer stable `ce-work` -- workflow orchestrators stay pointed at stable `ce-work` - -This limitation is deliberate. It avoids pushing beta-specific branching into every planning and orchestration surface. - -## Promotion Checklist - -When `ce-work-beta` is ready to promote: - -1. Copy the validated implementation from `skills/ce-work-beta/SKILL.md` into `skills/ce-work/SKILL.md`. -2. Restore stable frontmatter on `ce-work`: - - stable `name:` - - stable description without `[BETA]` - - remove `disable-model-invocation: true` -3. Remove beta-only manual invocation wording from the promoted stable skill. -4. Rework or remove `ce-work-beta` so it no longer looks like an active parallel implementation: - - delete it, or - - reduce it to a thin redirect/deprecation note -5. Update planning and workflow handoffs atomically: - - `ce-plan` - - `ce-brainstorm` - - any other skills or workflows that recommend or invoke `ce-work` -6. Revisit planner wording so it can safely mention the promoted stable behavior if needed. -7. Move contract tests from the beta surface to the stable surface. -8. Re-run release validation and any workflow-level tests that exercise the handoff chain. - -## Unique Gotchas - -### Manual-invocation caveats must be removed - -The beta skill intentionally says it must be invoked manually and that handoffs remain pointed at stable `ce-work`. After promotion, that wording becomes false and will actively mislead users. - -### `ce-plan` should stay neutral during beta, then flip intentionally - -While beta is manual-only, `ce-plan` should not teach beta-only invocation details. After promotion, the planner can acknowledge the promoted stable path, but that should happen in the promotion PR, not earlier. - -### Test ownership must migrate - -During beta, contract tests should assert delegation behavior on `ce-work-beta`. After promotion, those assertions belong on `ce-work`. Copying the skill content without moving the tests leaves the wrong surface protected. - -### Do not leave two active delegation paths - -If both `ce-work` and `ce-work-beta` retain live delegation logic after promotion, they will drift. Promotion should end with exactly one canonical implementation surface. - -### Promotion is both a beta-to-stable change and an orchestration change - -This promotion is unusual because the beta skill was intentionally isolated from workflow handoffs. The promotion PR must therefore do both: - -- normal beta-to-stable file/content promotion -- workflow contract cleanup now that the stable surface can own the feature - -See `docs/solutions/skill-design/beta-promotion-orchestration-contract.md` for the caller-update principle. - -## Verification - -Before merging the promotion PR, confirm: - -- stable `ce-work` contains the implementation -- `ce-work-beta` no longer reads like the active implementation path -- no beta-only manual invocation caveats remain on the stable path -- workflow handoffs point where intended -- contract tests assert the right surface -- release validation passes - -## Prevention - -- Treat `ce-work-beta` promotion as a coordinated workflow change, not just a text replacement. -- Update skill content, planner wording, workflow handoffs, and tests in the same PR. -- Leave a durable note like this one at beta time so later promotion work does not rely on memory. diff --git a/skills/ce-work-beta/SKILL.md b/skills/ce-work-beta/SKILL.md deleted file mode 100644 index 5e922ddf6..000000000 --- a/skills/ce-work-beta/SKILL.md +++ /dev/null @@ -1,438 +0,0 @@ ---- -name: ce-work-beta -description: "[BETA] Execute ce-work with external delegate support." -disable-model-invocation: true -argument-hint: "[Plan doc path or description of work. Blank to auto use latest plan doc] [delegate:codex]" ---- - -# Work Execution Command - -Execute work efficiently while maintaining quality and finishing features. - -## Introduction - -This command takes a work document (plan or specification) or a bare prompt describing the work, and executes it systematically. The focus is on **shipping complete features** by understanding requirements quickly, following existing patterns, and maintaining quality throughout. - -**Beta rollout note:** Invoke `ce-work-beta` manually when you want to trial Codex delegation. During the beta period, planning and workflow handoffs remain pointed at stable `ce-work` to avoid dual-path orchestration complexity. - -## Input Document - - #$ARGUMENTS - -## Argument Parsing - -Parse `$ARGUMENTS` for the following optional tokens. Strip each recognized token before interpreting the remainder as the plan file path or bare prompt. - -| Token | Example | Effect | -|-------|---------|--------| -| `delegate:codex` | `delegate:codex` | Activate Codex delegation mode for plan execution | -| `delegate:local` | `delegate:local` | Deactivate delegation even if enabled in config | - -All tokens are optional. When absent, fall back to the resolution chain below. - -**Fuzzy activation:** Also recognize imperative delegation-intent phrases such as "use codex", "delegate to codex", "codex mode", or "delegate mode" as equivalent to `delegate:codex`. A bare mention of "codex" in a prompt (e.g., "fix codex converter bugs") must NOT activate delegation -- only clear delegation intent triggers it. - -**Fuzzy deactivation:** Also recognize phrases such as "no codex", "local mode", "standard mode" as equivalent to `delegate:local`. - -### Settings Resolution Chain - -After extracting tokens from arguments, resolve the delegation state using this precedence chain: - -1. **Argument flag** -- `delegate:codex` or `delegate:local` from the current invocation (highest priority) -2. **Config file** -- extract settings from the config block below. Value `codex` for `work_delegate` activates delegation; `false` deactivates. -3. **Hard default** -- `false` (delegation off) - -**Read config.** The repo root is pre-resolved at skill load: -!`git rev-parse --show-toplevel 2>/dev/null || true` - -If the line above is an absolute path, use it as ``. If it is empty or still shows a backtick command string (a non-Claude harness that did not run the pre-resolution), resolve `` at runtime by running `git rev-parse --show-toplevel` with the shell tool. Then read `/.compound-engineering/config.local.yaml` with the native file-read tool (e.g., Read in Claude Code, read_file in Codex). If the root cannot be resolved or the file does not exist, all settings fall through to defaults. Otherwise extract values for the keys listed below. - -If any setting has an unrecognized value, fall through to the hard default for that setting. For optional settings without a hard default (`work_delegate_model`, `work_delegate_effort`), an unrecognized or unparseable value resolves to **unset** — the corresponding flag is omitted from the `codex exec` invocation so Codex resolves from `~/.codex/config.toml`. Never substitute an invalid value into the CLI flags. - -Config keys: -- `work_delegate` -- `codex` or default `false` -- `work_delegate_consent` -- `true` or default `false` -- `work_delegate_sandbox` -- `yolo` (default) or `full-auto` -- `work_delegate_decision` -- `auto` (default) or `ask` -- `work_delegate_model` -- Codex model to use. Optional — when unset or unparseable, defers to the user's `~/.codex/config.toml` default. Passthrough — any non-empty string is accepted as valid; only YAML parse failures or empty values resolve to unset. -- `work_delegate_effort` -- one of `minimal`, `low`, `medium`, `high`, or `xhigh`. Optional — when unset or set to a value outside this enum, resolves to unset and defers to the user's `~/.codex/config.toml` default. - -Store the resolved state for downstream consumption: -- `delegation_active` -- boolean, whether delegation mode is on -- `delegation_source` -- `argument` or `config` or `default` -- how delegation was resolved (used by environment guard to decide notification verbosity) -- `sandbox_mode` -- `yolo` or `full-auto` (from config or default `yolo`) -- `consent_granted` -- boolean (from config `work_delegate_consent`) -- `delegate_model` -- string from config, or unset (defer to Codex config) -- `delegate_effort` -- string from config, or unset (defer to Codex config). Floor for per-batch effort selection; not passed directly to `codex exec`. -- `effective_effort` -- per-batch derived value (`default | medium | high | xhigh`), computed before each batch from `delegate_effort` and the picked level per `references/codex-delegation-workflow.md` ("Per-Batch Effort"). Feeds the `codex exec` invocation in place of `delegate_effort`. - ---- - -## Execution Workflow - -### Phase 0: Input Triage - -Determine how to proceed based on what was provided in ``. - -**Plan document** (input is a file path to an existing plan or specification): read the plan's metadata first — YAML frontmatter for a markdown plan, or the visible header text for an HTML plan (both formats carry the same fields). If it carries `execution: knowledge-work`, this is a **non-code plan** — read `references/non-code-execution.md` and follow that carve-out instead of the rest of this workflow. Otherwise (the field is absent or `execution: code`) → skip to Phase 1 and run the normal code lifecycle. (The marker check lives here, inside plan-document handling, because detecting the marker requires already having a file; "Bare prompt" below is unaffected.) - -**Bare prompt** (input is a description of work, not a file path): - -1. **Scan the work area** - - - Identify files likely to change based on the prompt - - Find existing test files for those areas (search for test/spec files that import, reference, or share names with the implementation files) - - Note local patterns and conventions in the affected areas - -2. **Assess complexity and route** - - | Complexity | Signals | Action | - |-----------|---------|--------| - | **Trivial** | 1-2 files, no behavioral change (typo, config, rename) | Proceed to Phase 1 step 2 (environment setup), then implement directly — no task list, no execution loop. Apply Test Discovery if the change touches behavior-bearing code | - | **Small / Medium** | Clear scope, under ~10 files | Build a task list from discovery. Proceed to Phase 1 step 2 | - | **Large** | Cross-cutting, architectural decisions, 10+ files, touches auth/payments/migrations | Inform the user this would benefit from `/ce-brainstorm` or `/ce-plan` to surface edge cases and scope boundaries. Honor their choice. If proceeding, build a task list and continue to Phase 1 step 2 | - ---- - -### Phase 1: Quick Start - -1. **Read Plan and Clarify** _(skip if arriving from Phase 0 with a bare prompt)_ - - - Read the work document completely - - Treat the plan as a decision artifact, not an execution script - - If the plan includes sections such as `Implementation Units`, `Work Breakdown`, `Requirements` (or legacy `Requirements Trace`), `Files`, `Test Scenarios`, or `Verification`, use those as the primary source material for execution - - Check for `Execution note` on each implementation unit — these carry the plan's execution posture signal for that unit (for example, test-first or characterization-first). Note them when creating tasks. - - Check for a `Deferred to Implementation` or `Implementation-Time Unknowns` section — these are questions the planner intentionally left for you to resolve during execution. Note them before starting so they inform your approach rather than surprising you mid-task - - Check for a `Scope Boundaries` section — these are explicit non-goals. Refer back to them if implementation starts pulling you toward adjacent work - - Review any references or links provided in the plan - - If the user explicitly asks for TDD, test-first, or characterization-first execution in this session, honor that request even if the plan has no `Execution note` - - If anything is unclear or ambiguous, ask clarifying questions now - - If clarifying questions were needed above, get user approval on the resolved answers. If no clarifications were needed, proceed without a separate approval step — plan scope is the plan's authority, not something to renegotiate - - **Do not skip this** - better to ask questions now than build the wrong thing - - **Do not edit the plan body during execution.** The plan is a decision artifact; progress lives in git commits and the task tracker, not the plan. `ce-work` does not mutate the plan — whether it shipped is derived from git, not recorded in the doc. Legacy plans may contain `- [ ]` / `- [x]` marks on unit headings or a `status:` field — ignore them as state; per-unit completion is determined during execution by reading the current file state. - -2. **Setup Environment** - - First, check the current branch: - - ```bash - current_branch=$(git branch --show-current) - default_branch=$(git symbolic-ref refs/remotes/origin/HEAD 2>/dev/null | sed 's@^refs/remotes/origin/@@') - - # Fallback if remote HEAD isn't set - if [ -z "$default_branch" ]; then - default_branch=$(git rev-parse --verify origin/main >/dev/null 2>&1 && echo "main" || echo "master") - fi - ``` - - **If already on a feature branch** (not the default branch): - - First, check whether the branch name is **meaningful** — a name like `feat/crowd-sniff` or `fix/email-validation` tells future readers what the work is about. Auto-generated worktree names (e.g., `worktree-jolly-beaming-raven`) or other opaque names do not. - - If the branch name is meaningless or auto-generated, suggest renaming it before continuing: - ```bash - git branch -m - ``` - Derive the new name from the plan title or work description (e.g., `feat/crowd-sniff`). Present the rename as a recommended option alongside continuing as-is. - - Then ask: "Continue working on `[current_branch]`, or create a new branch?" - - If continuing (with or without rename), proceed to step 3 - - If creating new, follow Option A or B below - - **If on the default branch**, choose how to proceed: - - **Option A: Create a new branch** - ```bash - git pull origin [default_branch] - git checkout -b feature-branch-name - ``` - Use a meaningful name based on the work (e.g., `feat/user-authentication`, `fix/email-validation`). - - **Option B: Use a worktree (recommended for parallel development)** - ```bash - skill: ce-worktree - # Ensures isolation: detects an existing worktree, prefers the harness's - # native worktree tool, else creates one from the default branch - ``` - - **Option C: Continue on the default branch** - - Requires explicit user confirmation - - Only proceed after user explicitly says "yes, commit to [default_branch]" - - Never commit directly to the default branch without explicit permission - - **Recommendation**: Use worktree if: - - You want to work on multiple features simultaneously - - You want to keep the default branch clean while experimenting - - You plan to switch between branches frequently - -3. **Create Task List** _(skip if Phase 0 already built one, or if Phase 0 routed as Trivial)_ - - Use the platform's task tracking tool (`TaskCreate`/`TaskUpdate`/`TaskList` in Claude Code, `update_plan` in Codex, or the equivalent on other harnesses) to break the plan into actionable tasks - - Derive tasks from the plan's implementation units, dependencies, files, test targets, and verification criteria - - When the plan defines U-IDs for Implementation Units, preserve the unit's U-ID as a prefix in the task subject (e.g., "U3: Add parser coverage"). This keeps blocker references, deferred-work notes, and final summaries anchored to the same identifier the plan uses, so progress and traceability remain unambiguous across plan edits - - Carry each unit's `Execution note` into the task when present - - For each unit, read the `Patterns to follow` field before implementing — these point to specific files or conventions to mirror - - Use each unit's `Verification` field as the primary "done" signal for that task - - Do not expect the plan to contain implementation code, micro-step TDD instructions, or exact shell commands - - Include dependencies between tasks - - Prioritize based on what needs to be done first - - Include testing and quality check tasks - - Keep tasks specific and completable - -4. **Choose Execution Strategy** - - **Delegation routing gate:** If `delegation_active` is true AND the input is a plan file (not a bare prompt), read `references/codex-delegation-workflow.md` and follow its Pre-Delegation Checks and Delegation Decision flow. If all checks pass and delegation proceeds, force **serial execution** and proceed directly to Phase 2 using the workflow's batched execution loop. If any check disables delegation, fall through to the standard strategy table below. If delegation is active but the input is a bare prompt (no plan file), set `delegation_active` to false with a brief note: "Codex delegation requires a plan file -- using standard mode." and continue with the standard strategy selection below. - - After creating the task list, decide how to execute based on the plan's size and dependency structure: - - | Strategy | When to use | - |----------|-------------| - | **Inline** | 1-2 small tasks, or tasks needing user interaction mid-flight. **Default for bare-prompt work** — bare prompts rarely produce enough structured context to justify subagent dispatch | - | **Serial subagents** | 3+ tasks with dependencies between them. Each subagent gets a fresh context window focused on one unit — prevents context degradation across many tasks. Requires plan-unit metadata (Goal, Files, Approach, Test scenarios) | - | **Parallel subagents** | 3+ tasks that pass the Parallel Safety Check (below). Dispatch independent units simultaneously, run dependent units after their prerequisites complete. Requires plan-unit metadata | - - **Parallel Safety Check** — required before choosing parallel dispatch: - - 1. Build a file-to-unit mapping from every candidate unit's `Files:` section (Create, Modify, and Test paths) - 2. Check for intersection — any file path appearing in 2+ units means overlap - 3. **If overlap is found AND worktree isolation is unavailable**: downgrade to serial subagents. Log the reason (e.g., "Units 2 and 4 share `config/routes.rb` — using serial dispatch"). Serial subagents still provide context-window isolation without shared-directory write races. - 4. **If overlap is found AND worktree isolation is available**: parallel dispatch is still safe — subagents work in isolation, and the overlap surfaces as a predictable merge conflict the orchestrator handles via the post-batch flow below. Log the predicted overlap so the post-batch flow knows which merges to expect conflicts on. - - Even with no file overlap, parallel subagents sharing the orchestrator's working directory face git index contention (concurrent staging/committing corrupts the index) and test interference (concurrent test runs pick up each other's in-progress changes). Worktree isolation eliminates both; the shared-directory fallback constraints below mitigate them. - - **Subagent isolation** — give each parallel subagent its own working tree: - - **Claude Code (`Agent` tool):** pass `isolation: "worktree"` and `run_in_background: true`. The harness creates a per-subagent worktree under `.claude/worktrees/agent-` on its own branch. Verify `.claude/worktrees/` is gitignored before relying on this. - - **Other platforms** without built-in worktree isolation: subagents share the orchestrator's directory. - - **Subagent dispatch** uses your available subagent or task spawning mechanism. For each unit, give the subagent: - - The full plan file path (for overall context) - - The specific unit's Goal, Files, Approach, Execution note, Patterns, Test scenarios, and Verification - - Any resolved deferred questions relevant to that unit - - Instruction to check whether the unit's test scenarios cover all applicable categories (happy paths, edge cases, error paths, integration) and supplement gaps before writing tests - - **Shared-directory fallback constraints** — apply only when worktree isolation is unavailable: - - Instruct each subagent: "Do not stage files (`git add`), create commits, or run the project test suite. The orchestrator handles testing, staging, and committing after all parallel units complete." - - These constraints prevent git index contention and test interference between concurrent subagents. - - With worktree isolation active, omit these constraints — subagents may stage, commit, and run their unit's tests within their own worktree branch. - - **Permission mode:** Omit the `mode` parameter when dispatching subagents so the user's configured permission settings apply. Do not pass `mode: "auto"` — it overrides user-level settings like `bypassPermissions`. - - **After each subagent completes (serial mode):** - 1. Review the subagent's diff — verify changes match the unit's scope and `Files:` list - 2. Run the relevant test suite to confirm the tree is healthy - 3. If tests fail, diagnose and fix before proceeding — do not dispatch dependent units on a broken tree - 4. Update the task list (do not edit the plan body — progress is carried by the commit) - 5. Dispatch the next unit - - **After all parallel subagents in a batch complete (worktree-isolated mode):** - 1. Wait for every subagent in the current parallel batch to finish. - 2. For each completed subagent, in dependency order: review the worktree's diff against the orchestrator's branch. If the subagent did not commit its own work, stage and commit it inside that worktree. - 3. Merge each subagent's branch into the orchestrator's branch sequentially in dependency order. **If a merge conflict surfaces, abort the merge (`git merge --abort`) and re-dispatch the conflicting unit serially against the now-merged tree** — hand-resolving silently picks a side and discards one unit's intent. (Predicted overlap from the Parallel Safety Check surfaces here as a conflict, not as silent data loss in shared-directory mode.) - 4. After each merge, run the relevant test suite. If tests fail, diagnose and fix before merging the next branch. - 5. Update the task list (progress is carried by the merge commits). - 6. After merging, remove each subagent's worktree and delete its branch. Use the absolute path and branch name returned in the subagent's result. - - Unlock the worktree first — the harness locks per-subagent worktrees: `git worktree unlock ` - - Remove the worktree: `git worktree remove ` - - Delete the branch: `git branch -d ` (the branch outlives the worktree by default and accumulates as orphans if not cleaned up; `-d` lowercase refuses to delete unmerged branches, which is the safety we want — if it fails, investigate before forcing) - 7. Dispatch the next batch of independent units, or the next dependent unit. - - **After all parallel subagents in a batch complete (shared-directory fallback):** - 1. Wait for every subagent in the current parallel batch to finish before acting on any of their results - 2. Cross-check for discovered file collisions: compare the actual files modified by all subagents in the batch (not just their declared `Files:` lists). Subagents may create or modify files not anticipated during planning — this is expected, since plans describe *what* not *how*. A collision only matters when 2+ subagents in the same batch modified the same file. In a shared working directory, only the last writer's version survives — the other unit's changes to that file are lost. If a collision is detected: commit all non-colliding files from all units first, then re-run the affected units serially for the shared file so each builds on the other's committed work - 3. For each completed unit, in dependency order: review the diff, run the relevant test suite, stage only that unit's files, and commit with a conventional message derived from the unit's Goal - 4. If tests fail after committing a unit's changes, diagnose and fix before committing the next unit - 5. Update the task list (do not edit the plan body — progress is carried by the commits just made) - 6. Dispatch the next batch of independent units, or the next dependent unit - -### Phase 2: Execute - -1. **Task Execution Loop** - - For each task in priority order: - - ``` - while (tasks remain): - - Mark task as in-progress - - Read any referenced files from the plan or discovered during Phase 0 - - **If the unit's work is already present and matches the plan's intent** (files exist with the expected capability, or the unit's `Verification` criteria are already satisfied by the current code), the work has likely shipped on a prior branch or session. Verify it matches, mark the task complete, and move on. Do not silently reimplement. - - Look for similar patterns in codebase - - Find existing test files for implementation files being changed (Test Discovery — see below) - - If delegation_active: branch to the Codex Delegation Execution Loop - (see `references/codex-delegation-workflow.md`) - - Otherwise: implement following existing conventions - - Add, update, or remove tests to match implementation changes (see Test Discovery below) - - Run System-Wide Test Check (see below) - - Run tests after changes - - Assess testing coverage: did this task change behavior? If yes, were tests written or updated? If no tests were added, is the justification deliberate (e.g., pure config, no behavioral change)? - - Mark task as completed - - Evaluate for incremental commit (see below) - ``` - - When a unit carries an `Execution note`, honor it. For test-first units, write the failing test before implementation for that unit. For characterization-first units, capture existing behavior before changing it. For units without an `Execution note`, proceed pragmatically. - - Guardrails for execution posture: - - Do not write the test and implementation in the same step when working test-first - - Do not skip verifying that a new test fails before implementing the fix or feature - - Do not over-implement beyond the current behavior slice when working test-first - - Skip test-first discipline for trivial renames, pure configuration, and pure styling work - - **Test Discovery** — Before implementing changes to a file, find its existing test files (search for test/spec files that import, reference, or share naming patterns with the implementation file). When a plan specifies test scenarios or test files, start there, then check for additional test coverage the plan may not have enumerated. Changes to implementation files should be accompanied by corresponding test updates — new tests for new behavior, modified tests for changed behavior, removed or updated tests for deleted behavior. - - **Test Scenario Completeness** — Before writing tests for a feature-bearing unit, check whether the plan's `Test scenarios` cover all categories that apply to this unit. If a category is missing or scenarios are vague (e.g., "validates correctly" without naming inputs and expected outcomes), supplement from the unit's own context before writing tests: - - | Category | When it applies | How to derive if missing | - |----------|----------------|------------------------| - | **Happy path** | Always for feature-bearing units | Read the unit's Goal and Approach for core input/output pairs | - | **Edge cases** | When the unit has meaningful boundaries (inputs, state, concurrency) | Identify boundary values, empty/nil inputs, and concurrent access patterns | - | **Error/failure paths** | When the unit has failure modes (validation, external calls, permissions) | Enumerate invalid inputs the unit should reject, permission/auth denials it should enforce, and downstream failures it should handle | - | **Integration** | When the unit crosses layers (callbacks, middleware, multi-service) | Identify the cross-layer chain and write a scenario that exercises it without mocks | - - **System-Wide Test Check** — Before marking a task done, pause and ask: - - | Question | What to do | - |----------|------------| - | **What fires when this runs?** Callbacks, middleware, observers, event handlers — trace two levels out from your change. | Read the actual code (not docs) for callbacks on models you touch, middleware in the request chain, `after_*` hooks. | - | **Do my tests exercise the real chain?** If every dependency is mocked, the test proves your logic works *in isolation* — it says nothing about the interaction. | Write at least one integration test that uses real objects through the full callback/middleware chain. No mocks for the layers that interact. | - | **Can failure leave orphaned state?** If your code persists state (DB row, cache, file) before calling an external service, what happens when the service fails? Does retry create duplicates? | Trace the failure path with real objects. If state is created before the risky call, test that failure cleans up or that retry is idempotent. | - | **What other interfaces expose this?** Mixins, DSLs, alternative entry points (Agent vs Chat vs ChatMethods). | Grep for the method/behavior in related classes. If parity is needed, add it now — not as a follow-up. | - | **Do error strategies align across layers?** Retry middleware + application fallback + framework error handling — do they conflict or create double execution? | List the specific error classes at each layer. Verify your rescue list matches what the lower layer actually raises. | - - **When to skip:** Leaf-node changes with no callbacks, no state persistence, no parallel interfaces. If the change is purely additive (new helper method, new view partial), the check takes 10 seconds and the answer is "nothing fires, skip." - - **When this matters most:** Any change that touches models with callbacks, error handling with fallback/retry, or functionality exposed through multiple interfaces. - - -2. **Incremental Commits** - - After completing each task, evaluate whether to create an incremental commit: - - | Commit when... | Don't commit when... | - |----------------|---------------------| - | Logical unit complete (model, service, component) | Small part of a larger unit | - | Tests pass + meaningful progress | Tests failing | - | About to switch contexts (backend → frontend) | Purely scaffolding with no behavior | - | About to attempt risky/uncertain changes | Would need a "WIP" commit message | - - **Heuristic:** "Can I write a commit message that describes a complete, valuable change? If yes, commit. If the message would be 'WIP' or 'partial X', wait." - - If the plan has Implementation Units, use them as a starting guide for commit boundaries — but adapt based on what you find during implementation. A unit might need multiple commits if it's larger than expected, or small related units might land together. Use each unit's Goal to inform the commit message. - - **Commit workflow:** - ```bash - # 1. Verify tests pass (use project's test command) - # Examples: bin/rails test, npm test, pytest, go test, etc. - - # 2. Stage only files related to this logical unit (not `git add .`) - git add - - # 3. Commit with conventional message - git commit -m "feat(scope): description of this unit" - ``` - - **Handling merge conflicts:** If conflicts arise during rebasing or merging, resolve them immediately. Incremental commits make conflict resolution easier since each commit is small and focused. - - **Note:** Incremental commits use clean conventional messages without attribution footers. The final Phase 4 commit/PR includes the full attribution. - - **Parallel subagent mode:** Commit ownership is split by isolation mode (see Phase 1 Step 4): - - **Worktree-isolated:** subagents may stage and commit inside their own worktree branch; the orchestrator merges those branches in dependency order after the batch. - - **Shared-directory fallback:** subagents do not commit; the orchestrator stages and commits each unit after the entire parallel batch completes. - -3. **Follow Existing Patterns** - - - The plan should reference similar code - read those files first - - Match naming conventions exactly - - Reuse existing components where possible - - Follow the project's coding standards already in your context - - When in doubt, grep for similar implementations - -4. **Test Continuously** - - - Run relevant tests after each significant change - - Don't wait until the end to test - - Fix failures immediately - - Add new tests for new behavior, update tests for changed behavior, remove tests for deleted behavior - - **Unit tests with mocks prove logic in isolation. Integration tests with real objects prove the layers work together.** If your change touches callbacks, middleware, or error handling — you need both. - -5. **Simplify as You Go** - - After completing a cluster of related implementation units (or every 2-3 units), review recently changed files for simplification opportunities — consolidate duplicated patterns, extract shared helpers, and improve code reuse and efficiency. This is especially valuable when using subagents, since each agent works with isolated context and can't see patterns emerging across units. - - Don't simplify after every single unit — early patterns may look duplicated but diverge intentionally in later units. Wait for a natural phase boundary or when you notice accumulated complexity. - - If **`ce-simplify-code`** is available, invoke it at phase boundaries (especially before Phase 3 when the diff is >=30 lines). Otherwise, review the changed files yourself for reuse and consolidation opportunities. - -6. **Figma Design Sync** (if applicable) - - For UI work with Figma designs: - - - Implement components following design specs - - Read `references/agents/figma-design-sync.md` and dispatch a generic subagent seeded with that local prompt to compare implementation against the Figma design. Do not dispatch a standalone agent by type/name. - - Fix visual differences identified - - Repeat until implementation matches design - -7. **Frontend Design Guidance** (if applicable) - - For UI tasks without a Figma design -- where the implementation touches view, template, component, layout, or page files, creates user-visible routes, or the plan contains explicit UI/frontend/design language: - - - Apply the frontend guidance embedded in this skill and the active repo instructions: preserve existing design-system conventions, use real UI controls and states, keep layouts responsive, and verify text does not overflow or overlap. - - When browser tooling is available, inspect the changed UI at desktop and mobile widths before final validation. If no browser access is available, do a code-level responsive/layout review and record that browser verification was unavailable. - - Phase 4's screenshot capture still applies when the change is user-visible. - -8. **Track Progress** - - Keep the task list updated as you complete tasks - - Note any blockers or unexpected discoveries - - Create new tasks if scope expands - - Keep user informed of major milestones - - When the plan defines U-IDs for Implementation Units, or the plan or origin document carries stable R-IDs (and optionally A/F/AE IDs), reference them in blockers, deferred-work notes, task summaries, and final verification — not routine status updates. U-IDs anchor units across plan edits; R/A/F/AE anchor product intent across the brainstorm-plan handoff. Use the IDs the plan supplies and do not invent ones it does not. This preserves traceability without burying signal under noise. - -### Phase 3-4: Quality Check and Finishing Work - -When all Phase 2 tasks are complete and execution transitions to quality check, read `references/shipping-workflow.md` for the full shipping workflow: quality checks, code review, final validation, PR creation, and notification. - ---- - -## Codex Delegation Mode - -When `delegation_active` is true after argument parsing, read `references/codex-delegation-workflow.md` for the complete delegation workflow: pre-checks, batching, prompt template, execution loop, and result classification. - ---- - -## Key Principles - -### Start Fast, Execute Faster - -- Get clarification once at the start, then execute -- Don't wait for perfect understanding - ask questions and move -- The goal is to **finish the feature**, not create perfect process - -### The Plan is Your Guide - -- Work documents should reference similar code and patterns -- Load those references and follow them -- Don't reinvent - match what exists - -### Test As You Go - -- Run tests after each change, not at the end -- Fix failures immediately -- Continuous testing prevents big surprises - -### Quality is Built In - -- Review when Tier 1 is available or Tier 2 criteria match (see `shipping-workflow.md`) - -### Ship Complete Features - -- Mark all tasks completed before moving on -- Don't leave features 80% done -- A finished feature that ships beats a perfect feature that doesn't - -## Common Pitfalls to Avoid - -- **Analysis paralysis** - Don't overthink, read the plan and execute -- **Skipping clarifying questions** - Ask now, not after building wrong thing -- **Ignoring plan references** - The plan has links for a reason -- **Testing at the end** - Test continuously or suffer later -- **Forgetting to track progress** - Update task status as you go or lose track of what's done -- **80% done syndrome** - Finish the feature, don't move on early -- **Skipping review without reason** — Use Tier 1 when available; escalate to Tier 2 only on criteria in `shipping-workflow.md`; document when both are skipped -- **Re-scoping the plan into human-time phases** - The plan's Implementation Units define the scope of execution. Do not estimate human-hours per unit, propose multi-day breakdowns, or ask the user to pick a subset of units for "this session". Agents execute at agent speed, and context-window pressure is addressed by subagent dispatch (Phase 1 Step 4), not by phased sessions. If a plan-file input is genuinely too large for a single execution, say so plainly and suggest the user return to `/ce-plan` to reduce scope — don't invent session phases as a workaround. For bare-prompt input, Phase 0's Large routing already handles oversized work diff --git a/skills/ce-work-beta/references/agents/figma-design-sync.md b/skills/ce-work-beta/references/agents/figma-design-sync.md deleted file mode 100644 index 9a4117913..000000000 --- a/skills/ce-work-beta/references/agents/figma-design-sync.md +++ /dev/null @@ -1,165 +0,0 @@ -You are an expert design-to-code synchronization specialist with deep expertise in visual design systems, web development, CSS/Tailwind styling, and automated quality assurance. Your mission is to ensure pixel-perfect alignment between Figma designs and their web implementations through systematic comparison, detailed analysis, and precise code adjustments. - -## Your Core Responsibilities - -1. **Design Capture**: Use the Figma MCP to access the specified Figma URL and node/component. Extract the design specifications including colors, typography, spacing, layout, shadows, borders, and all visual properties. Also take a screenshot and load it into the agent. - -2. **Implementation Capture**: Use agent-browser CLI to navigate to the specified web page/component URL and capture a high-quality screenshot of the current implementation. - - ```bash - agent-browser open [url] - agent-browser snapshot -i - agent-browser screenshot implementation.png - ``` - -3. **Systematic Comparison**: Perform a meticulous visual comparison between the Figma design and the screenshot, analyzing: - - - Layout and positioning (alignment, spacing, margins, padding) - - Typography (font family, size, weight, line height, letter spacing) - - Colors (backgrounds, text, borders, shadows) - - Visual hierarchy and component structure - - Responsive behavior and breakpoints - - Interactive states (hover, focus, active) if visible - - Shadows, borders, and decorative elements - - Icon sizes, positioning, and styling - - Max width, height etc. - -4. **Detailed Difference Documentation**: For each discrepancy found, document: - - - Specific element or component affected - - Current state in implementation - - Expected state from Figma design - - Severity of the difference (critical, moderate, minor) - - Recommended fix with exact values - -5. **Precise Implementation**: Make the necessary code changes to fix all identified differences: - - - Modify CSS/Tailwind classes following the responsive design patterns above - - Prefer Tailwind default values when close to Figma specs (within 2-4px) - - Ensure components are full width (`w-full`) without max-width constraints - - Move any width constraints and horizontal padding to wrapper divs in parent HTML/ERB - - Update component props or configuration - - Adjust layout structures if needed - - Ensure changes follow the project's coding standards — the conventions already in your context, or, if you were dispatched without them, read the project's root agent-instruction file for this harness (e.g., `AGENTS.md`, `CLAUDE.md`, `GEMINI.md`, or `.cursor/rules`) - - Use mobile-first responsive patterns (e.g., `flex-col lg:flex-row`) - - Preserve dark mode support - -6. **Verification and Confirmation**: After implementing changes, clearly state: "Yes, I did it." followed by a summary of what was fixed. Also make sure that if you worked on a component or element you look how it fits in the overall design and how it looks in the other parts of the design. It should be flowing and having the correct background and width matching the other elements. - -## Responsive Design Patterns and Best Practices - -### Component Width Philosophy -- **Components should ALWAYS be full width** (`w-full`) and NOT contain `max-width` constraints -- **Components should NOT have padding** at the outer section level (no `px-*` on the section element) -- **All width constraints and horizontal padding** should be handled by wrapper divs in the parent HTML/ERB file - -### Responsive Wrapper Pattern -When wrapping components in parent HTML/ERB files, use: -```erb -
- <%= render SomeComponent.new(...) %> -
-``` - -This pattern provides: -- `w-full`: Full width on all screens -- `max-w-screen-xl`: Maximum width constraint (1280px, use Tailwind's default breakpoint values) -- `mx-auto`: Center the content -- `px-5 md:px-8 lg:px-[30px]`: Responsive horizontal padding - -### Prefer Tailwind Default Values -Use Tailwind's default spacing scale when the Figma design is close enough: -- **Instead of** `gap-[40px]`, **use** `gap-10` (40px) when appropriate -- **Instead of** `text-[45px]`, **use** `text-3xl` on mobile and `md:text-[45px]` on larger screens -- **Instead of** `text-[20px]`, **use** `text-lg` (18px) or `md:text-[20px]` -- **Instead of** `w-[56px] h-[56px]`, **use** `w-14 h-14` - -Only use arbitrary values like `[45px]` when: -- The exact pixel value is critical to match the design -- No Tailwind default is close enough (within 2-4px) - -Common Tailwind values to prefer: -- **Spacing**: `gap-2` (8px), `gap-4` (16px), `gap-6` (24px), `gap-8` (32px), `gap-10` (40px) -- **Text**: `text-sm` (14px), `text-base` (16px), `text-lg` (18px), `text-xl` (20px), `text-2xl` (24px), `text-3xl` (30px) -- **Width/Height**: `w-10` (40px), `w-14` (56px), `w-16` (64px) - -### Responsive Layout Pattern -- Use `flex-col lg:flex-row` to stack on mobile and go horizontal on large screens -- Use `gap-10 lg:gap-[100px]` for responsive gaps -- Use `w-full lg:w-auto lg:flex-1` to make sections responsive -- Don't use `flex-shrink-0` unless absolutely necessary -- Remove `overflow-hidden` from components - handle overflow at wrapper level if needed - -### Example of Good Component Structure -```erb - -
- <%= render SomeComponent.new(...) %> -
- - -
-
- -
-
-``` - -### Common Anti-Patterns to Avoid -**❌ DON'T do this in components:** -```erb - -
- -
-``` - -**✅ DO this instead:** -```erb - -
- -
-``` - -**❌ DON'T use arbitrary values when Tailwind defaults are close:** -```erb - -
-``` - -**✅ DO prefer Tailwind defaults:** -```erb - -
-``` - -## Quality Standards - -- **Precision**: Use exact values from Figma (e.g., "16px" not "about 15-17px"), but prefer Tailwind defaults when close enough -- **Completeness**: Address all differences, no matter how minor -- **Code Quality**: Follow the project's frontend conventions — from the project instructions already in your context, or its root agent-instruction file (e.g., `AGENTS.md`/`CLAUDE.md`/`GEMINI.md`/`.cursor/rules`) if they aren't already loaded -- **Communication**: Be specific about what changed and why -- **Iteration-Ready**: Design your fixes to allow the agent to run again for verification -- **Responsive First**: Always implement mobile-first responsive designs with appropriate breakpoints - -## Handling Edge Cases - -- **Missing Figma URL**: Request the Figma URL and node ID from the user -- **Missing Web URL**: Request the local or deployed URL to compare -- **MCP Access Issues**: Clearly report any connection problems with Figma or Playwright MCPs -- **Ambiguous Differences**: When a difference could be intentional, note it and ask for clarification -- **Breaking Changes**: If a fix would require significant refactoring, document the issue and propose the safest approach -- **Multiple Iterations**: After each run, suggest whether another iteration is needed based on remaining differences - -## Success Criteria - -You succeed when: - -1. All visual differences between Figma and implementation are identified -2. All differences are fixed with precise, maintainable code -3. The implementation follows project coding standards -4. You clearly confirm completion with "Yes, I did it." -5. The agent can be run again iteratively until perfect alignment is achieved - -Remember: You are the bridge between design and implementation. Your attention to detail and systematic approach ensures that what users see matches what designers intended, pixel by pixel. diff --git a/skills/ce-work-beta/references/codex-delegation-workflow.md b/skills/ce-work-beta/references/codex-delegation-workflow.md deleted file mode 100644 index 04d2fdcd0..000000000 --- a/skills/ce-work-beta/references/codex-delegation-workflow.md +++ /dev/null @@ -1,394 +0,0 @@ -# Codex Delegation Workflow - -When `delegation_active` is true, code implementation is delegated to the Codex CLI (`codex exec`) instead of being implemented directly. The orchestrating Claude Code agent retains control of planning, review, git operations, and orchestration. - -## Delegation Decision - -If `work_delegate_decision` is `ask`, present the recommendation and wait for the user's choice before proceeding. - -**When recommending Codex delegation:** - -> "Codex delegation active. [N] implementation units -- delegating in one batch." -> 1. Delegate to Codex *(recommended)* -> 2. Execute with Claude Code instead - -**When recommending Codex delegation, multiple batches:** - -> "Codex delegation active. [N] implementation units -- delegating in [X] batches." -> 1. Delegate to Codex *(recommended)* -> 2. Execute with Claude Code instead - -**When recommending Claude Code (all units are trivial):** - -> "Codex delegation active, but these are small changes where the cost of delegating outweighs having Claude Code do them." -> 1. Execute with Claude Code *(recommended)* -> 2. Delegate to Codex anyway - -If the user chooses the delegation option, proceed to Pre-Delegation Checks below. If the user chooses the Claude Code option, set `delegation_active` to false and return to standard execution in the parent skill. - -If `work_delegate_decision` is `auto` (the default), state the execution plan in one line and proceed without waiting: "Codex delegation active. Delegating [N] units in [X] batch(es)." If all units are trivial, set `delegation_active` to false and proceed: "Codex delegation active. All units are trivial -- executing with Claude Code." - -## Pre-Delegation Checks - -Run these checks **once before the first batch**. If any check fails, fall back to standard mode for the remainder of the plan execution. Do not re-run on subsequent batches. - -**0. Platform Gate** - -Codex delegation is only supported when the orchestrating agent is running in Claude Code. If the current session is Codex, Antigravity CLI (`agy`), OpenCode, or any other platform, set `delegation_active` to false and proceed in standard mode. - -**1. Environment Guard** - -Check whether the current agent is already running inside a Codex sandbox: - -```bash -if [ -n "$CODEX_SANDBOX" ] || [ -n "$CODEX_SESSION_ID" ]; then - echo "inside_sandbox=true" -else - echo "inside_sandbox=false" -fi -``` - -If `inside_sandbox` is true, delegation would recurse or fail. - -- If `delegation_source` is `argument`: emit "Already inside Codex sandbox -- using standard mode." and set `delegation_active` to false. -- If `delegation_source` is `config` or `default`: set `delegation_active` to false silently. - -**2. Availability Check** - -**Codex CLI path (pre-resolved):** -!`command -v codex 2>/dev/null || true` - -If the line above shows an absolute path (starts with `/`, e.g., `/opt/homebrew/bin/codex`), the Codex CLI is available — proceed to the next check. -Otherwise — empty, an unresolved command string like `command -v codex 2>/dev/null` left in place by a non-Claude harness that doesn't process `!` pre-resolution, or any other non-path value — run `command -v codex` via the shell/Bash tool to verify at runtime. If that prints an absolute path, the Codex CLI is available; proceed. If it fails or prints nothing, emit "Codex CLI not found (install via `npm install -g @openai/codex` or `brew install codex`) -- using standard mode." and set `delegation_active` to false. - -**3. Consent Flow** - -If `consent_granted` is not true (from config `work_delegate_consent`): - -Present a one-time consent warning using the platform's blocking question tool (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_question` in Antigravity CLI (`agy`), `ask_user` in Pi (requires the `pi-ask-user` extension)). The consent warning explains: -- Delegation sends implementation units to `codex exec` as a structured prompt -- **yolo mode** (`--dangerously-bypass-approvals-and-sandbox`): Full system access including network. Required for verification steps that run tests or install dependencies. **Recommended.** -- **full-auto mode** (`-s workspace-write`): Workspace-write sandbox, no network access by default. Network can be re-enabled by setting `network_access = true` under `[sandbox_workspace_write]` in `~/.codex/config.toml`. - -Present the sandbox mode choice: (1) yolo (recommended), (2) full-auto. - -On acceptance: -- Resolve the repo root: `git rev-parse --show-toplevel`. Write `work_delegate_consent: true` and `work_delegate_sandbox: ` to `/.compound-engineering/config.local.yaml` -- To write: (1) if file or directory does not exist, create `/.compound-engineering/` and write the YAML file; (2) if file exists, merge new keys preserving existing keys -- Update `consent_granted` and `sandbox_mode` in the resolved state - -On decline: -- Ask whether to disable delegation entirely for this project -- If yes: write `work_delegate: false` to `/.compound-engineering/config.local.yaml` (using the same repo root resolved above). To write: (1) if file or directory does not exist, create `/.compound-engineering/` and write the YAML file; (2) if file exists, merge new keys preserving existing keys. Set `delegation_active` to false, proceed in standard mode -- If no: set `delegation_active` to false for this invocation only, proceed in standard mode - -**Headless consent:** If running in a headless or non-interactive context, delegation proceeds only if `work_delegate_consent` is already `true` in the config file. If consent is not recorded, set `delegation_active` to false silently. - -## Batching - -Delegate all units in one batch. If the plan exceeds 5 units, split into batches at the plan's own phase boundaries, or in groups of roughly 5 -- never splitting units that share files. Skip delegation entirely if every unit is trivial. - -## Per-Batch Effort - -Each batch picks an effort level proportional to its complexity, then resolves against the config floor before invocation. - -**Effort levels — guidelines, not predicates** - -Pick the level that best fits the batch. These are signals to weigh, not boxes to tick — use judgment. - -- **default (no flag)** — trivial work with no behavioral change: a one-line config tweak, a rename, a typo or comment-only fix, a pure documentation update. Defers to the user's `~/.codex/config.toml` default (which is `medium` on a stock Codex install). -- **`medium`** — small, well-scoped behavioral changes that stay clear of high-risk areas. A handful of files, a single concern, no novel architecture. -- **`high`** — work that touches a high-risk area (auth/session logic, payments, database migrations, external API contracts, error handling with retries/fallbacks), or work spanning enough surface area that one mistake could cascade. -- **`xhigh`** — architectural work: cross-cutting refactors, multiple high-risk areas in the same batch, changes that propagate broadly, or anywhere a wrong call meaningfully degrades the project. - -When in doubt, lean up one level — under-resourcing risky work costs more than over-resourcing routine work. Briefly note the picked level and the signal that drove it (e.g., "`high` — touches db/migrations") so the choice is auditable. - -A few edge cases worth handling explicitly: -- **Test-only batches:** classify by what the tests *exercise*, not by file paths. Tests for auth flows, payment logic, or migrations get the same level the equivalent implementation work would get. -- **Mixed-complexity batches:** the batch picks one level. If a single batch combines a typo unit and a payments rewrite, pick the higher level. If the spread feels wasteful, prefer splitting at the batching step (see Batching above) over averaging it out. -- **Deletion-only batches:** classify by the risk of what is being removed, not by counts of remaining content. Removing an auth module is `high` even if the batch produces zero `Modify` content. -- **Documentation- or comment-only batches:** `default`. - -**Floor and resolution — hard rules** - -Effort levels are ordered: `minimal < low < medium < high < xhigh`. - -Compute `effective_effort`: - -- If `delegate_effort` is unset: `effective_effort = picked_level`. -- If `delegate_effort` is set: substitute `default` → `medium` in `picked_level`, then `effective_effort = max(picked_level, delegate_effort)`. - -Emit based on `effective_effort`: - -- `medium`, `high`, or `xhigh` → emit `-c 'model_reasoning_effort=""'`. -- `default` → omit the flag (defer to `~/.codex/config.toml`). Reachable only when `delegate_effort` is unset and the pick is `default`. - -Never pass the literal string `"default"` to `codex exec`. - -Store `effective_effort` as a per-batch derived state value (alongside the session-level `delegate_effort`) and use it in place of `delegate_effort` throughout the Execution Loop. - -## Prompt Template - -At the start of delegated execution, create a per-run OS-temp scratch directory via `mktemp -d` and capture its **absolute path** for all downstream use. All scratch files for this invocation live under that directory. Do not use `.context/` — these scratch files are per-run throwaway that get cleaned up when delegated execution ends (see Cleanup below), matching the repo Scratch Space convention for one-shot artifacts. Do not pass unresolved shell-variable strings to non-shell tools (Write, Read); use the absolute path returned by `mktemp -d`. - -```bash -SCRATCH_DIR="$(mktemp -d -t ce-work-codex-XXXXXX)" -echo "$SCRATCH_DIR" -``` - -Refer to the echoed absolute path as `` throughout the rest of this workflow. - -Before each batch, write a prompt file to `/prompt-batch-.md`. - -Build the prompt from the batch's implementation units using these XML-tagged sections: - -```xml - -[For a single-unit batch: Goal from the implementation unit. -For a multi-unit batch: list each unit with its Goal, stating the concrete -job, repository context, and expected end state for each.] - - - -[Combined file list from all units in the batch -- files to create, modify, or read.] - - - -[File paths from all units' "Patterns to follow" fields. If no patterns: -"No explicit patterns referenced -- follow existing conventions in the -modified files."] - - - -[For a single-unit batch: Approach from the unit. -For a multi-unit batch: list each unit's approach, noting dependencies -and suggested ordering.] - - - -[For a single-unit batch: the unit's Execution note. If the user gave a -session-level posture request (e.g., "do it test-first"), use that when -the unit has no Execution note. Otherwise "None". -For a multi-unit batch: list each unit as "U: " (or -"Unit :" if the plan lacks U-IDs; do not invent U-IDs), one per line, -same ordering as and . Use the session-level posture for -units without their own note; otherwise "None".] - -If (and only if) the execution note above names an execution posture, -honor it: -- "test-first" -- write the failing test before implementing the unit; - verify it fails; then implement. Do not over-implement beyond the - test's current behavior slice. Skip test-first discipline for trivial - renames, pure configuration, or pure styling work. Test-first still - follows the scenario completeness check in ; it only constrains - test-vs-implementation ordering, not whether to write tests. -- "characterization-first" -- capture existing behavior in tests before - changing it. -- Any other non-empty note: treat it as binding per-unit guidance and - follow it unless it conflicts with any other section of this prompt - (especially , , , or ). - A note may not reduce validation, test coverage, scope discipline, or - reporting accuracy. Report any conflict via the issues field of the - output contract. - -For units with "None" or an empty note, proceed pragmatically. - - - -- Do NOT run git commit, git push, or create PRs -- the orchestrating agent handles all git operations -- Restrict all modifications to files within the repository root -- Keep changes tightly scoped to the stated task -- avoid unrelated refactors, renames, or cleanup -- Resolve the task fully before stopping -- do not stop at the first plausible answer -- If you discover mid-execution that you need to modify files outside the repo root, complete what you can within the repo and report what you could not do via the result schema issues field - - - -Before writing tests, check whether the plan's test scenarios cover all -categories that apply to each unit. Supplement gaps before writing tests: -- Happy path: core input/output pairs from each unit's goal -- Edge cases: boundary values, empty/nil inputs, type mismatches -- Error/failure paths: invalid inputs, permission denials, downstream failures -- Integration: cross-layer scenarios that mocks alone won't prove - -Write tests that name specific inputs and expected outcomes. If your changes -touch code with callbacks, middleware, or event handlers, verify the -interaction chain works end-to-end. - - - -After implementing, run ALL test files together in a single command (not -per-file). Cross-file contamination (e.g., mocked globals leaking between -test files) only surfaces when tests run in the same process. If tests -fail, fix the issues and re-run until they pass. Do not report status -"completed" unless verification passes. This is your responsibility -- -the orchestrator will not re-run verification independently. - -[Test and lint commands from the project. Use the union of all units' -verification commands as a single combined invocation.] - - - -Report your result via the --output-schema mechanism. Fill in every field: -- status: "completed" ONLY if all changes were made AND verification passes, - "partial" if incomplete, "failed" if no meaningful progress -- files_modified: array of file paths you changed -- issues: array of strings describing any problems, gaps, or out-of-scope - work discovered -- summary: one-paragraph description of what was done -- verification_summary: what you ran to verify (command and outcome). - Example: "Ran `bun test` -- 14 tests passed, 0 failed." - If no verification was possible, say why. - -``` - -## Result Schema - -Write the result schema to `/result-schema.json` (using the absolute path captured at the start) once at the start of delegated execution: - -```json -{ - "type": "object", - "properties": { - "status": { "enum": ["completed", "partial", "failed"] }, - "files_modified": { "type": "array", "items": { "type": "string" } }, - "issues": { "type": "array", "items": { "type": "string" } }, - "summary": { "type": "string" }, - "verification_summary": { "type": "string" } - }, - "required": ["status", "files_modified", "issues", "summary", "verification_summary"], - "additionalProperties": false -} -``` - -Each batch's result is written to `/result-batch-.json` via the `-o` flag. On plan failure, files are left in place for debugging. - -If the result JSON is absent or malformed after a successful exit code, classify as task failure. - -## Execution Loop - -Initialize a `consecutive_failures` counter at 0 before the first batch. - -**Clean-baseline preflight:** Before the first batch, verify there are no uncommitted changes to tracked files: - -```bash -git diff --quiet HEAD -``` - -This intentionally ignores untracked files. Only staged or unstaged modifications to tracked files make rollback unsafe. However, if untracked files exist at paths in the batch's planned Files list, rollback (`git clean -fd -- `) would delete them. If such overlaps are detected, warn the user and recommend committing or stashing those files before proceeding. - -If tracked files are dirty, stop and present options: (1) commit current changes, (2) stash explicitly (`git stash push -m "pre-delegation"`), (3) continue in standard mode (sets `delegation_active` to false). Do not auto-stash user changes. - -**Delegation invocation:** For each batch, execute these as **separate Bash tool calls** (not combined into one): - -**Step A — Launch (background, separate Bash call):** - -Write the prompt file, then make a single Bash tool call with `run_in_background: true` set on the tool parameter. This call returns immediately and has no timeout ceiling. - -Substitute the literal absolute path captured at setup for every `` below. Each Bash tool call starts a fresh shell, so the `$SCRATCH_DIR` variable from the setup snippet is not preserved — an unresolved `$SCRATCH_DIR` would expand empty and break result detection. - -```bash -# Substitute the resolved sandbox_mode value (yolo or full-auto) from the skill state -SANDBOX_MODE="" - -# Resolve sandbox flag -if [ "$SANDBOX_MODE" = "full-auto" ]; then - SANDBOX_FLAG="-s workspace-write" -else - SANDBOX_FLAG="--dangerously-bypass-approvals-and-sandbox" -fi - -codex exec \ - $SANDBOX_FLAG \ - --output-schema "/result-schema.json" \ - -o "/result-batch-.json" \ - - < "/prompt-batch-.md" -``` - -**Conditional flags** — only include each line when the corresponding skill-state value is set: - -- If `delegate_model` is set, insert ` -m "" \` as a line before `$SANDBOX_FLAG`. -- If `effective_effort` is `medium`, `high`, or `xhigh` (resolved via Per-Batch Effort above), insert ` -c 'model_reasoning_effort=""' \` as a line before `$SANDBOX_FLAG`. When `effective_effort` is `default` (only possible when `delegate_effort` is unset and the pick is `default`), omit the line — never pass the literal string `"default"`. - -When either value is unset, omit its line entirely — Codex resolves the default from the user's `~/.codex/config.toml` (and ultimately the CLI's own built-in default). Do not substitute a placeholder string for unset values. - -Critical: `run_in_background: true` must be set as a **Bash tool parameter**, not as a shell `&` suffix. The tool parameter is what removes the timeout ceiling. A shell `&` inside a foreground Bash call still hits the 2-minute default timeout. - -Quoting is critical for the `-c` flag when present: use single quotes around the entire key=value and double quotes around the TOML string value inside. Example: `-c 'model_reasoning_effort="high"'`. - -Do not improvise CLI flags or modify this invocation template beyond the documented conditional insertions. - -**Step B — Poll (foreground, separate Bash calls):** - -After the launch call returns, make a **new, separate** foreground Bash tool call that polls for the result file. This keeps the agent's turn active so the user cannot interfere with the working tree. - -Substitute the literal absolute path captured at setup for ``. The shell variable from Step A does not survive across separate Bash tool calls. - -```bash -RESULT_FILE="/result-batch-.json" -for i in $(seq 1 6); do - test -s "$RESULT_FILE" && echo "DONE" && exit 0 - sleep 10 -done -echo "Waiting for Codex..." -``` - -If the output is "Waiting for Codex...", issue the same polling command again as another separate Bash call. Repeat until the output is "DONE", then read the result file and proceed to classification. - -**Polling termination conditions:** Stop polling when any of these conditions is met: - -- **Result file appears** (output is "DONE") -- proceed to result classification normally. -- **Background process exits with non-zero code** -- classify as CLI failure (row 1). Rollback and fall back to standard mode. -- **Background process exits with zero code but result file is absent** -- classify as task failure (row 2: exit 0, result JSON missing). Rollback and increment `consecutive_failures`. -- **5 polling rounds** elapse (~5 minutes) without the result file appearing and without a background process notification -- treat as a hung process. Classify as CLI failure (row 1). Rollback and fall back to standard mode. - -**Result classification:** Codex is responsible for running verification internally and fixing failures before reporting -- the orchestrator does not re-run verification independently. - -| # | Signal | Classification | Action | -|---|--------|---------------|--------| -| 1 | Exit code != 0 | CLI failure | Rollback to HEAD. Fall back to standard mode for ALL remaining work. | -| 2 | Exit code 0, result JSON missing or malformed | Task failure | Rollback to HEAD. Increment `consecutive_failures`. | -| 3 | Exit code 0, `status: "failed"` | Task failure | Rollback to HEAD. Increment `consecutive_failures`. | -| 4 | Exit code 0, `status: "partial"` | Partial success | Keep the diff. Complete remaining work locally, verify, and commit. Increment `consecutive_failures`. | -| 5 | Exit code 0, `status: "completed"` | Success | Commit changes. Reset `consecutive_failures` to 0. | - -**Result handoff — surface to user:** After reading the result JSON and before committing or rolling back, display a summary so the user sees what happened. Format: - -> **Codex batch ** -> -> -> **Files:** -> **Verification:** -> **Issues:** - -On failure or partial results, include the classification reason (e.g., "status: failed", "result JSON missing") so the user understands why the orchestrator is rolling back or completing locally. - -Keep this brief — the goal is transparency, not a wall of text. One short block per batch. - -**Rollback procedure:** - -```bash -git checkout -- . -git clean -fd -- -``` - -Do NOT use bare `git clean -fd` without path arguments. - -**Commit on success:** - -```bash -git add $(git diff --name-only HEAD; git ls-files --others --exclude-standard) -git commit -m "feat(): " -``` - -**Between batches** (plans split into multiple batches): Report what completed, test results, and what's next. Continue immediately unless the user intervenes -- the checkpoint exists so the user *can* steer, not so they *must*. - -**Circuit breaker:** After 3 consecutive failures, set `delegation_active` to false and emit: "Codex delegation disabled after 3 consecutive failures -- completing remaining units in standard mode." - -**Scratch cleanup:** No explicit cleanup needed — OS temp handles eventual cleanup (macOS `$TMPDIR` periodic purge; Linux/WSL `/tmp` reboot or periodic cleanup). Leaving `` in place after the run also preserves intermediate artifacts for debugging if anything went wrong. - -## Mixed-Model Attribution - -When some units are executed by Codex and others locally: -- If all units used delegation: attribute to the Codex model -- If all units used standard mode: attribute to the current agent's model -- If mixed: note which units were delegated in the PR description and credit both models diff --git a/skills/ce-work-beta/references/non-code-execution.md b/skills/ce-work-beta/references/non-code-execution.md deleted file mode 100644 index 52b7593ff..000000000 --- a/skills/ce-work-beta/references/non-code-execution.md +++ /dev/null @@ -1,23 +0,0 @@ -# Non-Code Execution (Knowledge-Work Carve-Out) - -Loaded from Phase 0 Input Triage when the plan carries `execution: knowledge-work`. The plan is a **production plan** for a non-code deliverable (a synthesized document, a study artifact, a research write-up) — typically produced by `ce-plan`'s approach-altitude flow. Execute it to produce the deliverable. This is a minority-case branch; the normal code lifecycle does not apply and is not invoked here. - -## What this skips - -Do **not** run any of the code-shipping machinery — it does not fit knowledge work: - -- No branch/worktree setup (Phase 1 Step 2). -- No task-list-from-implementation-units, no execution-strategy/subagent dispatch keyed on `Files:`. -- No Test Discovery, no test-scenario completeness, no system-wide test check. -- No incremental code commits, and none of `references/shipping-workflow.md` (no PR, no CI). - -## Execute the production plan - -1. **Read the plan fully.** It is a decision artifact describing *how* the deliverable gets made: which sources to read, how to mine each, how they combine, the shape of the deliverable, and any forks the user already confirmed. Honor those decisions. -2. **Read the sources the plan names** — the actual inputs (PDFs, transcripts, docs, links). Treat user-named resources as authoritative; read them rather than working from memory. If a named source is missing, say so plainly rather than substituting. -3. **Synthesize and produce the deliverable** following the plan's intended shape and the confirmed forks. This is the work the approach-plan deliberately deferred. -4. **Save and report.** Write the deliverable to a durable, repo-tracked location — default to a sensible `docs/` subpath (or a path the user named at the checkpoint) — and report its absolute path so the user can find it. Whether to git-commit vs. leave it written is the user's call; offer, don't force. - -## Stay scoped to non-code deliverables - -The carve-out is for knowledge-work output. If producing the deliverable legitimately requires emitting code (a script, a config file, a data-transform), route that specific sub-step back through the normal code path so its safeguards (Test Discovery, review, commit hygiene) still apply — do not silently produce code under the carve-out. The deliverable itself stays non-code. diff --git a/skills/ce-work-beta/references/review-findings-followup.md b/skills/ce-work-beta/references/review-findings-followup.md deleted file mode 100644 index c33479dee..000000000 --- a/skills/ce-work-beta/references/review-findings-followup.md +++ /dev/null @@ -1,104 +0,0 @@ -# Apply Code Review Findings (after `ce-code-review`) - -Load this reference when Tier 2 `ce-code-review` has finished and **ce-work-beta** should apply fixes before the Residual Work Gate. - -`ce-code-review` is invoked here with `mode:agent`, so it is **review-only** in this context — it reports findings and writes artifacts and does not mutate the checkout, commit, push, or file tickets. **The caller owns apply/fix policy.** (In its own default/interactive mode the review applies safe fixes itself; that path does not apply here.) - -## Consume the completed review (do not re-run it) - -This reference loads **after** review has run. In the ce-work-beta Tier 2 path, step 2a already invoked `ce-code-review`; this apply step **consumes that output** — do not start a second review, which would waste reviewer dispatches and risk overwriting the artifact the Residual Work Gate reconciles. - -Reuse the review output already in hand: - -- Parsed JSON (`status`, `actionable_findings`, `findings`, `artifact_path`, `run_id`) **or** the markdown Actionable Findings summary captured by the caller -- Run artifact dir: `/tmp/compound-engineering/ce-code-review//` (`review.json`, per-reviewer JSON for `why_it_matters`) - -If `status` is `failed`, stop shipping and surface `reason`. If `degraded`, note partial reviewer coverage before applying anything. - -### Fallback — invoke review only for cold callers - -Only when the caller reached this file **without** already running review (no review output in hand): invoke `ce-code-review` once, then proceed to apply. Do not invoke when the caller already ran review (e.g., ce-work-beta Tier 2 step 2a). - -Invoke the skill explicitly — do not treat a casual "review my changes" prompt as a substitute unless the harness routed it to `ce-code-review`. - -``` -ce-code-review mode:agent plan: base: -``` - -- `mode:agent` — JSON output (`review.json` + primary JSON response) for programmatic parsing; same review pipeline as default. -- `plan:` — when Phase 1 used a plan file (requirements completeness). -- `base:` — when the diff base is already resolved on the current checkout; omit when reviewing a PR number/URL or standalone current branch. -- Do **not** pass deprecated `mode:autofix`. - -For human / interactive shipping, invoke `ce-code-review` without `mode:agent` if markdown tables are preferred. Capture the same JSON / Actionable Findings and artifact dir listed above before applying. - -## Inputs for apply - -- `actionable_findings` from JSON, or the Actionable Findings section from markdown -- Full finding detail when needed: `review.json` / artifact `findings`, or `{reviewer}.json` for `why_it_matters` and `evidence` -- Stable finding `#` — reuse in commits, residual sinks, and subagent prompts - -## What to apply - -Default to applying every actionable finding. Applying is a reversible edit to a tracked tree; diffs are reviewed before commit (below) and tests run after — so leaving a clear, reversible fix unapplied "to be safe" is the failure mode, not the safe choice. Bias to act: - -- **Apply** any finding with a concrete `suggested_fix` that is a clear improvement — the common case. `confidence` and `autofix_class` tell you what to prioritize and what to flag, not whether you may apply: `autofix_class` is signal, **never permission**. -- **Push back** — keep the finding, don't apply — when the reviewer is wrong; note why. -- **Flag, don't block, green-but-unverifiable edits** — when an applied fix touches auth/authz, a public or cross-service contract/schema, or concurrency, a passing test does not prove safety; apply it when there is a clear `suggested_fix` and confidence, and call it out prominently in the diff review. - -There is no precondition safety checklist and no deny-list — a code-review fix is a reversible edit, so downside is controlled after the fact (diff review + tests + the commit checkpoint), not by gating the apply. - -**Evidence still matches the code** — the fix subagent confirms at `file:line` before editing. The orchestrator does **not** open files just to decide eligibility or dispatch. - -## What to defer (to the Residual Work Gate) - -- `autofix_class: advisory` — report-only. -- Findings with no concrete `suggested_fix` to act on. -- Findings whose right fix depends on a design or product decision — architecture direction, contract shape, or a behavior change needing sign-off. These need a human call before code changes. - -Surface what was deferred and why; never silently drop. - -## Execution — orchestrator batches, subagents apply - -The orchestrator **does not investigate findings** (no pre-read of cited files to judge complexity or inline vs subagent). That would spend the context window you are trying to protect. - -**Orchestrator owns:** parse review output → **eligibility filter on JSON fields only** → build batches → dispatch fix subagents → review diffs → tests → commit → Residual Work Gate. - -**Fix subagents own:** read `file:line`, confirm evidence still matches, apply or skip with reason, return summary. - -### Default: batched fix subagents - -After eligibility filtering, **dispatch subagents for all remaining applicable findings** unless the optional inline shortcut below applies. Do not classify findings by complexity in the parent thread. - -**Batching (primary rule — group by file):** - -1. Sort applicable findings by severity (P0 first). -2. **Group by `file`.** All eligible findings on the same file → **one subagent** (it loads the file once and works through its `#` list in severity order). -3. **Parallel waves:** batches with **disjoint file sets** may run in parallel (same worktree / shared-directory rules as Phase 1 Step 4 in `ce-work-beta` SKILL.md). -4. **Same file, many findings:** keep one subagent per file. If the prompt would exceed a comfortable size (~8 findings), split into **serial** subagent passes on that file (first batch highest severity, then next batch after merge or after the prior agent returns). -5. **Cross-file coupling:** do not merge unrelated files into one subagent just to reduce agent count — file grouping is the default. Only co-batch multiple files when findings explicitly reference the same small edit surface (rare); when in doubt, separate by file. - -**Subagent prompt (per batch):** the assigned findings only (`#`, severity, file, line, title, `suggested_fix`, `requires_verification`; add `why_it_matters` from `{reviewer}.json` in the run artifact when useful), plus: -- Work through assigned `#` in severity order; at each `file:line`, skip with a one-line reason if evidence no longer matches -- Apply the mechanical bar from § What to apply / What not to apply — skip anything that needs design judgment -- Do not re-run `ce-code-review` -- Shared-directory fallback: do not stage or commit — return which `#` were applied or skipped and which files changed - -**After each wave:** orchestrator reviews diffs (scope = assigned `#` only), runs tests (`requires_verification: true` on any applied finding → at least targeted tests; multi-file → broader suite), commits (`fix(review): apply findings #…`) unless worktree-isolated subagents merge per Phase 1. Repeat until all batches complete. - -### Optional inline shortcut (skip subagent spawn) - -Use **only** when **all** of the following hold: - -- Exactly **one** eligible finding after JSON filtering, **and** -- The orchestrator **already** has that file's relevant region in context from Phase 2 work this session (no new Read/Grep expedition) - -Otherwise dispatch a subagent — even for a single finding. When unsure, dispatch. - -### Summary (required) - -Report: batches dispatched, `#` applied vs skipped (with reasons from subagents), artifact path, tests run. - -## Handoff to Residual Work Gate - -Any actionable finding not applied in this pass is **residual work** — proceed to the Residual Work Gate with an updated count. Do not re-invoke `ce-code-review` solely to re-apply the same findings unless the diff changed materially after fixes. diff --git a/skills/ce-work-beta/references/shipping-workflow.md b/skills/ce-work-beta/references/shipping-workflow.md deleted file mode 100644 index ba5d6904d..000000000 --- a/skills/ce-work-beta/references/shipping-workflow.md +++ /dev/null @@ -1,146 +0,0 @@ -# Shipping Workflow - -This file contains the shipping workflow (Phase 3-4). Load it only when all Phase 2 tasks are complete and execution transitions to quality check. - -## Phase 3: Quality Check - -1. **Run Core Quality Checks** - - Always run before submitting: - - ```bash - # Run full test suite (use project's test command) - # Examples: bin/rails test, npm test, pytest, go test, etc. - - # Run linting (per the project's configured lint command / active instructions) - # Use linting-agent before pushing to origin - ``` - -2. **Simplify** (conditional — separate from code review tiers) - - Before code review, invoke **`ce-simplify-code`** when the diff is non-mechanical and large enough to benefit (default: **>=30 changed lines**). Skip when the diff is purely mechanical (formatting, dependency bumps, lint-only fixes, generated artifacts). - - This step refines reuse, quality, and efficiency on the **current diff** so any later review sees cleaner code. It is not a substitute for Tier 1 or Tier 2 review. - - Pass `plan:` or a scope hint when the plan or user narrowed what changed. If the skill is unavailable on the harness, skip or do a brief manual pass for obvious duplicate/dead code — do not escalate to Tier 2 because simplify was skipped. - -3. **Code Review** - - Use **Tier 1** when the harness provides a built-in review. Use **Tier 2** only when escalation criteria below match — **not** because Tier 1 is missing. - - **Tier 1 -- harness-native review (default when available).** Run the harness built-in code review (e.g., `/review` in Claude Code). Address blocking and suggested findings inline before Final Validation. Skip the Residual Work Gate. - - **Tier 2 -- `ce-code-review` (escalation only).** Two steps — **review is not fix.** - - **2a. Review (read-only).** Invoke `ce-code-review` with `mode:agent` (and `plan:` when known; add `base:` when the diff base is already resolved). Parse JSON or Actionable Findings. Do not pass `mode:autofix`. - - **2b. Apply fixes (caller-owned).** Load `references/review-findings-followup.md`: filter on JSON, batch by file, dispatch fix subagents. Then proceed to the Residual Work Gate. - - **When Tier 1 is unavailable and Tier 2 criteria are not met:** skip a dedicated review step. Phase 2 testing, simplify (when run), lint, and Final Validation still apply. Note in the shipping summary: `Code review: skipped (no Tier 1 tool; Tier 2 criteria not met).` - - Escalate to Tier 2 when **any** of the following is true: - - - **Sensitive surface touched.** The diff modifies any of: authentication or authorization, payments or billing, data migrations or backfills, cryptography or secret handling, security-relevant configuration, public API or library contracts, or dependency manifests. - - **Large and diffuse change.** The diff exceeds >=400 changed lines **and** spans more than 3 directories or 2 distinct subsystems. Either alone is a soft signal; together they are an escalation trigger. - - **Very large change.** The diff exceeds >=1,000 changed lines regardless of diffusion. - - **Plan or task explicitly requests it.** The plan, the originating task, or another instruction in scope calls for a full / deep / thorough code review. - - When the change is small, concentrated, and outside the sensitive surface list, Tier 1 is sufficient -- do not escalate "to be safe." - -4. **Residual Work Gate** (REQUIRED when Tier 2 ran) - - After Tier 2 code review and review-findings followup, inspect the **Actionable Findings** summary (or read the run artifact at `/tmp/compound-engineering/ce-code-review//`). If one or more actionable `downstream-resolver` findings were not applied in followup, do not proceed to Final Validation until the user decides how to handle them. - - Ask the user using the platform's blocking question tool (`AskUserQuestion` in Claude Code with `ToolSearch select:AskUserQuestion` pre-loaded if needed, `request_user_input` in Codex, `ask_question` in Antigravity CLI (`agy`), `ask_user` in Pi (requires the `pi-ask-user` extension)). Fall back to numbered options in chat only when the harness genuinely lacks a blocking tool. Never silently skip the gate. - - Stem: `Code review left N actionable finding(s) not yet fixed. How should the agent proceed?` - - Options (four or fewer, self-contained labels): - - `Apply/fix now` — load `references/review-findings-followup.md`, dispatch batched fix subagents for remaining eligible findings, run tests, commit if needed. - - `File tickets via project tracker` — load `references/tracker-defer.md` in Interactive mode; the agent files tickets in the project's detected tracker (or `gh` fallback, or leaves them in the report if no sink exists) and proceeds to Final Validation. - - `Accept and proceed` — record the residual findings verbatim in a durable "Known Residuals" sink before shipping. If a PR will be created or updated in Phase 4, include them in the PR description's "Known Residuals" section (the agent owns this when calling `ce-commit-push-pr`). If the user later chooses the no-PR `ce-commit` path, create `docs/residual-review-findings/.md`, include the accepted findings and source review-run context, stage it with the implementation commit, and mention the file path in the final summary. The user has acknowledged the risk, but the findings must not live only in the transient session. - - `Stop — do not ship` — abort the shipping workflow. The user will handle findings manually before re-invoking. - - Skip this gate entirely when the review reported `Actionable findings: none.` (and followup applied everything mechanical) or when only Tier 1 was used. Do not proceed past this gate on an `Accept and proceed` decision until the agent has recorded whether the durable sink is `PR Known Residuals` or `docs/residual-review-findings/.md`. - -5. **Final Validation** - - All tasks marked completed - - Testing addressed -- tests pass and new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed) - - Linting passes - - Code follows existing patterns - - Figma designs match (if applicable) - - No console errors or warnings - - If the plan has a `Requirements` section (or legacy `Requirements Trace`), verify each requirement is satisfied by the completed work - - If any `Deferred to Implementation` questions were noted, confirm they were resolved during execution - -6. **Prepare Operational Validation Plan** (REQUIRED) - - Add a `## Post-Deploy Monitoring & Validation` section to the PR description for every change. - - Include concrete: - - Log queries/search terms - - Metrics or dashboards to watch - - Expected healthy signals - - Failure signals and rollback/mitigation trigger - - Validation window and owner - - If there is truly no production/runtime impact, still include the section with: `No additional operational monitoring required` and a one-line reason. - -## Phase 4: Ship It - -1. **Prepare Validation Context** - - Do not try to launch a dedicated CE evidence-capture workflow. Modern harnesses provide their own browser, screenshot, terminal recording, and artifact capture tools; use those directly only when the user asks or when the artifact already exists. - - Note whether the completed work has observable behavior (UI rendering, CLI output, API/library behavior with a runnable example, generated artifacts, or workflow output), and summarize any manual validation performed. If the user supplied evidence (URL, markdown embed, local artifact path), pass it to `ce-commit-push-pr` as PR-description context. - -2. **Commit and Create Pull Request** - - Load the `ce-commit-push-pr` skill to handle committing, pushing, and PR creation. The skill handles convention detection, branch safety, logical commit splitting, adaptive PR descriptions, and attribution badges. - - When providing context for the PR description, include: - - The plan's summary and key decisions - - Testing notes (tests added/modified, manual testing performed) - - Evidence context from step 1, so `ce-commit-push-pr` can decide whether to ask about capturing evidence - - Figma design link (if applicable) - - The Post-Deploy Monitoring & Validation section (see Phase 3 Step 6) - - Any "Known Residuals" accepted in the Phase 3 Residual Work Gate, rendered as a dedicated section in the PR body with severity, file:line, and title per finding - - If the user prefers to commit without creating a PR, load the `ce-commit` skill instead. - -3. **Notify User** - - Summarize what was completed - - Link to PR (if one was created) - - Note any follow-up work needed - - Suggest next steps if applicable - -## Quality Checklist - -Before creating PR, verify: - -- [ ] All clarifying questions asked and answered -- [ ] All tasks marked completed -- [ ] Testing addressed -- tests pass AND new/changed behavior has corresponding test coverage (or an explicit justification for why tests are not needed) -- [ ] Linting passes (use linting-agent) -- [ ] Code follows existing patterns -- [ ] Figma designs match implementation (if applicable) -- [ ] Validation/evidence context passed to `ce-commit-push-pr` when the change has observable behavior -- [ ] Commit messages follow conventional format -- [ ] PR description includes Post-Deploy Monitoring & Validation section (or explicit no-impact rationale) -- [ ] Simplify: `ce-simplify-code` when diff >=30 lines (or skipped with reason) -- [ ] Code review: Tier 1 completed, or Tier 2 when escalated, or skipped (no Tier 1 + Tier 2 criteria not met — note in summary) -- [ ] PR description includes summary, testing notes, and evidence when captured -- [ ] PR description includes Compound Engineered badge with accurate model and harness - -## Code Review Tiers - -**Tier 1** when the harness has built-in review. **Tier 2** (`ce-code-review` + followup) only when escalation criteria match — missing Tier 1 is not a reason to escalate. - -**Tier 1 -- harness-native review.** Built-in command or skill (e.g., `/review`). Fix findings inline. - -**Tier 2 -- `ce-code-review` (escalation).** (2a) Review-only via `mode:agent`. (2b) Batched fix subagents per `references/review-findings-followup.md`; residuals → Residual Work Gate. - -**Skip dedicated review** when no Tier 1 and Tier 2 criteria not met (document in summary). - -Escalate to Tier 2 when any of these holds: -- Sensitive surface touched (auth/authz, payments/billing, data migrations or backfills, cryptography or secrets, security-relevant config, public API or library contracts, dependency manifests) -- Large and diffuse change (>=400 changed lines AND >3 directories or 2 subsystems) -- Very large change (>=1,000 changed lines) -- Plan or task explicitly requests a full / deep / thorough code review diff --git a/skills/ce-work-beta/references/tracker-defer.md b/skills/ce-work-beta/references/tracker-defer.md deleted file mode 100644 index de1f38cd4..000000000 --- a/skills/ce-work-beta/references/tracker-defer.md +++ /dev/null @@ -1,149 +0,0 @@ -# Tracker Detection and Defer Execution - -This reference covers how residual actionable findings are filed in the project's tracker. Loaded by caller workflows (for example `ce-work` Residual Work Gate, or `lfg` residual handling) — not by `ce-code-review`, which stops after the report. - ---- - -## Execution Modes - -Tracker-defer has two execution modes. The caller selects one; the detection, fallback chain, and ticket composition are shared. - -### Interactive mode - -Used by `ce-work` Residual Work Gate and similar caller flows when the user chooses to file tickets. All user-facing prompts fire: - -- First Defer of the session with a generic (non-named) label confirms the effective tracker choice. -- Execution failures prompt with Retry / Fall back to next sink / Convert to Skip. -- Labels in the routing question reflect `named_sink_available` (name the tracker) vs fallback generics. - -### Non-interactive mode - -Used by autonomous callers like `lfg` that must not prompt. All blocking questions are skipped; the fallback chain is executed silently in order. Behavior: - -- No confirmation on the first generic-label Defer; proceed directly. -- On execution failure, automatically fall to the next tier without prompting. Record the failure. -- On total chain exhaustion (every tier failed or no sink available), return findings in the `no_sink` bucket so the caller can route them to another surface (e.g., inline them in a PR description). -- Return a structured result: `{ filed: [{ finding_id, tracker, url }], failed: [{ finding_id, tracker, reason }], no_sink: [{ finding_id, title, severity, file, line }] }`. - -The caller decides how to surface the result to the user. The non-interactive mode treats "no sink available" as a data-producing outcome, not a prompt trigger. - ---- - -## Detection - -The agent determines the project's tracker from whatever documentation is obvious. Primary source: the project's active instructions and conventions already in its context — no need to open or name specific instruction files. Read a file directly only when the relevant instructions aren't already in context: a subdirectory-scoped instruction file governing the area you're working in, or when you're a fresh subagent that wasn't given the project's instructions. Supplementary signals (when primary documentation is ambiguous): `CONTRIBUTING.md`, `README.md`, PR templates under `.github/`, visible tracker URLs in the repo. - -A tracker can be surfaced via MCP tool (e.g., a Linear MCP server), CLI (e.g., `gh`), or direct API. All are acceptable. The detection output is a tuple with two availability flags — one for the named tracker specifically (drives label confidence in Interactive mode) and one for the full fallback chain (drives whether Defer is offered at all): - -``` -{ tracker_name, confidence, named_sink_available, any_sink_available } -``` - -Where: -- `tracker_name` — human-readable name ("Linear", "GitHub Issues", "Jira"), or `null` when detection cannot identify a specific tracker -- `confidence` — `high` when the tracker is named explicitly in documentation (or via a linked URL to a specific project/workspace) and is unambiguously the project's canonical tracker; `low` when the signal is thin, conflicting, or implied only -- `named_sink_available` — `true` only when the agent can actually invoke the detected tracker (MCP tool is loaded, CLI is authenticated, or API credentials are in environment); `false` when the tracker is documented but no tool reaches it, or when no tracker is found at all. Drives label confidence: inline tracker naming requires this to be `true`. -- `any_sink_available` — `true` when any tier in the fallback chain (named tracker or GitHub Issues via `gh`) can be invoked this session. Drives whether Defer is offered in Interactive mode, and drives the `no_sink` bucket in Non-interactive mode. - -Detection is reasoning-based. Do not maintain an enumerated checklist of files to read. Read the obvious sources and form a confident conclusion; when the obvious sources don't resolve, the label falls back to generic wording and the agent confirms with the user before executing (Interactive mode only). - ---- - -## Probe timing and caching - -Availability probes run **at most once per session** and **only when Defer execution is imminent**. Never speculatively at review start, never per-Defer, never per-walk-through-finding. The cached tuple is reused for every Defer action in the same run. - -Typical probe sequence: - -1. Consult the project's instructions already in context for tracker references — don't open or name specific instruction files; read one directly only when the relevant instructions aren't in context (subdirectory scope, or a fresh subagent). If nothing found, set `tracker_name = null`, `confidence = low`. -2. **Probe the named tracker when one was found.** For GitHub Issues, run `gh auth status` and `gh repo view --json hasIssuesEnabled`. For Linear or other connector/MCP-backed trackers, first discover available tools via the platform's tool-discovery primitive (e.g., `ToolSearch` in Claude Code) rather than assuming absence from an unloaded tool, then verify the discovered tool is responsive. For API-backed trackers, verify credentials wherever the platform exposes them (environment, connector auth, or a documented secrets location) — not only shell env vars. Set `named_sink_available` from the probe result. -3. **Probe the GitHub Issues fallback to compute `any_sink_available`.** Even when the named tracker was found and probed, `gh` matters for the `no_sink` bucket decision so that a run with no documented tracker but working `gh` still offers Defer. - - If `named_sink_available = true`: `any_sink_available = true` (no further probes needed). - - Otherwise, probe GitHub Issues via `gh auth status` + `gh repo view --json hasIssuesEnabled` (skip if already probed in step 2). If it works, `any_sink_available = true`. - - Otherwise, `any_sink_available = false`. - -When Interactive mode's routing question is skipped entirely (R2 zero-findings case), no probes run. When the cached tuple is reused across a session, any `named_sink_available = true` from the session's first probe stays cached — do not re-probe per Defer. - ---- - -## Label logic (Interactive mode) - -- When `confidence = high` AND `named_sink_available = true`: the routing question's option C and the walk-through's per-finding Defer option both include the tracker name verbatim. Example: `File a Linear ticket per finding`, `Defer — file a Linear ticket`. -- When `any_sink_available = true` but either `confidence = low` or `named_sink_available = false` (a fallback tier is working instead): the labels read generically — `File an issue per finding`, `Defer — file a ticket`. Before executing the first Defer of the session, the agent confirms the effective tracker choice with the user using the platform's blocking question tool. -- When `any_sink_available = false`: option C is omitted from the routing question, option B (Defer) is omitted from the walk-through per-finding options, and the agent tells the user why in the routing question's stem. - -Non-interactive mode skips label decisions entirely — it acts silently on the detected sink. - ---- - -## Fallback chain - -When the named tracker is unavailable or no tracker is named, fall back in this order. Prefer the project's detected tracker; use `gh` only when no named tracker was found or the named one is unreachable. - -1. **Named tracker** (MCP tool, CLI, or API the agent can invoke directly, identified via Detection above) -2. **GitHub Issues via `gh`** — when `gh auth status` succeeds and the current repo has issues enabled (`gh repo view --json hasIssuesEnabled` returns `true`) -3. **No sink** — findings remain in the review report's residual-work section (Interactive mode) or are returned in the `no_sink` bucket for the caller to route (Non-interactive mode). The agent does not re-display them through a transient surface. - -Previously this chain included a third in-session fallback tier. That tier was removed because in-session tasks do not survive past the session and therefore do not meet the "durable filing" intent of a Defer action. When no durable tracker exists, the correct behavior is to leave findings in the report (Interactive) or return them to the caller (Non-interactive). - ---- - -## Ticket composition - -Every Defer action creates a ticket with the following content, adapted to the tracker's capabilities: - -- **Title:** the merged finding's `title` (schema-capped at 10 words). -- **Body:** - - Plain-English problem statement — reads the persona-produced `why_it_matters` from the contributing reviewer's artifact file at `/tmp/compound-engineering/ce-code-review//{reviewer}.json`, using the same `file + line_bucket(line, +/-3) + normalize(title)` matching agent mode uses (see SKILL.md Stage 6 detail enrichment). Falls back to the merged finding's `title`, `severity`, `file`, and `suggested_fix` (when present) when no artifact match is available — these fields are guaranteed in the merge-tier compact return. - - Suggested fix (when present in the finding's `suggested_fix`). - - Evidence (direct quotes from the reviewer's artifact). - - Metadata block: `Severity: `, `Confidence: `, `Reviewer(s): `, `Finding ID: `. -- **Labels** (when the tracker supports labels): severity tag (`P0`, `P1`, `P2`, `P3`) and, when the tracker convention supports it, a category label sourced from the reviewer name. -- **Length cap:** when the composed body would exceed a tracker's body length limit, truncate with `... (continued in ce-code-review run artifact: /tmp/compound-engineering/ce-code-review//)` and include the finding_id in both the truncated body and the metadata block so the artifact is discoverable. - -The finding_id is a stable fingerprint composed as `normalize(file) + line_bucket(line, +/-3) + normalize(title)` — the same fingerprint used by the merge pipeline. - ---- - -## Failure path - -When ticket creation fails at execution (API error, auth expiry mid-session, rate limit, malformed body rejected, 4xx/5xx response): - -**Interactive mode:** surface the failure inline and ask the user using the platform's blocking question tool. - -Stem: -> Defer failed: returned . How should the agent handle this finding? - -Options: -- `Retry on ` — re-attempt the same tracker once more (useful for transient errors) -- `Fall back to next sink` — move this finding's Defer to the next tier in the fallback chain (e.g., from Linear to GitHub Issues) -- `Convert to Skip — record the failure` — abandon this Defer, note the failure in the completion report's failure section, and continue the walk-through or bulk flow - -**Non-interactive mode:** do not prompt. Automatically fall through to the next tier. If every tier fails, record the finding in the `failed` bucket of the structured return and continue. If the chain exhausts with no sink ever available, the finding ends up in the `no_sink` bucket. - -When a high-confidence named tracker fails at execution, the cached `named_sink_available` is set to `false` for the rest of the session. Subsequent Defer actions fall straight through to the next tier without retrying a confirmed-broken sink. `any_sink_available` is only downgraded to `false` when every tier has been confirmed broken — a failed Linear call that succeeds via `gh` keeps `any_sink_available = true`. - -Only when `ToolSearch` explicitly returns no match or the tool call errors — or on a platform with no blocking question tool — fall back to numbered options and waiting for the user's reply (Interactive mode only). - ---- - -## Per-tracker behavior - -Concrete behavior per tracker at execution time. The agent may invoke any of these through the appropriate interface (MCP, CLI, or API) — the choice depends on what is available in the current environment. - -| Tracker | Interface | Invocation sketch | Body format | Labels | -|---------|-----------|-------------------|-------------|--------| -| Linear | MCP (preferred) or API | Create issue in the project/workspace identified by documentation; assign to the reporter if the MCP tool exposes user context | Markdown | Severity priority field if the MCP exposes it; otherwise include severity in body | -| GitHub Issues | `gh issue create` | Repo defaults to the current repo. Use `--label` for severity tag when labels exist; omit `--label` if the repo has no label fixture. Fall back to a label-less issue on first failure. | Markdown | `--label P0` / `--label P1` / etc. when labels exist | -| Jira | MCP or API | Create issue in the project identified by documentation; Jira's markdown dialect differs from GitHub's — use plain text in the body when MCP does not handle conversion | Plain text when MCP does not handle markdown | Severity priority field | -| No sink available | — | Interactive: Defer option omitted, findings remain in the report's residual-work section. Non-interactive: findings returned in the `no_sink` bucket for caller routing. | — | — | - -When uncertain, prefer "drop with explicit user-facing notice" over "pass through silently and hope." A Defer that produces no durable artifact and no user message is data loss. - ---- - -## Cross-platform notes - -The question-tool name varies by platform. In Interactive mode, use the platform's blocking question tool (`AskUserQuestion` in Claude Code, `request_user_input` in Codex, `ask_question` in Antigravity CLI (`agy`), `ask_user` in Pi (requires the `pi-ask-user` extension)). In Claude Code the tool should already be loaded from the Interactive-mode pre-load step — if it isn't, call `ToolSearch` with query `select:AskUserQuestion` now. Fall back to numbered options in chat only when the harness genuinely lacks a blocking tool — `ToolSearch` returns no match, the tool call explicitly fails, or the runtime mode does not expose it (e.g., Codex edit modes without `request_user_input`). A pending schema load is not a fallback trigger. Never silently skip the question. - -Non-interactive mode is platform-agnostic: it never prompts, so the platform's question tool is not relevant. diff --git a/src/converters/claude-to-copilot.ts b/src/converters/claude-to-copilot.ts index 0bb57623f..18c6fcd3a 100644 --- a/src/converters/claude-to-copilot.ts +++ b/src/converters/claude-to-copilot.ts @@ -127,7 +127,7 @@ export function transformContentForCopilot(body: string): string { // The lookbehind ensures we only match at word boundaries or after common delimiters, // avoiding corruption of URLs, code identifiers, or unrelated namespace:value patterns. // Note: / is intentionally excluded — slash commands are already handled in step 2. - // Captures colons in the name segment so multi-colon refs like ce:work:beta → ce-work-beta. + // Captures colons in the name segment so multi-colon refs like ce:foo:bar → ce-foo-bar. result = result.replace(/(?<=^|[\s,.()`'"])ce:([a-z*][a-z0-9_*:-]*)/gim, (_, name: string) => `ce-${name.replace(/:/g, "-")}`) // 4. Rewrite .claude/ paths to .github/ and ~/.claude/ to ~/.copilot/ diff --git a/src/data/plugin-legacy-artifacts.ts b/src/data/plugin-legacy-artifacts.ts index b7e494c55..eb56820f1 100644 --- a/src/data/plugin-legacy-artifacts.ts +++ b/src/data/plugin-legacy-artifacts.ts @@ -70,6 +70,7 @@ const EXTRA_LEGACY_ARTIFACTS_BY_PLUGIN: Record = "ce-session-extract", "ce-session-inventory", "ce-update", + "ce-work-beta", "changelog", "claude-permissions-optimizer", "compound-docs", @@ -253,6 +254,7 @@ const EXTRA_LEGACY_ARTIFACTS_BY_PLUGIN: Record = "ce:plan", "ce:review", "ce:work", + "ce:work-beta", "changelog", "codify", "compound", diff --git a/src/utils/legacy-cleanup.ts b/src/utils/legacy-cleanup.ts index 29d63f28b..9a07ab6a2 100644 --- a/src/utils/legacy-cleanup.ts +++ b/src/utils/legacy-cleanup.ts @@ -109,6 +109,7 @@ export const STALE_SKILL_DIRS = [ "ce-sessions", "ce-slack-research", "ce-update", + "ce-work-beta", // ce-session-inventory and ce-session-extract were script-host skills called // only from ce-session-historian via the Skill tool. That dispatch path @@ -401,6 +402,9 @@ const LEGACY_PROMPT_DESCRIPTION_ALIASES: Record = { "Transform feature descriptions or requirements into implementation plans grounded in repo patterns and research.", ], "ce-work-beta.md": [ + // Last shipped ce-work-beta description (the file was deleted, so this is + // the final live frontmatter description preserved for upgrade cleanup). + "[BETA] Execute ce-work with external delegate support.", "[BETA] Execute work with external delegate support. Same as ce-work but includes experimental Codex delegation mode for token-conserving code implementation.", "[BETA] Execute work with external delegate support. Same as ce:work but includes experimental Codex delegation mode for token-conserving code implementation.", ], @@ -468,6 +472,10 @@ const LEGACY_ONLY_SKILL_DESCRIPTIONS: Record = { "[BETA] Structured code review using tiered persona agents, confidence-gated findings, and a merge/dedup pipeline. Use when reviewing code changes before creating a PR.", "ce-review-beta": "[BETA] Structured code review using tiered persona agents, confidence-gated findings, and a merge/dedup pipeline. Use when reviewing code changes before creating a PR.", + "ce:work-beta": + "[BETA] Execute ce-work with external delegate support.", + "ce-work-beta": + "[BETA] Execute ce-work with external delegate support.", "ce-onboarding": "Generate or regenerate ONBOARDING.md to help new contributors understand a codebase. Use when the user asks to 'create onboarding docs', 'generate ONBOARDING.md', 'document this project for new developers', 'write onboarding documentation', 'vonboard', 'vonboarding', 'prepare this repo for a new contributor', 'refresh the onboarding doc', or 'update ONBOARDING.md'. Also use when someone needs to onboard a new team member and wants a written artifact, or when a codebase lacks onboarding documentation and the user wants to generate one.", "ce-andrew-kane-gem-writer": @@ -860,9 +868,22 @@ async function loadLegacyFingerprints(): Promise { for (const [fileName, skillName] of Object.entries(LEGACY_PROMPT_CURRENT_SKILL_FOR_FILE)) { const currentPath = skillIndex.get(skillName) - if (!currentPath) continue - const description = await readDescription(currentPath) - if (description) prompts.set(fileName, description) + if (currentPath) { + const description = await readDescription(currentPath) + if (description) prompts.set(fileName, description) + continue + } + // The mapped skill no longer ships (fully retired, e.g. ce-work-beta). + // Seed the prompt fingerprint with any historical alias so cleanup can + // still match and sweep the orphaned wrapper on upgrade. The specific + // value is not significant — isLegacyPromptWrapper unions the full + // LEGACY_PROMPT_DESCRIPTION_ALIASES list when matching; this only has to + // be a non-empty description to clear descriptionsMatch's guard. Mirrors + // the LEGACY_ONLY_SKILL_DESCRIPTIONS / LEGACY_ONLY_AGENT_DESCRIPTIONS + // fallbacks above; the prompts dir is cross-plugin, so a description + // fingerprint (not a name-only match) is required to sweep safely. + const historicalFingerprint = LEGACY_PROMPT_DESCRIPTION_ALIASES[fileName]?.[0] + if (historicalFingerprint) prompts.set(fileName, historicalFingerprint) } return { skills, agents, prompts } diff --git a/tests/copilot-converter.test.ts b/tests/copilot-converter.test.ts index 6ba260de9..d449fff36 100644 --- a/tests/copilot-converter.test.ts +++ b/tests/copilot-converter.test.ts @@ -478,9 +478,9 @@ Task best-practices-researcher(topic)` }) test("replaces multi-colon ce: references fully", () => { - const input = "run ce:work:beta and ce:review:deep" + const input = "run ce:foo:bar and ce:baz:qux" const result = transformContentForCopilot(input) - expect(result).toBe("run ce-work-beta and ce-review-deep") + expect(result).toBe("run ce-foo-bar and ce-baz-qux") expect(result).not.toContain(":") }) diff --git a/tests/legacy-cleanup.test.ts b/tests/legacy-cleanup.test.ts index 8f9e2a300..afd686f51 100644 --- a/tests/legacy-cleanup.test.ts +++ b/tests/legacy-cleanup.test.ts @@ -225,6 +225,24 @@ describe("cleanupStaleSkillDirs", () => { expect(await exists(path.join(root, "workflows:plan"))).toBe(false) }) + test("removes a retired ce-work-beta skill dir via its last-shipped description", async () => { + // Regression: once ce-work-beta is removed from the plugin, loadLegacyFingerprints + // can no longer read its (deleted) SKILL.md, so the fingerprint comes from + // LEGACY_ONLY_SKILL_DESCRIPTIONS. Without that entry, skills.get("ce-work-beta") + // stays undefined and isLegacyPluginOwned returns false before deleting, leaving + // the stale install dir behind on upgrade. + const root = await fs.mkdtemp(path.join(os.tmpdir(), "cleanup-retired-skill-")) + await createDir( + path.join(root, "ce-work-beta"), + skillContent("ce-work-beta", "[BETA] Execute ce-work with external delegate support."), + ) + + const removed = await cleanupStaleSkillDirs(root) + + expect(removed).toBe(1) + expect(await exists(path.join(root, "ce-work-beta"))).toBe(false) + }) + test("returns 0 when directory does not exist", async () => { const removed = await cleanupStaleSkillDirs("/tmp/nonexistent-cleanup-dir-12345") expect(removed).toBe(0) @@ -621,10 +639,10 @@ describe("cleanupStalePrompts", () => { ), ) await createFile( - path.join(root, "ce-work-beta.md"), + path.join(root, "ce-work.md"), legacyWorkflowPromptContent( - "ce:work-beta", - (await pluginDescription("skills/ce-work-beta/SKILL.md")) + "ce:work", + (await pluginDescription("skills/ce-work/SKILL.md")) .replaceAll("ce-", "ce:"), ), ) @@ -633,7 +651,7 @@ describe("cleanupStalePrompts", () => { expect(removed).toBe(2) expect(await exists(path.join(root, "ce-plan.md"))).toBe(false) - expect(await exists(path.join(root, "ce-work-beta.md"))).toBe(false) + expect(await exists(path.join(root, "ce-work.md"))).toBe(false) }) test("removes wrappers whose description has drifted (matches a known historical alias)", async () => { @@ -695,6 +713,27 @@ describe("cleanupStalePrompts", () => { expect(await exists(path.join(root, "ce-brainstorm.md"))).toBe(false) }) + test("removes a retired ce-work-beta prompt wrapper built from the last shipped skill", async () => { + // Regression: a ce-work-beta.md wrapper generated from the final live skill + // carried the description "[BETA] Execute ce-work with external delegate + // support." After the skill is deleted, that exact description must still be + // recognized (seeded from LEGACY_PROMPT_DESCRIPTION_ALIASES) or the retired + // slash prompt is classified foreign and left behind on upgrade. + const root = await fs.mkdtemp(path.join(os.tmpdir(), "cleanup-retired-prompt-")) + await createFile( + path.join(root, "ce-work-beta.md"), + promptWrapperContent( + "ce-work-beta", + "[BETA] Execute ce-work with external delegate support.", + ), + ) + + const removed = await cleanupStalePrompts(root) + + expect(removed).toBe(1) + expect(await exists(path.join(root, "ce-work-beta.md"))).toBe(false) + }) + test("preserves wrappers whose description was never shipped by compound-engineering", async () => { // Defense-in-depth against a sibling plugin installed into the same // `~/.codex/prompts/` directory. `renderPrompt` in diff --git a/tests/pipeline-review-contract.test.ts b/tests/pipeline-review-contract.test.ts index f2a0058c7..584f7ea28 100644 --- a/tests/pipeline-review-contract.test.ts +++ b/tests/pipeline-review-contract.test.ts @@ -50,38 +50,6 @@ describe("ce-work review contract", () => { expect(content).not.toContain("[HARNESS_URL]") }) - test("ce-work-beta mirrors review and commit delegation", async () => { - const beta = await readRepoFile("skills/ce-work-beta/SKILL.md") - // Review/commit content extracted to references/shipping-workflow.md - const shipping = await readRepoFile("skills/ce-work-beta/references/shipping-workflow.md") - - // Extracted content in reference file: Simplify step at position 2, - // Code Review at position 3 - expect(shipping).toContain("2. **Simplify**") - expect(shipping).toContain("3. **Code Review**") - expect(shipping).toContain("`ce-commit-push-pr` skill") - expect(shipping).toContain("`ce-commit` skill") - - // Negative assertions stay on SKILL.md - expect(beta).not.toContain("Consider Code Review") - expect(beta).not.toContain("gh pr create") - }) - - test("ce-work-beta mirrors residual work gate sentinel with ce-work", async () => { - const workShipping = await readRepoFile( - "skills/ce-work/references/shipping-workflow.md", - ) - const betaShipping = await readRepoFile( - "skills/ce-work-beta/references/shipping-workflow.md", - ) - - expect(workShipping).toContain("Actionable findings: none.") - expect(betaShipping).toContain("Actionable findings: none.") - expect(betaShipping).not.toContain("Residual actionable work: none.") - expect(betaShipping).toContain("not yet fixed") - expect(betaShipping).not.toContain("skill did not auto-fix") - }) - test("includes per-task testing deliberation in execution loop", async () => { const content = await readRepoFile("skills/ce-work/SKILL.md") @@ -110,23 +78,6 @@ describe("ce-work review contract", () => { expect(shipping).not.toContain("Tests pass (run project's test command)") }) - test("ce-work-beta mirrors testing deliberation and checklist changes", async () => { - const beta = await readRepoFile("skills/ce-work-beta/SKILL.md") - // Checklist extracted to references/shipping-workflow.md - const shipping = await readRepoFile("skills/ce-work-beta/references/shipping-workflow.md") - - // Testing deliberation stays in SKILL.md (Phase 2 content) - expect(beta).toContain("Assess testing coverage") - - // New checklist language in reference file - expect(shipping).toContain("Testing addressed") - - // Old language removed from both - expect(beta).not.toContain("Tests pass (run project's test command)") - expect(beta).not.toContain("- All tests pass") - expect(shipping).not.toContain("Tests pass (run project's test command)") - }) - test("SKILL.md stub points to shipping-workflow reference", async () => { const content = await readRepoFile("skills/ce-work/SKILL.md") @@ -139,18 +90,6 @@ describe("ce-work review contract", () => { expect(content).not.toContain("## Code Review Tiers") }) - test("ce:work-beta SKILL.md stub points to shipping-workflow reference", async () => { - const content = await readRepoFile("skills/ce-work-beta/SKILL.md") - - // Stub references the shipping-workflow file - expect(content).toContain("`references/shipping-workflow.md`") - - // Extracted content is not in SKILL.md - expect(content).not.toContain("3. **Code Review**") - expect(content).not.toContain("## Quality Checklist") - expect(content).not.toContain("## Code Review Tiers") - }) - test("ce:work remains the stable non-delegating surface", async () => { const content = await readRepoFile("skills/ce-work/SKILL.md") @@ -160,131 +99,7 @@ describe("ce-work review contract", () => { }) }) -describe("ce:work-beta codex delegation contract", () => { - test("has argument parsing with delegate tokens", async () => { - const content = await readRepoFile("skills/ce-work-beta/SKILL.md") - - // Argument parsing section exists with delegation tokens - expect(content).toContain("## Argument Parsing") - expect(content).toContain("`delegate:codex`") - expect(content).toContain("`delegate:local`") - - // Resolution chain present - expect(content).toContain("### Settings Resolution Chain") - expect(content).toContain("work_delegate") - expect(content).toContain("config.local.yaml") - }) - - test("argument-hint includes delegate:codex for discoverability", async () => { - const content = await readRepoFile("skills/ce-work-beta/SKILL.md") - - expect(content).toContain("argument-hint:") - expect(content).toContain("delegate:codex") - }) - - test("remains manual-invocation beta during rollout", async () => { - const content = await readRepoFile("skills/ce-work-beta/SKILL.md") - - expect(content).toContain("disable-model-invocation: true") - expect(content).toContain("Invoke `ce-work-beta` manually") - expect(content).toContain("planning and workflow handoffs remain pointed at stable `ce-work`") - }) - - test("SKILL.md has delegation routing stub pointing to reference", async () => { - const content = await readRepoFile("skills/ce-work-beta/SKILL.md") - - expect(content).toContain("## Codex Delegation Mode") - expect(content).toContain("references/codex-delegation-workflow.md") - // Delegation details are NOT in SKILL.md body — they're in the reference - expect(content).not.toContain("### Pre-Delegation Checks") - expect(content).not.toContain("### Prompt Template") - expect(content).not.toContain("### Execution Loop") - }) - - test("delegation routing gate in Phase 1 Step 4", async () => { - const content = await readRepoFile("skills/ce-work-beta/SKILL.md") - - const gateIdx = content.indexOf("Delegation routing gate") - const strategyTableIdx = content.indexOf("| **Inline**") - expect(gateIdx).toBeGreaterThan(0) - expect(gateIdx).toBeLessThan(strategyTableIdx) - expect(content).toContain("Codex delegation requires a plan file") - }) - - test("delegation branches in Phase 2 task loop", async () => { - const content = await readRepoFile("skills/ce-work-beta/SKILL.md") - - expect(content).toContain("If delegation_active: branch to the Codex Delegation Execution Loop") - }) - - test("delegation reference has all required sections", async () => { - const content = await readRepoFile("skills/ce-work-beta/references/codex-delegation-workflow.md") - - // Pre-delegation checks - expect(content).toContain("## Pre-Delegation Checks") - expect(content).toContain("Platform Gate") - expect(content).toContain("CODEX_SANDBOX") - expect(content).toContain("command -v codex") - expect(content).toContain("Consent Flow") - - // Batching - expect(content).toContain("## Batching") - - // Prompt template - expect(content).toContain("## Prompt Template") - expect(content).toContain("") - expect(content).toContain("") - expect(content).toContain("") - expect(content).toContain("") - expect(content).toContain("test-first") - expect(content).toContain("characterization-first") - expect(content).toContain("the orchestrator will not re-run verification independently") - - // Result schema and execution loop - expect(content).toContain("## Result Schema") - expect(content).toContain("## Execution Loop") - expect(content).toContain("codex exec") - - // Circuit breaker - expect(content).toContain("consecutive_failures") - expect(content).toContain("3 consecutive failures") - - // Rollback safety - expect(content).toContain("git diff --quiet HEAD") - expect(content).toContain("git checkout -- .") - expect(content).toContain("Do NOT use bare `git clean -fd` without path arguments") - - // Mixed-model attribution - expect(content).toContain("## Mixed-Model Attribution") - }) - - test("delegation reference has decision prompts for ask mode", async () => { - const content = await readRepoFile("skills/ce-work-beta/references/codex-delegation-workflow.md") - - expect(content).toContain("## Delegation Decision") - expect(content).toContain("work_delegate_decision") - expect(content).toContain("Execute with Claude Code instead") - expect(content).toContain("Delegate to Codex anyway") - expect(content).toContain("the cost of delegating outweighs having Claude Code do them") - }) - - test("settings resolution includes delegation decision setting", async () => { - const content = await readRepoFile("skills/ce-work-beta/SKILL.md") - - expect(content).toContain("work_delegate_decision") - expect(content).toContain("`auto`") - expect(content).toContain("`ask`") - }) - - test("has frontend design guidance ported from beta", async () => { - const content = await readRepoFile("skills/ce-work-beta/SKILL.md") - - expect(content).toContain("**Frontend Design Guidance**") - expect(content).toContain("Apply the frontend guidance embedded in this skill") - }) -}) - -describe("ce:plan remains neutral during ce:work-beta rollout", () => { +describe("ce-plan stays neutral on delegation", () => { test("removes delegation-specific execution posture guidance", async () => { const content = await readRepoFile("skills/ce-plan/SKILL.md") diff --git a/tests/plugin-legacy-artifacts.test.ts b/tests/plugin-legacy-artifacts.test.ts index c6fbbac8c..4b836a711 100644 --- a/tests/plugin-legacy-artifacts.test.ts +++ b/tests/plugin-legacy-artifacts.test.ts @@ -50,6 +50,12 @@ describe("plugin legacy artifacts", () => { expect(artifacts.prompts).toContain("report-bug.md") expect(artifacts.prompts).toContain("workflows-review.md") expect(artifacts.prompts).toContain("technical_review.md") + + // ce-work-beta is fully retired: both its skill dir and its Codex + // slash-prompt wrapper (~/.codex/prompts/ce-work-beta.md) must be + // enumerated so flat installs without a manifest still sweep them. + expect(artifacts.skills).toContain("ce-work-beta") + expect(artifacts.prompts).toContain("ce-work-beta.md") }) test("Codex legacy detection ignores current bundle skills/agents not in the historical allow-list", () => { diff --git a/tests/release-metadata.test.ts b/tests/release-metadata.test.ts index 540d33ac2..72cd26bdf 100644 --- a/tests/release-metadata.test.ts +++ b/tests/release-metadata.test.ts @@ -147,7 +147,7 @@ describe("release metadata", () => { expect(counts).toEqual({ agents: 0, - skills: 27, + skills: 26, mcpServers: 0, }) }) diff --git a/tests/review-skill-contract.test.ts b/tests/review-skill-contract.test.ts index e7c13d90c..172102ee2 100644 --- a/tests/review-skill-contract.test.ts +++ b/tests/review-skill-contract.test.ts @@ -625,7 +625,6 @@ describe("ce-code-review contract", () => { test("ce-work shipping-workflow enforces a residual-work gate after Tier 2 review", async () => { for (const path of [ "skills/ce-work/references/shipping-workflow.md", - "skills/ce-work-beta/references/shipping-workflow.md", ]) { const workflow = await readRepoFile(path) await expect(readRepoFile(path.replace("shipping-workflow.md", "tracker-defer.md"))).resolves.toContain( diff --git a/tests/skill-conventions.test.ts b/tests/skill-conventions.test.ts index f79b61fef..53a72f4db 100644 --- a/tests/skill-conventions.test.ts +++ b/tests/skill-conventions.test.ts @@ -149,7 +149,6 @@ const EXPECTED_USER_INVOKED_SKILLS = new Set([ "ce-promote", "ce-setup", "ce-test-xcode", - "ce-work-beta", "lfg", ])