Skip to content

Commit 6c4a5df

Browse files
committed
feat(core): add runtime verification and DCR escalation to build workflow
1 parent db46efd commit 6c4a5df

12 files changed

Lines changed: 227 additions & 49 deletions

File tree

README.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@ pb-spec follows a **harness-first** philosophy: reliability comes from process d
2222
| [Plan-and-Solve Prompting](https://arxiv.org/abs/2305.04091) | Plan first to reduce missing-step errors | `design.md` + `tasks.md` are mandatory artifacts |
2323
| [ReAct](https://arxiv.org/abs/2210.03629) | Interleave reasoning and actions with environment feedback | `/pb-build` executes task-by-task with test/tool feedback loops |
2424
| [Reflexion](https://arxiv.org/abs/2303.11366) | Learn from failure signals via iterative retries | Retry/skip/abort and DCR flow in `pb-build` |
25+
| [Harness Engineering (OpenAI, 2026-02-11)](https://openai.com/index/harness-engineering/) | Treat runtime signals and checklists as first-class harness inputs | `pb-plan` requires runtime verification hooks; `pb-build` validates logs/health evidence before task closure |
26+
| [openai/symphony](https://github.com/openai/symphony) | Long-running agents need explicit observability and deterministic escalation | `pb-build` enforces bounded retries and emits standardized DCR packets for `pb-refine` |
2527
| [Effective Harnesses for Long-Running Agents](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents) | Grounding, context hygiene, recovery, observability | State checks, minimal context handoff, task-local rollback guidance |
2628
| [Building Effective Agents](https://www.anthropic.com/engineering/building-effective-agents) | Prefer simple composable workflows over framework complexity | Small adapter-based CLI + explicit workflow prompts |
2729
| [Stop Using /init for AGENTS.md](https://addyosmani.com/blog/agents-md/) | Keep AGENTS.md focused and maintainable | `/pb-init` updates a managed snapshot block in `AGENTS.md` while preserving all user-authored constraints outside that block |
@@ -30,7 +32,9 @@ pb-spec follows a **harness-first** philosophy: reliability comes from process d
3032

3133
- **Context Before Code:** `/pb-init` and `/pb-plan` establish project and requirement context before implementation starts.
3234
- **Verification by Design:** Planning requires explicit verification commands so completion is measurable.
35+
- **Observability as Context:** Service-facing tasks must capture runtime evidence (log tails and/or health probes), not only test output.
3336
- **Strict TDD Execution:** `/pb-build` enforces Red → Green → Refactor with per-task status tracking.
37+
- **Escalation Over Thrashing:** Three consecutive failures suspend the current task and route a standardized DCR packet to `/pb-refine`.
3438
- **Safe Failure Recovery:** Failed attempts use scoped recovery guidance to avoid polluting unrelated workspace state.
3539
- **Composable Architecture:** Platform differences stay in adapters; workflow semantics stay in shared templates.
3640

@@ -140,11 +144,11 @@ The spec directory follows the naming format `YYYY-MM-DD-NO-feature-name` (e.g.,
140144

141145
### 3. `/pb-refine <feature-name>` — Design Iteration (Optional)
142146

143-
Reads user feedback or Design Change Requests (from failed builds) and intelligently updates `design.md` and `tasks.md`. It maintains a revision history and cascades design changes to the task list without overwriting completed work. `AGENTS.md` remains read-only in this phase.
147+
Reads user feedback or Design Change Requests (from failed builds, including standardized 3-failure build-block packets) and intelligently updates `design.md` and `tasks.md`. It maintains a revision history and cascades design changes to the task list without overwriting completed work. `AGENTS.md` remains read-only in this phase.
144148

145149
### 4. `/pb-build <feature-name>` — Subagent-Driven Implementation
146150

147-
Reads `specs/<YYYY-MM-DD-NO-feature-name>/tasks.md` and implements each task sequentially. Every task is executed by a fresh subagent following strict TDD (Red → Green → Refactor). Supports **Design Change Requests** if the planned design proves infeasible during implementation. Only the `<feature-name>` part is needed when invoking — the agent resolves the full directory automatically. `AGENTS.md` is read-only unless the user explicitly requests an `AGENTS.md` change.
151+
Reads `specs/<YYYY-MM-DD-NO-feature-name>/tasks.md` and implements each task sequentially. Every task is executed by a fresh subagent following strict TDD (Red → Green → Refactor), then runtime verification (log/health evidence when applicable). Supports **Design Change Requests** if the planned design proves infeasible during implementation, and auto-escalates to DCR after three consecutive task failures. Only the `<feature-name>` part is needed when invoking — the agent resolves the full directory automatically. `AGENTS.md` is read-only unless the user explicitly requests an `AGENTS.md` change.
148152

149153
## Skills Overview
150154

@@ -168,13 +172,15 @@ pb-spec's prompt design is inspired by Anthropic's research on [Effective Harnes
168172
| **Context Hygiene** | Orchestrator passes only minimal, relevant context to each subagent — preventing context window pollution |
169173
| **Recovery Loop** | Failed tasks use pre-task snapshots + file-scoped recovery (`git restore` + task-local cleanup), and avoid workspace-wide restore in dirty trees |
170174
| **Verification Harness** | Design docs define explicit verification commands at planning time — subagents execute, not invent, verification |
175+
| **Observability as Context** | Task verification includes runtime signals (logs/health) for service-facing work, and build closure requires command-backed evidence |
176+
| **Escalation Loop** | Three consecutive failures trigger task suspension + standardized DCR handoff to `pb-refine` |
171177
| **Agent Rules** | `AGENTS.md` is treated as free-form policy context: `pb-init` manages only its marker block; `pb-plan`/`pb-refine`/`pb-build` read it without rewriting |
172178

173179
### Where Each Principle Lives
174180

175181
- **Worker (Implementer):** `implementer_prompt.md` enforces grounding-first workflow and error quoting
176-
- **Architect (Planner):** `design_template.md` includes Critical Path Verification table
177-
- **Orchestrator (Builder):** `pb-build` SKILL enforces context hygiene and task-local recovery with safe rollback rules
182+
- **Architect (Planner):** `design_template.md` + `tasks_template.md` enforce verification criteria, including runtime signals when applicable
183+
- **Orchestrator (Builder):** `pb-build` SKILL enforces context hygiene, runtime verification gates, bounded retries, and DCR escalation
178184
- **Foundation (Init):** `pb-init` updates only the managed marker block in `AGENTS.md`, preserving all external user-authored constraints
179185

180186
## Development

pyproject.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "uv_build"
44

55
[project]
66
name = "pb-spec"
7-
version = "0.5.0"
7+
version = "0.6.0"
88
description = "Plan-Build Spec (pb-spec): A CLI tool for managing AI coding assistant skills"
99
readme = "README.md"
1010
license = "Apache-2.0"
@@ -42,7 +42,7 @@ testpaths = ["tests"]
4242
dev = [
4343
"pytest>=9.0.2",
4444
"ruff>=0.15.4",
45-
"ty>=0.0.19",
45+
"ty>=0.0.20",
4646
]
4747

4848
[tool.ruff]

src/pb_spec/templates/prompts/pb-build.prompt.md

Lines changed: 45 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@ Run this when the user invokes `/pb-build <feature-name>`.
88

99
- Complete unfinished tasks in `tasks.md` sequentially until done or explicitly blocked.
1010
- Use one fresh subagent per task with minimal, task-relevant context only.
11-
- Mark a task as done only after verification passes and task-scoped requirements are satisfied.
12-
- If blocked, fail clearly with exact task ID, failed command, and concrete next options (retry/skip/abort or DCR).
11+
- Mark a task as done only after tests pass, task verification passes, and runtime evidence is captured when applicable.
12+
- If blocked, fail clearly with exact task ID, failed command, and concrete next options (retry/skip/abort within budget, then DCR escalation).
1313

1414
---
1515

@@ -69,10 +69,10 @@ For each unfinished task, in order:
6969
- The `AGENTS.md` (project constraints and hard rules; do not assume any fixed template layout).
7070
- The `design.md` (Feature Spec).
7171
- **Summary of previous tasks** — a one-line-per-task summary (e.g., "Task 1.1 created `models.py` with `User` class."). Do NOT pass raw logs or full outputs.
72-
4. **Subagent executes** the TDD cycle (see Implementer Prompt section).
72+
4. **Subagent executes** the TDD + runtime verification cycle (see Implementer Prompt section).
7373
5. **Mark completed** — update `- [ ]` to `- [x]` and Status to `🟢 DONE` in `tasks.md`.
7474
- **Use precise editing:** Use `sed`, string-replacement, or line-targeted edits to update the specific Task ID heading and its checkboxes. Do NOT rewrite the entire `tasks.md` file — this risks truncation and content loss in large files.
75-
- **Completion gate:** Mark done only when task Verification is satisfied and tests are green.
75+
- **Completion gate:** Mark done only when task Verification is satisfied, tests are green, and runtime checks (when applicable) are evidence-backed.
7676

7777
> **⚠️ Context Reset:** After completing all tasks (or when context grows large), output: "Recommend starting a fresh session. Run `/pb-build <feature-name>` again to continue from where you left off."
7878
@@ -87,10 +87,36 @@ If a subagent fails:
8787
- If pre-task workspace was dirty: do NOT run workspace-wide restore commands. Report file-level cleanup options and wait for user choice.
8888
4. **Report** the failure — which task, what went wrong, specific error output.
8989
- Include the exact failing command and a short quoted error excerpt.
90-
5. Prompt the user:
91-
- **Retry** — new subagent, fresh context, pass previous error as a hint constraint. Maximum 2 retries per task.
90+
5. **Track consecutive failures per task** (same task, same build run).
91+
- Allowed budget is **3 consecutive failures total**: initial attempt + up to 2 retries.
92+
6. **If failure count is 1 or 2**, prompt the user:
93+
- **Retry** — new subagent, fresh context, pass previous error as a hint constraint.
9294
- **Skip** — mark as `⏭️ SKIPPED`, move to next task.
9395
- **Abort** — stop the build, report progress so far.
96+
7. **If failure count reaches 3**, suspend the task and stop the build loop. Do not continue to later tasks. Output a standardized DCR packet:
97+
98+
```text
99+
🛑 Build Blocked — Task X.Y: [Task Name]
100+
Reason: 3 consecutive failed attempts (initial + 2 retries)
101+
102+
What We Tried:
103+
- Attempt 1: [summary]
104+
- Attempt 2: [summary]
105+
- Attempt 3: [summary]
106+
107+
Failure Evidence:
108+
- [command] -> "[error excerpt]"
109+
- [command] -> "[error excerpt]"
110+
111+
Suggested Design Change:
112+
- [What should change in design.md/tasks.md]
113+
114+
Impact:
115+
- [Which tasks are affected]
116+
117+
Next Action:
118+
- Run /pb-refine <feature-name> with this block, then re-run /pb-build <feature-name>.
119+
```
94120

95121
### Design Change Requests
96122

@@ -103,12 +129,14 @@ If during implementation a subagent discovers that the design is **infeasible or
103129
🔄 Design Change Request — Task X.Y: [Task Name]
104130
105131
Problem: [What is infeasible and why]
132+
What We Tried: [Attempt summaries and failed commands]
133+
Failure Evidence: [Quoted errors from failed attempts]
106134
Suggested Change: [What should change in design.md]
107135
Impact: [Which other tasks are affected]
108136
```
109137

110138
3. The orchestrator pauses the build, reports the DCR to the user, and awaits a decision:
111-
- **Accept** — user updates `design.md` (or approves the suggested change), then retries the task.
139+
- **Accept** — user runs `/pb-refine <feature-name>` (or manually updates `design.md`/`tasks.md`), then retries the task.
112140
- **Override** — user provides an alternative approach.
113141
- **Abort** — stop the build.
114142

@@ -184,17 +212,21 @@ Update `tasks.md` in-place after each task using **precise edits** (target the s
184212
- Rewrite the entire `tasks.md` file — use targeted edits only.
185213
- Mark a task as done without satisfying its Verification criteria.
186214
- Claim tests passed without running them.
215+
- Exceed the retry budget (initial attempt + 2 retries) for a single task in one build run.
216+
- Continue to later tasks after the third consecutive failure on the current task.
187217

188218
### ALWAYS
189219

190220
- Mark completed tasks in `tasks.md` immediately.
191221
- Capture a pre-task workspace snapshot before spawning subagents.
192222
- Self-review before submitting each task.
193223
- Run full test suite after each task.
194-
- Report failures with retry/skip/abort options.
224+
- Run runtime verification checks for runtime-facing tasks and capture evidence (logs/probes).
225+
- Report failures with retry/skip/abort options within retry budget, then escalate to DCR.
195226
- Follow YAGNI — only implement what the task requires.
196227
- Use existing project patterns and conventions.
197228
- File a Design Change Request if the design is infeasible.
229+
- Suspend and escalate with a standardized DCR packet after 3 consecutive failures.
198230
- Report command-backed outcomes (what ran, what failed, what passed).
199231

200232
---
@@ -210,6 +242,7 @@ Update `tasks.md` in-place after each task using **precise edits** (target the s
210242
7. **Fail fast, recover cleanly.** Use task-local rollback from the pre-task snapshot. Avoid workspace-wide resets in dirty trees.
211243
8. **Context hygiene.** Pass minimal, relevant context. Summarize — don't dump.
212244
9. **Evidence over assertion.** Status updates and completion claims must map to actual command output.
245+
10. **Escalate deterministically.** After three consecutive failures, stop thrashing and route to `pb-refine` with a structured DCR.
213246

214247
---
215248

@@ -262,6 +295,7 @@ Before writing any code, verify the current workspace state:
262295
| **Confirm RED** | Run test suite. **Quote the error.** Classify: expected failure (proceed) vs bad failure (fix test first). | Failure confirmed |
263296
| **GREEN** | Write minimum implementation. Only edit files you read in Step 1. | Only what's needed |
264297
| **Confirm GREEN** | Run full test suite. If failure: read error, read code, then fix — do not blind-fix. | ALL tests pass |
298+
| **Runtime Verification (if applicable)** | Run runtime checks from task Verification, capture logs + probe output (or explicit `N/A` reason). | Runtime evidence captured |
265299
| **REFACTOR** | Clean up if needed | ALL tests still pass |
266300
| **SCOPE CHECK** | Confirm implemented changes match task contract and nothing extra. | Task scope respected |
267301

@@ -293,6 +327,8 @@ Fix any "no" answers before submitting.
293327
294328
### Verification
295329
- [How verification criterion was met]
330+
- Runtime logs: [command + key output, or `N/A` with reason]
331+
- Runtime probe: [command + key output/status, or `N/A` with reason]
296332
- Test suite: X passed, 0 failed
297333
298334
### Commands Run
@@ -311,6 +347,7 @@ Fix any "no" answers before submitting.
311347
- Do not modify, delete, or reformat `AGENTS.md` unless the user explicitly requests an `AGENTS.md` change.
312348
- Do not modify unrelated code.
313349
- Tests are mandatory — never submit without them.
350+
- Runtime evidence is mandatory when applicable — do not claim completion without logs/probe evidence for runtime-facing tasks.
314351
- **No Blind Edits:** Always read a file before editing it.
315352
- **Verify Imports:** Check dependency files before importing third-party libs.
316353
- **Quote Errors:** Always quote specific error messages before attempting fixes.

src/pb_spec/templates/prompts/pb-plan.prompt.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,12 @@ Write a **flat task list** to `specs/<spec-dir>/tasks.md`:
158158
- [ ] Verification: ...
159159
```
160160

161+
For lightweight tasks that introduce or change runtime behavior (service startup, UI runtime flow, API availability, performance-critical paths), include runtime observability checks in `Verification`:
162+
163+
- Capture recent runtime logs (for example `tail -n 50 app.log` or project-equivalent command).
164+
- Capture a live probe result (for example `curl http://localhost:8080/health` or project-equivalent endpoint).
165+
- If runtime checks are not applicable, explicitly write `N/A` with the reason.
166+
161167
**Skip** phases, Summary & Timeline table, and Definition of Done boilerplate for lightweight specs.
162168

163169
## Step 5b: Output tasks.md — Full Mode (≥ 50 words)
@@ -173,6 +179,10 @@ Remove all instructional placeholder text (such as bracket examples) in the fina
173179
- **Task ID format:** Each task MUST have a unique ID: `Task X.Y` (e.g., `Task 1.1`, `Task 2.3`).
174180
- Ordered by dependency — no task references work from a later task.
175181
- Every task has a concrete **Verification** criterion.
182+
- For tasks that introduce or change runtime behavior (service startup, UI runtime flow, API/network availability, performance-sensitive code paths), **Verification must include runtime observability checks**:
183+
- Recent runtime logs (for example `tail -n 50 app.log` or equivalent).
184+
- A live health/probe command (for example `curl http://localhost:8080/health` or equivalent).
185+
- If not applicable, explicitly mark `N/A` with a reason.
176186
- **Reference reusable components** in task Context when the task should extend or use existing code.
177187
- Ensure every requirement from the Step 1 checklist is covered by at least one task or explicitly marked out-of-scope.
178188

@@ -202,7 +212,7 @@ Please review the design and tasks. When ready, run /pb-build <feature-name> to
202212
3. **Right-sized output (YAGNI).** Match output detail to requirement complexity. Simple changes get compact specs; complex features get full specs.
203213
4. **Live codebase analysis.** Always search the actual codebase. Use `AGENTS.md` as complementary policy context, not a replacement for code inspection.
204214
5. **Task granularity: Logical Unit of Work.** Each task is a self-contained, meaningful change. Do not split based on arbitrary time estimates.
205-
6. **Verification per task.** Every task defines how to prove it is done.
215+
6. **Verification per task.** Every task defines how to prove it is done; runtime-facing tasks include runtime observability evidence.
206216
7. **Dependency order.** Phases and tasks flow foundational → dependent.
207217
8. **Project-aware.** Use existing conventions, patterns, and tech stack. Reuse existing components — do not reinvent.
208218
9. **Requirements coverage.** Track every requirement from input to design sections and tasks.
@@ -432,6 +442,7 @@ Please review the design and tasks. When ready, run /pb-build <feature-name> to
432442
- [ ] **Step 1:** ...
433443
- [ ] **Step 2:** ...
434444
- [ ] **Verification:** [Concrete check]
445+
- [ ] **Runtime Verification (if applicable):** [Capture runtime signals — e.g., `tail -n 50 app.log` and `curl http://localhost:8080/health`; if not applicable, write `N/A` with reason]
435446

436447
---
437448

@@ -449,6 +460,7 @@ Please review the design and tasks. When ready, run /pb-build <feature-name> to
449460
- [ ] **Step 1:** ...
450461
- [ ] **Step 2:** ...
451462
- [ ] **Verification:** ...
463+
- [ ] **Runtime Verification (if applicable):** [Logs + probe result, or `N/A` with reason]
452464

453465
---
454466

@@ -466,6 +478,7 @@ Please review the design and tasks. When ready, run /pb-build <feature-name> to
466478
- [ ] **Step 1:** ...
467479
- [ ] **Step 2:** ...
468480
- [ ] **Verification:** ...
481+
- [ ] **Runtime Verification (if applicable):** [Logs + probe result, or `N/A` with reason]
469482

470483
---
471484

@@ -483,6 +496,7 @@ Please review the design and tasks. When ready, run /pb-build <feature-name> to
483496
- [ ] **Step 1:** ...
484497
- [ ] **Step 2:** ...
485498
- [ ] **Verification:** ...
499+
- [ ] **Runtime Verification (if applicable):** [Logs + probe result, or `N/A` with reason]
486500

487501
---
488502

@@ -502,4 +516,5 @@ Please review the design and tasks. When ready, run /pb-build <feature-name> to
502516
2. [ ] **Tested:** Unit tests covering added logic.
503517
3. [ ] **Formatted:** Code formatter applied.
504518
4. [ ] **Verified:** Task's specific Verification criterion met.
519+
5. [ ] **Runtime-Evidenced (when applicable):** Runtime logs and health/probe results are captured, or `N/A` is explicitly justified.
505520
```

0 commit comments

Comments
 (0)