longcipher
diff --git a/‎README.md‎
Lines changed: 10 additions & 4 deletions b/‎README.md‎
Lines changed: 10 additions & 4 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 2 additions & 2 deletions b/‎pyproject.toml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎src/pb_spec/templates/prompts/pb-build.prompt.md‎
Lines changed: 45 additions & 8 deletions b/‎src/pb_spec/templates/prompts/pb-build.prompt.md‎
Lines changed: 45 additions & 8 deletions
diff --git a/‎src/pb_spec/templates/prompts/pb-plan.prompt.md‎
Lines changed: 16 additions & 1 deletion b/‎src/pb_spec/templates/prompts/pb-plan.prompt.md‎
Lines changed: 16 additions & 1 deletion
@@ -22,6 +22,8 @@ pb-spec follows a **harness-first** philosophy: reliability comes from process d
 | [Plan-and-Solve Prompting](https://arxiv.org/abs/2305.04091) | Plan first to reduce missing-step errors | `design.md` + `tasks.md` are mandatory artifacts |
 | [ReAct](https://arxiv.org/abs/2210.03629) | Interleave reasoning and actions with environment feedback | `/pb-build` executes task-by-task with test/tool feedback loops |
 | [Reflexion](https://arxiv.org/abs/2303.11366) | Learn from failure signals via iterative retries | Retry/skip/abort and DCR flow in `pb-build` |
+| [Harness Engineering (OpenAI, 2026-02-11)](https://openai.com/index/harness-engineering/) | Treat runtime signals and checklists as first-class harness inputs | `pb-plan` requires runtime verification hooks; `pb-build` validates logs/health evidence before task closure |
+| [openai/symphony](https://github.com/openai/symphony) | Long-running agents need explicit observability and deterministic escalation | `pb-build` enforces bounded retries and emits standardized DCR packets for `pb-refine` |
 | [Effective Harnesses for Long-Running Agents](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents) | Grounding, context hygiene, recovery, observability | State checks, minimal context handoff, task-local rollback guidance |
 | [Building Effective Agents](https://www.anthropic.com/engineering/building-effective-agents) | Prefer simple composable workflows over framework complexity | Small adapter-based CLI + explicit workflow prompts |
 | [Stop Using /init for AGENTS.md](https://addyosmani.com/blog/agents-md/) | Keep AGENTS.md focused and maintainable | `/pb-init` updates a managed snapshot block in `AGENTS.md` while preserving all user-authored constraints outside that block |
@@ -30,7 +32,9 @@ pb-spec follows a **harness-first** philosophy: reliability comes from process d
 
 - **Context Before Code:** `/pb-init` and `/pb-plan` establish project and requirement context before implementation starts.
 - **Verification by Design:** Planning requires explicit verification commands so completion is measurable.
+- **Observability as Context:** Service-facing tasks must capture runtime evidence (log tails and/or health probes), not only test output.
 - **Strict TDD Execution:** `/pb-build` enforces Red → Green → Refactor with per-task status tracking.
+- **Escalation Over Thrashing:** Three consecutive failures suspend the current task and route a standardized DCR packet to `/pb-refine`.
 - **Safe Failure Recovery:** Failed attempts use scoped recovery guidance to avoid polluting unrelated workspace state.
 - **Composable Architecture:** Platform differences stay in adapters; workflow semantics stay in shared templates.
 
@@ -140,11 +144,11 @@ The spec directory follows the naming format `YYYY-MM-DD-NO-feature-name` (e.g.,
 
 ### 3. `/pb-refine <feature-name>` — Design Iteration (Optional)
 
-Reads user feedback or Design Change Requests (from failed builds) and intelligently updates `design.md` and `tasks.md`. It maintains a revision history and cascades design changes to the task list without overwriting completed work. `AGENTS.md` remains read-only in this phase.
+Reads user feedback or Design Change Requests (from failed builds, including standardized 3-failure build-block packets) and intelligently updates `design.md` and `tasks.md`. It maintains a revision history and cascades design changes to the task list without overwriting completed work. `AGENTS.md` remains read-only in this phase.
 
 ### 4. `/pb-build <feature-name>` — Subagent-Driven Implementation
 
-Reads `specs/<YYYY-MM-DD-NO-feature-name>/tasks.md` and implements each task sequentially. Every task is executed by a fresh subagent following strict TDD (Red → Green → Refactor). Supports **Design Change Requests** if the planned design proves infeasible during implementation. Only the `<feature-name>` part is needed when invoking — the agent resolves the full directory automatically. `AGENTS.md` is read-only unless the user explicitly requests an `AGENTS.md` change.
+Reads `specs/<YYYY-MM-DD-NO-feature-name>/tasks.md` and implements each task sequentially. Every task is executed by a fresh subagent following strict TDD (Red → Green → Refactor), then runtime verification (log/health evidence when applicable). Supports **Design Change Requests** if the planned design proves infeasible during implementation, and auto-escalates to DCR after three consecutive task failures. Only the `<feature-name>` part is needed when invoking — the agent resolves the full directory automatically. `AGENTS.md` is read-only unless the user explicitly requests an `AGENTS.md` change.
 
 ## Skills Overview
 
@@ -168,13 +172,15 @@ pb-spec's prompt design is inspired by Anthropic's research on [Effective Harnes
 | **Context Hygiene** | Orchestrator passes only minimal, relevant context to each subagent — preventing context window pollution |
 | **Recovery Loop** | Failed tasks use pre-task snapshots + file-scoped recovery (`git restore` + task-local cleanup), and avoid workspace-wide restore in dirty trees |
 | **Verification Harness** | Design docs define explicit verification commands at planning time — subagents execute, not invent, verification |
+| **Observability as Context** | Task verification includes runtime signals (logs/health) for service-facing work, and build closure requires command-backed evidence |
+| **Escalation Loop** | Three consecutive failures trigger task suspension + standardized DCR handoff to `pb-refine` |
 | **Agent Rules** | `AGENTS.md` is treated as free-form policy context: `pb-init` manages only its marker block; `pb-plan`/`pb-refine`/`pb-build` read it without rewriting |
 
 ### Where Each Principle Lives
 
 - **Worker (Implementer):** `implementer_prompt.md` enforces grounding-first workflow and error quoting
-- **Architect (Planner):** `design_template.md` includes Critical Path Verification table
-- **Orchestrator (Builder):** `pb-build` SKILL enforces context hygiene and task-local recovery with safe rollback rules
+- **Architect (Planner):** `design_template.md` + `tasks_template.md` enforce verification criteria, including runtime signals when applicable
+- **Orchestrator (Builder):** `pb-build` SKILL enforces context hygiene, runtime verification gates, bounded retries, and DCR escalation
 - **Foundation (Init):** `pb-init` updates only the managed marker block in `AGENTS.md`, preserving all external user-authored constraints
 
 ## Development
 
@@ -4,7 +4,7 @@ build-backend = "uv_build"
 
 [project]
 name = "pb-spec"
-version = "0.5.0"
+version = "0.6.0"
 description = "Plan-Build Spec (pb-spec): A CLI tool for managing AI coding assistant skills"
 readme = "README.md"
 license = "Apache-2.0"
@@ -42,7 +42,7 @@ testpaths = ["tests"]
 dev = [
     "pytest>=9.0.2",
     "ruff>=0.15.4",
-    "ty>=0.0.19",
+    "ty>=0.0.20",
 ]
 
 [tool.ruff]
 
@@ -8,8 +8,8 @@ Run this when the user invokes `/pb-build <feature-name>`.
 
 - Complete unfinished tasks in `tasks.md` sequentially until done or explicitly blocked.
 - Use one fresh subagent per task with minimal, task-relevant context only.
-- Mark a task as done only after verification passes and task-scoped requirements are satisfied.
-- If blocked, fail clearly with exact task ID, failed command, and concrete next options (retry/skip/abort or DCR).
+- Mark a task as done only after tests pass, task verification passes, and runtime evidence is captured when applicable.
+- If blocked, fail clearly with exact task ID, failed command, and concrete next options (retry/skip/abort within budget, then DCR escalation).
 
 ---
 
@@ -69,10 +69,10 @@ For each unfinished task, in order:
    - The `AGENTS.md` (project constraints and hard rules; do not assume any fixed template layout).
    - The `design.md` (Feature Spec).
    - **Summary of previous tasks** — a one-line-per-task summary (e.g., "Task 1.1 created `models.py` with `User` class."). Do NOT pass raw logs or full outputs.
-4. **Subagent executes** the TDD cycle (see Implementer Prompt section).
+4. **Subagent executes** the TDD + runtime verification cycle (see Implementer Prompt section).
 5. **Mark completed** — update `- [ ]` to `- [x]` and Status to `🟢 DONE` in `tasks.md`.
    - **Use precise editing:** Use `sed`, string-replacement, or line-targeted edits to update the specific Task ID heading and its checkboxes. Do NOT rewrite the entire `tasks.md` file — this risks truncation and content loss in large files.
-   - **Completion gate:** Mark done only when task Verification is satisfied and tests are green.
+   - **Completion gate:** Mark done only when task Verification is satisfied, tests are green, and runtime checks (when applicable) are evidence-backed.
 
 > **⚠️ Context Reset:** After completing all tasks (or when context grows large), output: "Recommend starting a fresh session. Run `/pb-build <feature-name>` again to continue from where you left off."
 
@@ -87,10 +87,36 @@ If a subagent fails:
    - If pre-task workspace was dirty: do NOT run workspace-wide restore commands. Report file-level cleanup options and wait for user choice.
 4. **Report** the failure — which task, what went wrong, specific error output.
    - Include the exact failing command and a short quoted error excerpt.
-5. Prompt the user:
-   - **Retry** — new subagent, fresh context, pass previous error as a hint constraint. Maximum 2 retries per task.
+5. **Track consecutive failures per task** (same task, same build run).
+   - Allowed budget is **3 consecutive failures total**: initial attempt + up to 2 retries.
+6. **If failure count is 1 or 2**, prompt the user:
+   - **Retry** — new subagent, fresh context, pass previous error as a hint constraint.
    - **Skip** — mark as `⏭️ SKIPPED`, move to next task.
    - **Abort** — stop the build, report progress so far.
+7. **If failure count reaches 3**, suspend the task and stop the build loop. Do not continue to later tasks. Output a standardized DCR packet:
+
+   ```text
+   🛑 Build Blocked — Task X.Y: [Task Name]
+   Reason: 3 consecutive failed attempts (initial + 2 retries)
+
+   What We Tried:
+   - Attempt 1: [summary]
+   - Attempt 2: [summary]
+   - Attempt 3: [summary]
+
+   Failure Evidence:
+   - [command] -> "[error excerpt]"
+   - [command] -> "[error excerpt]"
+
+   Suggested Design Change:
+   - [What should change in design.md/tasks.md]
+
+   Impact:
+   - [Which tasks are affected]
+
+   Next Action:
+   - Run /pb-refine <feature-name> with this block, then re-run /pb-build <feature-name>.
+   ```
 
 ### Design Change Requests
 
@@ -103,12 +129,14 @@ If during implementation a subagent discovers that the design is **infeasible or
    🔄 Design Change Request — Task X.Y: [Task Name]
 
    Problem: [What is infeasible and why]
+   What We Tried: [Attempt summaries and failed commands]
+   Failure Evidence: [Quoted errors from failed attempts]
    Suggested Change: [What should change in design.md]
    Impact: [Which other tasks are affected]
    ```
 
 3. The orchestrator pauses the build, reports the DCR to the user, and awaits a decision:
-   - **Accept** — user updates `design.md` (or approves the suggested change), then retries the task.
+   - **Accept** — user runs `/pb-refine <feature-name>` (or manually updates `design.md`/`tasks.md`), then retries the task.
    - **Override** — user provides an alternative approach.
    - **Abort** — stop the build.
 
@@ -184,17 +212,21 @@ Update `tasks.md` in-place after each task using **precise edits** (target the s
 - Rewrite the entire `tasks.md` file — use targeted edits only.
 - Mark a task as done without satisfying its Verification criteria.
 - Claim tests passed without running them.
+- Exceed the retry budget (initial attempt + 2 retries) for a single task in one build run.
+- Continue to later tasks after the third consecutive failure on the current task.
 
 ### ALWAYS
 
 - Mark completed tasks in `tasks.md` immediately.
 - Capture a pre-task workspace snapshot before spawning subagents.
 - Self-review before submitting each task.
 - Run full test suite after each task.
-- Report failures with retry/skip/abort options.
+- Run runtime verification checks for runtime-facing tasks and capture evidence (logs/probes).
+- Report failures with retry/skip/abort options within retry budget, then escalate to DCR.
 - Follow YAGNI — only implement what the task requires.
 - Use existing project patterns and conventions.
 - File a Design Change Request if the design is infeasible.
+- Suspend and escalate with a standardized DCR packet after 3 consecutive failures.
 - Report command-backed outcomes (what ran, what failed, what passed).
 
 ---
@@ -210,6 +242,7 @@ Update `tasks.md` in-place after each task using **precise edits** (target the s
 7. **Fail fast, recover cleanly.** Use task-local rollback from the pre-task snapshot. Avoid workspace-wide resets in dirty trees.
 8. **Context hygiene.** Pass minimal, relevant context. Summarize — don't dump.
 9. **Evidence over assertion.** Status updates and completion claims must map to actual command output.
+10. **Escalate deterministically.** After three consecutive failures, stop thrashing and route to `pb-refine` with a structured DCR.
 
 ---
 
@@ -262,6 +295,7 @@ Before writing any code, verify the current workspace state:
 | **Confirm RED** | Run test suite. **Quote the error.** Classify: expected failure (proceed) vs bad failure (fix test first). | Failure confirmed |
 | **GREEN** | Write minimum implementation. Only edit files you read in Step 1. | Only what's needed |
 | **Confirm GREEN** | Run full test suite. If failure: read error, read code, then fix — do not blind-fix. | ALL tests pass |
+| **Runtime Verification (if applicable)** | Run runtime checks from task Verification, capture logs + probe output (or explicit `N/A` reason). | Runtime evidence captured |
 | **REFACTOR** | Clean up if needed | ALL tests still pass |
 | **SCOPE CHECK** | Confirm implemented changes match task contract and nothing extra. | Task scope respected |
 
@@ -293,6 +327,8 @@ Fix any "no" answers before submitting.
 
 ### Verification
 - [How verification criterion was met]
+- Runtime logs: [command + key output, or `N/A` with reason]
+- Runtime probe: [command + key output/status, or `N/A` with reason]
 - Test suite: X passed, 0 failed
 
 ### Commands Run
@@ -311,6 +347,7 @@ Fix any "no" answers before submitting.
 - Do not modify, delete, or reformat `AGENTS.md` unless the user explicitly requests an `AGENTS.md` change.
 - Do not modify unrelated code.
 - Tests are mandatory — never submit without them.
+- Runtime evidence is mandatory when applicable — do not claim completion without logs/probe evidence for runtime-facing tasks.
 - **No Blind Edits:** Always read a file before editing it.
 - **Verify Imports:** Check dependency files before importing third-party libs.
 - **Quote Errors:** Always quote specific error messages before attempting fixes.
 
@@ -158,6 +158,12 @@ Write a **flat task list** to `specs/<spec-dir>/tasks.md`:
 - [ ] Verification: ...
 ```
 
+For lightweight tasks that introduce or change runtime behavior (service startup, UI runtime flow, API availability, performance-critical paths), include runtime observability checks in `Verification`:
+
+- Capture recent runtime logs (for example `tail -n 50 app.log` or project-equivalent command).
+- Capture a live probe result (for example `curl http://localhost:8080/health` or project-equivalent endpoint).
+- If runtime checks are not applicable, explicitly write `N/A` with the reason.
+
 **Skip** phases, Summary & Timeline table, and Definition of Done boilerplate for lightweight specs.
 
 ## Step 5b: Output tasks.md — Full Mode (≥ 50 words)
@@ -173,6 +179,10 @@ Remove all instructional placeholder text (such as bracket examples) in the fina
 - **Task ID format:** Each task MUST have a unique ID: `Task X.Y` (e.g., `Task 1.1`, `Task 2.3`).
 - Ordered by dependency — no task references work from a later task.
 - Every task has a concrete **Verification** criterion.
+- For tasks that introduce or change runtime behavior (service startup, UI runtime flow, API/network availability, performance-sensitive code paths), **Verification must include runtime observability checks**:
+  - Recent runtime logs (for example `tail -n 50 app.log` or equivalent).
+  - A live health/probe command (for example `curl http://localhost:8080/health` or equivalent).
+  - If not applicable, explicitly mark `N/A` with a reason.
 - **Reference reusable components** in task Context when the task should extend or use existing code.
 - Ensure every requirement from the Step 1 checklist is covered by at least one task or explicitly marked out-of-scope.
 
@@ -202,7 +212,7 @@ Please review the design and tasks. When ready, run /pb-build <feature-name> to
 3. **Right-sized output (YAGNI).** Match output detail to requirement complexity. Simple changes get compact specs; complex features get full specs.
 4. **Live codebase analysis.** Always search the actual codebase. Use `AGENTS.md` as complementary policy context, not a replacement for code inspection.
 5. **Task granularity: Logical Unit of Work.** Each task is a self-contained, meaningful change. Do not split based on arbitrary time estimates.
-6. **Verification per task.** Every task defines how to prove it is done.
+6. **Verification per task.** Every task defines how to prove it is done; runtime-facing tasks include runtime observability evidence.
 7. **Dependency order.** Phases and tasks flow foundational → dependent.
 8. **Project-aware.** Use existing conventions, patterns, and tech stack. Reuse existing components — do not reinvent.
 9. **Requirements coverage.** Track every requirement from input to design sections and tasks.
@@ -432,6 +442,7 @@ Please review the design and tasks. When ready, run /pb-build <feature-name> to
 - [ ] **Step 1:** ...
 - [ ] **Step 2:** ...
 - [ ] **Verification:** [Concrete check]
+- [ ] **Runtime Verification (if applicable):** [Capture runtime signals — e.g., `tail -n 50 app.log` and `curl http://localhost:8080/health`; if not applicable, write `N/A` with reason]
 
 ---
 
@@ -449,6 +460,7 @@ Please review the design and tasks. When ready, run /pb-build <feature-name> to
 - [ ] **Step 1:** ...
 - [ ] **Step 2:** ...
 - [ ] **Verification:** ...
+- [ ] **Runtime Verification (if applicable):** [Logs + probe result, or `N/A` with reason]
 
 ---
 
@@ -466,6 +478,7 @@ Please review the design and tasks. When ready, run /pb-build <feature-name> to
 - [ ] **Step 1:** ...
 - [ ] **Step 2:** ...
 - [ ] **Verification:** ...
+- [ ] **Runtime Verification (if applicable):** [Logs + probe result, or `N/A` with reason]
 
 ---
 
@@ -483,6 +496,7 @@ Please review the design and tasks. When ready, run /pb-build <feature-name> to
 - [ ] **Step 1:** ...
 - [ ] **Step 2:** ...
 - [ ] **Verification:** ...
+- [ ] **Runtime Verification (if applicable):** [Logs + probe result, or `N/A` with reason]
 
 ---
 
@@ -502,4 +516,5 @@ Please review the design and tasks. When ready, run /pb-build <feature-name> to
 2. [ ] **Tested:** Unit tests covering added logic.
 3. [ ] **Formatted:** Code formatter applied.
 4. [ ] **Verified:** Task's specific Verification criterion met.
+5. [ ] **Runtime-Evidenced (when applicable):** Runtime logs and health/probe results are captured, or `N/A` is explicitly justified.
 ```