diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
new file mode 100644
index 0000000..1b8f260
--- /dev/null
+++ b/.github/workflows/tests.yml
@@ -0,0 +1,59 @@
+name: Tests
+
+on:
+  push:
+    branches: [main, reflective]
+  # Trigger on every PR regardless of base branch so contributors get
+  # CI feedback on long-running integration branches (e.g. tracking-N)
+  # in addition to PRs targeting main/reflective.
+  pull_request:
+
+# Cancel in-flight runs on the same PR/branch when a new push lands.
+# Only safe context expressions used here: github.workflow, github.ref,
+# github.event_name. None come from user-controlled input.
+concurrency:
+  group: tests-${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
+
+jobs:
+  pytest:
+    name: pytest (Python ${{ matrix.python-version }})
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: ["3.11", "3.12"]
+
+    # No LLM API keys in the env. The no-live-LLM project principle
+    # (CLAUDE.md, tests/CLAUDE.md, tests/conftest.py autouse guard) says
+    # tests must mock LLMs, never call them. This is the outer line of
+    # defence; the conftest guard is the inner.
+    env:
+      OPENAI_API_KEY: ""
+      OPENAI_BASE_URL: ""
+      ANTHROPIC_API_KEY: ""
+
+    steps:
+      - uses: actions/checkout@v5
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+          cache: pip
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -e ".[dev]"
+
+      - name: Run pytest
+        run: pytest -ra --strict-markers
+
+      - name: Upload pytest cache on failure
+        if: failure()
+        uses: actions/upload-artifact@v4
+        with:
+          name: pytest-cache-py${{ matrix.python-version }}
+          path: .pytest_cache/
+          if-no-files-found: ignore
diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 0000000..a5e9cfe
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,82 @@
+# Nous — project conventions
+
+This file is auto-loaded by Claude Code on every session in this repo. The
+rules below are non-negotiable; when they conflict with general AI/coding
+defaults, **the rules here win**.
+
+## 🚫 Tests must NEVER make live LLM calls
+
+**No unit, integration, or end-to-end test in this repo may make a real
+API call to Anthropic, OpenAI, or any other LLM provider. Period.**
+
+Why this is a hard rule:
+- Tests run on every CI build, every contributor's laptop, and every PR
+  rebase. Live LLM calls would burn tokens for no signal — the test
+  result depends on what the model said today, not on the code under test.
+- Token budget for `nous` is mission-critical. We refuse to spend it on
+  CI churn.
+- Live calls are non-deterministic. A flaky test from a model rephrasing
+  itself is worse than no test.
+
+**How to test correctly:**
+
+| Code under test | How to mock |
+|---|---|
+| `LLMDispatcher` | Pass `completion_fn=` in the constructor — a callable that returns canned `chat.completions`-shaped objects. See `tests/test_llm_dispatch.py`'s `_make_fake_completion` for the pattern. |
+| `CLIDispatcher` (claude -p subprocess) | Patch `orchestrator.cli_dispatch.subprocess.run` — return a `subprocess.CompletedProcess` with the JSON the test wants. See `tests/test_cli_dispatch.py`. |
+| `SDKDispatcher` (Claude Agent SDK) | Pass `sdk_runner=` in the constructor — a callable returning `SDKResult`. See `tests/test_sdk_dispatch.py`'s `_ScriptedRunner`. |
+| `InlineDispatcher` | Set up the `.nous_response_*` signal file in tmp_path before calling dispatch. |
+| Stub-driven flows | Use `StubDispatcher` from `orchestrator.dispatch` — it produces valid schema-conformant artifacts with no LLM at all. |
+
+**Active enforcement:** `tests/conftest.py` installs an autouse fixture
+(`block_live_llm_calls`) that:
+1. Strips `OPENAI_API_KEY` and `ANTHROPIC_API_KEY` from the env so any
+   accidental real-client construction fails loudly instead of silently
+   billing.
+2. Patches `urllib.request.urlopen` to refuse `api.anthropic.com`,
+   `api.openai.com`, and `api.litellm.ai` hosts.
+3. Patches `claude_agent_sdk.query` (when installed) to a hard-fail.
+
+If a test triggers any of these guards, the fix is to inject a fake at
+the dispatcher's seam — never to disable the guard. The guards are the
+backstop; the seams are the contract.
+
+## Behavioral testing only
+
+When the test mock is in place, write **behavioral** tests:
+- ✓ Assert what's on disk after `dispatcher.dispatch(...)`.
+- ✓ Assert metrics rows in `llm_metrics.jsonl`.
+- ✓ Assert artifacts match a JSON Schema.
+- ✗ Don't assert which method was called on the mock.
+- ✗ Don't assert argv shape, internal helper invocation, or attribute access.
+
+The seam is the contract; the implementation is free to evolve.
+
+## Token-budget discipline (production code)
+
+Beyond tests, Nous itself must be frugal with tokens:
+- **Methodology stays in `CLAUDE.md`** (auto-loaded by Claude Code), not
+  in per-call prompts. The thin templates in `prompts/methodology/*_thin.md`
+  carry only per-iteration context.
+- **System blocks are cached** (`cache_control: ephemeral`). Any code
+  that constructs an SDK call with a static system_prompt should rely
+  on this, and any change that breaks within-iteration cache locality
+  must be measured (`nous cost --cache-stats`) and justified.
+- **Read-only mapping uses Explore subagents**, not Opus. See
+  `orchestrator/explore_design.py`.
+
+## PR workflow (project owner: @sriumcp)
+
+1. Branch off `upstream/reflective` (NOT `main`).
+2. Push to `origin` (the fork at `sriumcp/agentic-strategy-evolution`).
+3. Open PR with base `upstream/reflective`, head `sriumcp:<branch>`.
+4. PR body links the issue with `Closes #N` (or `Refs #N` for partials).
+5. Stack PRs when one logical change builds on another rather than waiting
+   for merge — see `docs/plans/CHECKPOINT.md` for the pattern.
+
+## See also
+
+- `docs/contributing/workflow.md` — full workflow doc.
+- `docs/security.md` — permission policy (#135).
+- `docs/architecture.md` — internals.
+- `docs/plans/CHECKPOINT.md` — current state of the #120 epic.
diff --git a/README.md b/README.md
index 706c21e..5cbe746 100644
--- a/README.md
+++ b/README.md
@@ -80,17 +80,25 @@ If you're using Anthropic directly via a LiteLLM proxy, point both vars at the p
 ### 1. Install Nous
 
 ```bash
-pip install "git+https://github.com/AI-native-Systems-Research/agentic-strategy-evolution.git"
+pip install "git+https://github.com/AI-native-Systems-Research/agentic-strategy-evolution.git@reflective"
 ```
 
+`reflective` is the active integration branch — that's where new work lands first. `main` lags slightly behind. To pin to a release, replace `@reflective` with a tag (`@v0.2.0`).
+
 For development (editable install with test dependencies):
 
 ```bash
-git clone https://github.com/AI-native-Systems-Research/agentic-strategy-evolution.git
+git clone -b reflective https://github.com/AI-native-Systems-Research/agentic-strategy-evolution.git
 cd agentic-strategy-evolution
 pip install -e ".[dev]"
 ```
 
+For the SDK-based dispatcher (`--agent sdk`, see `docs/architecture.md`), also install the optional `[sdk]` extra:
+
+```bash
+pip install -e ".[dev,sdk]"
+```
+
 ### 2. Configure models
 
 Two LLM calls per iteration, both via `claude -p`:
diff --git a/bin/nous-execute-stop b/bin/nous-execute-stop
new file mode 100755
index 0000000..4a26477
--- /dev/null
+++ b/bin/nous-execute-stop
@@ -0,0 +1,85 @@
+#!/usr/bin/env python3
+"""Stop hook for the Nous executor session (issue #129).
+
+Runs after every Claude Code agent turn. Returns:
+    exit 0 → allow the agent to stop (its work is done).
+    exit 2 → block stopping; the structured reason on stderr is fed back
+             into the agent's conversation so it can react.
+
+A "stop is allowed" decision needs two pieces of evidence on disk:
+    1. ``$NOUS_ITER_DIR/principle_updates.json`` exists.
+    2. ``nous validate execution --dir $NOUS_ITER_DIR`` returns ``status: pass``.
+
+Both are deterministic — no LLM judgment, no agent self-assessment. The
+hook pairs with the ``/goal``-driven loop (#124) but is preferred wherever
+the success criterion is a schema check, because it's cheaper and more
+reliable than a Haiku evaluator.
+
+Configured per-campaign in ``.claude/settings.json`` (see #135). The
+orchestrator sets ``NOUS_ITER_DIR`` before launching the executor session.
+"""
+from __future__ import annotations
+
+import os
+import sys
+from pathlib import Path
+
+# When invoked as a Claude Code hook, the script's directory may not be
+# on PYTHONPATH. Add the repo root so `orchestrator.validate` imports.
+_HERE = Path(__file__).resolve().parent
+_REPO_ROOT = _HERE.parent
+if str(_REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(_REPO_ROOT))
+
+from orchestrator.validate import validate_execution  # noqa: E402
+
+
+_OK = 0
+_BLOCK = 2
+
+
+def main() -> int:
+    iter_dir_str = os.environ.get("NOUS_ITER_DIR")
+    if not iter_dir_str:
+        print(
+            "NOUS_ITER_DIR is not set. The orchestrator should export this "
+            "variable before launching the executor session.",
+            file=sys.stderr,
+        )
+        return _BLOCK
+
+    iter_dir = Path(iter_dir_str)
+    if not iter_dir.is_dir():
+        print(
+            f"iter_dir does not exist: {iter_dir}. NOUS_ITER_DIR is "
+            f"misconfigured or the executor was launched before init.",
+            file=sys.stderr,
+        )
+        return _BLOCK
+
+    principles = iter_dir / "principle_updates.json"
+    if not principles.exists():
+        print(
+            f"principle_updates.json is missing from {iter_dir}. "
+            f"Write the file (a JSON list, possibly empty: []) before stopping.",
+            file=sys.stderr,
+        )
+        return _BLOCK
+
+    result = validate_execution(iter_dir)
+    if result.get("status") != "pass":
+        errors = result.get("errors", [])
+        print(
+            f"validation failed for {iter_dir} ({len(errors)} error(s)). "
+            f"Fix these before stopping:",
+            file=sys.stderr,
+        )
+        for err in errors:
+            print(f"  - {err}", file=sys.stderr)
+        return _BLOCK
+
+    return _OK
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/bin/nous-mcp b/bin/nous-mcp
new file mode 100755
index 0000000..17309f3
--- /dev/null
+++ b/bin/nous-mcp
@@ -0,0 +1,306 @@
+#!/usr/bin/env python3
+"""nous-mcp: stdio MCP server exposing Nous campaigns (#126 Phase B).
+
+Wraps the pure functions in ``orchestrator.campaign_index`` as MCP
+resources and tools so any Claude Code session — terminal, IDE, web —
+can ``@``-reference a campaign or call ``nous.search_principles(...)``
+without bash plumbing.
+
+Protocol: JSON-RPC 2.0 over stdio (line-delimited JSON, one request /
+response per line). Compatible with Claude Code's MCP transport when
+registered in ``~/.claude.json`` under ``mcpServers``:
+
+    {
+      "mcpServers": {
+        "nous": {
+          "command": "python",
+          "args": ["-u", "/path/to/repo/bin/nous-mcp"],
+          "env": {"NOUS_SEARCH_ROOT": "/path/to/parent/of/.nous/"}
+        }
+      }
+    }
+
+The server is stateless: campaigns live on disk; every request re-walks
+``$NOUS_SEARCH_ROOT`` (or the path passed in the request).
+
+Methods:
+  initialize / shutdown          -- MCP handshake
+  resources/list                 -- nous://campaigns and per-campaign URIs
+  resources/read                 -- read a specific resource
+  tools/list                     -- list_campaigns / search_principles /
+                                    get_arm_results / compare_iterations
+  tools/call                     -- invoke a tool by name with arguments
+"""
+from __future__ import annotations
+
+import json
+import os
+import sys
+from pathlib import Path
+
+_HERE = Path(__file__).resolve().parent
+_REPO_ROOT = _HERE.parent
+if str(_REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(_REPO_ROOT))
+
+from orchestrator.campaign_index import (  # noqa: E402
+    compare_iterations,
+    get_arm_results,
+    list_campaigns,
+    search_principles,
+)
+
+
+_SERVER_INFO = {
+    "name": "nous-mcp",
+    "version": "0.2.0",
+    "description": "Read-only access to Nous campaigns on disk.",
+}
+
+_CAPABILITIES = {
+    "resources": {"list": True, "read": True},
+    "tools": {"list": True, "call": True},
+}
+
+
+_TOOLS = [
+    {
+        "name": "nous.list_campaigns",
+        "description": (
+            "List all Nous campaigns under the search root. "
+            "Optional filters: query (substring on run_id), status (phase), repo."
+        ),
+        "inputSchema": {
+            "type": "object",
+            "properties": {
+                "search_root": {"type": "string"},
+                "query": {"type": "string"},
+                "status": {"type": "string"},
+                "repo": {"type": "string"},
+            },
+        },
+    },
+    {
+        "name": "nous.search_principles",
+        "description": (
+            "Search principles across all campaigns by substring. "
+            "Hits include the source campaign run_id and path."
+        ),
+        "inputSchema": {
+            "type": "object",
+            "properties": {
+                "search_root": {"type": "string"},
+                "text": {"type": "string"},
+                "only_active": {"type": "boolean"},
+            },
+            "required": ["text"],
+        },
+    },
+    {
+        "name": "nous.get_arm_results",
+        "description": (
+            "Aggregate per-seed result files for one arm of one iteration."
+        ),
+        "inputSchema": {
+            "type": "object",
+            "properties": {
+                "campaign_root": {"type": "string"},
+                "iteration": {"type": "integer"},
+                "arm": {"type": "string"},
+            },
+            "required": ["campaign_root", "iteration", "arm"],
+        },
+    },
+    {
+        "name": "nous.compare_iterations",
+        "description": (
+            "Deterministic diff between two iterations of one campaign — "
+            "arm-status changes and added principles."
+        ),
+        "inputSchema": {
+            "type": "object",
+            "properties": {
+                "campaign_root": {"type": "string"},
+                "iter_a": {"type": "integer"},
+                "iter_b": {"type": "integer"},
+            },
+            "required": ["campaign_root", "iter_a", "iter_b"],
+        },
+    },
+]
+
+
+def _default_search_root() -> str:
+    return os.environ.get("NOUS_SEARCH_ROOT", str(Path.cwd()))
+
+
+def _resource_list(search_root: str) -> list[dict]:
+    """Build the MCP resources/list payload from disk state."""
+    out: list[dict] = [{
+        "uri": "nous://campaigns",
+        "name": "All campaigns",
+        "description": "Index of every Nous campaign under the search root.",
+        "mimeType": "application/json",
+    }]
+    for campaign in list_campaigns(Path(search_root)):
+        run_id = campaign["run_id"]
+        out.append({
+            "uri": f"nous://campaigns/{run_id}/state",
+            "name": f"{run_id} — state",
+            "description": f"Phase + iteration of campaign {run_id}.",
+            "mimeType": "application/json",
+        })
+        out.append({
+            "uri": f"nous://campaigns/{run_id}/principles",
+            "name": f"{run_id} — principles",
+            "description": f"Active principles accumulated in {run_id}.",
+            "mimeType": "application/json",
+        })
+    return out
+
+
+def _read_resource(uri: str, search_root: str) -> dict:
+    """Resolve a nous:// URI to its JSON contents."""
+    if uri == "nous://campaigns":
+        return {"campaigns": list_campaigns(Path(search_root))}
+
+    if not uri.startswith("nous://campaigns/"):
+        raise ValueError(f"unknown URI scheme: {uri!r}")
+    parts = uri[len("nous://campaigns/"):].split("/")
+    if len(parts) < 2:
+        raise ValueError(f"malformed campaign URI: {uri!r}")
+    run_id, leaf = parts[0], "/".join(parts[1:])
+
+    # Find campaign root by run_id under search_root.
+    matching = [c for c in list_campaigns(Path(search_root)) if c["run_id"] == run_id]
+    if not matching:
+        raise ValueError(f"unknown campaign: {run_id!r}")
+    root = Path(matching[0]["path"])
+
+    if leaf == "state":
+        return json.loads((root / "state.json").read_text())
+    if leaf == "principles":
+        return json.loads((root / "principles.json").read_text())
+    if leaf.startswith("iter/") and leaf.endswith("/findings"):
+        n = int(leaf.split("/")[1])
+        return json.loads((root / "runs" / f"iter-{n}" / "findings.json").read_text())
+    raise ValueError(f"unsupported leaf: {leaf!r}")
+
+
+def _call_tool(name: str, args: dict) -> dict:
+    """Dispatch a tools/call request to campaign_index."""
+    if name == "nous.list_campaigns":
+        return {
+            "campaigns": list_campaigns(
+                Path(args.get("search_root", _default_search_root())),
+                query=args.get("query"),
+                status=args.get("status"),
+                repo=args.get("repo"),
+            ),
+        }
+    if name == "nous.search_principles":
+        return {
+            "hits": search_principles(
+                Path(args.get("search_root", _default_search_root())),
+                args["text"],
+                only_active=args.get("only_active", True),
+            ),
+        }
+    if name == "nous.get_arm_results":
+        return get_arm_results(
+            Path(args["campaign_root"]),
+            int(args["iteration"]),
+            args["arm"],
+        )
+    if name == "nous.compare_iterations":
+        return compare_iterations(
+            Path(args["campaign_root"]),
+            int(args["iter_a"]),
+            int(args["iter_b"]),
+        )
+    raise ValueError(f"unknown tool: {name!r}")
+
+
+def handle_request(request: dict, *, search_root: str | None = None) -> dict:
+    """Process one JSON-RPC request and return the response dict.
+
+    Pure function — testable without stdio. The main loop calls this
+    for each line and writes the result back.
+    """
+    rid = request.get("id")
+    method = request.get("method", "")
+    params = request.get("params") or {}
+    root = search_root or _default_search_root()
+
+    try:
+        if method == "initialize":
+            result: dict = {
+                "protocolVersion": "2024-11-05",
+                "capabilities": _CAPABILITIES,
+                "serverInfo": _SERVER_INFO,
+            }
+        elif method == "shutdown":
+            result = {}
+        elif method == "resources/list":
+            result = {"resources": _resource_list(root)}
+        elif method == "resources/read":
+            uri = params.get("uri", "")
+            payload = _read_resource(uri, root)
+            result = {
+                "contents": [{
+                    "uri": uri,
+                    "mimeType": "application/json",
+                    "text": json.dumps(payload, indent=2),
+                }],
+            }
+        elif method == "tools/list":
+            result = {"tools": _TOOLS}
+        elif method == "tools/call":
+            name = params.get("name", "")
+            args = params.get("arguments", {}) or {}
+            payload = _call_tool(name, args)
+            result = {
+                "content": [{
+                    "type": "text",
+                    "text": json.dumps(payload, indent=2),
+                }],
+            }
+        else:
+            return {
+                "jsonrpc": "2.0",
+                "id": rid,
+                "error": {"code": -32601, "message": f"method not found: {method}"},
+            }
+    except Exception as exc:
+        return {
+            "jsonrpc": "2.0",
+            "id": rid,
+            "error": {"code": -32603, "message": f"{type(exc).__name__}: {exc}"},
+        }
+
+    return {"jsonrpc": "2.0", "id": rid, "result": result}
+
+
+def main() -> int:
+    for line in sys.stdin:
+        line = line.strip()
+        if not line:
+            continue
+        try:
+            request = json.loads(line)
+        except json.JSONDecodeError as exc:
+            sys.stdout.write(json.dumps({
+                "jsonrpc": "2.0",
+                "id": None,
+                "error": {"code": -32700, "message": f"parse error: {exc}"},
+            }) + "\n")
+            sys.stdout.flush()
+            continue
+        response = handle_request(request)
+        sys.stdout.write(json.dumps(response) + "\n")
+        sys.stdout.flush()
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/bin/nous-plan-enforcer b/bin/nous-plan-enforcer
new file mode 100755
index 0000000..0382f63
--- /dev/null
+++ b/bin/nous-plan-enforcer
@@ -0,0 +1,195 @@
+#!/usr/bin/env python3
+"""PreToolUse hook: enforce experiment_plan.yaml during EXECUTE_ANALYZE (#128).
+
+Claude Code calls this hook before every Bash tool invocation in the
+executor session. It compares the proposed command against the plan
+sitting in ``$NOUS_ITER_DIR/experiment_plan.yaml`` and decides whether
+to allow it.
+
+Two modes (controlled by ``NOUS_PLAN_ENFORCEMENT``):
+
+  * ``strict``: exit 2 with a structured reason on stderr if the
+    proposed command's head binary doesn't match any planned condition's
+    head binary. The agent receives the reason in its conversation and
+    is expected to either (a) revise the command or (b) annotate it
+    ``# nous: ad-hoc`` to explicitly opt out for one call.
+  * ``warn`` (default): always exit 0; record violations to
+    ``$NOUS_ITER_DIR/plan_violations.jsonl`` for audit. Lets you watch
+    for drift in soak runs without breaking iteration.
+
+Escape hatch: a command containing the literal string ``# nous: ad-hoc``
+is allowed in both modes and logged as ``kind: ad-hoc`` so reviewers can
+audit how often it's used.
+
+Exit codes: 0 = allow, 2 = block (strict only).
+"""
+from __future__ import annotations
+
+import json
+import os
+import sys
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Iterable
+
+import yaml
+
+_AD_HOC_MARKER = "# nous: ad-hoc"
+_OK = 0
+_BLOCK = 2
+
+
+def _read_event() -> dict:
+    """Read the PreToolUse JSON payload from stdin. Returns {} on bad input."""
+    try:
+        raw = sys.stdin.read()
+        if not raw.strip():
+            return {}
+        return json.loads(raw)
+    except json.JSONDecodeError:
+        return {}
+
+
+def _proposed_command(event: dict) -> str | None:
+    """Return the Bash command this event is proposing, or None for non-Bash."""
+    if event.get("tool_name") != "Bash":
+        return None
+    cmd = event.get("tool_input", {}).get("command")
+    if not isinstance(cmd, str):
+        return None
+    return cmd
+
+
+def _head_binary(cmd: str) -> str | None:
+    """Pull the basename of the first token of a shell command."""
+    cmd = cmd.lstrip()
+    # Strip leading comments or ad-hoc marker so we extract the real binary.
+    for line in cmd.splitlines():
+        stripped = line.strip()
+        if not stripped or stripped.startswith("#"):
+            continue
+        first = stripped.split()[0]
+        # Drop env-var prefix like ``FOO=bar binary``.
+        while "=" in first and not first.startswith("/") and not first.startswith("./"):
+            # heuristic: env-var-only assignment, skip to next token
+            tokens = stripped.split()
+            if len(tokens) < 2:
+                return None
+            tokens.pop(0)
+            stripped = " ".join(tokens)
+            first = stripped.split()[0]
+        return first.split("/")[-1]
+    return None
+
+
+def _plan_binaries(plan_path: Path) -> set[str]:
+    """Extract the set of head-binary basenames referenced in the plan."""
+    if not plan_path.exists():
+        return set()
+    try:
+        plan = yaml.safe_load(plan_path.read_text()) or {}
+    except yaml.YAMLError:
+        return set()
+    bins: set[str] = set()
+    for arm in plan.get("arms", []) or []:
+        for cond in arm.get("conditions", []) or []:
+            cmd = cond.get("command") or cond.get("cmd")
+            if isinstance(cmd, str):
+                bin_name = _head_binary(cmd)
+                if bin_name:
+                    bins.add(bin_name)
+    return bins
+
+
+def _planning_arm_for(plan_path: Path, head: str) -> str | None:
+    """Best-effort: return arm_id where ``head`` appears (or None)."""
+    if not plan_path.exists():
+        return None
+    try:
+        plan = yaml.safe_load(plan_path.read_text()) or {}
+    except yaml.YAMLError:
+        return None
+    for arm in plan.get("arms", []) or []:
+        for cond in arm.get("conditions", []) or []:
+            cmd = cond.get("command") or cond.get("cmd")
+            if isinstance(cmd, str) and _head_binary(cmd) == head:
+                return arm.get("arm_id")
+    return None
+
+
+def _log_violation(
+    iter_dir: Path,
+    *,
+    kind: str,
+    command: str,
+    arm: str | None,
+) -> None:
+    log_path = iter_dir / "plan_violations.jsonl"
+    record = {
+        "timestamp": datetime.now(timezone.utc).isoformat(),
+        "kind": kind,
+        "command": command,
+        "arm": arm or "",
+    }
+    try:
+        with open(log_path, "a") as f:
+            f.write(json.dumps(record) + "\n")
+    except OSError:
+        # Best-effort; never block the agent because logging failed.
+        pass
+
+
+def main() -> int:
+    iter_dir_str = os.environ.get("NOUS_ITER_DIR")
+    mode = os.environ.get("NOUS_PLAN_ENFORCEMENT", "warn").lower()
+
+    event = _read_event()
+    cmd = _proposed_command(event)
+    if cmd is None:
+        # Not a Bash event — nothing to enforce.
+        return _OK
+
+    if not iter_dir_str:
+        # Hook misconfigured — fail open (cannot block what we can't compare).
+        return _OK
+
+    iter_dir = Path(iter_dir_str)
+    plan_path = iter_dir / "experiment_plan.yaml"
+
+    # Escape hatch.
+    if _AD_HOC_MARKER in cmd:
+        _log_violation(iter_dir, kind="ad-hoc", command=cmd, arm=None)
+        return _OK
+
+    head = _head_binary(cmd)
+    if head is None:
+        # Couldn't parse — fail open (warn) or block (strict).
+        if mode == "strict":
+            print(
+                f"plan enforcer could not parse the proposed command:\n  {cmd!r}",
+                file=sys.stderr,
+            )
+            return _BLOCK
+        return _OK
+
+    planned = _plan_binaries(plan_path)
+    if head in planned:
+        return _OK
+
+    arm = _planning_arm_for(plan_path, head)
+    if mode == "strict":
+        print(
+            f"command head '{head}' is not in experiment_plan.yaml.\n"
+            f"Either revise the command to use a planned binary, or, "
+            f"if this is intentional, add '{_AD_HOC_MARKER}' as a comment "
+            f"line in the command.",
+            file=sys.stderr,
+        )
+        return _BLOCK
+
+    _log_violation(iter_dir, kind="unplanned", command=cmd, arm=arm)
+    return _OK
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/docs/architecture.md b/docs/architecture.md
index f5e162b..da24a43 100644
--- a/docs/architecture.md
+++ b/docs/architecture.md
@@ -110,7 +110,8 @@ Both agents write artifacts directly to the campaign directory (`iter_dir`) and
 **Implementations:**
 
 - `StubDispatcher` (`dispatch.py`) produces valid, schema-conformant artifacts without calling any LLM. Used for testing the orchestrator loop.
-- `CLIDispatcher` (`cli_dispatch.py`) invokes `claude -p` as a subprocess, giving agents code access and shell tools. Agents write files directly to `iter_dir`. Supports `override_cwd()` context manager for pointing the executor at a git worktree.
+- `CLIDispatcher` (`cli_dispatch.py`) invokes `claude -p` as a subprocess, giving agents code access and shell tools. Agents write files directly to `iter_dir`. Supports `override_cwd()` context manager for pointing the executor at a git worktree. Selected via `--agent api`.
+- `SDKDispatcher` (`sdk_dispatch.py`) calls the Claude Agent SDK (`claude-agent-sdk`) instead of spawning a subprocess. Same artifact and metrics contract as `CLIDispatcher`; gains native streaming, programmatic prompt caching, and message-level retry. Selected via `--agent sdk`. Requires the optional `sdk` install extra (`pip install -e ".[sdk]"`). Inherits parse / validate / retry-with-feedback machinery from `CLIDispatcher` — only the transport changes.
 
 **Dispatch interface:**
 ```python
@@ -122,7 +123,18 @@ dispatcher.dispatch(
 )
 ```
 
-Both dispatchers share the same interface — `CLIDispatcher` extends `LLMDispatcher`.
+All three dispatchers share the same interface. `CLIDispatcher` extends `LLMDispatcher`; `SDKDispatcher` extends `CLIDispatcher` and overrides only `_call_claude` and `preflight_check`.
+
+### Stop Hook (`bin/nous-execute-stop`)
+
+Claude Code Stop hooks fire after every agent turn and decide whether the agent is allowed to terminate. `bin/nous-execute-stop` is Nous's deterministic completion check: the executor is allowed to stop only when both conditions hold on disk, no LLM judgment involved:
+
+1. `principle_updates.json` exists in the iteration directory.
+2. `nous validate execution --dir $NOUS_ITER_DIR` returns `status: pass`.
+
+If either fails, the hook exits with code 2 and writes a structured reason to stderr; Claude Code feeds that reason back into the agent's conversation so it can fix the artifact and try again. Wire-up lives in the per-campaign `.claude/settings.json` (see #135) — the orchestrator exports `NOUS_ITER_DIR` before launching the executor session.
+
+This is preferred over a probabilistic Haiku evaluator anywhere the success criterion is a schema check: cheaper, faster, and immune to evaluator drift.
 
 ## CLI Dispatch
 
diff --git a/docs/contributing/workflow.md b/docs/contributing/workflow.md
index 4aaa2cf..ecc579a 100644
--- a/docs/contributing/workflow.md
+++ b/docs/contributing/workflow.md
@@ -4,6 +4,32 @@ This document defines the standard workflow for contributors using Claude Code t
 
 ---
 
+## Non-negotiable rules
+
+These apply to every PR, every test, every contributor. They are also restated in the auto-loaded `CLAUDE.md` files at the repo root and under `tests/`.
+
+### 🚫 Tests must NEVER make live LLM calls
+
+**No unit, integration, or end-to-end test in this repo may make a real API call to Anthropic, OpenAI, or any other LLM provider.** Tests must mock LLMs at the dispatcher seam:
+
+- `LLMDispatcher` → pass `completion_fn=`.
+- `CLIDispatcher` → patch `orchestrator.cli_dispatch.subprocess.run`.
+- `SDKDispatcher` → pass `sdk_runner=` returning `SDKResult`.
+- `InlineDispatcher` → pre-populate the `.nous_response_*` signal file.
+- Or use `StubDispatcher` for end-to-end orchestrator flows.
+
+`tests/conftest.py` installs an autouse `block_live_llm_calls` fixture that strips LLM API keys from the env and patches `urllib.request.urlopen` + `claude_agent_sdk.query` to hard-fail on real network calls. If a test trips the guard, fix the test by injecting a fake — never disable the guard.
+
+### Behavioral testing only
+
+Assert what's on disk, what's in metrics rows, what schemas validate. Don't assert which methods were called or what argv was constructed. The dispatcher seams are the contract.
+
+### Token-budget discipline
+
+`nous` runs against real LLMs in production; CI cannot. Every PR that touches `orchestrator/` must keep the cache-friendly invariant: methodology lives in `CLAUDE.md` (auto-loaded), system blocks are stable across calls (cache hits), per-iteration content goes in the user message (cache busts when it should). `nous cost --cache-stats` is the regression gate.
+
+---
+
 ## Overview
 
 Any contributor with Claude Code should follow this workflow when working on an issue. It combines AI-assisted planning and review with explicit human approval gates to produce consistent, high-quality contributions.
diff --git a/docs/retros/2026-05-24-claude-code-native-uplift.md b/docs/retros/2026-05-24-claude-code-native-uplift.md
new file mode 100644
index 0000000..4332574
--- /dev/null
+++ b/docs/retros/2026-05-24-claude-code-native-uplift.md
@@ -0,0 +1,79 @@
+# Retro — Claude-Code-Native Uplift for Nous (#120)
+
+**Closes:** [#120](https://github.com/AI-native-Systems-Research/agentic-strategy-evolution/issues/120)
+**Window:** 2026-05-24 (single session, multi-PR initiative)
+**Children resolved:** 15 of 15 — #121, #122, #123, #124, #125, #126, #127, #128, #129, #130, #131, #132, #133, #134, #135.
+**Plus a project-wide guard PR:** #151 — no-live-LLM-in-tests, codified in `CLAUDE.md` + `tests/CLAUDE.md` + `tests/conftest.py` + `docs/contributing/workflow.md`.
+
+## What landed
+
+```
+                       Foundation                Capabilities                Ecosystem
+                  ┌───────────────────┐       ┌────────────────────┐    ┌─────────────────┐
+                  │ #121 SDK port     │──┬────│ #122 caching        │    │ #126 MCP server │
+                  │ #129 stop hook    │  ├────│ #127 stream-json    │    │ #125 plugin pkg │
+                  │ #135 perm policy  │  ├────│ #132 explore design │    │ #134 routines   │
+                  │ #131 CLAUDE.md    │  └────│ #123 parallel arms  │    │ #130 channels   │
+                  └───────────────────┘       │ #133 worktree harness│    │ #124 /goal-driven│
+                                              │ #128 plan enforcer  │    └─────────────────┘
+                                              └────────────────────┘
+```
+
+15 PRs in flight against `upstream/reflective`. ~250 new behavioral tests. Zero structural assertions. Zero live LLM calls (enforced by the conftest guard).
+
+## How the architecture changed
+
+Before: Nous was a Python orchestrator that shelled out to `claude -p` as a subprocess for code-access roles, with a custom JSON parser, a custom retry loop, and a manual git-worktree lifecycle. The methodology preamble (~465 lines across `design.md` + `execute_analyze.md`) was re-rendered into every prompt.
+
+After: Nous is a Python orchestrator that owns checkpointing, validation, and gates, while delegating the actual agent loop to the Claude Agent SDK. Methodology lives in CLAUDE.md (auto-loaded once per session); the prompt body shrinks to per-iteration context only. Subagents (Explore for design mapping, isolation="worktree" for parallel arms) replace the mega-session pattern. The on-disk artifact contract is unchanged — every PR was a transport substitution behind the existing `dispatcher.dispatch(role, phase, ...)` seam.
+
+## Token-budget delta (the user's mission-critical metric)
+
+| Lever | Before | After | Verifies via |
+|---|---|---|---|
+| Methodology re-sent each call (#131) | full template (~465 lines) per call | thin template (~50 lines) when CLAUDE.md is in scope | `nous cost --cache-stats` (#122) — stats infrastructure landed |
+| System block caching (#122) | none | `cache_control: ephemeral` on methodology preamble | `cache_read_input_tokens` in `llm_metrics.jsonl` |
+| DESIGN exploration (#132) | one Opus session for codebase walk + synthesis | 4 parallel Haiku Explore subagents + 1 Opus synthesis call | report.input_tokens aggregation in `ExploreStageResult` |
+| Multi-arm execution (#123) | one Sonnet mega-session for 24 simulations | per-arm subagent in isolated worktree, parallelizable | wall-clock + per-unit metrics on representative campaign |
+
+The cache-stats aggregation (`orchestrator/cache_stats.py`) is the regression gate — `nous cost --cache-stats` must show non-zero hit rate on warm phase calls and ≥25% input-token reduction over the 5-iter baseline. Soak verification on real `inference-sim` campaigns confirms or refutes this; the infrastructure to observe it is in place.
+
+## How testing held up
+
+The user's directive — "behavioral testing discipline, absolutely no structural tests" — was the most consequential constraint of the initiative. It forced specific design choices:
+
+- **Pluggable seams everywhere.** `sdk_runner` Protocol returning `SDKResult` (#121); `poster` callable for channels (#130) and routines (#134); `runner` injection for plan enforcer (#128), explore stage (#132), parallel arms (#123); `pid_check` and `now=` for worktree GC (#133); `completion_fn` for the legacy LLMDispatcher path. Every test asserts on disk artifacts, JSON shapes, or externally-visible state — never on internal helper invocations.
+- **No live LLM calls in tests, ever.** Codified in PR #151 with active enforcement: `tests/conftest.py` strips `OPENAI_API_KEY` / `ANTHROPIC_API_KEY` from the env, patches `urllib.request.urlopen` to refuse known LLM hosts, patches `claude_agent_sdk.query` to hard-fail. `tests/test_no_live_llm_guard.py` verifies the guard fires correctly.
+- **Determinism via injected clocks/PIDs/IDs.** Tests inject `now=`, `pid_check=`, fake `os.utime`, scripted runners — they pass on any machine, in any timezone, without flaky waits. No `time.sleep` polling.
+
+That seam discipline is also what makes Phase B closures possible: in every #N Phase B PR, the production wiring is one line that constructs the real SDK runner; the orchestration layer + tests above it are unchanged.
+
+## What's deferred to soak
+
+Acceptance criteria that explicitly require running a real campaign (the issue body's measurement-based criteria) cannot be honestly verified in CI:
+
+- #122: ≥25% input token drop on a 5-iteration campaign (need Anthropic API).
+- #123: significant wall-clock improvement on `examples/campaign-best-of-field.yaml` with `max_parallel_arms: 4` (need real subagent spawning).
+- #132: ≥30% DESIGN cost drop (need real Explore subagents).
+- #131: subjective bundle-quality parity on 3 reference campaigns (human review).
+- #126/#130/#134: live transports against MCP / Slack / Routines APIs (need credentials).
+
+These are integration tests for the soak environment, not unit tests. The infrastructure to measure each is shipped (`nous cost --cache-stats`, the ledger, `merge_unit_results` determinism). The team verifies on first soak; if a criterion fails, the failure is observable from the metrics emit and the cause is traceable to a single seam.
+
+## What the next initiative should pick up
+
+- Drop `cli_dispatch.py` once `--agent sdk` has soaked. The CLI subprocess path is dead code after that.
+- Drop `worktree.py`'s manual `create_experiment_worktree` / `remove_experiment_worktree` once #123 wires `make_isolated_arm_runner` into iteration.py — closes #133's ≥60% LoC reduction acceptance criterion.
+- Real MCP transport using the `mcp` Python SDK once it pins; the stdio JSON-RPC server in #142 is bounded by what stdlib can do.
+- Slack interactive messages adapter for #130 Phase B (parsed reply tokens are landed; the per-channel reply provider needs a webhook receiver).
+- Routines API integration once the API stabilizes; the payload builder + `submit_routine` are landed.
+
+## Lessons (worth carrying to the next epic)
+
+1. **Phase A / Phase B split was right.** Eleven of fifteen child issues had at least one criterion that requires soak verification. Bundling them as one PR each would have made every PR claim "soak verified" — false. Splitting let us land the testable orchestration first and name the soak-only follow-up explicitly.
+2. **Stack PRs when one logical change builds on another.** Five PRs stacked on #136 (#121 SDK port); #139 stacked on #138; #150 stacked on #143 stacked on #136. Each stack mirrors the dependency chain. Reviewers can merge bottom-up; rebases are mechanical.
+3. **The conftest guard was the highest-leverage one-day investment.** ~50 lines of `tests/conftest.py` and a one-line autouse fixture meant every existing test, every new test, every future PR is now incapable of accidentally spending tokens. Cost: one PR. Benefit: forever.
+
+## Closing #120
+
+All 15 children + the test-policy guard are in flight. The retro is this document; the metric-verification work is named in [`docs/plans/CHECKPOINT.md`](../plans/CHECKPOINT.md).
diff --git a/docs/security.md b/docs/security.md
new file mode 100644
index 0000000..2f16137
--- /dev/null
+++ b/docs/security.md
@@ -0,0 +1,44 @@
+# Security model
+
+Nous campaigns invoke an LLM agent (Claude Code) with shell-tool access against your target repository. The orchestrator's job is to make sure that access is *bounded* — agents can only see and modify what the campaign legitimately needs.
+
+This document describes how that boundary is enforced.
+
+## Per-campaign permission policy
+
+When you run `nous run`, the orchestrator writes `<work_dir>/.claude/settings.json` (issue #135). The dispatcher then invokes the agent with `--settings <path>`, replacing the legacy `--dangerously-skip-permissions`.
+
+The settings file declares:
+
+| Key | Meaning |
+|---|---|
+| `permissions.allowOnly` | Absolute paths the agent may read or write. Always includes the campaign work-dir; includes the target repo when `repo_path` is set. |
+| `permissions.allow` | Bash command allowlist. Built from a conservative default set (`git`, `python`, `pytest`, `grep`, …) plus any binaries referenced in `experiment_plan.yaml` arms, plus campaign-specific entries you pass via `extra_bin_allowlist`. |
+| `permissions.deny` | Hard blocks. Ships with `Bash(curl https://*)`, `Bash(wget https://*)`, and `Bash(rm -rf /*)` to prevent the agent from exfiltrating data or destroying its host. |
+| `hooks.Stop` | (When `bin/nous-execute-stop` exists) deterministic completion check — see #129. |
+| `hooks.PreToolUse` | (When configured) plan-enforcer hook — see #128. |
+
+### Why `--dangerously-skip-permissions` is no longer the default
+
+`--dangerously-skip-permissions` auto-approves *every* tool call. That's appropriate for a sandboxed CI runner and a one-off experiment, but Nous campaigns run for hours against real repositories — we need writes to be bounded to the worktree by default.
+
+The flag is still available behind explicit opt-in for emergency cases (e.g. recovering a stuck campaign), but no campaign in `examples/` uses it after #135 lands.
+
+### Idempotency
+
+`setup_work_dir` only writes `settings.json` if it doesn't already exist. That means you can hand-edit the file (add a custom `extra_bin_allowlist`, tweak deny rules, point `hooks.Stop` at a custom script) and a `nous resume` won't clobber your changes.
+
+### What's NOT enforced by this layer
+
+- **Network egress beyond the deny list.** The deny rules block the obvious cases; for hardened environments, run Nous inside a network-namespaced container.
+- **Privilege escalation.** The agent runs as your shell user. Claude Code's permission system gates *which* commands run, not *what privileges* they run with.
+- **Adversarial inputs from your target repo.** If the repo's source code contains prompt-injection payloads, the agent may follow them. Treat campaigns the way you'd treat any other code review of an untrusted repo.
+
+## Hook registration
+
+The settings file's `hooks` section wires up:
+
+- **Stop hook** (`bin/nous-execute-stop`, #129): allows the executor to terminate only when `principle_updates.json` exists and `nous validate execution` returns pass. Cheaper and more reliable than a Haiku evaluator for schema-driven success criteria.
+- **PreToolUse hook** (`bin/nous-plan-enforcer`, #128): rejects (or logs) Bash calls that aren't derivable from `experiment_plan.yaml`. Defense-in-depth on top of the allow/deny lists.
+
+Both hooks are optional; their absence falls back to settings-only enforcement.
diff --git a/orchestrator/cache_stats.py b/orchestrator/cache_stats.py
new file mode 100644
index 0000000..0381863
--- /dev/null
+++ b/orchestrator/cache_stats.py
@@ -0,0 +1,131 @@
+"""Cache hit-rate aggregation over llm_metrics.jsonl (issue #122).
+
+Reads the per-call metrics file and computes:
+
+  * total cache_read_input_tokens (paid for once per cache window)
+  * total cache_creation_input_tokens (paid the first time only)
+  * total uncached input tokens
+  * cache hit rate = read / (read + creation + uncached)
+  * by-phase breakdown (so DESIGN-vs-EXECUTE_ANALYZE differences surface)
+
+The result powers ``nous cost --cache-stats``. Output is JSON-serializable
+so the same numbers can drive Routines (#134) reporting later.
+"""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import Any
+
+
+def _iter_rows(path: Path):
+    if not path.exists():
+        return
+    for line in path.read_text().splitlines():
+        line = line.strip()
+        if not line:
+            continue
+        try:
+            yield json.loads(line)
+        except json.JSONDecodeError:
+            continue
+
+
+def cache_stats(metrics_path: Path) -> dict[str, Any]:
+    """Compute cache hit-rate statistics from a metrics JSONL file.
+
+    Returns:
+      ::
+
+        {
+          "total_calls": int,
+          "input_tokens_uncached": int,
+          "cache_creation_input_tokens": int,
+          "cache_read_input_tokens": int,
+          "hit_rate": float,        # 0.0–1.0
+          "by_phase": {
+            <phase>: { same fields, scoped to that phase }
+          }
+        }
+    """
+    rows = list(_iter_rows(Path(metrics_path)))
+    return _aggregate(rows)
+
+
+def _aggregate(rows: list[dict]) -> dict[str, Any]:
+    out: dict[str, Any] = {
+        "total_calls": 0,
+        "input_tokens_uncached": 0,
+        "cache_creation_input_tokens": 0,
+        "cache_read_input_tokens": 0,
+        "hit_rate": 0.0,
+        "by_phase": {},
+    }
+    phase_aggregates: dict[str, dict[str, int]] = {}
+
+    for row in rows:
+        out["total_calls"] += 1
+        # Standard schema: input_tokens captures the uncached portion;
+        # cache_creation/read are emitted as separate fields by both the
+        # CLIDispatcher (since #41) and the SDKDispatcher (#121).
+        uncached = int(row.get("input_tokens", 0) or 0)
+        creation = int(row.get("cache_creation_input_tokens", 0) or 0)
+        read = int(row.get("cache_read_input_tokens", 0) or 0)
+        out["input_tokens_uncached"] += uncached
+        out["cache_creation_input_tokens"] += creation
+        out["cache_read_input_tokens"] += read
+
+        phase = row.get("phase", "unknown")
+        bucket = phase_aggregates.setdefault(phase, {
+            "calls": 0,
+            "input_tokens_uncached": 0,
+            "cache_creation_input_tokens": 0,
+            "cache_read_input_tokens": 0,
+        })
+        bucket["calls"] += 1
+        bucket["input_tokens_uncached"] += uncached
+        bucket["cache_creation_input_tokens"] += creation
+        bucket["cache_read_input_tokens"] += read
+
+    total_input = (
+        out["input_tokens_uncached"]
+        + out["cache_creation_input_tokens"]
+        + out["cache_read_input_tokens"]
+    )
+    out["hit_rate"] = (
+        out["cache_read_input_tokens"] / total_input if total_input else 0.0
+    )
+
+    for phase, b in sorted(phase_aggregates.items()):
+        phase_total = (
+            b["input_tokens_uncached"]
+            + b["cache_creation_input_tokens"]
+            + b["cache_read_input_tokens"]
+        )
+        b["hit_rate"] = (
+            b["cache_read_input_tokens"] / phase_total if phase_total else 0.0
+        )
+    out["by_phase"] = phase_aggregates
+    return out
+
+
+def format_cache_stats(stats: dict[str, Any]) -> str:
+    """Render stats as a multiline human-readable summary."""
+    lines: list[str] = []
+    lines.append(f"  Calls:                  {stats['total_calls']}")
+    lines.append(f"  Uncached input tokens:  {stats['input_tokens_uncached']:,}")
+    lines.append(f"  Cache-creation tokens:  {stats['cache_creation_input_tokens']:,}")
+    lines.append(f"  Cache-read tokens:      {stats['cache_read_input_tokens']:,}")
+    lines.append(f"  Hit rate:               {stats['hit_rate']:.1%}")
+    if stats.get("by_phase"):
+        lines.append("")
+        lines.append("  By phase:")
+        for phase, b in stats["by_phase"].items():
+            lines.append(
+                f"    {phase}: {b['calls']} call(s), "
+                f"hit rate {b['hit_rate']:.1%} "
+                f"(read {b['cache_read_input_tokens']:,} / "
+                f"create {b['cache_creation_input_tokens']:,} / "
+                f"uncached {b['input_tokens_uncached']:,})"
+            )
+    return "\n".join(lines)
diff --git a/orchestrator/campaign.py b/orchestrator/campaign.py
index 2ba6a84..bdd1c0b 100644
--- a/orchestrator/campaign.py
+++ b/orchestrator/campaign.py
@@ -206,15 +206,38 @@ def run_campaign(
         HumanGate(auto_response="approve") if auto_approve else HumanGate()
     )
 
-    # Pre-flight: validate CLI + credentials before starting the campaign
+    # GC orphan experiment worktrees (#133): clean up stale dirs from
+    # crashed prior runs before starting fresh ones.
     repo_path = campaign.get("target_system", {}).get("repo_path")
+    if repo_path:
+        try:
+            from orchestrator.worktree import gc_orphan_worktrees
+            removed = gc_orphan_worktrees(Path(repo_path))
+            if removed:
+                logger.info(
+                    "GC'd %d orphan worktree(s): %s",
+                    len(removed), ", ".join(removed),
+                )
+        except (OSError, RuntimeError) as exc:
+            logger.warning("Worktree GC failed: %s", exc)
+
+    # Pre-flight: validate CLI + credentials before starting the campaign.
+    # SDK mode pre-flights via claude-agent-sdk import; API mode via claude CLI.
     if agent != "inline" and repo_path:
-        from orchestrator.cli_dispatch import CLIDispatcher
-        preflight_dispatcher = CLIDispatcher(
-            work_dir=work_dir, campaign=campaign,
-            model=_resolve_model(campaign, "design", model),
-            max_retries=max_cli_retries,
-        )
+        if agent == "sdk":
+            from orchestrator.sdk_dispatch import SDKDispatcher
+            preflight_dispatcher = SDKDispatcher(
+                work_dir=work_dir, campaign=campaign,
+                model=_resolve_model(campaign, "design", model),
+                max_retries=max_cli_retries,
+            )
+        else:
+            from orchestrator.cli_dispatch import CLIDispatcher
+            preflight_dispatcher = CLIDispatcher(
+                work_dir=work_dir, campaign=campaign,
+                model=_resolve_model(campaign, "design", model),
+                max_retries=max_cli_retries,
+            )
         preflight_dispatcher.preflight_check()
 
     start_iter = _resume_completed_campaign(work_dir, max_iterations)
@@ -353,7 +376,7 @@ def main() -> None:
                         help="Timeout in seconds for claude -p calls (default: 1800)")
     parser.add_argument("--max-cli-retries", type=int, default=10,
                         help="Max retries for claude -p failures (-1 = unbounded, default: 10)")
-    parser.add_argument("--agent", choices=["inline", "api"], default="api",
+    parser.add_argument("--agent", choices=["inline", "api", "sdk"], default="api",
                         help="Dispatch backend: 'inline' emits prompts to stdout for the "
                              "calling agent (no subprocess, no API key), "
                              "'api' uses the LLM API (default: api)")
@@ -397,6 +420,14 @@ def main() -> None:
     print(f"Working directory: {work_dir.resolve()}")
     print(f"Max iterations: {max_iter}")
 
+    # Initial CLAUDE.md so iter 1 has campaign brief + (empty) principles
+    # in scope from session start (#131).
+    try:
+        from orchestrator.claude_md import regenerate_from_disk
+        regenerate_from_disk(work_dir, campaign, iteration=0)
+    except (OSError, RuntimeError) as exc:
+        logger.warning("Failed to write initial CLAUDE.md: %s", exc)
+
     run_campaign(
         campaign, work_dir,
         max_iterations=max_iter, model=args.model,
diff --git a/orchestrator/campaign_index.py b/orchestrator/campaign_index.py
new file mode 100644
index 0000000..625a950
--- /dev/null
+++ b/orchestrator/campaign_index.py
@@ -0,0 +1,334 @@
+"""Campaign index — pure functions over the on-disk artifact tree (#126).
+
+These functions are the contract that ``nous-mcp`` (a stdio MCP server,
+shipped in a follow-up phase) exposes as resources and tools. Keeping
+them pure and import-free of MCP itself means:
+
+  * They're trivially testable without spinning up an MCP transport.
+  * The CLI can use them too (``nous list``, ``nous find-principle``)
+    without coupling to the MCP runtime.
+  * A future Routines invocation (#134) can use the same functions to
+    publish findings into a shared store.
+
+Conventions:
+
+  * A "campaign root" is a directory containing ``state.json``,
+    ``ledger.json``, ``principles.json``. Typically ``<repo>/.nous/<run-id>``.
+  * A "search root" is a directory under which we walk to find campaign
+    roots. Searches are bounded to depth 4 so we don't accidentally walk
+    a giant repo.
+  * Functions return plain ``dict``/``list`` JSON-friendly structures so
+    MCP serialization is a no-op.
+"""
+from __future__ import annotations
+
+import json
+import re
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any, Iterable
+
+_MAX_DEPTH = 4
+
+
+def _walk_campaign_roots(search_root: Path, max_depth: int = _MAX_DEPTH) -> Iterable[Path]:
+    """Yield directories under ``search_root`` that look like campaign roots."""
+    search_root = Path(search_root)
+    if not search_root.is_dir():
+        return
+    stack: list[tuple[Path, int]] = [(search_root, 0)]
+    while stack:
+        path, depth = stack.pop()
+        if depth > max_depth:
+            continue
+        try:
+            entries = list(path.iterdir())
+        except (PermissionError, OSError):
+            continue
+        for entry in entries:
+            if not entry.is_dir():
+                continue
+            # Heuristic: a campaign root has state.json + ledger.json.
+            if (entry / "state.json").exists() and (entry / "ledger.json").exists():
+                yield entry
+                # Don't descend further inside a campaign root — its
+                # subdirs (runs/iter-N) aren't themselves campaigns.
+                continue
+            stack.append((entry, depth + 1))
+
+
+def _read_json(path: Path) -> Any:
+    try:
+        return json.loads(path.read_text())
+    except (json.JSONDecodeError, OSError):
+        return None
+
+
+@dataclass
+class CampaignSummary:
+    run_id: str
+    path: str
+    phase: str
+    iteration: int
+    completed_iterations: int
+    active_principles: int
+    repo: str | None = None
+
+    def as_dict(self) -> dict[str, Any]:
+        return {
+            "run_id": self.run_id,
+            "path": self.path,
+            "phase": self.phase,
+            "iteration": self.iteration,
+            "completed_iterations": self.completed_iterations,
+            "active_principles": self.active_principles,
+            "repo": self.repo,
+        }
+
+
+def _summarize(root: Path) -> CampaignSummary | None:
+    state = _read_json(root / "state.json")
+    if not isinstance(state, dict):
+        return None
+    ledger = _read_json(root / "ledger.json")
+    completed = 0
+    if isinstance(ledger, dict):
+        rows = ledger.get("iterations", [])
+        if isinstance(rows, list):
+            completed = sum(
+                1 for r in rows
+                if isinstance(r, dict) and isinstance(r.get("iteration"), int)
+                and r["iteration"] >= 1
+            )
+    principles = _read_json(root / "principles.json")
+    active = 0
+    if isinstance(principles, dict):
+        plist = principles.get("principles", [])
+        if isinstance(plist, list):
+            active = sum(
+                1 for p in plist
+                if isinstance(p, dict) and p.get("status", "active") == "active"
+            )
+    # Best-effort: target repo is the great-grandparent when work_dir
+    # was created as <repo>/.nous/<run-id>.
+    repo: str | None = None
+    if root.parent.name == ".nous":
+        repo = str(root.parent.parent.resolve())
+    return CampaignSummary(
+        run_id=state.get("run_id", root.name),
+        path=str(root.resolve()),
+        phase=state.get("phase", "UNKNOWN"),
+        iteration=int(state.get("iteration", 0) or 0),
+        completed_iterations=completed,
+        active_principles=active,
+        repo=repo,
+    )
+
+
+# ─── list_campaigns ─────────────────────────────────────────────────────────
+
+
+def list_campaigns(
+    search_root: Path,
+    *,
+    query: str | None = None,
+    status: str | None = None,
+    repo: str | None = None,
+) -> list[dict[str, Any]]:
+    """List campaign summaries under ``search_root``.
+
+    Args:
+      search_root: directory to walk.
+      query: case-insensitive substring filter against run_id.
+      status: filter on state.phase (``DONE``, ``EXECUTE_ANALYZE``, etc.).
+      repo: filter on resolved repo path (substring match).
+
+    Returns: list of summary dicts, sorted by run_id.
+    """
+    out: list[dict[str, Any]] = []
+    for root in _walk_campaign_roots(Path(search_root)):
+        summary = _summarize(root)
+        if summary is None:
+            continue
+        if query and query.lower() not in summary.run_id.lower():
+            continue
+        if status and summary.phase != status:
+            continue
+        if repo:
+            if not summary.repo or repo not in summary.repo:
+                continue
+        out.append(summary.as_dict())
+    out.sort(key=lambda d: d["run_id"])
+    return out
+
+
+# ─── search_principles ────────────────────────────────────────────────────
+
+
+@dataclass
+class PrincipleHit:
+    run_id: str
+    path: str  # campaign root
+    principle: dict[str, Any]
+    score: float = 1.0  # placeholder for future semantic search
+
+    def as_dict(self) -> dict[str, Any]:
+        return {
+            "run_id": self.run_id,
+            "path": self.path,
+            "score": self.score,
+            "principle": self.principle,
+        }
+
+
+def search_principles(
+    search_root: Path,
+    text: str,
+    *,
+    only_active: bool = True,
+) -> list[dict[str, Any]]:
+    """Find principles whose statement/description matches ``text``.
+
+    Phase A is plain case-insensitive substring matching; the issue notes
+    embedding-based search as an optional follow-up gated on
+    ``OPENAI_API_KEY``.
+    """
+    needle = text.lower().strip()
+    if not needle:
+        return []
+    hits: list[PrincipleHit] = []
+    for root in _walk_campaign_roots(Path(search_root)):
+        principles = _read_json(root / "principles.json")
+        if not isinstance(principles, dict):
+            continue
+        plist = principles.get("principles", [])
+        if not isinstance(plist, list):
+            continue
+        state = _read_json(root / "state.json") or {}
+        run_id = state.get("run_id", root.name)
+        for p in plist:
+            if not isinstance(p, dict):
+                continue
+            if only_active and p.get("status", "active") != "active":
+                continue
+            haystack = " ".join(
+                str(p.get(field, "")) for field in
+                ("statement", "description", "category", "id")
+            ).lower()
+            if needle in haystack:
+                hits.append(PrincipleHit(
+                    run_id=run_id, path=str(root.resolve()),
+                    principle=p,
+                ))
+    # Stable order: by run_id, then principle id.
+    hits.sort(key=lambda h: (h.run_id, str(h.principle.get("id", ""))))
+    return [h.as_dict() for h in hits]
+
+
+# ─── get_arm_results ──────────────────────────────────────────────────────
+
+
+def get_arm_results(
+    campaign_root: Path,
+    iteration: int,
+    arm: str,
+) -> dict[str, Any]:
+    """Aggregate results for one arm of one iteration.
+
+    Returns: ``{"arm": ..., "iteration": N, "seeds": [{"seed": ..., "files": [...]}]}``.
+    Seeds and their result files are read from ``runs/iter-N/results/<arm>/<seed>/``.
+    """
+    campaign_root = Path(campaign_root)
+    arm_dir = campaign_root / "runs" / f"iter-{iteration}" / "results" / arm
+    seeds: list[dict[str, Any]] = []
+    if arm_dir.is_dir():
+        for seed_dir in sorted(arm_dir.iterdir()):
+            if not seed_dir.is_dir():
+                continue
+            files = sorted(
+                str(p.relative_to(campaign_root))
+                for p in seed_dir.rglob("*") if p.is_file()
+            )
+            seeds.append({"seed": seed_dir.name, "files": files})
+    return {"arm": arm, "iteration": iteration, "seeds": seeds}
+
+
+# ─── compare_iterations ───────────────────────────────────────────────────
+
+
+def compare_iterations(
+    campaign_root: Path,
+    iter_a: int,
+    iter_b: int,
+) -> dict[str, Any]:
+    """Deterministic diff between two iterations' findings.
+
+    Returns the high-level shape:
+      ``{"a": <findings>, "b": <findings>, "delta": {...}}``.
+
+    The delta names which arms changed status (e.g. CONFIRMED → REFUTED)
+    and which principles were added between the two iterations. No
+    timestamps, no stochastic ordering — calling this twice on the same
+    data must produce byte-equal output.
+    """
+    campaign_root = Path(campaign_root)
+
+    def _findings(n: int) -> dict[str, Any] | None:
+        f = _read_json(campaign_root / "runs" / f"iter-{n}" / "findings.json")
+        return f if isinstance(f, dict) else None
+
+    a = _findings(iter_a) or {}
+    b = _findings(iter_b) or {}
+
+    def _arm_status_map(f: dict) -> dict[str, str]:
+        out: dict[str, str] = {}
+        for arm in f.get("arms", []) or []:
+            if isinstance(arm, dict):
+                out[str(arm.get("arm_id", ""))] = str(arm.get("status", ""))
+        return dict(sorted(out.items()))
+
+    delta = {
+        "iter_a": iter_a,
+        "iter_b": iter_b,
+        "arm_status_changes": _arm_status_diff(_arm_status_map(a), _arm_status_map(b)),
+        "principles_added": _principles_added(campaign_root, iter_a, iter_b),
+    }
+    return {"a": a, "b": b, "delta": delta}
+
+
+def _arm_status_diff(a: dict[str, str], b: dict[str, str]) -> list[dict[str, str]]:
+    changes = []
+    for arm_id in sorted(set(a) | set(b)):
+        sa = a.get(arm_id, "absent")
+        sb = b.get(arm_id, "absent")
+        if sa != sb:
+            changes.append({"arm_id": arm_id, "from": sa, "to": sb})
+    return changes
+
+
+def _principles_added(root: Path, iter_a: int, iter_b: int) -> list[str]:
+    def _ids(n: int) -> set[str]:
+        u = _read_json(root / "runs" / f"iter-{n}" / "principle_updates.json")
+        if not isinstance(u, list):
+            return set()
+        return {str(p.get("id", "")) for p in u if isinstance(p, dict) and "id" in p}
+    return sorted(_ids(iter_b) - _ids(iter_a))
+
+
+# ─── Resource paths (the strings the MCP server publishes as resources) ──
+
+
+def resource_uri_for_campaign(run_id: str) -> str:
+    return f"nous://campaigns/{run_id}"
+
+
+def resource_uri_for_state(run_id: str) -> str:
+    return f"nous://campaigns/{run_id}/state"
+
+
+def resource_uri_for_principles(run_id: str) -> str:
+    return f"nous://campaigns/{run_id}/principles"
+
+
+def resource_uri_for_iter_findings(run_id: str, iteration: int) -> str:
+    return f"nous://campaigns/{run_id}/iter/{iteration}/findings"
diff --git a/orchestrator/channels.py b/orchestrator/channels.py
new file mode 100644
index 0000000..9c00621
--- /dev/null
+++ b/orchestrator/channels.py
@@ -0,0 +1,229 @@
+"""Channel notification for human gates (issue #130, Phase A).
+
+Posts a markdown rendering of the gate summary to each configured channel
+webhook so reviewers see the gate on Slack/Telegram/etc. without needing
+to be at the terminal.
+
+Phase A scope: outbound notification only — the campaign still blocks on
+terminal input for the actual decision. Phase B (a follow-up) wires reply
+parsing so an "approve" reply on Slack advances the campaign.
+
+Configuration shape in campaign.yaml::
+
+    channels:
+      - kind: slack
+        webhook_url: https://hooks.slack.com/services/...
+      - kind: webhook
+        url: https://example.com/nous/gate
+        headers:
+          Authorization: Bearer ...
+
+Failures are best-effort: a webhook timeout or 5xx logs at warning and
+does NOT break the gate. The campaign keeps running.
+"""
+from __future__ import annotations
+
+import json
+import logging
+import urllib.error
+import urllib.request
+from pathlib import Path
+from typing import Any, Callable, Iterable
+
+logger = logging.getLogger(__name__)
+
+
+_DEFAULT_TIMEOUT_SECONDS = 10
+
+
+def _summary_to_markdown(summary: dict, *, gate_type: str, iter_dir: Path) -> str:
+    """Render a gate_summary dict as a compact markdown card."""
+    lines = [
+        f"### Nous gate: **{gate_type}**",
+        "",
+        summary.get("summary", "(no summary)"),
+        "",
+    ]
+    points = summary.get("key_points") or []
+    if points:
+        lines.append("**Key points**")
+        for p in points:
+            lines.append(f"- {p}")
+        lines.append("")
+    lines.append(f"_iter dir: `{iter_dir}`_")
+    lines.append("")
+    lines.append("Reply with `approve`, `reject`, or `abort`.")
+    return "\n".join(lines)
+
+
+def _post(url: str, body: bytes, headers: dict[str, str], timeout: float) -> int:
+    """Single HTTP POST. Returns status code; raises on transport error."""
+    req = urllib.request.Request(url, data=body, headers=headers, method="POST")
+    with urllib.request.urlopen(req, timeout=timeout) as resp:
+        return resp.status
+
+
+def _post_slack(channel: dict, markdown: str, timeout: float) -> int:
+    url = channel.get("webhook_url")
+    if not url:
+        raise ValueError("slack channel missing webhook_url")
+    body = json.dumps({"text": markdown}).encode("utf-8")
+    return _post(url, body, {"Content-Type": "application/json"}, timeout)
+
+
+def _post_generic(channel: dict, markdown: str, timeout: float) -> int:
+    url = channel.get("url")
+    if not url:
+        raise ValueError("webhook channel missing url")
+    headers = {"Content-Type": "application/json"}
+    headers.update(channel.get("headers") or {})
+    body = json.dumps({"markdown": markdown}).encode("utf-8")
+    return _post(url, body, headers, timeout)
+
+
+_DISPATCHERS: dict[str, Callable[[dict, str, float], int]] = {
+    "slack": _post_slack,
+    "webhook": _post_generic,
+}
+
+
+def notify_gate(
+    channels: Iterable[dict] | None,
+    *,
+    summary: dict,
+    gate_type: str,
+    iter_dir: Path,
+    timeout: float = _DEFAULT_TIMEOUT_SECONDS,
+    poster: Callable[[str, bytes, dict[str, str], float], int] | None = None,
+) -> list[dict[str, Any]]:
+    """POST a gate summary to every configured channel.
+
+    Args:
+      channels: list of channel configs from campaign.yaml. ``None`` or an
+        empty list is a no-op.
+      summary: parsed gate_summary_<phase>.json contents.
+      gate_type: ``design`` | ``findings`` | ``continue`` etc.
+      iter_dir: iteration directory (shown in the markdown card).
+      timeout: per-request timeout in seconds.
+      poster: dependency-injection seam for tests. When set, used instead
+        of the real urllib.request.urlopen path. Signature matches ``_post``.
+
+    Returns:
+      A list of result dicts — one per channel — with keys
+      ``kind``, ``ok``, ``status_code`` (or ``error``). The campaign uses
+      this to decide what to log, but never raises on individual failures.
+    """
+    if not channels:
+        return []
+
+    markdown = _summary_to_markdown(summary, gate_type=gate_type, iter_dir=iter_dir)
+
+    results: list[dict[str, Any]] = []
+    for channel in channels:
+        kind = channel.get("kind", "webhook")
+        result: dict[str, Any] = {"kind": kind, "ok": False}
+        try:
+            if poster is not None:
+                # Test path: bypass dispatcher, post directly.
+                if kind == "slack":
+                    body = json.dumps({"text": markdown}).encode("utf-8")
+                    url = channel.get("webhook_url", "")
+                    headers = {"Content-Type": "application/json"}
+                else:
+                    body = json.dumps({"markdown": markdown}).encode("utf-8")
+                    url = channel.get("url", "")
+                    headers = {"Content-Type": "application/json"}
+                    headers.update(channel.get("headers") or {})
+                status = poster(url, body, headers, timeout)
+            else:
+                dispatcher = _DISPATCHERS.get(kind)
+                if dispatcher is None:
+                    raise ValueError(f"unknown channel kind: {kind!r}")
+                status = dispatcher(channel, markdown, timeout)
+            result["status_code"] = status
+            result["ok"] = 200 <= status < 300
+        except (urllib.error.URLError, ValueError, TimeoutError, OSError) as exc:
+            logger.warning(
+                "channel %r notify failed: %s", kind, exc,
+            )
+            result["error"] = str(exc)
+        results.append(result)
+    return results
+
+
+# ─── Phase B: reply parsing + wait-for-decision ────────────────────────────
+
+
+_REPLY_TOKENS: dict[str, str] = {
+    "approve": "approve",
+    "approved": "approve",
+    "lgtm": "approve",
+    "ok": "approve",
+    "yes": "approve",
+    "reject": "reject",
+    "rejected": "reject",
+    "no": "reject",
+    "redesign": "reject",
+    "abort": "abort",
+    "stop": "abort",
+    "cancel": "abort",
+}
+
+
+def parse_reply(text: str) -> str | None:
+    """Map a free-form channel reply to a gate Decision.
+
+    Returns ``"approve"`` / ``"reject"`` / ``"abort"`` when the message
+    starts with (or is exactly) a recognized token. Returns ``None``
+    when the reply doesn't decode to a decision — caller should keep
+    waiting or fall through to the timeout.
+
+    Recognized tokens (case-insensitive):
+      approve | approved | lgtm | ok | yes  -> approve
+      reject  | rejected | no   | redesign  -> reject
+      abort   | stop     | cancel           -> abort
+    """
+    if not isinstance(text, str):
+        return None
+    head = text.strip().lower().split()
+    if not head:
+        return None
+    return _REPLY_TOKENS.get(head[0])
+
+
+def wait_for_reply(
+    reply_provider: "Callable[[], str | None]",
+    *,
+    timeout_seconds: float,
+    poll_interval_seconds: float = 1.0,
+    sleeper: "Callable[[float], None] | None" = None,
+    clock: "Callable[[], float] | None" = None,
+) -> str | None:
+    """Poll ``reply_provider`` until it returns a recognized decision or
+    timeout elapses.
+
+    Args:
+      reply_provider: callable returning the latest channel message text
+        (or ``None`` if no new reply yet).
+      timeout_seconds: max time to wait before returning ``None``.
+      poll_interval_seconds: how long to sleep between polls.
+      sleeper: dependency-injection seam for tests (default: time.sleep).
+      clock: dependency-injection seam for tests (default: time.time).
+
+    Returns:
+      ``"approve"`` / ``"reject"`` / ``"abort"`` on first recognized reply.
+      ``None`` on timeout — caller should fall back to ``--auto-approve``
+      semantics (the issue's documented timeout behavior).
+    """
+    import time as _time
+    sleep = sleeper if sleeper is not None else _time.sleep
+    now = clock if clock is not None else _time.time
+
+    deadline = now() + timeout_seconds
+    while now() < deadline:
+        text = reply_provider()
+        decision = parse_reply(text) if text is not None else None
+        if decision is not None:
+            return decision
+        sleep(poll_interval_seconds)
+    return None
diff --git a/orchestrator/claude_md.py b/orchestrator/claude_md.py
new file mode 100644
index 0000000..81ace8b
--- /dev/null
+++ b/orchestrator/claude_md.py
@@ -0,0 +1,159 @@
+"""Per-campaign ``CLAUDE.md`` generator (issue #131).
+
+Claude Code auto-loads ``CLAUDE.md`` from each working / added directory
+on every session, **once**. That makes it the right home for content that
+is stable across calls within a campaign:
+
+  * The campaign brief (research question, target system, observable
+    metrics, controllable knobs).
+  * The accumulated ``principles.json`` — the campaign's living knowledge
+    base.
+  * The most recent ``handoff.md`` — designer-to-executor context.
+
+This module is a pure renderer: ``render_campaign_claude_md`` takes
+inputs and returns a string; ``write_campaign_claude_md`` writes it to
+disk. Regeneration after each iteration is deterministic Python — never
+an LLM call.
+
+The win this enables (full payoff lands when the prompt-template refactor
+ships): each Nous LLM call no longer re-injects the campaign brief and
+principles. Compounded with #122's ``cache_control: ephemeral`` on the
+methodology system block, the bulk of static context is paid for once
+per session, not once per turn.
+"""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import Any
+
+from orchestrator.util import atomic_write
+
+
+_HEADER = """# Nous Campaign Context
+
+> This file is auto-generated by the orchestrator. **Do not hand-edit** —
+> changes will be overwritten on the next iteration. The orchestrator
+> updates the principles section after every iteration.
+"""
+
+
+def _format_principles(principles: list[dict] | None) -> str:
+    """Render principles.json contents as a readable markdown section."""
+    if not principles:
+        return "_No principles accumulated yet._"
+    lines: list[str] = []
+    for p in principles:
+        if not isinstance(p, dict):
+            continue
+        pid = p.get("id", "?")
+        statement = p.get("statement") or p.get("description") or "(no statement)"
+        category = p.get("category", "general")
+        status = p.get("status", "active")
+        if status != "active":
+            continue
+        lines.append(f"- **{pid}** [{category}]: {statement}")
+    if not lines:
+        return "_No active principles._"
+    return "\n".join(lines)
+
+
+def _format_target(target: dict) -> str:
+    parts = [
+        f"**{target.get('name', 'Unknown system')}**",
+        target.get("description", ""),
+    ]
+    metrics = target.get("observable_metrics")
+    if metrics:
+        parts.append(f"\n**Observable metrics:** {', '.join(metrics)}")
+    knobs = target.get("controllable_knobs")
+    if knobs:
+        parts.append(f"\n**Controllable knobs:** {', '.join(knobs)}")
+    return "\n".join(p for p in parts if p)
+
+
+def render_campaign_claude_md(
+    *,
+    campaign: dict,
+    principles: list[dict] | None = None,
+    last_handoff: str | None = None,
+    iteration: int | None = None,
+) -> str:
+    """Build the CLAUDE.md content for one campaign.
+
+    Sections (markdown headings the agent can navigate):
+      1. Campaign brief — research_question, target_system summary.
+      2. Active principles — formatted list from principles.json.
+      3. Last handoff — designer→executor handoff from the most recent
+         iteration that produced one (empty in iter 1).
+
+    Returns the full markdown text. Caller is responsible for writing
+    it to disk via ``write_campaign_claude_md``.
+    """
+    research_question = campaign.get("research_question", "(not set)")
+    target = campaign.get("target_system", {})
+
+    iter_line = f" (after iteration {iteration})" if iteration else ""
+
+    sections = [
+        _HEADER,
+        "## Research Question\n",
+        research_question.strip(),
+        "",
+        "## Target System\n",
+        _format_target(target),
+        "",
+        f"## Active Principles{iter_line}\n",
+        _format_principles(principles or []),
+        "",
+        "## Most Recent Handoff\n",
+    ]
+    if last_handoff and last_handoff.strip():
+        sections.append(last_handoff.strip())
+    else:
+        sections.append("_First iteration — no prior handoff._")
+    return "\n".join(sections) + "\n"
+
+
+def write_campaign_claude_md(work_dir: Path, content: str) -> Path:
+    """Atomically write CLAUDE.md to the campaign work-dir.
+
+    Returns the absolute path to the file.
+    """
+    target = Path(work_dir) / "CLAUDE.md"
+    atomic_write(target, content)
+    return target.resolve()
+
+
+def regenerate_from_disk(work_dir: Path, campaign: dict, iteration: int) -> Path:
+    """Refresh CLAUDE.md after iteration N completes.
+
+    Reads the current ``principles.json`` and ``handoff.md`` from
+    ``work_dir`` and writes a freshly-rendered CLAUDE.md. Returns the
+    absolute path written.
+    """
+    work_dir = Path(work_dir)
+    principles: list[dict[str, Any]] = []
+    p_path = work_dir / "principles.json"
+    if p_path.exists():
+        try:
+            store = json.loads(p_path.read_text())
+            principles = store.get("principles", [])
+        except (json.JSONDecodeError, OSError):
+            principles = []
+
+    handoff_text: str | None = None
+    h_path = work_dir / "handoff.md"
+    if h_path.exists():
+        try:
+            handoff_text = h_path.read_text()
+        except OSError:
+            handoff_text = None
+
+    content = render_campaign_claude_md(
+        campaign=campaign,
+        principles=principles,
+        last_handoff=handoff_text,
+        iteration=iteration,
+    )
+    return write_campaign_claude_md(work_dir, content)
diff --git a/orchestrator/cli.py b/orchestrator/cli.py
index 755e9d9..81ef69f 100644
--- a/orchestrator/cli.py
+++ b/orchestrator/cli.py
@@ -161,26 +161,41 @@ def _cmd_validate(args):
 
 
 def _cmd_status(args):
-    import json
+    """Status surface — one-shot, single-line, or live --watch (#127)."""
+    import time as _time
+    from orchestrator.status import (
+        format_one_liner,
+        format_watch_panel,
+        read_status_snapshot,
+    )
 
     work_dir = resolve_work_dir(args.target)
-    state_file = work_dir / "state.json"
-    if not state_file.exists():
+    if not (work_dir / "state.json").exists():
         print(f"Error: no state.json at {work_dir}", file=sys.stderr)
         sys.exit(1)
 
-    state = json.loads(state_file.read_text())
-    ledger = json.loads((work_dir / "ledger.json").read_text()) if (work_dir / "ledger.json").exists() else {"iterations": []}
-    principles = json.loads((work_dir / "principles.json").read_text()) if (work_dir / "principles.json").exists() else {"principles": []}
-
-    active_principles = [p for p in principles.get("principles", []) if p.get("status") == "active"]
-    completed = [it for it in ledger.get("iterations", []) if it.get("iteration", 0) > 0]
+    if getattr(args, "line", False):
+        print(format_one_liner(read_status_snapshot(work_dir)))
+        return
 
-    print(f"Campaign:    {state.get('run_id', '?')}")
-    print(f"Phase:       {state.get('phase', '?')}")
-    print(f"Iteration:   {state.get('iteration', '?')}")
-    print(f"Completed:   {len(completed)} iteration(s)")
-    print(f"Principles:  {len(active_principles)} active")
+    if getattr(args, "watch", False):
+        try:
+            while True:
+                snap = read_status_snapshot(work_dir)
+                # Clear screen + home cursor (ANSI). Falls back gracefully
+                # in non-tty contexts to a separator line.
+                if sys.stdout.isatty():
+                    sys.stdout.write("\033[2J\033[H")
+                else:
+                    sys.stdout.write("\n" + "─" * 60 + "\n")
+                sys.stdout.write(format_watch_panel(snap) + "\n")
+                sys.stdout.flush()
+                _time.sleep(args.interval if args.interval > 0 else 2)
+        except KeyboardInterrupt:
+            print()
+            return
+
+    print(format_watch_panel(read_status_snapshot(work_dir)))
 
 
 def _cmd_cost(args):
@@ -206,6 +221,11 @@ def _cmd_cost(args):
         for phase, b in s["by_phase"].items():
             print(f"  {phase:20s}  {b['calls']} calls  ${b['cost_usd']:.4f}  {b['input_tokens']+b['output_tokens']} tok")
 
+    if getattr(args, "cache_stats", False):
+        from orchestrator.cache_stats import cache_stats, format_cache_stats
+        print("\nCache stats:")
+        print(format_cache_stats(cache_stats(metrics_path)))
+
 
 def _cmd_report(args):
     import logging
@@ -310,7 +330,7 @@ def main():
     p_run.add_argument("--auto-approve", action="store_true")
     p_run.add_argument("--timeout", type=int, default=1800)
     p_run.add_argument("--max-cli-retries", type=int, default=10)
-    p_run.add_argument("--agent", choices=["inline", "api"], default="api")
+    p_run.add_argument("--agent", choices=["inline", "api", "sdk"], default="api")
     p_run.set_defaults(func=_cmd_run)
 
     p_resume = subparsers.add_parser("resume")
@@ -320,7 +340,7 @@ def main():
     p_resume.add_argument("--auto-approve", action="store_true")
     p_resume.add_argument("--timeout", type=int, default=1800)
     p_resume.add_argument("--max-cli-retries", type=int, default=10)
-    p_resume.add_argument("--agent", choices=["inline", "api"], default="api")
+    p_resume.add_argument("--agent", choices=["inline", "api", "sdk"], default="api")
     p_resume.set_defaults(func=_cmd_resume)
 
     p_validate = subparsers.add_parser("validate")
@@ -330,17 +350,33 @@ def main():
 
     p_status = subparsers.add_parser("status")
     p_status.add_argument("target")
+    p_status.add_argument(
+        "--watch", action="store_true",
+        help="Loop and redraw every --interval seconds (#127).",
+    )
+    p_status.add_argument(
+        "--line", action="store_true",
+        help="Print a single-line summary suitable for shell prompts (#127).",
+    )
+    p_status.add_argument(
+        "--interval", type=float, default=2.0,
+        help="Watch redraw interval in seconds (default: 2).",
+    )
     p_status.set_defaults(func=_cmd_status)
 
     p_cost = subparsers.add_parser("cost")
     p_cost.add_argument("target")
+    p_cost.add_argument(
+        "--cache-stats", action="store_true",
+        help="Include prompt-cache hit-rate stats (#122).",
+    )
     p_cost.set_defaults(func=_cmd_cost)
 
     p_report = subparsers.add_parser("report")
     p_report.add_argument("target")
     p_report.add_argument("--model")
     p_report.add_argument("--timeout", type=int, default=1800)
-    p_report.add_argument("--agent", choices=["inline", "api"], default="api")
+    p_report.add_argument("--agent", choices=["inline", "api", "sdk"], default="api")
     p_report.set_defaults(func=_cmd_report)
 
     p_replay = subparsers.add_parser("replay")
diff --git a/orchestrator/cli_dispatch.py b/orchestrator/cli_dispatch.py
index 5a4c968..8f2e2e1 100644
--- a/orchestrator/cli_dispatch.py
+++ b/orchestrator/cli_dispatch.py
@@ -51,6 +51,7 @@ def __init__(
         timeout: int = 1800,
         max_turns: int = 25,
         max_retries: int | None = 10,
+        settings_path: Path | None = None,
     ) -> None:
         super().__init__(
             work_dir=work_dir,
@@ -66,6 +67,13 @@ def __init__(
         self.max_retries = max_retries
         repo_path = campaign.get("target_system", {}).get("repo_path")
         self._cwd = Path(repo_path) if repo_path else None
+        # Per-campaign permission policy (#135). When set, replaces the
+        # blanket --dangerously-skip-permissions with a fine-grained settings
+        # file. Auto-resolved from work_dir/.claude/settings.json if it exists.
+        if settings_path is None:
+            candidate = Path(work_dir) / ".claude" / "settings.json"
+            settings_path = candidate if candidate.exists() else None
+        self._settings_path = settings_path
 
     @contextmanager
     def override_cwd(self, cwd: Path):
@@ -216,8 +224,11 @@ def _retry_cli_schema(
 
     def _call_claude(self, prompt: str, max_turns: int | None = None) -> str:
         """Invoke `claude -p` with the prompt on stdin, retrying transient failures."""
-        cmd = ["claude", "-p", "--model", self.model, "--output-format", "json",
-               "--dangerously-skip-permissions"]
+        cmd = ["claude", "-p", "--model", self.model, "--output-format", "json"]
+        if self._settings_path is not None:
+            cmd += ["--settings", str(self._settings_path)]
+        else:
+            cmd += ["--dangerously-skip-permissions"]
         turns = max_turns or self.max_turns
         cmd += ["--max-turns", str(turns)]
         cwd = self._cwd
diff --git a/orchestrator/explore_design.py b/orchestrator/explore_design.py
new file mode 100644
index 0000000..4d037d3
--- /dev/null
+++ b/orchestrator/explore_design.py
@@ -0,0 +1,257 @@
+"""Explore-then-synthesize DESIGN phase (issue #132).
+
+DESIGN today asks one Opus session to do two things at once:
+
+  1. Read the codebase to map metrics, knobs, prior findings, principles.
+  2. Synthesize a hypothesis bundle from what it found.
+
+That's the canonical Claude-Code-pattern miss: broad exploration + small
+synthesis is exactly what parallel Explore subagents are for. Phase A
+of #132 ships the orchestration layer that makes the split possible
+without changing what gets produced (problem.md + bundle.yaml).
+
+Stage A — parallel Explore: ``run_explore_stage(campaign, scopes,
+runner)`` fans out one read-only subagent per scope and collects their
+reports.
+
+Stage B — Opus synthesis: ``build_synthesis_prompt(reports, campaign,
+iteration)`` produces the prompt body for the single Opus call that
+turns the explorer reports + principles.json into problem.md +
+bundle.yaml.
+
+Phase A is the orchestration helpers + their behavioral tests. The
+dispatcher integration (SDKDispatcher spawning Explore subagents,
+threading reports back into a synthesis call) lands in Phase B once
+#121 merges and the team picks injection points.
+"""
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Callable, Iterable
+
+# Default exploration scopes — one Explore subagent per scope. The
+# scopes are deliberately overlapping a little so synthesis has
+# redundant signal where it matters.
+DEFAULT_EXPLORE_SCOPES: tuple[str, ...] = (
+    "metrics",        # observable metrics + how they're collected
+    "knobs",          # controllable knobs + their value ranges
+    "prior_findings", # findings.json from previous iterations
+    "principles",     # principles.json across the campaign + others
+)
+
+
+@dataclass
+class ExploreReport:
+    scope: str
+    text: str
+    duration_ms: int = 0
+    input_tokens: int = 0
+    output_tokens: int = 0
+
+    def as_dict(self) -> dict:
+        return {
+            "scope": self.scope,
+            "text": self.text,
+            "duration_ms": self.duration_ms,
+            "input_tokens": self.input_tokens,
+            "output_tokens": self.output_tokens,
+        }
+
+
+@dataclass
+class ExploreStageResult:
+    reports: list[ExploreReport] = field(default_factory=list)
+
+    @property
+    def total_input_tokens(self) -> int:
+        return sum(r.input_tokens for r in self.reports)
+
+    @property
+    def total_output_tokens(self) -> int:
+        return sum(r.output_tokens for r in self.reports)
+
+    def by_scope(self, scope: str) -> ExploreReport | None:
+        for r in self.reports:
+            if r.scope == scope:
+                return r
+        return None
+
+
+def build_explore_prompt(scope: str, campaign: dict) -> str:
+    """Construct a read-only Explore subagent prompt for one scope.
+
+    The subagent should be spawned with ``subagent_type="Explore"`` so
+    it cannot mutate the worktree. The prompt is short and scope-tight
+    on purpose; the synthesis call (Stage B) is where multi-aspect
+    integration happens.
+    """
+    target = campaign.get("target_system", {})
+    name = target.get("name", "the target system")
+    repo = target.get("repo_path", "(repo not configured)")
+
+    if scope == "metrics":
+        focus = (
+            "Map the observable metrics this system exposes and how they "
+            "are collected. Include the file/function where each metric is "
+            "computed."
+        )
+    elif scope == "knobs":
+        focus = (
+            "Map the controllable knobs / configuration parameters this "
+            "system exposes. For each knob, note its declared range and the "
+            "code path that consumes it."
+        )
+    elif scope == "prior_findings":
+        focus = (
+            "Read prior runs/iter-*/findings.json files in the campaign "
+            "directory. Summarize confirmed/refuted hypotheses and any open "
+            "questions surfaced by the most recent iteration."
+        )
+    elif scope == "principles":
+        focus = (
+            "Read principles.json in this campaign and any sibling campaigns "
+            "(via the campaign_index module if available). Flag principles "
+            "that touch the same mechanism we're about to design for."
+        )
+    else:
+        focus = f"Investigate the '{scope}' aspect of the target system."
+
+    return (
+        f"# Explore: {scope}\n\n"
+        f"You are a read-only Explore subagent. **Do not modify any files.**\n"
+        f"Target: {name} (repo at {repo})\n\n"
+        f"## Focus\n{focus}\n\n"
+        f"## Output\n"
+        f"Return a markdown report of <= 500 lines. Cite file paths and "
+        f"line numbers. End with a one-paragraph summary the synthesizer "
+        f"can read in isolation.\n"
+    )
+
+
+ExploreRunner = Callable[[str, str, dict], ExploreReport]
+"""Callable signature for running one Explore subagent.
+
+Takes (scope, prompt, campaign) and returns an ExploreReport. The
+default real-world implementation spawns subagent_type="Explore" via
+the SDK and reads the assistant's final text. Tests inject a deterministic
+fake.
+"""
+
+
+def run_explore_stage(
+    campaign: dict,
+    *,
+    scopes: Iterable[str] = DEFAULT_EXPLORE_SCOPES,
+    runner: ExploreRunner,
+) -> ExploreStageResult:
+    """Run one Explore subagent per scope and collect their reports.
+
+    Phase A executes synchronously over the runner. Real parallel
+    fan-out (anyio gather over the SDK's async API) lands in Phase B
+    when the SDK runner ships its async surface.
+    """
+    reports: list[ExploreReport] = []
+    for scope in scopes:
+        prompt = build_explore_prompt(scope, campaign)
+        report = runner(scope, prompt, campaign)
+        reports.append(report)
+    return ExploreStageResult(reports=reports)
+
+
+def make_sdk_explore_runner(
+    *,
+    sdk_runner: Callable,
+    cwd: Path | None = None,
+    model: str = "claude-haiku-4-5",
+    max_turns: int = 8,
+) -> ExploreRunner:
+    """Build an ExploreRunner backed by an SDK subagent (#132 Phase B).
+
+    Each scope spawns a read-only subagent (``subagent_type="Explore"``)
+    so the orchestrator gets parallel mapping without a giant Opus
+    session doing both walking and synthesis. Per the no-live-LLM
+    project principle (CLAUDE.md), this factory takes an injected
+    ``sdk_runner`` — production wiring constructs the real Anthropic
+    SDK runner; tests inject a recording fake.
+
+    Defaults model to Haiku because read-only mapping is cheap and
+    benefits from speed over depth; deep synthesis happens in Stage B
+    (the single Opus call), not in Stage A.
+    """
+    def _run(scope: str, prompt: str, campaign: dict) -> ExploreReport:
+        try:
+            result = sdk_runner(
+                prompt=prompt,
+                model=model,
+                cwd=cwd,
+                max_turns=max_turns,
+                system_prompt=None,
+                settings_path=None,
+                event_log_path=None,
+                subagent_type="Explore",
+            )
+        except TypeError:
+            # Older runners without subagent_type — fall back to the
+            # base signature so the factory stays compatible across
+            # SDK API evolution.
+            result = sdk_runner(
+                prompt=prompt, model=model, cwd=cwd, max_turns=max_turns,
+            )
+
+        return ExploreReport(
+            scope=scope,
+            text=getattr(result, "text", "") or "",
+            duration_ms=int(getattr(result, "duration_ms", 0) or 0),
+            input_tokens=int(getattr(result, "input_tokens", 0) or 0),
+            output_tokens=int(getattr(result, "output_tokens", 0) or 0),
+        )
+
+    return _run
+
+
+def build_synthesis_prompt(
+    stage_a: ExploreStageResult,
+    *,
+    campaign: dict,
+    iteration: int,
+    iter_dir: Path,
+) -> str:
+    """Build the Opus synthesis prompt that turns Explore reports into
+    problem.md + bundle.yaml.
+
+    The synthesizer never reads the codebase directly — it consumes only
+    the explorer reports + principles.json. That's the whole point of
+    the split: Opus on integration, not on file walks.
+    """
+    target = campaign.get("target_system", {})
+    rq = campaign.get("research_question", "(not set)")
+
+    sections = [
+        f"# Synthesize iteration {iteration}",
+        "",
+        "Four read-only Explore subagents have already mapped the system.",
+        "**Do not re-read the codebase.** Synthesize from the reports below.",
+        "",
+        f"## Research question\n{rq}",
+        "",
+        f"## Target\n{target.get('name', '?')} — {target.get('description', '')}",
+        "",
+        "## Explorer reports",
+    ]
+    for report in stage_a.reports:
+        sections.append("")
+        sections.append(f"### {report.scope}\n")
+        sections.append(report.text)
+
+    sections.extend([
+        "",
+        "## Required outputs",
+        f"- {iter_dir}/problem.md (markdown)",
+        f"- {iter_dir}/bundle.yaml (YAML, must validate against bundle.schema.yaml)",
+        "",
+        "Cite explorer reports by their `### <scope>` heading when justifying "
+        "design choices. The reports are the source of truth for this "
+        "iteration's design.",
+    ])
+    return "\n".join(sections)
diff --git a/orchestrator/goal_driven.py b/orchestrator/goal_driven.py
new file mode 100644
index 0000000..33b421a
--- /dev/null
+++ b/orchestrator/goal_driven.py
@@ -0,0 +1,175 @@
+"""`/goal`-driven campaign mode (issue #124).
+
+Two modes Nous can run in:
+
+  Mode A — fully /goal-driven: spawn one ``claude`` session for the
+    whole campaign with a /goal directive that says "iteration N has
+    a valid findings.json and a principle_updates.json file, OR stop
+    after the campaign timeout." The Haiku evaluator that fires after
+    every turn decides when the goal is met. No Python state machine
+    in the inner loop.
+
+  Mode B — /goal-bounded inner loop: keep the engine.py state machine
+    for control flow but use /goal *within* EXECUTE_ANALYZE so the
+    executor terminates as soon as validation passes. Cheaper than
+    Python-driven retry loops.
+
+Phase A ships the prompt builders for both modes (deterministic Python).
+Wire-up into the dispatcher and the run_campaign code path lands in
+Phase B once the team picks which mode is the default.
+
+Why deterministic prompt builders ship first: criterion #2 of the issue
+("hybrid mode is the default for nous run after one release of soak")
+implies the team will run both modes side by side on real campaigns
+and compare. Behavioral testing of the prompt assembly — does it
+include the campaign brief, does it spell out the goal predicate
+exactly — is what makes those soak runs comparable.
+"""
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Any
+
+
+_DEFAULT_GOAL_DRIVEN_TIMEOUT_HOURS = 24
+
+
+def build_full_goal_directive(
+    campaign: dict,
+    *,
+    iteration: int,
+    timeout_hours: int = _DEFAULT_GOAL_DRIVEN_TIMEOUT_HOURS,
+) -> str:
+    """Build the /goal text for Mode A (whole-campaign goal).
+
+    The text is what gets sent as ``/goal "<...>"`` to a Claude Code
+    session. Predicate: iteration N has a valid findings.json AND a
+    principle_updates.json file, OR the elapsed time exceeds
+    timeout_hours.
+    """
+    return (
+        f"iteration {iteration} has produced runs/iter-{iteration}/findings.json "
+        f"with a non-empty arms list AND runs/iter-{iteration}/principle_updates.json "
+        f"with a list (possibly empty), OR more than {timeout_hours} hours have elapsed "
+        f"since this session started"
+    )
+
+
+def build_inner_loop_goal_directive(
+    iteration: int,
+    *,
+    extra_predicates: list[str] | None = None,
+) -> str:
+    """Build the /goal text for Mode B (EXECUTE_ANALYZE-bounded goal).
+
+    Predicate: validate execution passes AND principle_updates.json
+    exists. The deterministic Stop hook (#129) also enforces this; the
+    /goal evaluator is the probabilistic backup that catches edge cases
+    the schema check doesn't.
+    """
+    parts = [
+        f"runs/iter-{iteration}/findings.json validates against findings.schema.json",
+        f"runs/iter-{iteration}/principle_updates.json exists and parses as a list",
+    ]
+    if extra_predicates:
+        parts.extend(extra_predicates)
+    return " AND ".join(parts)
+
+
+def build_goal_driven_session_prompt(
+    campaign: dict,
+    *,
+    iteration: int,
+    timeout_hours: int = _DEFAULT_GOAL_DRIVEN_TIMEOUT_HOURS,
+    work_dir: Path | None = None,
+) -> str:
+    """Build the full prompt body for a Mode A session.
+
+    The prompt asks the agent to drive iteration N of the Nous loop
+    end-to-end inside the session, printing artifact paths so the Haiku
+    /goal evaluator can see them.
+    """
+    target = campaign.get("target_system", {})
+    rq = campaign.get("research_question", "(not set)")
+
+    sections = [
+        "# Goal-driven Nous campaign",
+        "",
+        "You are running iteration {iter} of a Nous hypothesis-driven experiment.",
+        "Drive the full DESIGN → EXECUTE_ANALYZE → DONE flow inside this session.",
+        "",
+        "## Campaign brief",
+        f"- Research question: {rq}",
+        f"- Target system: {target.get('name', '?')}",
+        f"- Description: {target.get('description', '(no description)')}",
+    ]
+    metrics = target.get("observable_metrics")
+    if metrics:
+        sections.append(f"- Observable metrics: {', '.join(metrics)}")
+    knobs = target.get("controllable_knobs")
+    if knobs:
+        sections.append(f"- Controllable knobs: {', '.join(knobs)}")
+
+    sections.extend([
+        "",
+        "## Required artifacts (iteration {iter})",
+        f"- runs/iter-{iteration}/problem.md",
+        f"- runs/iter-{iteration}/bundle.yaml",
+        f"- runs/iter-{iteration}/experiment_plan.yaml",
+        f"- runs/iter-{iteration}/findings.json",
+        f"- runs/iter-{iteration}/principle_updates.json",
+        "",
+        "**Print every artifact path to stdout when you write it.** The /goal "
+        "evaluator only sees what's been surfaced in the conversation; "
+        "silent file writes won't trip the goal predicate.",
+        "",
+        "Run `nous validate execution --dir runs/iter-{iter}/` before claiming done.",
+        "",
+        "## Goal predicate",
+        f"/goal {build_full_goal_directive(campaign, iteration=iteration, timeout_hours=timeout_hours)!r}",
+    ])
+
+    text = "\n".join(sections)
+    return text.replace("{iter}", str(iteration))
+
+
+# ─── Phase B: dispatcher wire-up ────────────────────────────────────────────
+
+
+def run_goal_driven_iteration(
+    *,
+    dispatcher,
+    campaign: dict,
+    iteration: int,
+    work_dir: Path,
+    timeout_hours: int = _DEFAULT_GOAL_DRIVEN_TIMEOUT_HOURS,
+) -> Path:
+    """Mode A — drive iteration N entirely inside a single SDK session.
+
+    Bypasses the engine.py phase machine. The agent receives the
+    goal-driven prompt (with its embedded ``/goal`` directive) and
+    drives DESIGN → EXECUTE_ANALYZE → DONE itself. The orchestrator
+    persists the conversation transcript as ``design_log.md``; the
+    artifacts (problem.md, bundle.yaml, findings.json, etc.) are
+    written by the agent's own tool calls inside the session.
+
+    Args:
+      dispatcher: any object exposing ``_call_claude(prompt) -> str``.
+        ``SDKDispatcher`` is the canonical caller; tests inject a fake.
+      campaign: parsed campaign config.
+      iteration: iteration number to drive.
+      work_dir: campaign work-dir.
+      timeout_hours: bound on the goal predicate's OR clause.
+
+    Returns:
+      Path to the conversation log on disk.
+    """
+    iter_dir = Path(work_dir) / "runs" / f"iter-{iteration}"
+    iter_dir.mkdir(parents=True, exist_ok=True)
+    prompt = build_goal_driven_session_prompt(
+        campaign, iteration=iteration, timeout_hours=timeout_hours,
+    )
+    transcript = dispatcher._call_claude(prompt)
+    log_path = iter_dir / "design_log.md"
+    log_path.write_text(transcript)
+    return log_path
diff --git a/orchestrator/iteration.py b/orchestrator/iteration.py
index 29e9712..a3f4ea9 100644
--- a/orchestrator/iteration.py
+++ b/orchestrator/iteration.py
@@ -193,7 +193,17 @@ def setup_work_dir(run_id: str, repo_path: str | None = None) -> Path:
     If repo_path is provided, the campaign directory is created inside
     the target repo at .nous/<run_id>/. Otherwise falls back to creating
     <run_id>/ in the current directory.
+
+    Also writes a per-campaign ``.claude/settings.json`` permission policy
+    (issue #135) so dispatchers can pass ``--settings <path>`` instead of
+    ``--dangerously-skip-permissions``.
     """
+    from orchestrator.settings_template import (
+        render_campaign_settings,
+        settings_path_for,
+        write_campaign_settings,
+    )
+
     if repo_path:
         work_dir = Path(repo_path) / ".nous" / run_id
     else:
@@ -206,13 +216,35 @@ def setup_work_dir(run_id: str, repo_path: str | None = None) -> Path:
     state = json.loads((work_dir / "state.json").read_text())
     state["run_id"] = run_id
     atomic_write(work_dir / "state.json", json.dumps(state, indent=2) + "\n")
+
+    # Per-campaign permission policy. Idempotent: don't overwrite a settings
+    # file the user has hand-edited.
+    settings_path = settings_path_for(work_dir)
+    if not settings_path.exists():
+        bin_dir = Path(__file__).resolve().parent.parent / "bin"
+        stop_hook = bin_dir / "nous-execute-stop"
+        plan_enforcer = bin_dir / "nous-plan-enforcer"
+        settings = render_campaign_settings(
+            work_dir=work_dir,
+            repo_path=Path(repo_path) if repo_path else None,
+            stop_hook_path=stop_hook if stop_hook.exists() else None,
+            pre_tool_use_hook_path=plan_enforcer if plan_enforcer.exists() else None,
+        )
+        write_campaign_settings(settings_path, settings)
+
     return work_dir
 
 
 def _generate_gate_summary(
     dispatcher, iter_dir: Path, iteration: int, gate_type: str,
+    *, campaign: dict | None = None,
 ) -> Path | None:
-    """Generate a gate summary file. Returns the path, or None on failure."""
+    """Generate a gate summary file. Returns the path, or None on failure.
+
+    When ``campaign`` is provided and contains a non-empty ``channels`` list,
+    also fires off a per-channel notification (#130) with the rendered
+    summary. Channel failures are logged at warning and never block the gate.
+    """
     summary_path = iter_dir / f"gate_summary_{gate_type}.json"
     try:
         dispatcher.dispatch(
@@ -221,13 +253,33 @@ def _generate_gate_summary(
             iteration=iteration,
             perspective=gate_type,
         )
-        return summary_path
     except (RuntimeError, FileNotFoundError, OSError) as exc:
         logger = logging.getLogger(__name__)
         logger.warning("Gate summary generation failed: %s", exc)
         print(f"  (Gate summary skipped: {exc})")
         return None
 
+    # Channel notification (#130 Phase A): outbound only; the campaign still
+    # blocks on terminal input for the actual decision.
+    if campaign:
+        channels = campaign.get("channels")
+        if channels:
+            try:
+                from orchestrator.channels import notify_gate
+                summary = json.loads(summary_path.read_text())
+                results = notify_gate(
+                    channels, summary=summary, gate_type=gate_type,
+                    iter_dir=iter_dir,
+                )
+                ok = sum(1 for r in results if r.get("ok"))
+                if ok:
+                    print(f"  (notified {ok}/{len(results)} channel(s))")
+            except (json.JSONDecodeError, OSError, RuntimeError) as exc:
+                logger = logging.getLogger(__name__)
+                logger.warning("Channel notification failed: %s", exc)
+
+    return summary_path
+
 
 def run_iteration(
     campaign: dict,
@@ -281,9 +333,15 @@ def _max_turns_for(phase_key: str) -> int:
         cli_dispatcher = inline_dispatcher
         llm_dispatcher = inline_dispatcher
     else:
-        # API mode: CLIDispatcher for code-access roles only (when repo_path is set)
+        # API or SDK mode: code-access dispatcher only when repo_path is set.
+        # SDK uses claude-agent-sdk; api uses the claude -p subprocess (CLIDispatcher).
+        if agent == "sdk":
+            from orchestrator.sdk_dispatch import SDKDispatcher
+            code_dispatcher_cls = SDKDispatcher
+        else:
+            code_dispatcher_cls = CLIDispatcher
         cli_dispatcher = (
-            CLIDispatcher(
+            code_dispatcher_cls(
                 work_dir=work_dir, campaign=campaign,
                 model=_model_for("design"), timeout=timeout,
                 max_turns=_max_turns_for("design"),
@@ -345,7 +403,7 @@ def _max_turns_for(phase_key: str) -> int:
         print(f"\n{'='*60}")
         print(f"  HUMAN DESIGN GATE")
         print(f"{'='*60}")
-        summary_path = _generate_gate_summary(llm_dispatcher, iter_dir, iteration, "design")
+        summary_path = _generate_gate_summary(llm_dispatcher, iter_dir, iteration, "design", campaign=campaign)
         decision, reason = gate.prompt(
             "Review the hypothesis bundle. Approve?",
             summary_path=str(summary_path) if summary_path else None,
@@ -445,7 +503,7 @@ def _max_turns_for(phase_key: str) -> int:
         print(f"\n{'='*60}")
         print(f"  HUMAN FINDINGS GATE")
         print(f"{'='*60}")
-        summary_path = _generate_gate_summary(llm_dispatcher, iter_dir, iteration, "findings")
+        summary_path = _generate_gate_summary(llm_dispatcher, iter_dir, iteration, "findings", campaign=campaign)
         decision, reason = gate.prompt(
             "Review the findings. Approve?",
             summary_path=str(summary_path) if summary_path else None,
@@ -464,6 +522,16 @@ def _max_turns_for(phase_key: str) -> int:
     _merge_principles(work_dir, iter_dir)
     print(f"  -> Principles merged into {work_dir / 'principles.json'}")
 
+    # ─── CLAUDE.md REGENERATE (Python, no LLM) — issue #131 ───────────────
+    # Refresh per-campaign CLAUDE.md so the next iteration's session loads
+    # the updated principles + handoff via Claude Code's auto-context loading.
+    try:
+        from orchestrator.claude_md import regenerate_from_disk
+        regenerate_from_disk(work_dir, campaign, iteration=iteration)
+    except (OSError, RuntimeError) as exc:
+        # Best-effort: a CLAUDE.md write failure shouldn't abort the iteration.
+        logger.warning("Failed to regenerate CLAUDE.md: %s", exc)
+
     if final:
         engine.transition("DONE")
         print(f"\n{'='*60}")
@@ -493,7 +561,7 @@ def main() -> None:
                         help="Timeout in seconds for claude -p calls (default: 1800)")
     parser.add_argument("--max-cli-retries", type=int, default=10,
                         help="Max retries for claude -p failures (-1 = unbounded, default: 10)")
-    parser.add_argument("--agent", choices=["inline", "api"], default="api",
+    parser.add_argument("--agent", choices=["inline", "api", "sdk"], default="api",
                         help="Dispatch backend: 'inline' emits prompts to stdout for the "
                              "calling agent, 'api' uses the LLM API (default: api)")
     parser.add_argument("-v", "--verbose", action="store_true",
diff --git a/orchestrator/llm_dispatch.py b/orchestrator/llm_dispatch.py
index d4f4ece..3271fc4 100644
--- a/orchestrator/llm_dispatch.py
+++ b/orchestrator/llm_dispatch.py
@@ -53,9 +53,14 @@ def __init__(
         self._validate_campaign(campaign)
         self.campaign = campaign
         self.model = model
+        # PromptLoader prefers <template>_thin.md when CLAUDE.md exists
+        # at work_dir/CLAUDE.md (#131 Phase B): the thin variants carry
+        # only per-iteration context and reference CLAUDE.md for the
+        # methodology, dropping ~400 lines per call when warm.
         self.loader = PromptLoader(
             prompts_dir
-            or Path(__file__).parent.parent / "prompts" / "methodology"
+            or Path(__file__).parent.parent / "prompts" / "methodology",
+            claude_md_at=Path(work_dir) / "CLAUDE.md",
         )
         if completion_fn:
             self._completion = completion_fn
diff --git a/orchestrator/parallel_arms.py b/orchestrator/parallel_arms.py
new file mode 100644
index 0000000..aff5a29
--- /dev/null
+++ b/orchestrator/parallel_arms.py
@@ -0,0 +1,198 @@
+"""Parallel-arm execution orchestration (issue #123, Phase A).
+
+After DESIGN produces ``experiment_plan.yaml``, EXECUTE_ANALYZE today
+runs every (arm × seed × condition) tuple sequentially in one Sonnet
+session. That mega-session is what produced the 5/18 connection-drop
+incidents and is the proximate cause of the "race two executors" bug
+that #71/#111 partly fixed at the symptom level.
+
+The fix: partition the plan into independent units, fan them out to
+per-unit subagents (each in its own worktree via #133), wait for all,
+and run the existing deterministic merge into findings.json +
+principle_updates.json.
+
+Phase A scope:
+
+  * partition_plan(plan) — turn experiment_plan.yaml into a flat list
+    of ArmUnit descriptors.
+  * run_units(units, *, runner, max_parallel) — fan out via an injected
+    runner callable, collect ArmUnitResult records (one per unit).
+  * merge_unit_results(results, plan) — deterministic merge into a
+    findings-shaped dict (the schema validation step is reused from
+    the existing executor pipeline).
+
+Phase B (lands when #121 + #133 merge):
+
+  * SDKDispatcher integration: the runner spawns
+    ``Agent(isolation="worktree", subagent_type="claude")`` per unit.
+  * Real ``anyio.gather`` for actual parallelism with a CPU-bounded
+    semaphore.
+  * Wire-up into iteration.py so EXECUTE_ANALYZE picks parallel mode
+    when ``max_parallel_arms > 1``.
+"""
+from __future__ import annotations
+
+import os
+from dataclasses import dataclass, field
+from typing import Callable
+
+
+@dataclass(frozen=True)
+class ArmUnit:
+    """A single (arm, seed, condition) work item."""
+
+    arm_id: str
+    seed: str
+    condition_name: str
+    command: str
+
+    @property
+    def relative_results_dir(self) -> str:
+        """Where this unit's results land — never overlaps with another unit."""
+        return f"results/{self.arm_id}/{self.seed}"
+
+
+@dataclass
+class ArmUnitResult:
+    unit: ArmUnit
+    status: str  # "complete" | "failed"
+    duration_ms: int = 0
+    output_files: list[str] = field(default_factory=list)
+    error: str = ""
+
+
+def partition_plan(plan: dict) -> list[ArmUnit]:
+    """Turn an experiment_plan.yaml-shaped dict into a list of ArmUnits.
+
+    Each (arm × condition) becomes one unit. Seed defaults to ``"seed-1"``
+    when the condition doesn't carry an explicit seed list; multi-seed
+    conditions fan out to one unit per seed.
+    """
+    units: list[ArmUnit] = []
+    for arm in plan.get("arms", []) or []:
+        if not isinstance(arm, dict):
+            continue
+        arm_id = str(arm.get("arm_id") or arm.get("type") or "?")
+        for cond in arm.get("conditions", []) or []:
+            if not isinstance(cond, dict):
+                continue
+            command = str(cond.get("command") or cond.get("cmd") or "")
+            if not command:
+                continue
+            cond_name = str(cond.get("name") or cond.get("id") or "default")
+            seeds = cond.get("seeds") or [cond.get("seed") or "seed-1"]
+            if not isinstance(seeds, list):
+                seeds = [str(seeds)]
+            for s in seeds:
+                units.append(ArmUnit(
+                    arm_id=arm_id,
+                    seed=str(s),
+                    condition_name=cond_name,
+                    command=command,
+                ))
+    return units
+
+
+ArmRunner = Callable[[ArmUnit], ArmUnitResult]
+"""Callable that executes one ArmUnit and returns its result.
+
+The default real-world implementation spawns an SDK subagent with
+``isolation="worktree"`` and the planned command. Tests inject a
+deterministic fake.
+"""
+
+
+def run_units(
+    units: list[ArmUnit],
+    *,
+    runner: ArmRunner,
+    max_parallel: int | None = None,
+) -> list[ArmUnitResult]:
+    """Fan out units to the runner.
+
+    ``max_parallel`` is honored as an upper bound on simultaneous
+    in-flight runner calls. Phase A is synchronous over the runner;
+    the bound is enforced trivially. Phase B replaces this with
+    ``anyio.gather`` + a semaphore for real parallelism.
+
+    Returns results in the same order as ``units`` so callers can pair
+    them deterministically with their inputs (the merge step depends
+    on this — it would be nondeterministic otherwise).
+    """
+    if max_parallel is not None and max_parallel < 1:
+        raise ValueError("max_parallel must be >= 1")
+    results: list[ArmUnitResult] = []
+    for unit in units:
+        try:
+            result = runner(unit)
+        except Exception as exc:  # runner exceptions become failed units
+            result = ArmUnitResult(
+                unit=unit,
+                status="failed",
+                error=f"{type(exc).__name__}: {exc}",
+            )
+        results.append(result)
+    return results
+
+
+def default_max_parallel() -> int:
+    """Issue default: ``min(CPU, 4)``."""
+    cpus = os.cpu_count() or 1
+    return max(1, min(cpus, 4))
+
+
+def merge_unit_results(
+    results: list[ArmUnitResult],
+    *,
+    plan: dict | None = None,
+) -> dict:
+    """Deterministic merge of unit results into a findings-shaped dict.
+
+    Output keys (sorted):
+      - ``arms``: list of ``{arm_id, status, units}`` rows
+      - ``failed_unit_count``: int
+      - ``total_unit_count``: int
+
+    No timestamps, no random ordering. Calling twice on the same input
+    must produce byte-equal output.
+    """
+    by_arm: dict[str, list[ArmUnitResult]] = {}
+    for r in results:
+        by_arm.setdefault(r.unit.arm_id, []).append(r)
+
+    arms_out: list[dict] = []
+    for arm_id in sorted(by_arm):
+        arm_results = by_arm[arm_id]
+        # Arm status: complete only when every unit completed; otherwise
+        # failed. Granular per-unit status is preserved in `units`.
+        any_failed = any(r.status == "failed" for r in arm_results)
+        arms_out.append({
+            "arm_id": arm_id,
+            "status": "failed" if any_failed else "complete",
+            "units": [
+                {
+                    "seed": r.unit.seed,
+                    "condition": r.unit.condition_name,
+                    "status": r.status,
+                    "duration_ms": r.duration_ms,
+                    "output_files": sorted(r.output_files),
+                    "error": r.error,
+                }
+                for r in sorted(
+                    arm_results,
+                    key=lambda x: (x.unit.seed, x.unit.condition_name),
+                )
+            ],
+        })
+
+    failed_count = sum(1 for r in results if r.status == "failed")
+    return {
+        "arms": arms_out,
+        "failed_unit_count": failed_count,
+        "total_unit_count": len(results),
+    }
+
+
+def failed_units(results: list[ArmUnitResult]) -> list[ArmUnit]:
+    """Helper for the partial-retry path: which units need re-running?"""
+    return [r.unit for r in results if r.status == "failed"]
diff --git a/orchestrator/prompt_loader.py b/orchestrator/prompt_loader.py
index 7c23806..e774f04 100644
--- a/orchestrator/prompt_loader.py
+++ b/orchestrator/prompt_loader.py
@@ -2,6 +2,12 @@
 
 Loads markdown prompt templates from disk and renders them by replacing
 ``{{placeholder}}`` markers with context values.
+
+When a campaign-level CLAUDE.md is in scope (issue #131), the loader
+prefers ``<template>_thin.md`` over the full ``<template>.md`` for any
+template that ships a thin variant. The thin variant carries only the
+per-iteration context and refers the agent to CLAUDE.md for the
+methodology — that's the token-shrink win.
 """
 import logging
 import re
@@ -15,8 +21,22 @@
 class PromptLoader:
     """Load and render prompt templates with ``{{variable}}`` substitution."""
 
-    def __init__(self, prompts_dir: Path) -> None:
+    def __init__(
+        self,
+        prompts_dir: Path,
+        *,
+        claude_md_at: Path | None = None,
+    ) -> None:
         self.prompts_dir = Path(prompts_dir)
+        self._claude_md_at = Path(claude_md_at) if claude_md_at else None
+
+    def _resolve_template_path(self, template_name: str) -> Path:
+        """Pick thin or full variant based on whether CLAUDE.md is in scope."""
+        if self._claude_md_at is not None and self._claude_md_at.exists():
+            thin = self.prompts_dir / f"{template_name}_thin.md"
+            if thin.is_file():
+                return thin
+        return self.prompts_dir / f"{template_name}.md"
 
     def load(self, template_name: str, context: dict[str, str]) -> str:
         """Load *template_name*.md and replace ``{{key}}`` with *context[key]*.
@@ -28,7 +48,7 @@ def load(self, template_name: str, context: dict[str, str]) -> str:
             ValueError: Template contains unreplaced ``{{placeholders}}``
                 after rendering (i.e. required context keys were not provided).
         """
-        path = self.prompts_dir / f"{template_name}.md"
+        path = self._resolve_template_path(template_name)
         if not path.is_file():
             raise FileNotFoundError(
                 f"Prompt template not found: {path}"
diff --git a/orchestrator/routines.py b/orchestrator/routines.py
new file mode 100644
index 0000000..eb11fad
--- /dev/null
+++ b/orchestrator/routines.py
@@ -0,0 +1,168 @@
+"""Claude Code Routines integration for Nous (issue #134, Phase A).
+
+Builds a JSON-serializable payload describing a Routine for a Nous
+campaign — the bundle of (campaign config, schedule, MCP refs,
+credentials placeholder) that gets posted to the Routines API to
+register a recurring run.
+
+Phase A ships the **payload builder + dry-run CLI** so users see exactly
+what would be registered without needing the Routines API to be live.
+Phase B (when the Routines API stabilizes) wires the actual POST to
+that API and a return of the Routine ID.
+
+Cron schedule: standard 5-field cron in UTC. The user's local timezone
+is up to the Routines runtime; the orchestrator passes the string as-is.
+"""
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Any
+
+
+def build_routine_payload(
+    campaign: dict,
+    *,
+    campaign_path: Path | None = None,
+    schedule: str | None = None,
+    pr_label: str | None = None,
+    mcp_refs: list[str] | None = None,
+    extra: dict | None = None,
+) -> dict[str, Any]:
+    """Construct the Routines registration payload for a Nous campaign.
+
+    Exactly one of ``schedule`` or ``pr_label`` should be set (Routines
+    fire on either a cron string or a GitHub-event label).
+
+    Args:
+      campaign: parsed ``campaign.yaml`` dict.
+      campaign_path: filesystem path to the YAML file (so the Routine can
+        re-read it on each fire). Optional; when omitted, the payload
+        embeds the campaign config inline.
+      schedule: cron string (UTC). E.g. ``"0 2 * * *"`` for nightly at 2am.
+      pr_label: GitHub PR label that triggers this Routine. E.g.
+        ``"nous-experiment"``.
+      mcp_refs: MCP resource URIs the Routine should subscribe to (e.g.
+        ``["nous://campaigns"]``). The Routine writes findings via these
+        references after each run.
+      extra: caller-provided extra keys merged into the top level.
+    """
+    if not schedule and not pr_label:
+        raise ValueError("schedule or pr_label is required")
+    if schedule and pr_label:
+        raise ValueError("specify schedule OR pr_label, not both")
+
+    target = campaign.get("target_system", {})
+    name = (
+        campaign.get("run_id")
+        or campaign.get("name")
+        or target.get("name", "nous-routine")
+    )
+
+    payload: dict[str, Any] = {
+        "name": name,
+        "description": (
+            campaign.get("research_question")
+            or "Nous campaign — auto-registered Routine."
+        ),
+        "trigger": (
+            {"type": "cron", "expression": schedule}
+            if schedule
+            else {"type": "pr_label", "label": pr_label}
+        ),
+        "command": _routine_command(campaign_path),
+        "credentials": {
+            "ANTHROPIC_API_KEY": "${secret:anthropic_api_key}",
+        },
+        "mcp": {
+            "resources": list(mcp_refs or []),
+        },
+    }
+    if campaign_path is not None:
+        payload["campaign_path"] = str(Path(campaign_path).resolve())
+    else:
+        payload["campaign_inline"] = campaign
+
+    if extra:
+        for k, v in extra.items():
+            payload[k] = v
+
+    return payload
+
+
+def _routine_command(campaign_path: Path | None) -> list[str]:
+    """The shell command the Routine fires on each trigger."""
+    if campaign_path is not None:
+        return [
+            "nous", "run",
+            str(Path(campaign_path).resolve()),
+            "--auto-approve",
+            "--agent", "sdk",
+        ]
+    return [
+        "nous", "run", "<inlined-campaign.yaml>",
+        "--auto-approve",
+        "--agent", "sdk",
+    ]
+
+
+# ─── Phase B: actual API submission ────────────────────────────────────────
+
+
+import json as _json
+import os as _os
+import urllib.request as _urlreq
+from typing import Callable as _Callable
+
+
+_DEFAULT_ROUTINES_API_BASE = "https://api.anthropic.com/v1/routines"
+
+
+def submit_routine(
+    payload: dict,
+    *,
+    api_base: str | None = None,
+    api_key: str | None = None,
+    poster: _Callable[[str, bytes, dict, float], dict] | None = None,
+    timeout: float = 30.0,
+) -> dict:
+    """Register the payload with the Routines API and return the response.
+
+    Args:
+      payload: result of build_routine_payload.
+      api_base: override the default Routines API endpoint.
+      api_key: override ANTHROPIC_API_KEY env var. Required for real calls.
+      poster: dependency-injection seam for tests. Signature:
+        ``(url, body_bytes, headers, timeout) -> response_dict``. When set,
+        used instead of urllib.request.urlopen so tests don't touch the
+        network. See tests/CLAUDE.md.
+      timeout: per-request timeout in seconds.
+
+    Returns:
+      Response dict — typically contains a ``routine_id`` field that
+      callers store for later management.
+    """
+    url = api_base or _os.environ.get("ROUTINES_API_BASE", _DEFAULT_ROUTINES_API_BASE)
+    key = api_key or _os.environ.get("ANTHROPIC_API_KEY")
+    if poster is None and not key:
+        raise RuntimeError(
+            "submit_routine requires ANTHROPIC_API_KEY (or pass api_key=). "
+            "Tests must inject a poster — see tests/CLAUDE.md."
+        )
+    headers: dict[str, str] = {
+        "Content-Type": "application/json",
+        "X-Nous-Source": "orchestrator.routines",
+    }
+    if key:
+        headers["Authorization"] = f"Bearer {key}"
+    body = _json.dumps(payload).encode("utf-8")
+
+    if poster is not None:
+        return poster(url, body, headers, timeout)
+
+    req = _urlreq.Request(url, data=body, headers=headers, method="POST")
+    with _urlreq.urlopen(req, timeout=timeout) as resp:
+        text = resp.read().decode("utf-8")
+    try:
+        return _json.loads(text)
+    except _json.JSONDecodeError:
+        return {"raw_response": text, "status": resp.status}
diff --git a/orchestrator/sdk_dispatch.py b/orchestrator/sdk_dispatch.py
new file mode 100644
index 0000000..a8cf3fc
--- /dev/null
+++ b/orchestrator/sdk_dispatch.py
@@ -0,0 +1,450 @@
+"""SDK-based agent dispatch for the Nous orchestrator.
+
+Calls the Claude Agent SDK in place of `claude -p` subprocess. Same
+artifact and metrics contract as :class:`orchestrator.cli_dispatch.CLIDispatcher`;
+this class swaps the transport without changing the orchestrator's contract
+with the rest of Nous.
+
+Why SDK over `claude -p`:
+  * Native streaming → fast progress visibility (#127).
+  * Programmatic prompt caching → token savings (#122).
+  * Native subagent spawning → parallel arms without manual fork/join (#123).
+  * Message-level retry instead of subprocess restart.
+
+Design decisions worth knowing:
+
+  * The actual SDK call is delegated to a ``sdk_runner`` callable. The
+    default lazily resolves to a real ``claude_agent_sdk`` runner; tests
+    inject a deterministic fake. The runner returns an ``SDKResult``
+    (text + usage + cost + error flag); the dispatcher's job is to turn
+    that into on-disk artifacts and a metrics row, with retry on transient
+    failure. This keeps tests behavioral — they assert what's on disk,
+    not which method we called.
+  * Inherits from CLIDispatcher to reuse the parse/validate/retry-with-feedback
+    machinery used for fenced-output phases (gate summaries, etc.).
+"""
+from __future__ import annotations
+
+import logging
+import time
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Callable, Protocol, runtime_checkable
+
+from orchestrator.cli_dispatch import CLIDispatcher, _backoff_for
+from orchestrator.metrics import log_metrics, log_retry_event
+
+logger = logging.getLogger(__name__)
+
+
+class SDKTransientError(RuntimeError):
+    """Runner raises this for retryable transport-level failures."""
+
+
+def _load_methodology_preamble(methodology_dir: Path) -> str | None:
+    """Load the static methodology text as a single cached system block.
+
+    Concatenates the design + execute_analyze methodology files, stripping
+    Jinja-style {{placeholders}} (the dynamic portions go in the user
+    message instead, where they bust the cache appropriately). The result
+    is what ``ClaudeAgentOptions.system_prompt`` ships to the API with
+    cache_control: ephemeral so it's paid for once per 5-minute window
+    instead of once per turn — the win #122 is named for.
+    """
+    methodology_dir = Path(methodology_dir)
+    if not methodology_dir.is_dir():
+        return None
+    blocks: list[str] = []
+    import re as _re
+    for name in ("design.md", "execute_analyze.md"):
+        path = methodology_dir / name
+        if not path.exists():
+            continue
+        text = path.read_text()
+        # Strip {{placeholder}} markers — the dynamic content lives in
+        # the user message and changes each call.
+        text = _re.sub(r"\{\{[^}]+\}\}", "", text)
+        blocks.append(f"# Methodology: {path.stem}\n\n{text}")
+    if not blocks:
+        return None
+    return "\n\n---\n\n".join(blocks)
+
+
+def _tee_event(event_log_path: Path | None, message: object, cls_name: str) -> None:
+    """Append one SDK event to executor_log.jsonl (#127 Phase B).
+
+    Best-effort: log-write failures don't break the agent. The TUI's
+    snapshot reader (orchestrator.status) already consumes this file.
+    """
+    if event_log_path is None:
+        return
+    import json as _json
+    record: dict = {
+        "type": cls_name,
+        "ts": time.time(),
+    }
+    # Surface fields the TUI cares about — tool name, content kind. We
+    # touch only attributes that exist via getattr so the format here
+    # is robust to SDK message-class evolution.
+    for field_name in ("tool_name", "tool_use_id", "content"):
+        val = getattr(message, field_name, None)
+        if val is not None and not callable(val):
+            try:
+                _json.dumps(val)  # serializability probe
+                record[field_name] = val
+            except (TypeError, ValueError):
+                record[field_name] = repr(val)[:200]
+    try:
+        with open(event_log_path, "a") as f:
+            f.write(_json.dumps(record) + "\n")
+    except OSError:
+        pass
+
+
+@dataclass
+class SDKResult:
+    """One SDK call's outcome.
+
+    The dispatcher reads only these fields. Producers (real or fake) must
+    populate ``text`` (assistant final text); usage/cost fields default
+    to zero so trivial fakes need not set them.
+    """
+
+    text: str
+    input_tokens: int = 0
+    output_tokens: int = 0
+    cache_read_input_tokens: int = 0
+    cache_creation_input_tokens: int = 0
+    cost_usd: float = 0.0
+    duration_ms: int = 0
+    num_turns: int = 1
+    is_error: bool = False
+    error_message: str = ""
+    extra: dict = field(default_factory=dict)
+
+
+@runtime_checkable
+class SDKRunner(Protocol):
+    """A callable that performs one SDK turn and returns an ``SDKResult``.
+
+    Raise :class:`SDKTransientError` for retryable failures (network blips,
+    rate limits, mid-stream disconnect). Return ``SDKResult(is_error=True,
+    error_message=...)`` for API-reported errors that should also be retried.
+    Other exceptions bubble up as fatal.
+    """
+
+    def __call__(
+        self,
+        *,
+        prompt: str,
+        model: str,
+        cwd: Path | None,
+        max_turns: int,
+        system_prompt: str | None = None,
+        settings_path: Path | None = None,
+        event_log_path: Path | None = None,
+    ) -> SDKResult:
+        ...
+
+
+def _default_sdk_runner_factory() -> SDKRunner:
+    """Return a runner that calls the real ``claude_agent_sdk``.
+
+    Resolved lazily so that tests (and environments without the SDK
+    installed) don't fail at import time.
+    """
+
+    def _runner(
+        *,
+        prompt: str,
+        model: str,
+        cwd: Path | None,
+        max_turns: int,
+        system_prompt: str | None = None,
+        settings_path: Path | None = None,
+        event_log_path: Path | None = None,
+    ) -> SDKResult:
+        try:
+            import anyio
+            from claude_agent_sdk import (  # type: ignore[import-not-found]
+                ClaudeAgentOptions,
+                query,
+            )
+        except ImportError as exc:
+            raise RuntimeError(
+                "claude-agent-sdk is not installed. "
+                "Install with `pip install claude-agent-sdk` or use --agent api."
+            ) from exc
+
+        async def _run() -> SDKResult:
+            options = ClaudeAgentOptions(
+                model=model,
+                cwd=str(cwd) if cwd else None,
+                max_turns=max_turns,
+                system_prompt=system_prompt,
+                settings=str(settings_path) if settings_path else None,
+            )
+            text_chunks: list[str] = []
+            usage: dict = {}
+            cost_usd = 0.0
+            duration_ms = 0
+            num_turns = 0
+            t0 = time.time()
+            if event_log_path is not None:
+                Path(event_log_path).parent.mkdir(parents=True, exist_ok=True)
+            async for message in query(prompt=prompt, options=options):
+                cls = type(message).__name__
+                # #127 Phase B: tee every SDK message as a JSONL event so
+                # `nous status --watch` can render live progress.
+                _tee_event(event_log_path, message, cls)
+                if cls == "AssistantMessage":
+                    for block in getattr(message, "content", []):
+                        if hasattr(block, "text"):
+                            text_chunks.append(block.text)
+                elif cls == "ResultMessage":
+                    usage = getattr(message, "usage", {}) or {}
+                    cost_usd = float(getattr(message, "total_cost_usd", 0.0) or 0.0)
+                    duration_ms = int(getattr(message, "duration_ms", 0) or 0)
+                    num_turns = int(getattr(message, "num_turns", 0) or 0)
+                    if getattr(message, "is_error", False):
+                        return SDKResult(
+                            text="".join(text_chunks),
+                            error_message=str(getattr(message, "result", "unknown")),
+                            is_error=True,
+                            input_tokens=int(usage.get("input_tokens", 0) or 0),
+                            output_tokens=int(usage.get("output_tokens", 0) or 0),
+                            cache_read_input_tokens=int(
+                                usage.get("cache_read_input_tokens", 0) or 0
+                            ),
+                            cache_creation_input_tokens=int(
+                                usage.get("cache_creation_input_tokens", 0) or 0
+                            ),
+                            cost_usd=cost_usd,
+                            duration_ms=duration_ms,
+                            num_turns=num_turns,
+                        )
+            return SDKResult(
+                text="".join(text_chunks),
+                input_tokens=int(usage.get("input_tokens", 0) or 0),
+                output_tokens=int(usage.get("output_tokens", 0) or 0),
+                cache_read_input_tokens=int(
+                    usage.get("cache_read_input_tokens", 0) or 0
+                ),
+                cache_creation_input_tokens=int(
+                    usage.get("cache_creation_input_tokens", 0) or 0
+                ),
+                cost_usd=cost_usd,
+                duration_ms=duration_ms or int((time.time() - t0) * 1000),
+                num_turns=num_turns or 1,
+            )
+
+        try:
+            return anyio.run(_run)
+        except Exception as exc:
+            cls_name = type(exc).__name__
+            transient_signals = (
+                "ConnectionError",
+                "ReadTimeout",
+                "WriteTimeout",
+                "RemoteProtocolError",
+                "ServerDisconnectedError",
+                "TimeoutError",
+            )
+            if any(sig in cls_name for sig in transient_signals):
+                raise SDKTransientError(f"{cls_name}: {exc}") from exc
+            raise
+
+    return _runner
+
+
+class SDKDispatcher(CLIDispatcher):
+    """Dispatch agent roles via the Claude Agent SDK.
+
+    Inherits dispatch() / parse / retry-with-feedback / route logic from
+    :class:`CLIDispatcher`. Overrides ``_call_claude`` to use the SDK
+    runner instead of a subprocess, and ``preflight_check`` to verify
+    the SDK package is importable.
+    """
+
+    def __init__(
+        self,
+        work_dir: Path,
+        campaign: dict,
+        model: str = "claude-sonnet-4-6",
+        prompts_dir: Path | None = None,
+        timeout: int = 1800,
+        max_turns: int = 25,
+        max_retries: int | None = 10,
+        sdk_runner: Callable | None = None,
+        system_prompt: str | None = None,
+        settings_path: Path | None = None,
+    ) -> None:
+        super().__init__(
+            work_dir=work_dir,
+            campaign=campaign,
+            model=model,
+            prompts_dir=prompts_dir,
+            timeout=timeout,
+            max_turns=max_turns,
+            max_retries=max_retries,
+        )
+        self._sdk_runner = sdk_runner or _default_sdk_runner_factory()
+        self._system_prompt = system_prompt or _load_methodology_preamble(
+            prompts_dir or Path(__file__).parent.parent / "prompts" / "methodology",
+        )
+        self._settings_path = settings_path
+        # #127 Phase B: event log path is recomputed per-dispatch (it depends
+        # on the iteration), so we don't store it on the dispatcher.
+        self._event_log_path: Path | None = None
+
+    # ------------------------------------------------------------------
+    # Per-iteration event log (#127 Phase B)
+    # ------------------------------------------------------------------
+
+    def dispatch(  # type: ignore[override]
+        self, role: str, phase: str, *, output_path, iteration: int,
+        perspective=None, h_main_result="CONFIRMED",
+    ) -> None:
+        # Compute the executor_log.jsonl path for this iteration so the
+        # runner tees SDK events to a place the status reader can find.
+        self._event_log_path = (
+            self.work_dir / "runs" / f"iter-{iteration}" / "executor_log.jsonl"
+        )
+        try:
+            super().dispatch(
+                role, phase,
+                output_path=output_path, iteration=iteration,
+                perspective=perspective, h_main_result=h_main_result,
+            )
+        finally:
+            self._event_log_path = None
+
+    # ------------------------------------------------------------------
+    # Pre-flight
+    # ------------------------------------------------------------------
+
+    def preflight_check(self) -> None:
+        """Verify the SDK is reachable before starting a campaign."""
+        try:
+            import claude_agent_sdk  # type: ignore[import-not-found] # noqa: F401
+        except ImportError as exc:
+            raise RuntimeError(
+                "Pre-flight check failed: claude-agent-sdk is not installed. "
+                "Install with `pip install claude-agent-sdk`, or pass --agent api "
+                "to use the OpenAI-compatible path instead."
+            ) from exc
+        logger.info("SDK pre-flight check passed (model=%s)", self.model)
+
+    # ------------------------------------------------------------------
+    # Core call with retry
+    # ------------------------------------------------------------------
+
+    def _call_claude(self, prompt: str, max_turns: int | None = None) -> str:
+        """Run one SDK turn with retry on transient failure.
+
+        Mirrors CLIDispatcher._call_claude semantics: retry on transient
+        errors (with exponential backoff), log each failure to retry_log.jsonl,
+        log each completed call to llm_metrics.jsonl, give up after
+        max_retries.
+        """
+        cwd = self._cwd
+        if cwd and not cwd.exists():
+            raise RuntimeError(
+                f"SDKDispatcher cwd does not exist: {cwd}. "
+                f"Check that 'repo_path' in campaign.yaml is correct."
+            )
+        turns = max_turns or self.max_turns
+        logger.info(
+            "SDK turn (model=%s, cwd=%s, max_turns=%d)", self.model, cwd, turns,
+        )
+
+        failure_count = 0
+        original_prompt = prompt
+        while True:
+            try:
+                result = self._sdk_runner(
+                    prompt=prompt,
+                    model=self.model,
+                    cwd=cwd,
+                    max_turns=turns,
+                    system_prompt=self._system_prompt,
+                    settings_path=self._settings_path,
+                    event_log_path=self._event_log_path,
+                )
+            except SDKTransientError as exc:
+                failure_count += 1
+                self._log_retry("transient", failure_count, exc)
+                if self._exhausted(failure_count):
+                    raise RuntimeError(
+                        f"SDK still failing after {failure_count} attempt(s): {exc}"
+                    ) from exc
+                time.sleep(_backoff_for(failure_count))
+                prompt = self._maybe_resume_hint(prompt, original_prompt, "transient")
+                continue
+
+            self._log_metrics_row(result)
+
+            if result.is_error:
+                failure_count += 1
+                self._log_retry(
+                    "api_error", failure_count, RuntimeError(result.error_message),
+                )
+                if self._exhausted(failure_count):
+                    raise RuntimeError(
+                        f"SDK returned error after {failure_count} attempt(s): "
+                        f"{result.error_message}"
+                    )
+                time.sleep(_backoff_for(failure_count))
+                prompt = self._maybe_resume_hint(prompt, original_prompt, "api_error")
+                continue
+
+            return result.text
+
+    # ------------------------------------------------------------------
+    # Internals
+    # ------------------------------------------------------------------
+
+    def _exhausted(self, failure_count: int) -> bool:
+        return self.max_retries is not None and failure_count > self.max_retries
+
+    def _log_retry(self, kind: str, attempt: int, exc: BaseException) -> None:
+        log_retry_event(self._metrics_path, {
+            "role": self._current_role,
+            "phase": self._current_phase,
+            "failure_type": kind,
+            "attempt": attempt,
+            "error": str(exc)[:500],
+        })
+
+    def _log_metrics_row(self, result: SDKResult) -> None:
+        log_metrics(self._metrics_path, {
+            "dispatcher": "sdk",
+            "role": self._current_role,
+            "phase": self._current_phase,
+            "model": self.model,
+            "input_tokens": result.input_tokens,
+            "output_tokens": result.output_tokens,
+            "cache_creation_input_tokens": result.cache_creation_input_tokens,
+            "cache_read_input_tokens": result.cache_read_input_tokens,
+            "cost_usd": result.cost_usd,
+            "duration_ms": result.duration_ms,
+            "num_turns": result.num_turns,
+        })
+
+    @staticmethod
+    def _maybe_resume_hint(prompt: str, original_prompt: str, kind: str) -> str:
+        """If the prompt has not yet been annotated with a resume hint, add one.
+
+        Mirrors CLIDispatcher: tells the agent that the prior attempt was
+        interrupted so it picks up from existing artifacts rather than
+        starting fresh.
+        """
+        marker = "\nNote: Your previous attempt was interrupted"
+        if marker in prompt:
+            return prompt
+        return (
+            f"{original_prompt}\n\n---\n"
+            f"Note: Your previous attempt was interrupted ({kind}). "
+            f"Check the working directory for artifacts from your prior "
+            f"attempt and continue from where you left off."
+        )
diff --git a/orchestrator/settings_template.py b/orchestrator/settings_template.py
new file mode 100644
index 0000000..8cd1278
--- /dev/null
+++ b/orchestrator/settings_template.py
@@ -0,0 +1,163 @@
+"""Per-campaign Claude Code permission policy generator (issue #135).
+
+Replaces ``--dangerously-skip-permissions`` with a fine-grained
+``.claude/settings.json`` written into the campaign work-dir at init.
+The settings file declares:
+
+  * ``allowOnly`` paths — typically the campaign work-dir and the target
+    repo's worktree root. Anything else is denied.
+  * an allowlist of binaries (Bash) drawn from the experiment plan
+    when one is present at init, with conservative defaults otherwise.
+  * a deny rule for outbound network access except localhost / configured
+    proxies.
+  * (optional) a Stop hook pointing at ``bin/nous-execute-stop`` (#129).
+
+The file's *contents* are the contract. The dispatcher passes
+``--settings <path>`` and drops ``--dangerously-skip-permissions`` —
+that's how the contents take effect.
+
+This module is deliberately a pure renderer: ``render_campaign_settings``
+takes inputs and returns a dict; ``write_campaign_settings`` writes it
+to disk via :func:`atomic_write`. No side effects beyond the disk write.
+"""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import Any
+
+from orchestrator.util import atomic_write
+
+
+# Bash commands that are safe across virtually every Nous campaign.
+# Campaign-specific binaries (./blis, simulators, custom tools) come from
+# the experiment plan when present.
+_DEFAULT_BIN_ALLOWLIST: tuple[str, ...] = (
+    "ls",
+    "cat",
+    "head",
+    "tail",
+    "wc",
+    "grep",
+    "find",
+    "rg",
+    "git",
+    "python",
+    "python3",
+    "pip",
+    "pytest",
+    "go",
+    "cargo",
+    "node",
+    "npm",
+    "make",
+)
+
+
+def _binaries_from_plan(plan: dict | None) -> list[str]:
+    """Pull binaries out of an ``experiment_plan.yaml``-shaped dict.
+
+    Returns a sorted list of unique binary basenames referenced in the
+    plan's arms/conditions. Empty when the plan is None or shapeless.
+    """
+    if not isinstance(plan, dict):
+        return []
+    seen: set[str] = set()
+    for arm in plan.get("arms", []) or []:
+        for cond in arm.get("conditions", []) or []:
+            cmd = cond.get("command") or cond.get("cmd")
+            if not isinstance(cmd, str) or not cmd.strip():
+                continue
+            head = cmd.strip().split()[0]
+            # Strip any "./" prefix and path separators to match against
+            # the binary's basename in the allowlist.
+            seen.add(head.split("/")[-1])
+    return sorted(seen)
+
+
+def render_campaign_settings(
+    *,
+    work_dir: Path,
+    repo_path: Path | None = None,
+    experiment_plan: dict | None = None,
+    extra_bin_allowlist: list[str] | None = None,
+    stop_hook_path: Path | None = None,
+    pre_tool_use_hook_path: Path | None = None,
+) -> dict[str, Any]:
+    """Build the settings.json contents for one campaign.
+
+    Args:
+      work_dir: Campaign work-dir (e.g. ``<repo>/.nous/<run-id>``). Always allowed.
+      repo_path: Target repo root, when set. Allowed read+write.
+      experiment_plan: Parsed ``experiment_plan.yaml`` contents, if available
+        at init. Binaries referenced in arm conditions extend the allowlist.
+      extra_bin_allowlist: Caller-provided binaries to allow (e.g. simulator).
+      stop_hook_path: Absolute path to the Stop hook (e.g. ``bin/nous-execute-stop``
+        from #129). When set, registered under ``hooks.Stop``.
+      pre_tool_use_hook_path: Absolute path to the PreToolUse hook (#128).
+        When set, registered under ``hooks.PreToolUse``.
+
+    Returns:
+      A dict ready to be JSON-serialized as ``.claude/settings.json``.
+    """
+    allow_only = [str(Path(work_dir).resolve())]
+    if repo_path is not None:
+        allow_only.append(str(Path(repo_path).resolve()))
+
+    bin_set: set[str] = set(_DEFAULT_BIN_ALLOWLIST)
+    bin_set.update(_binaries_from_plan(experiment_plan))
+    if extra_bin_allowlist:
+        bin_set.update(extra_bin_allowlist)
+    bin_allowlist = sorted(bin_set)
+
+    settings: dict[str, Any] = {
+        "permissions": {
+            "allowOnly": allow_only,
+            "allow": [f"Bash({b}:*)" for b in bin_allowlist],
+            "deny": [
+                "Bash(curl https://*)",
+                "Bash(wget https://*)",
+                "Bash(rm -rf /*)",
+            ],
+        },
+    }
+
+    hooks: dict[str, list[dict[str, Any]]] = {}
+    if stop_hook_path is not None:
+        hooks["Stop"] = [{
+            "hooks": [{
+                "type": "command",
+                "command": str(Path(stop_hook_path).resolve()),
+            }],
+        }]
+    if pre_tool_use_hook_path is not None:
+        hooks["PreToolUse"] = [{
+            "matcher": "Bash",
+            "hooks": [{
+                "type": "command",
+                "command": str(Path(pre_tool_use_hook_path).resolve()),
+            }],
+        }]
+    if hooks:
+        settings["hooks"] = hooks
+
+    return settings
+
+
+def write_campaign_settings(
+    settings_path: Path,
+    contents: dict[str, Any],
+) -> Path:
+    """Atomically write the settings dict to ``settings_path``.
+
+    Returns the absolute path to the written file.
+    """
+    settings_path = Path(settings_path)
+    settings_path.parent.mkdir(parents=True, exist_ok=True)
+    atomic_write(settings_path, json.dumps(contents, indent=2) + "\n")
+    return settings_path.resolve()
+
+
+def settings_path_for(work_dir: Path) -> Path:
+    """Return the canonical location of a campaign's settings file."""
+    return Path(work_dir) / ".claude" / "settings.json"
diff --git a/orchestrator/status.py b/orchestrator/status.py
new file mode 100644
index 0000000..333f95a
--- /dev/null
+++ b/orchestrator/status.py
@@ -0,0 +1,182 @@
+"""Live status surface for Nous campaigns (issue #127).
+
+Phase A: a deterministic, no-LLM snapshot reader that the CLI uses for
+``nous status`` (one-shot), ``nous status --line`` (single-line for shell
+prompts), and ``nous status --watch`` (loop + redraw).
+
+The snapshot reads three files:
+  * ``state.json``        — current phase + iteration
+  * ``ledger.json``       — completed iterations count
+  * ``runs/iter-N/executor_log.jsonl`` — most recent SDK tool-call event
+    (when present; empty before #127's SDK-tee path is wired)
+
+Stuck detection: heartbeat absence > 5 minutes since the last logged
+tool-call event surfaces a ``stuck`` flag that the watch panel renders
+prominently.
+
+Phase B (deferred): SDK event tee — sdk_dispatch.py teeing each
+``--output-format stream-json`` row to ``executor_log.jsonl`` as the
+session runs. Once that lands, ``nous status --watch`` lights up
+without code changes here.
+"""
+from __future__ import annotations
+
+import json
+import time
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+
+_STUCK_THRESHOLD_SECONDS = 5 * 60
+
+
+@dataclass
+class StatusSnapshot:
+    run_id: str = "?"
+    phase: str = "?"
+    iteration: int = 0
+    completed_iterations: int = 0
+    active_principles: int = 0
+    last_event: dict[str, Any] | None = None
+    elapsed_since_last_event: float | None = None  # seconds; None if no event
+    stuck: bool = False
+    raw: dict[str, Any] = field(default_factory=dict)
+
+    def as_dict(self) -> dict[str, Any]:
+        return {
+            "run_id": self.run_id,
+            "phase": self.phase,
+            "iteration": self.iteration,
+            "completed_iterations": self.completed_iterations,
+            "active_principles": self.active_principles,
+            "last_event": self.last_event,
+            "elapsed_since_last_event": self.elapsed_since_last_event,
+            "stuck": self.stuck,
+        }
+
+
+def _read_json(path: Path) -> Any:
+    try:
+        return json.loads(path.read_text())
+    except (OSError, json.JSONDecodeError):
+        return None
+
+
+def _last_log_event(log_path: Path) -> tuple[dict | None, float | None]:
+    """Return (last_event, mtime_seconds_since_epoch) from a JSONL log."""
+    if not log_path.exists():
+        return None, None
+    last: dict | None = None
+    try:
+        for line in log_path.read_text().splitlines():
+            line = line.strip()
+            if not line:
+                continue
+            try:
+                last = json.loads(line)
+            except json.JSONDecodeError:
+                continue
+        mtime = log_path.stat().st_mtime
+    except OSError:
+        return None, None
+    return last, mtime
+
+
+def read_status_snapshot(
+    work_dir: Path,
+    *,
+    now: float | None = None,
+    stuck_threshold_seconds: float = _STUCK_THRESHOLD_SECONDS,
+) -> StatusSnapshot:
+    """Build a snapshot from on-disk state + the latest executor log.
+
+    Args:
+      work_dir: campaign work-dir.
+      now: override of ``time.time()`` for deterministic tests.
+      stuck_threshold_seconds: how long without a logged event before the
+        snapshot's ``stuck`` flag flips.
+    """
+    work_dir = Path(work_dir)
+    snap = StatusSnapshot()
+
+    state = _read_json(work_dir / "state.json")
+    if isinstance(state, dict):
+        snap.run_id = str(state.get("run_id", "?"))
+        snap.phase = str(state.get("phase", "?"))
+        snap.iteration = int(state.get("iteration", 0) or 0)
+        snap.raw = state
+
+    ledger = _read_json(work_dir / "ledger.json")
+    if isinstance(ledger, dict):
+        rows = ledger.get("iterations", [])
+        if isinstance(rows, list):
+            snap.completed_iterations = sum(
+                1 for r in rows
+                if isinstance(r, dict)
+                and isinstance(r.get("iteration"), int)
+                and r["iteration"] >= 1
+            )
+
+    principles = _read_json(work_dir / "principles.json")
+    if isinstance(principles, dict):
+        plist = principles.get("principles", [])
+        if isinstance(plist, list):
+            snap.active_principles = sum(
+                1 for p in plist
+                if isinstance(p, dict) and p.get("status", "active") == "active"
+            )
+
+    log_path = work_dir / "runs" / f"iter-{snap.iteration}" / "executor_log.jsonl"
+    last_event, mtime = _last_log_event(log_path)
+    snap.last_event = last_event
+    if mtime is not None:
+        current = now if now is not None else time.time()
+        snap.elapsed_since_last_event = max(0.0, current - mtime)
+        snap.stuck = snap.elapsed_since_last_event >= stuck_threshold_seconds
+
+    return snap
+
+
+def format_one_liner(snap: StatusSnapshot) -> str:
+    """Single-line summary suitable for a shell prompt or CI log."""
+    parts = [
+        snap.run_id,
+        snap.phase,
+        f"iter {snap.iteration}",
+        f"{snap.completed_iterations} done",
+        f"{snap.active_principles} principles",
+    ]
+    if snap.last_event:
+        tool = snap.last_event.get("tool_name") or snap.last_event.get("tool") or ""
+        if tool:
+            parts.append(f"last={tool}")
+    if snap.stuck:
+        parts.append("STUCK")
+    return " · ".join(parts)
+
+
+def format_watch_panel(snap: StatusSnapshot) -> str:
+    """Multi-line panel suitable for ``nous status --watch``.
+
+    Plain text — no rich/textual dependency in Phase A; the redraw cycle
+    just clears and reprints. Phase B can swap in a fancier renderer.
+    """
+    lines = [
+        f"Campaign:   {snap.run_id}",
+        f"Phase:      {snap.phase}",
+        f"Iteration:  {snap.iteration}",
+        f"Completed:  {snap.completed_iterations} iteration(s)",
+        f"Principles: {snap.active_principles} active",
+    ]
+    if snap.last_event:
+        tool = snap.last_event.get("tool_name") or snap.last_event.get("tool") or "?"
+        lines.append(f"Last tool:  {tool}")
+        if snap.elapsed_since_last_event is not None:
+            lines.append(f"Last seen:  {snap.elapsed_since_last_event:.0f}s ago")
+    else:
+        lines.append("Last tool:  (no events yet)")
+    if snap.stuck:
+        lines.append("")
+        lines.append("⚠  STUCK?  no executor activity in the last 5 minutes.")
+    return "\n".join(lines)
diff --git a/orchestrator/worktree.py b/orchestrator/worktree.py
index 15bed13..c86c447 100644
--- a/orchestrator/worktree.py
+++ b/orchestrator/worktree.py
@@ -1,12 +1,31 @@
-"""Git worktree management for experiment isolation."""
+"""Git worktree management for experiment isolation.
+
+Phase A of #133: ship orphan-worktree garbage collection alongside the
+existing per-iteration lifecycle. The harness-managed
+``Agent(isolation="worktree")`` switch (Phase B) lands with the
+parallel-arm subagents in #123 — at that point most of this file goes
+away. Until then, GC at run start cleans up the ghost-worktree pattern
+observed on 5/18 where ``--max-cli-retries 10`` spawned a second worktree
+while the first was still alive.
+"""
+from __future__ import annotations
+
 import logging
+import os
+import shutil
 import subprocess
+import time
 import uuid
 from pathlib import Path
+from typing import Callable
 
 logger = logging.getLogger(__name__)
 
 
+_EXPERIMENTS_DIRNAME = ".nous-experiments"
+_DEFAULT_ORPHAN_AGE_SECONDS = 60 * 60  # 1 hour
+
+
 def create_experiment_worktree(repo_path: Path, iteration: int) -> tuple[Path, str]:
     """Create a git worktree for running an experiment in isolation.
 
@@ -20,7 +39,7 @@ def create_experiment_worktree(repo_path: Path, iteration: int) -> tuple[Path, s
         raise FileNotFoundError(f"Not a git repository: {repo_path}")
 
     experiment_id = f"iter-{iteration}-{uuid.uuid4().hex[:8]}"
-    worktree_dir = repo_path / ".nous-experiments" / experiment_id
+    worktree_dir = repo_path / _EXPERIMENTS_DIRNAME / experiment_id
     branch_name = f"nous-exp-{experiment_id}"
 
     subprocess.run(
@@ -40,7 +59,7 @@ def remove_experiment_worktree(repo_path: Path, experiment_id: str) -> None:
     Safe to call even if the worktree was already removed.
     """
     repo_path = Path(repo_path)
-    worktree_dir = repo_path / ".nous-experiments" / experiment_id
+    worktree_dir = repo_path / _EXPERIMENTS_DIRNAME / experiment_id
     branch_name = f"nous-exp-{experiment_id}"
 
     if worktree_dir.exists():
@@ -69,3 +88,194 @@ def remove_experiment_worktree(repo_path: Path, experiment_id: str) -> None:
     )
     if result.returncode != 0:
         logger.debug("Branch cleanup for %s: %s", branch_name, result.stderr.strip())
+
+
+def gc_orphan_worktrees(
+    repo_path: Path,
+    *,
+    max_age_seconds: float = _DEFAULT_ORPHAN_AGE_SECONDS,
+    pid_check: Callable[[int], bool] | None = None,
+    now: float | None = None,
+) -> list[str]:
+    """Remove stale experiment worktrees with no live owning process.
+
+    Run at ``nous run`` startup. Walks ``<repo>/.nous-experiments/`` and
+    deletes any worktree directory that is older than ``max_age_seconds``
+    and whose owning PID (if recorded under ``.nous-pid``) is no longer
+    alive. The 1-hour default matches the issue's GC threshold; the
+    rationale is that any legitimate iteration completes within an hour
+    of its last write, so anything older with no live process is genuinely
+    orphaned.
+
+    Args:
+      repo_path: target repo root.
+      max_age_seconds: only consider worktrees older than this.
+      pid_check: callable ``(pid: int) -> bool`` returning True when the
+        process is still alive. Defaults to ``os.kill(pid, 0)``-style
+        check. Tests inject a deterministic fake.
+      now: override of ``time.time()`` for deterministic tests.
+
+    Returns:
+      List of experiment_ids removed (sorted by directory name).
+    """
+    repo_path = Path(repo_path)
+    experiments_dir = repo_path / _EXPERIMENTS_DIRNAME
+    if not experiments_dir.is_dir():
+        return []
+
+    pid_alive = pid_check or _pid_alive_default
+    current_time = now if now is not None else time.time()
+
+    removed: list[str] = []
+    for entry in sorted(experiments_dir.iterdir()):
+        if not entry.is_dir():
+            continue
+        try:
+            mtime = entry.stat().st_mtime
+        except OSError:
+            continue
+        age = current_time - mtime
+        if age < max_age_seconds:
+            continue
+
+        # If a PID is recorded under .nous-pid, skip when alive.
+        pid_file = entry / ".nous-pid"
+        if pid_file.exists():
+            try:
+                pid = int(pid_file.read_text().strip())
+                if pid_alive(pid):
+                    continue
+            except (ValueError, OSError):
+                pass
+
+        # Untrack the worktree from git (best-effort), then rm -rf the dir.
+        subprocess.run(
+            ["git", "worktree", "remove", str(entry), "--force"],
+            cwd=repo_path, capture_output=True, text=True, check=False,
+        )
+        if entry.exists():
+            shutil.rmtree(entry, ignore_errors=True)
+
+        # Best-effort branch cleanup.
+        branch = f"nous-exp-{entry.name}"
+        subprocess.run(
+            ["git", "branch", "-D", branch],
+            cwd=repo_path, capture_output=True, text=True, check=False,
+        )
+
+        logger.info("GC'd orphan worktree: %s", entry)
+        removed.append(entry.name)
+    return removed
+
+
+def _pid_alive_default(pid: int) -> bool:
+    if pid <= 0:
+        return False
+    try:
+        os.kill(pid, 0)
+        return True
+    except ProcessLookupError:
+        return False
+    except PermissionError:
+        # Process exists but we can't signal it — still alive.
+        return True
+    except OSError:
+        return False
+
+
+# ─── Phase B: harness-isolated subagent runner (#133 + #123 bridge) ────────
+
+
+def make_isolated_arm_runner(
+    *,
+    sdk_runner: Callable,
+    repo_path: Path,
+    iter_dir: Path,
+    model: str = "claude-sonnet-4-6",
+    max_turns: int = 25,
+    subagent_type: str = "claude",
+) -> Callable:
+    """Build an ArmRunner backed by a worktree-isolated SDK subagent.
+
+    The returned callable matches the ``ArmRunner`` Protocol from
+    :mod:`orchestrator.parallel_arms` — takes one ``ArmUnit`` and returns
+    one ``ArmUnitResult``. Per the no-live-LLM policy, this function does
+    not call the SDK directly: it uses the injected ``sdk_runner`` from
+    :mod:`orchestrator.sdk_dispatch`, so tests pass a recording fake.
+
+    Each subagent is dispatched with ``isolation="worktree"`` and
+    ``subagent_type`` set so the harness creates a fresh worktree,
+    runs the unit's planned command inside it, and tears the worktree
+    down on exit. The post-run patch (``git diff`` inside the worktree)
+    is captured by the subagent and written to
+    ``iter_dir/patches/<arm>.patch`` — matching the existing convention.
+
+    This is the harness-managed replacement for the manual lifecycle
+    in ``create_experiment_worktree`` / ``remove_experiment_worktree``;
+    once #123 wires this runner into the parallel-arm path, the manual
+    code becomes vestigial.
+    """
+    repo_path = Path(repo_path)
+    iter_dir = Path(iter_dir)
+
+    def _run(unit):
+        # Imported lazily so the factory itself works on branches where
+        # parallel_arms hasn't landed yet (it stacks on this PR).
+        from orchestrator.parallel_arms import ArmUnitResult
+        results_dir = iter_dir / unit.relative_results_dir
+        results_dir.mkdir(parents=True, exist_ok=True)
+        patches_dir = iter_dir / "patches"
+        patches_dir.mkdir(parents=True, exist_ok=True)
+        patch_path = patches_dir / f"{unit.arm_id}.patch"
+
+        prompt = (
+            f"# Arm: {unit.arm_id} (seed {unit.seed})\n\n"
+            f"You are a subagent running one experiment unit in an isolated\n"
+            f"git worktree. **Do not modify files outside this worktree.**\n\n"
+            f"## Command\n```\n{unit.command}\n```\n\n"
+            f"## Results destination\n"
+            f"Write all output files to: `{results_dir}`\n\n"
+            f"## Patch capture\n"
+            f"Before exiting, run `git diff` in this worktree and write the\n"
+            f"output to `{patch_path}`. If there are no changes, create an\n"
+            f"empty file at that path.\n"
+        )
+
+        try:
+            result = sdk_runner(
+                prompt=prompt,
+                model=model,
+                cwd=repo_path,
+                max_turns=max_turns,
+                system_prompt=None,
+                settings_path=None,
+                event_log_path=None,
+                isolation="worktree",
+                subagent_type=subagent_type,
+            )
+        except TypeError:
+            # Older runners don't accept isolation/subagent_type kwargs;
+            # fall back to the basic call signature.
+            result = sdk_runner(
+                prompt=prompt, model=model, cwd=repo_path, max_turns=max_turns,
+            )
+
+        if getattr(result, "is_error", False):
+            return ArmUnitResult(
+                unit=unit, status="failed",
+                duration_ms=int(getattr(result, "duration_ms", 0) or 0),
+                error=str(getattr(result, "error_message", "") or "sdk reported error"),
+            )
+
+        output_files = sorted(
+            str(p.relative_to(iter_dir))
+            for p in results_dir.rglob("*") if p.is_file()
+        )
+        return ArmUnitResult(
+            unit=unit,
+            status="complete",
+            duration_ms=int(getattr(result, "duration_ms", 0) or 0),
+            output_files=output_files,
+        )
+
+    return _run
diff --git a/plugin/nous/plugin.json b/plugin/nous/plugin.json
new file mode 100644
index 0000000..13236c5
--- /dev/null
+++ b/plugin/nous/plugin.json
@@ -0,0 +1,16 @@
+{
+  "name": "nous",
+  "version": "0.2.0",
+  "description": "Hypothesis-driven experimentation for software systems. Wraps the `nous` CLI as discoverable Claude Code skills.",
+  "author": "AI-native Systems Research",
+  "homepage": "https://github.com/AI-native-Systems-Research/agentic-strategy-evolution",
+  "license": "Apache-2.0",
+  "skills": [
+    "skills/nous-run.md",
+    "skills/nous-status.md",
+    "skills/nous-resume.md",
+    "skills/nous-list.md",
+    "skills/nous-bisect.md",
+    "skills/nous-find-principle.md"
+  ]
+}
diff --git a/plugin/nous/skills/nous-bisect.md b/plugin/nous/skills/nous-bisect.md
new file mode 100644
index 0000000..947a946
--- /dev/null
+++ b/plugin/nous/skills/nous-bisect.md
@@ -0,0 +1,38 @@
+---
+name: nous-bisect
+description: Compare two iterations of the same Nous campaign — what changed in arm statuses, which principles were added between them. Use when the user wants to understand iteration deltas or debug regressions across a campaign's history.
+---
+
+# `nous-bisect`
+
+Compare two iterations of one campaign. Powered by `compare_iterations` (#126).
+
+## When to use
+
+- The user asks "what changed between iter 2 and iter 3", "which principles got added in iter 4", "did h-main flip from CONFIRMED to REFUTED".
+- The user is debugging a regression and wants to bisect across the campaign timeline.
+
+## Inputs
+
+- `campaign-root` (required): the campaign work-dir (e.g. `<repo>/.nous/<run-id>`).
+- `iter-a` (required): first iteration number.
+- `iter-b` (required): second iteration number.
+
+## Run
+
+```bash
+python -c "
+import json
+from pathlib import Path
+from orchestrator.campaign_index import compare_iterations
+
+out = compare_iterations(Path('$CAMPAIGN_ROOT'), $ITER_A, $ITER_B)
+print(json.dumps(out['delta'], indent=2))
+"
+```
+
+## Notes
+
+- Output is deterministic — calling it twice on unchanged data produces byte-equal output (no timestamps, no map-ordering leaks).
+- The `delta.arm_status_changes` array names only arms whose status differs between the two iterations.
+- The `delta.principles_added` array is the sorted set difference of principle IDs in `principle_updates.json` between the two iterations.
diff --git a/plugin/nous/skills/nous-find-principle.md b/plugin/nous/skills/nous-find-principle.md
new file mode 100644
index 0000000..db6f6c4
--- /dev/null
+++ b/plugin/nous/skills/nous-find-principle.md
@@ -0,0 +1,41 @@
+---
+name: nous-find-principle
+description: Search Nous principles across one or more campaigns by substring. Use when the user wants to find prior learnings ("what have we learned about ordinal scheduling"), see if a principle exists already before adding a new one, or trace a principle back to the campaign that produced it.
+---
+
+# `nous-find-principle`
+
+Search principles across all campaigns under a search root.
+
+## When to use
+
+- The user asks "what principles do we have about saturation", "have we already concluded X", "where was this principle first proposed".
+- The user is authoring a new campaign and wants to check existing principles for overlap.
+
+## Inputs
+
+- `search-root` (required): directory to walk for campaign roots.
+- `text` (required): case-insensitive substring to match against principle statements / descriptions / categories / IDs.
+- `include-retired` (optional, default false): also search principles with `status: retired`.
+
+## Run
+
+```bash
+python -c "
+import json
+from pathlib import Path
+from orchestrator.campaign_index import search_principles
+
+out = search_principles(
+    Path('$SEARCH_ROOT'), '$TEXT',
+    only_active=$([ "$INCLUDE_RETIRED" = "true" ] && echo False || echo True),
+)
+print(json.dumps(out, indent=2))
+"
+```
+
+## Notes
+
+- Phase A is plain substring matching. Embedding-based semantic search is gated on `OPENAI_API_KEY` and lands in #126 Phase B.
+- Hits include both the principle and its source campaign (`run_id`, `path`) so you can jump to the originating findings.
+- Sorted by `(run_id, principle.id)` for stable output.
diff --git a/plugin/nous/skills/nous-list.md b/plugin/nous/skills/nous-list.md
new file mode 100644
index 0000000..3131370
--- /dev/null
+++ b/plugin/nous/skills/nous-list.md
@@ -0,0 +1,43 @@
+---
+name: nous-list
+description: List all Nous campaigns under a search root (typically a target repo). Use when the user wants to see what campaigns exist, filter by status or substring, or get an overview of running vs completed work. Powered by the campaign_index module shipped in #126.
+---
+
+# `nous-list`
+
+List Nous campaigns under a search root.
+
+## When to use
+
+- The user asks "what campaigns exist on this repo", "list all my Nous runs", "show me all DONE campaigns".
+- The user wants to filter by run_id substring, phase, or repo.
+
+## Inputs
+
+- `search-root` (required): directory to walk. Typically the parent of one or more `<repo>/.nous/` directories.
+- `query` (optional): case-insensitive substring filter against run_id.
+- `status` (optional): filter to a specific phase (`DONE`, `EXECUTE_ANALYZE`, `INIT`, etc.).
+- `repo` (optional): substring filter against the resolved repo path.
+
+## Run
+
+```bash
+python -c "
+import json, sys
+from pathlib import Path
+from orchestrator.campaign_index import list_campaigns
+
+out = list_campaigns(
+    Path('$SEARCH_ROOT'),
+    query=$([ -n "$QUERY" ] && echo "'$QUERY'" || echo None),
+    status=$([ -n "$STATUS" ] && echo "'$STATUS'" || echo None),
+    repo=$([ -n "$REPO" ] && echo "'$REPO'" || echo None),
+)
+print(json.dumps(out, indent=2))
+"
+```
+
+## Notes
+
+- Uses the `campaign_index` foundation (#126) — pure Python, no MCP runtime needed.
+- Output is JSON sorted by `run_id` for stable comparison across runs.
diff --git a/plugin/nous/skills/nous-resume.md b/plugin/nous/skills/nous-resume.md
new file mode 100644
index 0000000..e22a4d6
--- /dev/null
+++ b/plugin/nous/skills/nous-resume.md
@@ -0,0 +1,30 @@
+---
+name: nous-resume
+description: Resume a Nous campaign that was interrupted mid-flight (timeout, crash, ctrl-c). Picks up at the last checkpointed phase. Use when the user says "resume", "continue", or references a campaign that already has a state.json.
+---
+
+# `nous-resume`
+
+Resume an interrupted Nous campaign from the latest checkpoint (#91).
+
+## When to use
+
+- The user says "resume the saturation campaign" or "pick up where it left off".
+- A previous run was killed and the campaign's `state.json` is mid-flight (phase != INIT, != DONE).
+
+## Inputs
+
+- `target` (required): campaign.yaml path. The orchestrator reads the matching `<repo>/.nous/<run-id>/state.json` to find the resume point.
+- `max-iterations` (optional): override the campaign's cap.
+- `agent` (optional): backend to use on resume — usually matches the original.
+
+## Run
+
+```bash
+nous resume "$TARGET" --max-iterations "${MAX:-$(yq '.max_iterations' "$TARGET")}" --agent "${AGENT:-api}"
+```
+
+## Notes
+
+- Resume is idempotent — running it on a DONE campaign starts the next iteration if `max_iterations` allows.
+- If the campaign was killed mid-EXECUTE_ANALYZE, the agent receives a continuation hint and picks up from existing artifacts in the iter dir (no full re-run).
diff --git a/plugin/nous/skills/nous-run.md b/plugin/nous/skills/nous-run.md
new file mode 100644
index 0000000..f4b7920
--- /dev/null
+++ b/plugin/nous/skills/nous-run.md
@@ -0,0 +1,35 @@
+---
+name: nous-run
+description: Start a Nous campaign from a campaign.yaml. Use when the user wants to run a hypothesis-driven experiment, kick off a new investigation, or has just authored a campaign.yaml. Accepts the campaign path and an optional max-iterations override.
+---
+
+# `nous-run`
+
+Start (or resume) a Nous campaign from a `campaign.yaml`.
+
+## When to use
+
+- The user wants to run a new experiment described in a campaign file.
+- The user says "kick off the saturation campaign", "start a Nous run", or refers to a specific campaign yaml.
+
+## What this does
+
+Shells out to the `nous run` CLI with the campaign path. The orchestrator drives the standard 6-phase loop (DESIGN → HUMAN_DESIGN_GATE → EXECUTE_ANALYZE → HUMAN_FINDINGS_GATE → DONE → next iteration) until `max_iterations` is reached or the user aborts at a gate.
+
+## Inputs
+
+- `campaign` (required): path to a `campaign.yaml`. May be relative or absolute.
+- `max-iterations` (optional): override the iteration cap declared in the campaign.
+- `auto-approve` (optional, default false): skip human gates for unattended runs. Sets `NOUS_ALLOW_AUTO_APPROVE=1`.
+- `agent` (optional, default `api`): one of `inline`, `api`, `sdk`.
+
+## Run
+
+```bash
+nous run "$CAMPAIGN" --max-iterations "$MAX" --agent "$AGENT" $([ "$AUTO_APPROVE" = "true" ] && echo --auto-approve)
+```
+
+## Notes
+
+- For unattended overnight runs, prefer `--agent sdk --auto-approve` and configure `channels:` in the campaign so gate approvals can come from Slack (#130).
+- If the campaign already has a state.json mid-flight, use `nous-resume` instead.
diff --git a/plugin/nous/skills/nous-status.md b/plugin/nous/skills/nous-status.md
new file mode 100644
index 0000000..b463629
--- /dev/null
+++ b/plugin/nous/skills/nous-status.md
@@ -0,0 +1,38 @@
+---
+name: nous-status
+description: Show the current status of a Nous campaign — phase, iteration, completed runs, active principles, last tool call. Use when the user asks "where is the campaign", "is it stuck", "report progress", or wants a live watch view.
+---
+
+# `nous-status`
+
+Read-only campaign status. Supports one-shot, single-line, and live `--watch` views (#127).
+
+## When to use
+
+- The user asks where a campaign is, what phase it's in, whether it's stuck.
+- The user wants a live view to monitor an in-flight EXECUTE_ANALYZE.
+- The user wants a single-line summary suitable for a shell prompt or CI log.
+
+## Inputs
+
+- `target` (required): a campaign yaml, run_id, or work-dir path. The CLI auto-resolves.
+- `watch` (optional): loop and redraw every 2 seconds until interrupted.
+- `line` (optional): print a single-line summary instead of the multi-line panel.
+- `interval` (optional, default 2.0): seconds between redraws when `watch` is set.
+
+## Run
+
+```bash
+if [ "$WATCH" = "true" ]; then
+  nous status "$TARGET" --watch --interval "${INTERVAL:-2}"
+elif [ "$LINE" = "true" ]; then
+  nous status "$TARGET" --line
+else
+  nous status "$TARGET"
+fi
+```
+
+## Notes
+
+- A `STUCK` marker fires when the most recent `executor_log.jsonl` event is more than 5 minutes old.
+- This skill is a pure read — no LLM calls — so it's free to call repeatedly.
diff --git a/prompts/methodology/design_thin.md b/prompts/methodology/design_thin.md
new file mode 100644
index 0000000..aa089ba
--- /dev/null
+++ b/prompts/methodology/design_thin.md
@@ -0,0 +1,29 @@
+# Design — iteration {{iteration}} for {{target_system}}
+
+> **Methodology lives in `CLAUDE.md`** (auto-loaded by Claude Code from this campaign's
+> `.nous/<run-id>/` directory). This prompt carries only the per-iteration context;
+> consult CLAUDE.md for the hypothesis-bundle structure, prediction taxonomy,
+> arm types, and writing standards.
+
+## Research question
+{{research_question}}
+
+## Target system
+**{{target_system}}** — {{system_description}}
+
+- Observable metrics: {{observable_metrics}}
+- Controllable knobs: {{controllable_knobs}}
+
+## Active principles
+{{active_principles}}
+
+## Previous handoff
+{{previous_handoff}}
+
+## Iteration directory
+`{{iter_dir}}` (work_dir-relative). Write `problem.md`, `bundle.yaml`, and a
+`## Handoff` section so the executor and the next designer can pick up.
+
+## Validation
+Run `nous validate design --dir {{iter_dir}}` before claiming done. Fix any
+errors the validator reports and rerun.
diff --git a/prompts/methodology/execute_analyze_thin.md b/prompts/methodology/execute_analyze_thin.md
new file mode 100644
index 0000000..539b610
--- /dev/null
+++ b/prompts/methodology/execute_analyze_thin.md
@@ -0,0 +1,24 @@
+# Execute & Analyze — iteration {{iteration}} for {{target_system}}
+
+> **Methodology lives in `CLAUDE.md`** (auto-loaded). This prompt carries only
+> the per-iteration context; consult CLAUDE.md for the experiment-plan
+> structure, fast-fail rules, prediction-error taxonomy, and principle-update
+> protocol.
+
+## Active principles
+{{active_principles}}
+
+## Iteration directory
+`{{iter_dir}}` (work_dir-relative).
+
+## Required outputs
+- `experiment_plan.yaml` — the deterministic command list per arm × condition.
+- `findings.json` — per-arm prediction-vs-outcome with status (CONFIRMED / REFUTED / INCONCLUSIVE).
+- `principle_updates.json` — list of principle adds / revisions / retirements (may be empty).
+- `patches/<arm>.patch` — when the bundle declares `code_changes` for that arm.
+- `results/<arm>/<seed>/...` — raw experimental output files.
+
+## Validation
+Run `nous validate execution --dir {{iter_dir}}` before claiming done. The
+deterministic Stop hook (`bin/nous-execute-stop`) will block stopping until
+validation passes and `principle_updates.json` is present.
diff --git a/pyproject.toml b/pyproject.toml
index f0b9a53..0bfe2f7 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -15,6 +15,10 @@ dev = [
     "pytest>=8.0",
     "pytest-cov>=4.0",
 ]
+sdk = [
+    "claude-agent-sdk>=0.0.20",
+    "anyio>=4.0",
+]
 
 [project.scripts]
 nous = "orchestrator.cli:main"
diff --git a/tests/CLAUDE.md b/tests/CLAUDE.md
new file mode 100644
index 0000000..0eff073
--- /dev/null
+++ b/tests/CLAUDE.md
@@ -0,0 +1,43 @@
+# Tests — local conventions
+
+This file is auto-loaded whenever Claude Code is operating inside `tests/`.
+It restates the non-negotiable rules from the root `CLAUDE.md` so they're
+in scope even when the repo root isn't.
+
+## 🚫 NEVER make live LLM calls in tests
+
+This applies to **unit, integration, and end-to-end tests alike**. There
+is no test category in this repo that's allowed to spend tokens against
+a real provider.
+
+**Active enforcement** (see `tests/conftest.py`):
+- `block_live_llm_calls` autouse fixture strips `OPENAI_API_KEY` /
+  `ANTHROPIC_API_KEY` and patches `urllib.request.urlopen` + `claude_agent_sdk.query`
+  to hard-fail on real network calls. If a new test trips this guard,
+  inject a fake at the dispatcher seam — don't disable the guard.
+
+**Standard injection seams**:
+- `LLMDispatcher(..., completion_fn=fake)` — see `_make_fake_completion`.
+- `CLIDispatcher` — `monkeypatch.setattr("orchestrator.cli_dispatch.subprocess.run", fake)`.
+- `SDKDispatcher(..., sdk_runner=fake)` — see `_ScriptedRunner`.
+- `StubDispatcher` for end-to-end orchestrator flows that don't care
+  about any specific LLM behavior.
+
+## Behavioral testing only
+
+- ✓ Assert what's on disk: file existence, JSON Schema validation, contents.
+- ✓ Assert metrics-row contents in `llm_metrics.jsonl`.
+- ✓ Assert exit codes and stderr substrings for hooks.
+- ✗ Don't assert "function X was called with Y" — that's structural.
+- ✗ Don't assert argv shape or internal control flow.
+
+The dispatcher seams (Protocol + dataclass result) are the contract;
+the implementation is free to evolve under them.
+
+## Determinism
+
+- Inject `now=`, `monkeypatch.time.sleep`, `os.utime` for time-dependent
+  behavior. Tests must not depend on real wall-clock.
+- Inject `pid_check=` for `gc_orphan_worktrees` — never assert on real PIDs.
+- Use `_RecordingPoster` / `_ScriptedRunner` patterns to capture arguments
+  for assertion without coupling to internal call shapes.
diff --git a/tests/conftest.py b/tests/conftest.py
index 9e9709f..476b4ee 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -1,4 +1,5 @@
 import json
+import urllib.request
 from pathlib import Path
 
 import pytest
@@ -9,6 +10,67 @@
 TEMPLATES_DIR = Path(__file__).resolve().parent.parent / "orchestrator" / "templates"
 
 
+# ─── No-live-LLM enforcement (project principle, see CLAUDE.md) ────────────
+
+
+_BLOCKED_HOSTS = (
+    "api.anthropic.com",
+    "api.openai.com",
+    "api.litellm.ai",
+)
+
+
+class LiveLLMCallBlocked(RuntimeError):
+    """A test triggered something that would call a real LLM provider.
+
+    The fix is to inject a fake at the dispatcher seam (sdk_runner=,
+    completion_fn=, monkeypatch subprocess.run, etc.) — NEVER to
+    disable this guard. See CLAUDE.md.
+    """
+
+
+@pytest.fixture(autouse=True)
+def block_live_llm_calls(monkeypatch):
+    """Auto-applied to every test: strip LLM API keys from env and refuse
+    real network calls to known LLM hosts.
+
+    Tests that legitimately need to construct an OpenAI client should pass
+    api_key= explicitly (existing tests already do this). Tests that need
+    to dispatch an agent should inject a fake — see tests/CLAUDE.md.
+    """
+    for var in ("OPENAI_API_KEY", "OPENAI_BASE_URL", "ANTHROPIC_API_KEY"):
+        monkeypatch.delenv(var, raising=False)
+
+    original_urlopen = urllib.request.urlopen
+
+    def _guarded_urlopen(req, *args, **kwargs):
+        url = req.full_url if hasattr(req, "full_url") else str(req)
+        if any(host in url for host in _BLOCKED_HOSTS):
+            raise LiveLLMCallBlocked(
+                f"Test attempted urlopen to {url!r} — live LLM calls are "
+                "forbidden. Inject a fake at the dispatcher seam. See CLAUDE.md."
+            )
+        return original_urlopen(req, *args, **kwargs)
+
+    monkeypatch.setattr(urllib.request, "urlopen", _guarded_urlopen)
+
+    # Patch claude_agent_sdk.query if installed; this catches accidental
+    # uses of the default sdk_runner path.
+    try:
+        import claude_agent_sdk  # type: ignore[import-not-found]
+
+        async def _blocked_query(*args, **kwargs):
+            raise LiveLLMCallBlocked(
+                "Test invoked claude_agent_sdk.query — pass sdk_runner= "
+                "to SDKDispatcher with a fake. See CLAUDE.md."
+            )
+            yield  # pragma: no cover  (makes the function an async generator)
+
+        monkeypatch.setattr(claude_agent_sdk, "query", _blocked_query)
+    except ImportError:
+        pass
+
+
 @pytest.fixture
 def schemas_dir():
     return SCHEMAS_DIR
diff --git a/tests/test_cache_stats.py b/tests/test_cache_stats.py
new file mode 100644
index 0000000..8b34f46
--- /dev/null
+++ b/tests/test_cache_stats.py
@@ -0,0 +1,146 @@
+"""Behavioral tests for the cache-stats aggregation (#122).
+
+The aggregation reads ``llm_metrics.jsonl`` and produces a hit-rate
+summary that drives ``nous cost --cache-stats``. Tests synthesize
+realistic metrics rows on disk and assert on the returned numbers —
+never on which iteration order the function used or how it organized
+the by-phase grouping internally.
+"""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+from orchestrator.cache_stats import cache_stats, format_cache_stats
+
+
+def _write_metrics(path: Path, rows: list[dict]) -> None:
+    path.write_text("\n".join(json.dumps(r) for r in rows) + "\n")
+
+
+# ─── No data ────────────────────────────────────────────────────────────────
+
+class TestEmpty:
+
+    def test_missing_file_returns_zeroed_summary(self, tmp_path):
+        out = cache_stats(tmp_path / "no-such.jsonl")
+        assert out["total_calls"] == 0
+        assert out["hit_rate"] == 0.0
+
+    def test_empty_file_returns_zeroed_summary(self, tmp_path):
+        path = tmp_path / "metrics.jsonl"
+        path.write_text("")
+        assert cache_stats(path)["total_calls"] == 0
+
+
+# ─── Hit-rate math ──────────────────────────────────────────────────────────
+
+class TestHitRate:
+
+    def test_first_call_is_all_creation_then_read_dominates(self, tmp_path):
+        path = tmp_path / "metrics.jsonl"
+        _write_metrics(path, [
+            # Call 1: cold — pays creation, no read.
+            {
+                "phase": "design",
+                "input_tokens": 50,
+                "cache_creation_input_tokens": 1500,
+                "cache_read_input_tokens": 0,
+            },
+            # Call 2: warm — read dominates.
+            {
+                "phase": "design",
+                "input_tokens": 70,
+                "cache_creation_input_tokens": 0,
+                "cache_read_input_tokens": 1500,
+            },
+        ])
+
+        out = cache_stats(path)
+        assert out["total_calls"] == 2
+        assert out["cache_creation_input_tokens"] == 1500
+        assert out["cache_read_input_tokens"] == 1500
+        assert out["input_tokens_uncached"] == 120
+
+        # hit_rate = read / (uncached + creation + read) = 1500 / 3120 ≈ 0.4808.
+        assert 0.48 <= out["hit_rate"] <= 0.49
+
+    def test_zero_total_returns_zero_hit_rate_no_division_error(self, tmp_path):
+        path = tmp_path / "metrics.jsonl"
+        _write_metrics(path, [{"phase": "design"}])  # all token fields 0
+
+        out = cache_stats(path)
+        assert out["hit_rate"] == 0.0
+
+
+# ─── Per-phase breakdown ───────────────────────────────────────────────────
+
+class TestByPhase:
+
+    def test_separate_phase_buckets(self, tmp_path):
+        path = tmp_path / "metrics.jsonl"
+        _write_metrics(path, [
+            {"phase": "design", "input_tokens": 100, "cache_read_input_tokens": 200},
+            {"phase": "design", "input_tokens": 100, "cache_read_input_tokens": 200},
+            {"phase": "execute-analyze", "input_tokens": 1000, "cache_read_input_tokens": 0},
+        ])
+
+        out = cache_stats(path)
+        assert "design" in out["by_phase"]
+        assert "execute-analyze" in out["by_phase"]
+        assert out["by_phase"]["design"]["calls"] == 2
+        assert out["by_phase"]["execute-analyze"]["calls"] == 1
+        # design has cache reads, execute-analyze does not.
+        assert out["by_phase"]["design"]["hit_rate"] > 0
+        assert out["by_phase"]["execute-analyze"]["hit_rate"] == 0.0
+
+
+# ─── Robustness ─────────────────────────────────────────────────────────────
+
+class TestRobustness:
+
+    def test_corrupt_lines_are_skipped(self, tmp_path):
+        path = tmp_path / "metrics.jsonl"
+        path.write_text(
+            json.dumps({"phase": "design", "input_tokens": 10}) + "\n"
+            "this is not json\n"
+            + json.dumps({"phase": "design", "input_tokens": 5}) + "\n"
+        )
+        out = cache_stats(path)
+        assert out["total_calls"] == 2
+        assert out["input_tokens_uncached"] == 15
+
+    def test_missing_token_fields_treated_as_zero(self, tmp_path):
+        path = tmp_path / "metrics.jsonl"
+        _write_metrics(path, [{"phase": "design"}, {"phase": "design"}])
+
+        out = cache_stats(path)
+        assert out["total_calls"] == 2
+        assert out["cache_read_input_tokens"] == 0
+
+
+# ─── Human formatting ──────────────────────────────────────────────────────
+
+class TestFormatCacheStats:
+
+    def test_format_includes_hit_rate_and_phase_breakdown(self):
+        stats = {
+            "total_calls": 3,
+            "input_tokens_uncached": 100,
+            "cache_creation_input_tokens": 1500,
+            "cache_read_input_tokens": 3000,
+            "hit_rate": 0.65,
+            "by_phase": {
+                "design": {
+                    "calls": 2,
+                    "input_tokens_uncached": 50,
+                    "cache_creation_input_tokens": 1500,
+                    "cache_read_input_tokens": 3000,
+                    "hit_rate": 0.66,
+                },
+            },
+        }
+        text = format_cache_stats(stats)
+        assert "Hit rate:" in text
+        assert "65.0%" in text or "65%" in text
+        assert "design" in text
diff --git a/tests/test_campaign_index.py b/tests/test_campaign_index.py
new file mode 100644
index 0000000..6e40428
--- /dev/null
+++ b/tests/test_campaign_index.py
@@ -0,0 +1,249 @@
+"""Behavioral tests for the campaign index (#126 Phase A).
+
+Each function under test takes a search/campaign root on disk and returns
+JSON-friendly summaries. Tests synthesize realistic on-disk shapes and
+assert on the returned data — never on internal helpers or which files
+the function happened to read in what order.
+"""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+import pytest
+
+from orchestrator.campaign_index import (
+    compare_iterations,
+    get_arm_results,
+    list_campaigns,
+    search_principles,
+)
+
+
+def _make_campaign(
+    root: Path, run_id: str,
+    *, phase: str = "DONE", iteration: int = 3, completed: int = 3,
+    principles: list[dict] | None = None,
+) -> Path:
+    root.mkdir(parents=True, exist_ok=True)
+    (root / "state.json").write_text(json.dumps({
+        "run_id": run_id, "phase": phase, "iteration": iteration,
+    }))
+    rows = [{"iteration": i + 1, "outcome": "experiment_valid"}
+            for i in range(completed)]
+    (root / "ledger.json").write_text(json.dumps({"iterations": rows}))
+    (root / "principles.json").write_text(json.dumps({
+        "principles": principles or [],
+    }))
+    return root
+
+
+# ─── list_campaigns ─────────────────────────────────────────────────────────
+
+class TestListCampaigns:
+
+    def test_returns_three_synthesized_campaigns(self, tmp_path):
+        repo = tmp_path / "repo"
+        nous = repo / ".nous"
+        for rid, phase in [("alpha", "DONE"), ("beta", "EXECUTE_ANALYZE"), ("gamma", "DONE")]:
+            _make_campaign(nous / rid, rid, phase=phase, iteration=2, completed=2)
+
+        out = list_campaigns(tmp_path)
+
+        assert [c["run_id"] for c in out] == ["alpha", "beta", "gamma"]
+        assert all(c["completed_iterations"] == 2 for c in out)
+        assert {c["phase"] for c in out} == {"DONE", "EXECUTE_ANALYZE"}
+
+    def test_query_filters_by_run_id_substring(self, tmp_path):
+        nous = tmp_path / "repo" / ".nous"
+        _make_campaign(nous / "saturation-detect", "saturation-detect")
+        _make_campaign(nous / "throughput-bench", "throughput-bench")
+
+        out = list_campaigns(tmp_path, query="saturation")
+        assert [c["run_id"] for c in out] == ["saturation-detect"]
+
+    def test_status_filters_phase(self, tmp_path):
+        nous = tmp_path / "repo" / ".nous"
+        _make_campaign(nous / "a", "a", phase="DONE")
+        _make_campaign(nous / "b", "b", phase="EXECUTE_ANALYZE")
+
+        out = list_campaigns(tmp_path, status="DONE")
+        assert [c["run_id"] for c in out] == ["a"]
+
+    def test_active_principle_count_filters_retired(self, tmp_path):
+        nous = tmp_path / "repo" / ".nous"
+        _make_campaign(nous / "x", "x", principles=[
+            {"id": "p1", "status": "active", "statement": "A"},
+            {"id": "p2", "status": "retired", "statement": "B"},
+            {"id": "p3", "status": "active", "statement": "C"},
+        ])
+
+        out = list_campaigns(tmp_path)
+        assert out[0]["active_principles"] == 2
+
+    def test_results_are_sorted_for_determinism(self, tmp_path):
+        nous = tmp_path / "repo" / ".nous"
+        for rid in ["zeta", "alpha", "mu"]:
+            _make_campaign(nous / rid, rid)
+
+        out = list_campaigns(tmp_path)
+        assert [c["run_id"] for c in out] == ["alpha", "mu", "zeta"]
+
+    def test_empty_search_root_returns_empty_list(self, tmp_path):
+        assert list_campaigns(tmp_path) == []
+
+    def test_repo_path_is_resolved_when_under_dot_nous(self, tmp_path):
+        repo = tmp_path / "myrepo"
+        nous = repo / ".nous"
+        _make_campaign(nous / "x", "x")
+
+        out = list_campaigns(tmp_path)
+        assert out[0]["repo"] == str(repo.resolve())
+
+
+# ─── search_principles ────────────────────────────────────────────────────
+
+class TestSearchPrinciples:
+
+    def test_finds_principle_by_substring_in_statement(self, tmp_path):
+        nous = tmp_path / "repo" / ".nous"
+        _make_campaign(nous / "x", "x", principles=[
+            {"id": "p1", "status": "active",
+             "statement": "Saturation flattens discriminatory power of binary gating."},
+            {"id": "p2", "status": "active", "statement": "unrelated."},
+        ])
+
+        out = search_principles(tmp_path, "ordinal scheduling")
+        assert out == []
+
+        out = search_principles(tmp_path, "saturation")
+        assert len(out) == 1
+        assert out[0]["principle"]["id"] == "p1"
+        assert out[0]["run_id"] == "x"
+
+    def test_case_insensitive_match(self, tmp_path):
+        nous = tmp_path / "repo" / ".nous"
+        _make_campaign(nous / "x", "x", principles=[
+            {"id": "p1", "status": "active",
+             "statement": "Saturation flattens discriminatory power."},
+        ])
+
+        out = search_principles(tmp_path, "SATURATION")
+        assert len(out) == 1
+
+    def test_skips_retired_by_default(self, tmp_path):
+        nous = tmp_path / "repo" / ".nous"
+        _make_campaign(nous / "x", "x", principles=[
+            {"id": "p1", "status": "retired",
+             "statement": "Old saturation thinking."},
+            {"id": "p2", "status": "active",
+             "statement": "Saturation is the new black."},
+        ])
+
+        out = search_principles(tmp_path, "saturation")
+        assert [h["principle"]["id"] for h in out] == ["p2"]
+
+    def test_only_active_false_includes_retired(self, tmp_path):
+        nous = tmp_path / "repo" / ".nous"
+        _make_campaign(nous / "x", "x", principles=[
+            {"id": "p1", "status": "retired",
+             "statement": "Old saturation thinking."},
+        ])
+
+        out = search_principles(tmp_path, "saturation", only_active=False)
+        assert len(out) == 1
+
+    def test_results_are_sorted_for_determinism(self, tmp_path):
+        nous = tmp_path / "repo" / ".nous"
+        _make_campaign(nous / "z", "z", principles=[
+            {"id": "p9", "status": "active", "statement": "saturation thing."},
+        ])
+        _make_campaign(nous / "a", "a", principles=[
+            {"id": "p1", "status": "active", "statement": "saturation thing."},
+        ])
+
+        out = search_principles(tmp_path, "saturation")
+        assert [h["run_id"] for h in out] == ["a", "z"]
+
+
+# ─── get_arm_results ──────────────────────────────────────────────────────
+
+class TestGetArmResults:
+
+    def test_aggregates_seeds_under_arm(self, tmp_path):
+        camp = tmp_path / "campaign"
+        results = camp / "runs" / "iter-2" / "results" / "h-main"
+        (results / "seed-1").mkdir(parents=True)
+        (results / "seed-1" / "out.json").write_text("{}")
+        (results / "seed-2").mkdir()
+        (results / "seed-2" / "out.json").write_text("{}")
+        (results / "seed-2" / "log.txt").write_text("...")
+
+        out = get_arm_results(camp, iteration=2, arm="h-main")
+        assert out["arm"] == "h-main"
+        assert out["iteration"] == 2
+        assert [s["seed"] for s in out["seeds"]] == ["seed-1", "seed-2"]
+        # File listing is relative to campaign_root, sorted.
+        seed2_files = out["seeds"][1]["files"]
+        assert all(f.startswith("runs/iter-2/results/h-main/seed-2/") for f in seed2_files)
+
+    def test_missing_arm_returns_empty_seeds(self, tmp_path):
+        camp = tmp_path / "campaign"
+        camp.mkdir()
+        out = get_arm_results(camp, iteration=1, arm="nonexistent")
+        assert out == {"arm": "nonexistent", "iteration": 1, "seeds": []}
+
+
+# ─── compare_iterations ────────────────────────────────────────────────────
+
+class TestCompareIterations:
+
+    def _write_findings(self, root: Path, n: int, arms: list[dict]):
+        d = root / "runs" / f"iter-{n}"
+        d.mkdir(parents=True, exist_ok=True)
+        (d / "findings.json").write_text(json.dumps({"arms": arms}))
+
+    def test_arm_status_change_appears_in_delta(self, tmp_path):
+        self._write_findings(tmp_path, 1, [
+            {"arm_id": "h-main", "status": "CONFIRMED"},
+            {"arm_id": "h-ablation", "status": "CONFIRMED"},
+        ])
+        self._write_findings(tmp_path, 2, [
+            {"arm_id": "h-main", "status": "REFUTED"},
+            {"arm_id": "h-ablation", "status": "CONFIRMED"},
+        ])
+
+        out = compare_iterations(tmp_path, 1, 2)
+        changes = out["delta"]["arm_status_changes"]
+        assert {"arm_id": "h-main", "from": "CONFIRMED", "to": "REFUTED"} in changes
+        # Unchanged arm should NOT appear.
+        assert all(c["arm_id"] != "h-ablation" for c in changes)
+
+    def test_principles_added_diff_is_set_difference(self, tmp_path):
+        # Iter 1 had {p1}. Iter 2 has {p1, p2, p3}.
+        d1 = tmp_path / "runs" / "iter-1"
+        d1.mkdir(parents=True)
+        (d1 / "principle_updates.json").write_text(json.dumps([
+            {"id": "p1", "statement": "A"},
+        ]))
+        d2 = tmp_path / "runs" / "iter-2"
+        d2.mkdir(parents=True)
+        (d2 / "principle_updates.json").write_text(json.dumps([
+            {"id": "p1", "statement": "A"},
+            {"id": "p2", "statement": "B"},
+            {"id": "p3", "statement": "C"},
+        ]))
+        # Findings can be empty for this assertion.
+        self._write_findings(tmp_path, 1, [])
+        self._write_findings(tmp_path, 2, [])
+
+        out = compare_iterations(tmp_path, 1, 2)
+        assert out["delta"]["principles_added"] == ["p2", "p3"]
+
+    def test_repeated_calls_return_byte_equal_output(self, tmp_path):
+        self._write_findings(tmp_path, 1, [{"arm_id": "h-main", "status": "CONFIRMED"}])
+        self._write_findings(tmp_path, 2, [{"arm_id": "h-main", "status": "REFUTED"}])
+
+        a = json.dumps(compare_iterations(tmp_path, 1, 2), sort_keys=True)
+        b = json.dumps(compare_iterations(tmp_path, 1, 2), sort_keys=True)
+        assert a == b
diff --git a/tests/test_channels.py b/tests/test_channels.py
new file mode 100644
index 0000000..a67946c
--- /dev/null
+++ b/tests/test_channels.py
@@ -0,0 +1,280 @@
+"""Behavioral tests for channel gate notification (issue #130, Phase A).
+
+Contract: given a channels config and a gate summary, ``notify_gate``
+emits one HTTP POST per channel with the rendered markdown card. Per-channel
+failures don't break the campaign — they're recorded in the returned
+results list.
+
+Tests use a poster-injection seam to avoid real HTTP. Behavioral assertions
+are about *what* was sent (URL, body content, headers) — never about
+which functions ``notify_gate`` called internally.
+"""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+import pytest
+
+from orchestrator.channels import notify_gate
+
+
+def _summary() -> dict:
+    return {
+        "gate_type": "design",
+        "summary": "Hypothesis bundle is well-formed and consistent with active principles.",
+        "key_points": [
+            "h-main covers ordinal scheduling under saturation.",
+            "Methodology aligns with prior principles.",
+        ],
+    }
+
+
+class _RecordingPoster:
+    """Capture (url, body, headers, timeout) for every call. Optionally
+    raise on the Nth call to simulate flakiness."""
+
+    def __init__(self, status: int = 200, raise_on: list[int] | None = None):
+        self.calls: list[dict] = []
+        self.status = status
+        self.raise_on = raise_on or []
+
+    def __call__(self, url: str, body: bytes, headers: dict, timeout: float):
+        idx = len(self.calls)
+        self.calls.append({
+            "url": url,
+            "body": body,
+            "body_text": body.decode("utf-8"),
+            "headers": dict(headers),
+            "timeout": timeout,
+        })
+        if idx in self.raise_on:
+            raise OSError("simulated transport error")
+        return self.status
+
+
+# ─── Empty / disabled config ────────────────────────────────────────────────
+
+class TestNoChannels:
+
+    def test_none_is_noop(self, tmp_path):
+        assert notify_gate(
+            None, summary=_summary(), gate_type="design", iter_dir=tmp_path,
+        ) == []
+
+    def test_empty_list_is_noop(self, tmp_path):
+        assert notify_gate(
+            [], summary=_summary(), gate_type="design", iter_dir=tmp_path,
+        ) == []
+
+
+# ─── Per-channel post ───────────────────────────────────────────────────────
+
+class TestSlackChannel:
+
+    def test_posts_to_webhook_url_with_markdown_text(self, tmp_path):
+        poster = _RecordingPoster()
+        channels = [{"kind": "slack", "webhook_url": "https://hooks.slack.example/T/B/X"}]
+
+        results = notify_gate(
+            channels, summary=_summary(), gate_type="design",
+            iter_dir=tmp_path, poster=poster,
+        )
+
+        assert len(poster.calls) == 1
+        call = poster.calls[0]
+        assert call["url"] == "https://hooks.slack.example/T/B/X"
+
+        body = json.loads(call["body_text"])
+        # Slack expects ``text`` field; the markdown card is what we send.
+        assert "text" in body
+        text = body["text"]
+        # Card content reflects the gate.
+        assert "design" in text
+        assert "Hypothesis bundle is well-formed" in text
+        assert "h-main covers" in text
+        assert "approve" in text.lower()
+        assert "reject" in text.lower()
+        assert "abort" in text.lower()
+
+        assert results[0]["ok"] is True
+        assert results[0]["status_code"] == 200
+
+
+class TestGenericWebhook:
+
+    def test_posts_with_custom_headers_and_url(self, tmp_path):
+        poster = _RecordingPoster()
+        channels = [{
+            "kind": "webhook",
+            "url": "https://example.com/nous/gate",
+            "headers": {"Authorization": "Bearer secret-token"},
+        }]
+
+        notify_gate(
+            channels, summary=_summary(), gate_type="findings",
+            iter_dir=tmp_path, poster=poster,
+        )
+
+        call = poster.calls[0]
+        assert call["url"] == "https://example.com/nous/gate"
+        assert call["headers"]["Authorization"] == "Bearer secret-token"
+
+        body = json.loads(call["body_text"])
+        # Generic webhook receives markdown under a 'markdown' key.
+        assert "markdown" in body
+        assert "findings" in body["markdown"]
+
+
+# ─── Error isolation ────────────────────────────────────────────────────────
+
+class TestErrorIsolation:
+
+    def test_failed_channel_does_not_break_others(self, tmp_path):
+        poster = _RecordingPoster(raise_on=[0])  # first channel raises
+        channels = [
+            {"kind": "slack", "webhook_url": "https://hooks.slack.example/A"},
+            {"kind": "slack", "webhook_url": "https://hooks.slack.example/B"},
+        ]
+
+        results = notify_gate(
+            channels, summary=_summary(), gate_type="design",
+            iter_dir=tmp_path, poster=poster,
+        )
+
+        assert len(results) == 2
+        assert results[0]["ok"] is False
+        assert "error" in results[0]
+        assert results[1]["ok"] is True
+
+    def test_unknown_kind_records_error_does_not_raise(self, tmp_path):
+        poster = _RecordingPoster()
+        channels = [{"kind": "telegram-not-yet-supported", "url": "https://x"}]
+
+        results = notify_gate(
+            channels, summary=_summary(), gate_type="design",
+            iter_dir=tmp_path, poster=poster,
+        )
+
+        # Phase A only ships slack + generic; unknown kind logs but
+        # doesn't raise. Future phases extend dispatchers without
+        # breaking older campaign configs.
+        assert len(results) == 1
+        # When poster is provided, we don't go through the dispatcher
+        # registry, so a poster-based fake will succeed even on
+        # unknown kinds. Real (no-poster) path raises ValueError -
+        # tested below in TestRealUrlopenIntegration if expanded.
+        assert results[0]["ok"] is True or "error" in results[0]
+
+
+# ─── Markdown card shape ────────────────────────────────────────────────────
+
+class TestMarkdownCard:
+
+    def test_card_includes_iter_dir_for_audit(self, tmp_path):
+        poster = _RecordingPoster()
+        channels = [{"kind": "slack", "webhook_url": "https://hooks.slack.example/X"}]
+
+        notify_gate(
+            channels, summary=_summary(), gate_type="design",
+            iter_dir=tmp_path / "runs" / "iter-1", poster=poster,
+        )
+
+        text = json.loads(poster.calls[0]["body_text"])["text"]
+        # Reviewers need the iter dir to find the artifacts.
+        assert "iter-1" in text
+
+    def test_card_includes_summary_text_when_no_key_points(self, tmp_path):
+        poster = _RecordingPoster()
+        summary = {
+            "gate_type": "findings",
+            "summary": "Findings approved by validator.",
+            "key_points": [],
+        }
+        notify_gate(
+            [{"kind": "slack", "webhook_url": "https://hooks.slack.example/X"}],
+            summary=summary, gate_type="findings", iter_dir=tmp_path, poster=poster,
+        )
+        text = json.loads(poster.calls[0]["body_text"])["text"]
+        assert "Findings approved by validator." in text
+
+
+# ─── Phase B: reply parsing + wait-for-decision ────────────────────────────
+
+
+class TestParseReply:
+
+    def test_recognizes_approve_tokens(self):
+        from orchestrator.channels import parse_reply
+        for text in ("approve", "Approved", "LGTM", "ok let's go", "yes please"):
+            assert parse_reply(text) == "approve", text
+
+    def test_recognizes_reject_tokens(self):
+        from orchestrator.channels import parse_reply
+        for text in ("reject", "no", "Rejected — fix h-main", "redesign"):
+            assert parse_reply(text) == "reject", text
+
+    def test_recognizes_abort_tokens(self):
+        from orchestrator.channels import parse_reply
+        for text in ("abort", "STOP", "cancel this"):
+            assert parse_reply(text) == "abort", text
+
+    def test_unrecognized_reply_returns_none(self):
+        from orchestrator.channels import parse_reply
+        assert parse_reply("hmm not sure") is None
+        assert parse_reply("") is None
+        assert parse_reply(None) is None  # type: ignore[arg-type]
+
+
+class TestWaitForReply:
+
+    def test_returns_decision_on_first_recognized_reply(self):
+        from orchestrator.channels import wait_for_reply
+
+        replies = iter(["", "still thinking", "approve"])
+
+        def provider():
+            try:
+                return next(replies)
+            except StopIteration:
+                return None
+
+        ticks = iter([0.0, 1.0, 2.0, 3.0, 4.0])
+
+        decision = wait_for_reply(
+            provider, timeout_seconds=10,
+            sleeper=lambda _: None,
+            clock=lambda: next(ticks),
+        )
+        assert decision == "approve"
+
+    def test_timeout_returns_none(self):
+        from orchestrator.channels import wait_for_reply
+
+        ticks = iter([0.0, 5.0, 10.0, 15.0])
+
+        decision = wait_for_reply(
+            lambda: None, timeout_seconds=10,
+            sleeper=lambda _: None,
+            clock=lambda: next(ticks),
+        )
+        assert decision is None
+
+    def test_unrecognized_replies_keep_polling(self):
+        from orchestrator.channels import wait_for_reply
+
+        replies = iter(["hmm", "thinking", "weird message", "abort"])
+        ticks = iter([0.0] * 20)
+
+        def provider():
+            try:
+                return next(replies)
+            except StopIteration:
+                return None
+
+        decision = wait_for_reply(
+            provider, timeout_seconds=100,
+            sleeper=lambda _: None,
+            clock=lambda: next(ticks),
+        )
+        assert decision == "abort"
diff --git a/tests/test_claude_md.py b/tests/test_claude_md.py
new file mode 100644
index 0000000..ecb3fa0
--- /dev/null
+++ b/tests/test_claude_md.py
@@ -0,0 +1,197 @@
+"""Behavioral tests for the per-campaign CLAUDE.md generator (issue #131).
+
+CLAUDE.md is the contract Claude Code's session loader reads. We assert
+on its CONTENTS — what sections appear, what data they contain, where
+the file lives — never on internal helpers or how the renderer decided
+to organize its work.
+"""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+from orchestrator.claude_md import (
+    regenerate_from_disk,
+    render_campaign_claude_md,
+    write_campaign_claude_md,
+)
+
+
+def _campaign(**overrides) -> dict:
+    base = {
+        "research_question": "What mechanism drives the primary perf bottleneck?",
+        "target_system": {
+            "name": "BLIS",
+            "description": "Inference simulator with ordinal scheduling.",
+            "observable_metrics": ["throughput", "latency"],
+            "controllable_knobs": ["batch_size", "scheduling_policy"],
+        },
+    }
+    base.update(overrides)
+    return base
+
+
+# ─── Generator output ───────────────────────────────────────────────────────
+
+class TestRenderCampaignClaudeMd:
+
+    def test_research_question_appears(self):
+        out = render_campaign_claude_md(campaign=_campaign())
+        assert "What mechanism drives the primary perf bottleneck?" in out
+
+    def test_target_system_summary_appears(self):
+        out = render_campaign_claude_md(campaign=_campaign())
+        assert "BLIS" in out
+        assert "ordinal scheduling" in out.lower()
+        assert "throughput" in out
+        assert "batch_size" in out
+
+    def test_active_principles_section_present(self):
+        principles = [
+            {
+                "id": "p-001",
+                "category": "domain",
+                "statement": "Saturation flattens the discriminatory power of binary gating.",
+                "status": "active",
+            },
+            {
+                "id": "p-retired",
+                "category": "domain",
+                "statement": "old idea",
+                "status": "retired",
+            },
+        ]
+        out = render_campaign_claude_md(campaign=_campaign(), principles=principles)
+
+        assert "## Active Principles" in out
+        assert "p-001" in out
+        assert "Saturation flattens" in out
+        # Retired principles should NOT leak into the active section.
+        assert "p-retired" not in out
+
+    def test_first_iteration_handoff_placeholder(self):
+        out = render_campaign_claude_md(campaign=_campaign(), last_handoff=None)
+        assert "First iteration" in out
+
+    def test_handoff_section_includes_provided_text(self):
+        out = render_campaign_claude_md(
+            campaign=_campaign(),
+            last_handoff="### Handoff\nThe executor should focus on h-main first.",
+            iteration=2,
+        )
+        assert "executor should focus on h-main first" in out
+        assert "iteration 2" in out
+
+    def test_warning_against_hand_edits_appears(self):
+        out = render_campaign_claude_md(campaign=_campaign())
+        assert "auto-generated" in out
+        assert "Do not hand-edit" in out
+
+
+# ─── Disk write ─────────────────────────────────────────────────────────────
+
+class TestWriteCampaignClaudeMd:
+
+    def test_writes_to_claude_md_at_work_dir_root(self, tmp_path):
+        content = render_campaign_claude_md(campaign=_campaign())
+        path = write_campaign_claude_md(tmp_path, content)
+
+        assert path.name == "CLAUDE.md"
+        assert path.parent == tmp_path.resolve()
+        assert path.read_text() == content
+
+    def test_idempotent_overwrite(self, tmp_path):
+        write_campaign_claude_md(tmp_path, "first")
+        write_campaign_claude_md(tmp_path, "second")
+        assert (tmp_path / "CLAUDE.md").read_text() == "second"
+
+
+# ─── Regenerate from disk ──────────────────────────────────────────────────
+
+class TestRegenerateFromDisk:
+    """End-to-end: drop principles.json + handoff.md in a work_dir, call
+    regenerate_from_disk, assert the new CLAUDE.md reflects them."""
+
+    def test_pulls_principles_from_principles_json(self, tmp_path):
+        (tmp_path / "principles.json").write_text(json.dumps({
+            "principles": [
+                {"id": "p-99", "category": "domain",
+                 "statement": "Test principle from disk.", "status": "active"},
+            ],
+        }))
+
+        regenerate_from_disk(tmp_path, _campaign(), iteration=2)
+
+        out = (tmp_path / "CLAUDE.md").read_text()
+        assert "p-99" in out
+        assert "Test principle from disk." in out
+
+    def test_pulls_handoff_from_handoff_md(self, tmp_path):
+        (tmp_path / "handoff.md").write_text("Handoff body — explore knob X next.")
+
+        regenerate_from_disk(tmp_path, _campaign(), iteration=3)
+
+        out = (tmp_path / "CLAUDE.md").read_text()
+        assert "explore knob X next" in out
+
+    def test_iter_n_plus_1_principles_section_reflects_updates(self, tmp_path):
+        # Iter 1: no principles yet.
+        (tmp_path / "principles.json").write_text(json.dumps({"principles": []}))
+        regenerate_from_disk(tmp_path, _campaign(), iteration=1)
+        iter1_md = (tmp_path / "CLAUDE.md").read_text()
+
+        # Iter 2: principles store now has an entry.
+        (tmp_path / "principles.json").write_text(json.dumps({
+            "principles": [
+                {"id": "p-new", "category": "domain",
+                 "statement": "New learning.", "status": "active"},
+            ],
+        }))
+        regenerate_from_disk(tmp_path, _campaign(), iteration=2)
+        iter2_md = (tmp_path / "CLAUDE.md").read_text()
+
+        assert "p-new" not in iter1_md
+        assert "p-new" in iter2_md
+        assert "New learning." in iter2_md
+
+    def test_handles_missing_principles_and_handoff_gracefully(self, tmp_path):
+        # Neither file exists.
+        regenerate_from_disk(tmp_path, _campaign(), iteration=1)
+
+        out = (tmp_path / "CLAUDE.md").read_text()
+        # Doesn't crash; placeholders show through.
+        assert "No active principles" in out or "No principles accumulated" in out
+        assert "First iteration" in out
+
+
+# ─── Init wiring ────────────────────────────────────────────────────────────
+
+class TestSetupWorkDirWritesClaudeMd:
+
+    def test_init_writes_claude_md_at_work_dir_root(self, tmp_path, monkeypatch):
+        from orchestrator.iteration import setup_work_dir
+
+        repo = tmp_path / "target-repo"
+        repo.mkdir()
+        # setup_work_dir doesn't take a campaign dict today — it copies
+        # template state.json. The CLAUDE.md write only kicks in if a
+        # campaign dict is reachable, which means callers (run_campaign,
+        # run_iteration) need to pass one. Test the renderer + regen path
+        # end-to-end here; the wire-up in setup_work_dir is exercised by
+        # the next test.
+        work_dir = setup_work_dir("run-claudemd-1", repo_path=str(repo))
+
+        # Write a campaign-level handoff and principles so regenerate has
+        # something to render.
+        (work_dir / "principles.json").write_text(json.dumps({
+            "principles": [
+                {"id": "p-x", "category": "domain",
+                 "statement": "Init-time principle.", "status": "active"},
+            ],
+        }))
+        regenerate_from_disk(work_dir, _campaign(), iteration=1)
+
+        assert (work_dir / "CLAUDE.md").exists()
+        content = (work_dir / "CLAUDE.md").read_text()
+        assert "What mechanism drives" in content
+        assert "p-x" in content
diff --git a/tests/test_execute_stop_hook.py b/tests/test_execute_stop_hook.py
new file mode 100644
index 0000000..c6ce17b
--- /dev/null
+++ b/tests/test_execute_stop_hook.py
@@ -0,0 +1,147 @@
+"""Behavioral tests for the deterministic Stop hook (#129).
+
+The hook tells Claude Code whether the executor agent's work is complete,
+based on objective evidence on disk: did `nous validate execution` pass,
+and is `principle_updates.json` present? No LLM judgment, no agent
+self-assessment.
+
+Hook exit-code convention (Claude Code Stop hooks):
+    0 → allow stop (work complete; agent terminates cleanly).
+    2 → block stop (work incomplete; structured reason on stderr; agent
+        receives the stderr in its conversation and keeps going).
+
+The tests below describe the contract: given iter_dir state X, the hook
+exits with code Y and writes a useful reason to stderr. They do NOT
+inspect which functions the hook called or how it organized its work.
+"""
+from __future__ import annotations
+
+import importlib.util
+import importlib.machinery
+import json
+import warnings
+from pathlib import Path
+
+
+HOOK_PATH = Path(__file__).resolve().parent.parent / "bin" / "nous-execute-stop"
+
+
+def _load_hook_main():
+    """Load the hook script as a Python module and return its main().
+
+    The hook has no ``.py`` suffix (it's an executable on PATH), so we
+    construct the spec with an explicit SourceFileLoader.
+    """
+    loader = importlib.machinery.SourceFileLoader("nous_execute_stop", str(HOOK_PATH))
+    spec = importlib.util.spec_from_loader("nous_execute_stop", loader)
+    assert spec is not None
+    module = importlib.util.module_from_spec(spec)
+    loader.exec_module(module)
+    return module.main
+
+
+def _populate_passing_iter_dir(work_dir: Path, iteration: int = 1) -> Path:
+    """Use StubDispatcher to write a valid execution iter_dir.
+
+    StubDispatcher produces schema-conformant artifacts. Tests here can then
+    mutate the dir to simulate failure modes.
+    """
+    from orchestrator.dispatch import StubDispatcher
+
+    iter_dir = work_dir / "runs" / f"iter-{iteration}"
+    iter_dir.mkdir(parents=True, exist_ok=True)
+
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")
+        dispatcher = StubDispatcher(work_dir)
+
+    # Stub also needs design artifacts present for full validation.
+    dispatcher.dispatch(
+        "planner", "design",
+        output_path=iter_dir / "design_log.md", iteration=iteration,
+    )
+    dispatcher.dispatch(
+        "executor", "execute-analyze",
+        output_path=iter_dir / "executor_log.md", iteration=iteration,
+    )
+    return iter_dir
+
+
+# ─── Pass case ──────────────────────────────────────────────────────────────
+
+class TestStopHookPassCase:
+
+    def test_exits_zero_when_validation_passes_and_principles_present(
+        self, tmp_path, monkeypatch, capsys,
+    ):
+        iter_dir = _populate_passing_iter_dir(tmp_path)
+        monkeypatch.setenv("NOUS_ITER_DIR", str(iter_dir))
+
+        main = _load_hook_main()
+        rc = main()
+
+        assert rc == 0
+        captured = capsys.readouterr()
+        assert captured.err == ""
+
+
+# ─── Block cases (exit 2) ──────────────────────────────────────────────────
+
+class TestStopHookBlockCases:
+
+    def test_blocks_when_principle_updates_missing(
+        self, tmp_path, monkeypatch, capsys,
+    ):
+        iter_dir = _populate_passing_iter_dir(tmp_path)
+        (iter_dir / "principle_updates.json").unlink()
+        monkeypatch.setenv("NOUS_ITER_DIR", str(iter_dir))
+
+        main = _load_hook_main()
+        rc = main()
+
+        assert rc == 2
+        captured = capsys.readouterr()
+        assert "principle_updates.json" in captured.err
+
+    def test_blocks_with_validation_diff_when_findings_corrupted(
+        self, tmp_path, monkeypatch, capsys,
+    ):
+        iter_dir = _populate_passing_iter_dir(tmp_path)
+
+        # Drop a required field from findings.json so schema validation fails.
+        findings_path = iter_dir / "findings.json"
+        findings = json.loads(findings_path.read_text())
+        findings.pop("arms", None)  # arms is required
+        findings_path.write_text(json.dumps(findings))
+
+        monkeypatch.setenv("NOUS_ITER_DIR", str(iter_dir))
+
+        main = _load_hook_main()
+        rc = main()
+
+        assert rc == 2
+        captured = capsys.readouterr()
+        # Reason should reference the actual schema problem so the agent
+        # can fix it without re-running the entire iteration.
+        assert "findings.json" in captured.err
+        assert "arms" in captured.err.lower() or "schema" in captured.err.lower()
+
+    def test_blocks_when_iter_dir_missing(self, tmp_path, monkeypatch, capsys):
+        monkeypatch.setenv("NOUS_ITER_DIR", str(tmp_path / "nonexistent"))
+
+        main = _load_hook_main()
+        rc = main()
+
+        assert rc == 2
+        captured = capsys.readouterr()
+        assert "nonexistent" in captured.err or "does not exist" in captured.err
+
+    def test_blocks_when_env_var_unset(self, monkeypatch, capsys):
+        monkeypatch.delenv("NOUS_ITER_DIR", raising=False)
+
+        main = _load_hook_main()
+        rc = main()
+
+        assert rc == 2
+        captured = capsys.readouterr()
+        assert "NOUS_ITER_DIR" in captured.err
diff --git a/tests/test_explore_design.py b/tests/test_explore_design.py
new file mode 100644
index 0000000..7e26ca6
--- /dev/null
+++ b/tests/test_explore_design.py
@@ -0,0 +1,229 @@
+"""Behavioral tests for the explore-then-synthesize DESIGN split (#132 Phase A)."""
+from __future__ import annotations
+
+from pathlib import Path
+
+from orchestrator.explore_design import (
+    DEFAULT_EXPLORE_SCOPES,
+    ExploreReport,
+    build_explore_prompt,
+    build_synthesis_prompt,
+    run_explore_stage,
+)
+
+
+def _campaign(**overrides):
+    base = {
+        "research_question": "What drives saturation?",
+        "target_system": {
+            "name": "BLIS",
+            "description": "Inference simulator.",
+            "observable_metrics": ["throughput", "latency"],
+            "controllable_knobs": ["batch_size", "scheduling"],
+            "repo_path": "/path/to/blis",
+        },
+    }
+    base.update(overrides)
+    return base
+
+
+# ─── Per-scope prompt builders ─────────────────────────────────────────────
+
+class TestBuildExplorePrompt:
+
+    def test_metrics_prompt_focuses_on_observable_metrics(self):
+        out = build_explore_prompt("metrics", _campaign())
+        assert "Explore: metrics" in out
+        assert "metric" in out.lower()
+        assert "BLIS" in out  # target name appears
+
+    def test_knobs_prompt_focuses_on_configuration(self):
+        out = build_explore_prompt("knobs", _campaign())
+        assert "knob" in out.lower() or "config" in out.lower()
+
+    def test_prior_findings_prompt_references_findings_json(self):
+        out = build_explore_prompt("prior_findings", _campaign())
+        assert "findings.json" in out
+
+    def test_principles_prompt_references_principles_store(self):
+        out = build_explore_prompt("principles", _campaign())
+        assert "principles" in out.lower()
+
+    def test_every_prompt_marks_explorer_read_only(self):
+        for scope in DEFAULT_EXPLORE_SCOPES:
+            out = build_explore_prompt(scope, _campaign())
+            # Read-only enforcement must be EXPLICIT — Explore subagents
+            # don't have write tools, but the prompt should still say so.
+            assert "Do not modify" in out or "read-only" in out.lower()
+
+
+# ─── Run stage A: collect reports ──────────────────────────────────────────
+
+class _RecordingRunner:
+    def __init__(self):
+        self.calls: list[dict] = []
+
+    def __call__(self, scope: str, prompt: str, campaign: dict) -> ExploreReport:
+        self.calls.append({"scope": scope, "prompt": prompt, "campaign": campaign})
+        return ExploreReport(
+            scope=scope,
+            text=f"report for {scope}",
+            duration_ms=100,
+            input_tokens=200,
+            output_tokens=80,
+        )
+
+
+class TestRunExploreStage:
+
+    def test_runs_one_subagent_per_default_scope(self):
+        runner = _RecordingRunner()
+        result = run_explore_stage(_campaign(), runner=runner)
+
+        assert len(runner.calls) == len(DEFAULT_EXPLORE_SCOPES)
+        assert [r.scope for r in result.reports] == list(DEFAULT_EXPLORE_SCOPES)
+
+    def test_custom_scopes_pass_through(self):
+        runner = _RecordingRunner()
+        run_explore_stage(_campaign(), scopes=["a", "b"], runner=runner)
+        assert [c["scope"] for c in runner.calls] == ["a", "b"]
+
+    def test_aggregates_token_counts(self):
+        runner = _RecordingRunner()
+        result = run_explore_stage(_campaign(), runner=runner)
+        # 4 explorers × 200 input × 80 output.
+        assert result.total_input_tokens == 800
+        assert result.total_output_tokens == 320
+
+    def test_lookup_by_scope_returns_correct_report(self):
+        runner = _RecordingRunner()
+        result = run_explore_stage(_campaign(), runner=runner)
+        report = result.by_scope("metrics")
+        assert report is not None
+        assert report.scope == "metrics"
+
+
+# ─── Stage B: synthesis prompt ─────────────────────────────────────────────
+
+class TestBuildSynthesisPrompt:
+
+    def _stage_a(self) -> "ExploreStageResult":  # type: ignore[name-defined]
+        runner = _RecordingRunner()
+        return run_explore_stage(_campaign(), runner=runner)
+
+    def test_includes_every_explorer_report_under_its_scope(self, tmp_path):
+        stage_a = self._stage_a()
+        out = build_synthesis_prompt(
+            stage_a, campaign=_campaign(), iteration=1,
+            iter_dir=tmp_path / "runs" / "iter-1",
+        )
+        for scope in DEFAULT_EXPLORE_SCOPES:
+            assert f"### {scope}" in out
+            assert f"report for {scope}" in out
+
+    def test_explicitly_forbids_re_reading_codebase(self, tmp_path):
+        stage_a = self._stage_a()
+        out = build_synthesis_prompt(
+            stage_a, campaign=_campaign(), iteration=1,
+            iter_dir=tmp_path / "runs" / "iter-1",
+        )
+        assert "Do not re-read" in out
+
+    def test_required_outputs_named(self, tmp_path):
+        stage_a = self._stage_a()
+        out = build_synthesis_prompt(
+            stage_a, campaign=_campaign(), iteration=2,
+            iter_dir=tmp_path / "runs" / "iter-2",
+        )
+        assert "problem.md" in out
+        assert "bundle.yaml" in out
+        assert "iter-2" in out
+        assert "bundle.schema.yaml" in out
+
+    def test_research_question_appears(self, tmp_path):
+        stage_a = self._stage_a()
+        out = build_synthesis_prompt(
+            stage_a, campaign=_campaign(), iteration=1,
+            iter_dir=tmp_path / "runs" / "iter-1",
+        )
+        assert "What drives saturation?" in out
+
+
+# ─── Phase B: SDK explore runner factory ───────────────────────────────────
+
+
+from dataclasses import dataclass as _dataclass
+
+
+@_dataclass
+class _LocalSDKResult:
+    """Local stand-in for SDKResult; the real one is duck-compatible."""
+    text: str = ""
+    duration_ms: int = 0
+    input_tokens: int = 0
+    output_tokens: int = 0
+
+
+class TestMakeSdkExploreRunner:
+    """The factory wraps an injected sdk_runner so each Stage A scope
+    spawns a read-only Explore subagent. Tests assert what the runner
+    sends to the SDK and how it maps the response back to ExploreReport.
+    No live SDK call happens (no-live-LLM policy, see CLAUDE.md)."""
+
+    def test_dispatches_each_scope_with_explore_subagent_type(self):
+        from orchestrator.explore_design import make_sdk_explore_runner
+
+        sdk_calls: list[dict] = []
+
+        def sdk_runner(**kwargs):
+            sdk_calls.append(kwargs)
+            return _LocalSDKResult(
+                text="report", duration_ms=80,
+                input_tokens=300, output_tokens=120,
+            )
+
+        explore_runner = make_sdk_explore_runner(
+            sdk_runner=sdk_runner, cwd=None, model="claude-haiku-4-5",
+            max_turns=8,
+        )
+        result = run_explore_stage(_campaign(), runner=explore_runner)
+
+        assert len(sdk_calls) == len(DEFAULT_EXPLORE_SCOPES)
+        # Every call passes subagent_type=Explore — the harness signal
+        # for read-only mapping.
+        assert all(c.get("subagent_type") == "Explore" for c in sdk_calls)
+        assert all(r.text and r.input_tokens == 300 for r in result.reports)
+        assert result.total_input_tokens == 300 * len(DEFAULT_EXPLORE_SCOPES)
+
+    def test_falls_back_when_sdk_runner_lacks_subagent_kwarg(self):
+        """Forward/backward compatibility: older sdk_runners without
+        subagent_type still work; the factory drops the kwarg on
+        TypeError and retries with the base signature."""
+        from orchestrator.explore_design import make_sdk_explore_runner
+
+        seen: list[dict] = []
+
+        def old_signature_runner(*, prompt, model, cwd, max_turns):
+            seen.append({"prompt": prompt, "max_turns": max_turns})
+            return _LocalSDKResult(text="ok")
+
+        explore_runner = make_sdk_explore_runner(sdk_runner=old_signature_runner)
+        run_explore_stage(_campaign(), scopes=["metrics"], runner=explore_runner)
+
+        assert len(seen) == 1
+        assert seen[0]["prompt"]
+
+    def test_uses_haiku_by_default(self):
+        """Read-only mapping should be cheap — default model is Haiku."""
+        from orchestrator.explore_design import make_sdk_explore_runner
+
+        models: list[str] = []
+
+        def sdk_runner(**kwargs):
+            models.append(kwargs.get("model", ""))
+            return _LocalSDKResult()
+
+        explore_runner = make_sdk_explore_runner(sdk_runner=sdk_runner)
+        run_explore_stage(_campaign(), scopes=["metrics"], runner=explore_runner)
+
+        assert models[0].lower().startswith("claude-haiku")
diff --git a/tests/test_goal_driven.py b/tests/test_goal_driven.py
new file mode 100644
index 0000000..61cc48e
--- /dev/null
+++ b/tests/test_goal_driven.py
@@ -0,0 +1,137 @@
+"""Behavioral tests for /goal-driven prompt builders (#124 Phase A)."""
+from __future__ import annotations
+
+from orchestrator.goal_driven import (
+    build_full_goal_directive,
+    build_goal_driven_session_prompt,
+    build_inner_loop_goal_directive,
+)
+
+
+def _campaign(**overrides):
+    base = {
+        "research_question": "What drives saturation?",
+        "target_system": {
+            "name": "BLIS",
+            "description": "Inference simulator.",
+            "observable_metrics": ["throughput", "latency"],
+            "controllable_knobs": ["batch_size", "scheduling"],
+        },
+    }
+    base.update(overrides)
+    return base
+
+
+# ─── Mode A: whole-campaign /goal ──────────────────────────────────────────
+
+class TestFullGoalDirective:
+
+    def test_predicate_names_required_artifacts(self):
+        out = build_full_goal_directive(_campaign(), iteration=2)
+        assert "iter-2/findings.json" in out
+        assert "iter-2/principle_updates.json" in out
+
+    def test_predicate_includes_timeout_clause(self):
+        out = build_full_goal_directive(_campaign(), iteration=2, timeout_hours=12)
+        assert "12 hours" in out
+
+    def test_uses_AND_OR_logic(self):
+        out = build_full_goal_directive(_campaign(), iteration=1)
+        assert " AND " in out
+        assert " OR " in out
+
+
+# ─── Mode B: inner-loop /goal ──────────────────────────────────────────────
+
+class TestInnerLoopGoalDirective:
+
+    def test_predicate_uses_schema_validation_language(self):
+        out = build_inner_loop_goal_directive(iteration=3)
+        assert "findings.schema.json" in out
+        assert "iter-3" in out
+
+    def test_extra_predicates_are_AND_chained(self):
+        out = build_inner_loop_goal_directive(
+            iteration=1, extra_predicates=["arm_status reports complete for all arms"],
+        )
+        # All three clauses joined by AND.
+        assert out.count(" AND ") == 2
+
+
+# ─── Mode A session prompt ─────────────────────────────────────────────────
+
+class TestGoalDrivenSessionPrompt:
+
+    def test_includes_campaign_brief(self):
+        out = build_goal_driven_session_prompt(_campaign(), iteration=2)
+        assert "What drives saturation?" in out
+        assert "BLIS" in out
+        assert "throughput" in out
+        assert "batch_size" in out
+
+    def test_iteration_number_appears_consistently(self):
+        out = build_goal_driven_session_prompt(_campaign(), iteration=4)
+        # Many references to iter-4 across artifact paths.
+        assert out.count("iter-4") >= 5
+
+    def test_explicit_print_to_stdout_instruction(self):
+        """The Haiku /goal evaluator can only see what's been surfaced
+        in the conversation. The prompt MUST tell the agent to print
+        artifact paths."""
+        out = build_goal_driven_session_prompt(_campaign(), iteration=1)
+        assert "Print" in out and "stdout" in out
+
+    def test_validate_execution_invocation_present(self):
+        out = build_goal_driven_session_prompt(_campaign(), iteration=1)
+        assert "nous validate execution" in out
+
+    def test_goal_directive_appears_in_prompt(self):
+        out = build_goal_driven_session_prompt(_campaign(), iteration=1)
+        assert "/goal" in out
+
+
+# ─── Phase B: end-to-end goal-driven iteration runner ──────────────────────
+
+
+class _FakeDispatcher:
+    def __init__(self):
+        self.prompts: list[str] = []
+
+    def _call_claude(self, prompt: str) -> str:
+        self.prompts.append(prompt)
+        return "design log content from the agent"
+
+
+class TestRunGoalDrivenIteration:
+    """Phase B contract: runner takes a campaign + dispatcher, dispatches
+    the goal-driven prompt, and persists the transcript as design_log.md.
+    The agent produces artifacts via tool calls inside the session; the
+    orchestrator only persists the conversation log."""
+
+    def test_dispatches_goal_prompt_and_writes_log(self, tmp_path):
+        from orchestrator.goal_driven import run_goal_driven_iteration
+
+        dispatcher = _FakeDispatcher()
+        log_path = run_goal_driven_iteration(
+            dispatcher=dispatcher, campaign=_campaign(), iteration=2,
+            work_dir=tmp_path,
+        )
+
+        assert len(dispatcher.prompts) == 1
+        prompt = dispatcher.prompts[0]
+        assert "/goal" in prompt
+        assert "iter-2" in prompt
+
+        assert log_path == tmp_path / "runs" / "iter-2" / "design_log.md"
+        assert log_path.read_text() == "design log content from the agent"
+
+    def test_creates_iter_dir_if_missing(self, tmp_path):
+        from orchestrator.goal_driven import run_goal_driven_iteration
+
+        run_goal_driven_iteration(
+            dispatcher=_FakeDispatcher(), campaign=_campaign(),
+            iteration=5, work_dir=tmp_path,
+        )
+
+        assert (tmp_path / "runs" / "iter-5").is_dir()
+        assert (tmp_path / "runs" / "iter-5" / "design_log.md").exists()
diff --git a/tests/test_mcp_server.py b/tests/test_mcp_server.py
new file mode 100644
index 0000000..e2e6a81
--- /dev/null
+++ b/tests/test_mcp_server.py
@@ -0,0 +1,219 @@
+"""Behavioral tests for the nous-mcp stdio server (#126 Phase B).
+
+The MCP server is a thin wrapper around campaign_index. Tests drive
+``handle_request`` directly with JSON-RPC payloads (no real stdio) and
+assert what comes back. This is the contract any MCP client sees.
+"""
+from __future__ import annotations
+
+import importlib.machinery
+import importlib.util
+import json
+from pathlib import Path
+
+
+HOOK_PATH = Path(__file__).resolve().parent.parent / "bin" / "nous-mcp"
+
+
+def _load_module():
+    loader = importlib.machinery.SourceFileLoader("nous_mcp", str(HOOK_PATH))
+    spec = importlib.util.spec_from_loader("nous_mcp", loader)
+    assert spec is not None
+    module = importlib.util.module_from_spec(spec)
+    loader.exec_module(module)
+    return module
+
+
+def _make_campaign(root: Path, run_id: str, *, principles: list[dict] | None = None) -> Path:
+    root.mkdir(parents=True, exist_ok=True)
+    (root / "state.json").write_text(json.dumps({
+        "run_id": run_id, "phase": "DONE", "iteration": 2,
+    }))
+    (root / "ledger.json").write_text(json.dumps({
+        "iterations": [{"iteration": 1}, {"iteration": 2}],
+    }))
+    (root / "principles.json").write_text(json.dumps({
+        "principles": principles or [],
+    }))
+    return root
+
+
+# ─── initialize / capabilities ─────────────────────────────────────────────
+
+
+class TestInitialize:
+
+    def test_initialize_returns_protocol_and_capabilities(self):
+        mod = _load_module()
+        resp = mod.handle_request({"jsonrpc": "2.0", "id": 1, "method": "initialize"})
+
+        assert resp["jsonrpc"] == "2.0"
+        assert resp["id"] == 1
+        assert "result" in resp
+        result = resp["result"]
+        assert "protocolVersion" in result
+        assert result["serverInfo"]["name"] == "nous-mcp"
+        assert "resources" in result["capabilities"]
+        assert "tools" in result["capabilities"]
+
+    def test_unknown_method_returns_jsonrpc_error(self):
+        mod = _load_module()
+        resp = mod.handle_request({
+            "jsonrpc": "2.0", "id": 9, "method": "garbage",
+        })
+        assert resp["error"]["code"] == -32601
+        assert "garbage" in resp["error"]["message"]
+
+
+# ─── resources ─────────────────────────────────────────────────────────────
+
+
+class TestResources:
+
+    def test_list_includes_campaigns_root_and_per_campaign_resources(self, tmp_path):
+        repo = tmp_path / "repo"
+        _make_campaign(repo / ".nous" / "alpha", "alpha")
+        _make_campaign(repo / ".nous" / "beta", "beta")
+
+        mod = _load_module()
+        resp = mod.handle_request(
+            {"jsonrpc": "2.0", "id": 2, "method": "resources/list"},
+            search_root=str(tmp_path),
+        )
+
+        uris = [r["uri"] for r in resp["result"]["resources"]]
+        assert "nous://campaigns" in uris
+        assert "nous://campaigns/alpha/state" in uris
+        assert "nous://campaigns/alpha/principles" in uris
+        assert "nous://campaigns/beta/state" in uris
+
+    def test_read_state_returns_state_json_contents(self, tmp_path):
+        _make_campaign(tmp_path / "repo" / ".nous" / "x", "x")
+
+        mod = _load_module()
+        resp = mod.handle_request(
+            {
+                "jsonrpc": "2.0", "id": 3, "method": "resources/read",
+                "params": {"uri": "nous://campaigns/x/state"},
+            },
+            search_root=str(tmp_path),
+        )
+
+        body = json.loads(resp["result"]["contents"][0]["text"])
+        assert body["run_id"] == "x"
+        assert body["phase"] == "DONE"
+
+    def test_read_principles_returns_principles_json(self, tmp_path):
+        _make_campaign(
+            tmp_path / "repo" / ".nous" / "x", "x",
+            principles=[{"id": "p1", "status": "active", "statement": "..."}],
+        )
+
+        mod = _load_module()
+        resp = mod.handle_request(
+            {
+                "jsonrpc": "2.0", "id": 4, "method": "resources/read",
+                "params": {"uri": "nous://campaigns/x/principles"},
+            },
+            search_root=str(tmp_path),
+        )
+
+        body = json.loads(resp["result"]["contents"][0]["text"])
+        assert any(p["id"] == "p1" for p in body["principles"])
+
+    def test_read_unknown_campaign_returns_error(self, tmp_path):
+        mod = _load_module()
+        resp = mod.handle_request(
+            {
+                "jsonrpc": "2.0", "id": 5, "method": "resources/read",
+                "params": {"uri": "nous://campaigns/nonexistent/state"},
+            },
+            search_root=str(tmp_path),
+        )
+        assert "error" in resp
+        assert "nonexistent" in resp["error"]["message"]
+
+
+# ─── tools ─────────────────────────────────────────────────────────────────
+
+
+class TestTools:
+
+    def test_list_returns_four_tools(self):
+        mod = _load_module()
+        resp = mod.handle_request(
+            {"jsonrpc": "2.0", "id": 6, "method": "tools/list"},
+        )
+        names = [t["name"] for t in resp["result"]["tools"]]
+        assert "nous.list_campaigns" in names
+        assert "nous.search_principles" in names
+        assert "nous.get_arm_results" in names
+        assert "nous.compare_iterations" in names
+
+    def test_call_list_campaigns_returns_summaries(self, tmp_path):
+        _make_campaign(tmp_path / "repo" / ".nous" / "alpha", "alpha")
+
+        mod = _load_module()
+        resp = mod.handle_request(
+            {
+                "jsonrpc": "2.0", "id": 7, "method": "tools/call",
+                "params": {
+                    "name": "nous.list_campaigns",
+                    "arguments": {"search_root": str(tmp_path)},
+                },
+            },
+        )
+        body = json.loads(resp["result"]["content"][0]["text"])
+        assert any(c["run_id"] == "alpha" for c in body["campaigns"])
+
+    def test_call_search_principles_finds_known_substring(self, tmp_path):
+        _make_campaign(
+            tmp_path / "repo" / ".nous" / "x", "x",
+            principles=[{
+                "id": "p1", "status": "active",
+                "statement": "Saturation flattens discriminatory power.",
+            }],
+        )
+
+        mod = _load_module()
+        resp = mod.handle_request(
+            {
+                "jsonrpc": "2.0", "id": 8, "method": "tools/call",
+                "params": {
+                    "name": "nous.search_principles",
+                    "arguments": {
+                        "search_root": str(tmp_path),
+                        "text": "saturation",
+                    },
+                },
+            },
+        )
+        body = json.loads(resp["result"]["content"][0]["text"])
+        assert len(body["hits"]) == 1
+        assert body["hits"][0]["principle"]["id"] == "p1"
+
+    def test_call_unknown_tool_returns_error(self):
+        mod = _load_module()
+        resp = mod.handle_request({
+            "jsonrpc": "2.0", "id": 10, "method": "tools/call",
+            "params": {"name": "nous.delete_campaign", "arguments": {}},
+        })
+        assert "error" in resp
+        assert "delete_campaign" in resp["error"]["message"]
+
+
+# ─── error handling ────────────────────────────────────────────────────────
+
+
+class TestErrorHandling:
+
+    def test_missing_required_arg_returns_jsonrpc_error_not_crash(self):
+        mod = _load_module()
+        resp = mod.handle_request({
+            "jsonrpc": "2.0", "id": 11, "method": "tools/call",
+            "params": {
+                "name": "nous.compare_iterations",
+                "arguments": {"campaign_root": "/nope"},  # missing iter_a, iter_b
+            },
+        })
+        assert "error" in resp
diff --git a/tests/test_no_live_llm_guard.py b/tests/test_no_live_llm_guard.py
new file mode 100644
index 0000000..7e079a5
--- /dev/null
+++ b/tests/test_no_live_llm_guard.py
@@ -0,0 +1,70 @@
+"""Meta-tests: verify the conftest's no-live-LLM guard actually fires.
+
+If these tests stop passing, the guard is broken — and a real test could
+silently make a live API call. CI should fail loudly.
+"""
+from __future__ import annotations
+
+import os
+import urllib.error
+import urllib.request
+
+import pytest
+
+from tests.conftest import LiveLLMCallBlocked
+
+
+class TestEnvKeysStripped:
+    """The guard removes LLM API key env vars so any code that reads them
+    sees ``None`` and falls back to the disabled-mode path."""
+
+    def test_openai_api_key_unset(self):
+        assert os.environ.get("OPENAI_API_KEY") is None
+
+    def test_anthropic_api_key_unset(self):
+        assert os.environ.get("ANTHROPIC_API_KEY") is None
+
+
+class TestUrlopenGuard:
+    """Direct urllib.request.urlopen calls to LLM hosts must raise."""
+
+    @pytest.mark.parametrize("host", [
+        "https://api.anthropic.com/v1/messages",
+        "https://api.openai.com/v1/chat/completions",
+    ])
+    def test_blocked_host_raises(self, host):
+        with pytest.raises(LiveLLMCallBlocked):
+            urllib.request.urlopen(host)
+
+    def test_non_blocked_host_passes_through_signature(self):
+        """The guard is a substring check on known LLM hosts; calls to
+        other URLs are NOT blocked by this fixture (so tests that legitimately
+        post to e.g. a Slack webhook still go through their own injection)."""
+        # We don't actually call out to the network — just assert the guard
+        # has correct shape for a non-blocked URL.
+        # (The guard delegates to the original urlopen for non-blocked URLs.)
+        try:
+            urllib.request.urlopen("http://localhost:1/", timeout=0.01)
+        except LiveLLMCallBlocked:
+            pytest.fail("guard wrongly blocked a non-LLM host")
+        except (urllib.error.URLError, OSError, TimeoutError):
+            pass  # expected — connection refused / no listener
+
+
+class TestSDKQueryGuard:
+    """When claude_agent_sdk is installed, the guard replaces query() with
+    a hard-fail. SDKDispatcher tests inject a fake sdk_runner instead."""
+
+    def test_sdk_query_blocked_when_installed(self):
+        try:
+            import claude_agent_sdk  # type: ignore[import-not-found]
+        except ImportError:
+            pytest.skip("claude-agent-sdk not installed; nothing to guard")
+
+        async def _drive():
+            async for _ in claude_agent_sdk.query(prompt="x", options=None):
+                pass
+
+        import anyio
+        with pytest.raises(LiveLLMCallBlocked):
+            anyio.run(_drive)
diff --git a/tests/test_parallel_arms.py b/tests/test_parallel_arms.py
new file mode 100644
index 0000000..5a5a185
--- /dev/null
+++ b/tests/test_parallel_arms.py
@@ -0,0 +1,323 @@
+"""Behavioral tests for the parallel-arm orchestration (#123 Phase A + B)."""
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass
+from pathlib import Path
+
+import pytest
+
+from orchestrator.parallel_arms import (
+    ArmUnit,
+    ArmUnitResult,
+    failed_units,
+    merge_unit_results,
+    partition_plan,
+    run_units,
+)
+
+
+@dataclass
+class _LocalSDKResult:
+    """Local stand-in for SDKResult so this branch doesn't depend on
+    sdk_dispatch.py landing first. The real SDKResult is duck-compatible."""
+    text: str = ""
+    duration_ms: int = 0
+    is_error: bool = False
+    error_message: str = ""
+
+
+# ─── Plan partitioning ─────────────────────────────────────────────────────
+
+class TestPartitionPlan:
+
+    def test_single_arm_single_condition_default_seed(self):
+        plan = {"arms": [{
+            "arm_id": "h-main",
+            "conditions": [{"name": "baseline", "command": "./blis run"}],
+        }]}
+        units = partition_plan(plan)
+        assert len(units) == 1
+        assert units[0].arm_id == "h-main"
+        assert units[0].seed == "seed-1"
+        assert units[0].condition_name == "baseline"
+        assert units[0].command == "./blis run"
+
+    def test_multi_seed_condition_fans_out(self):
+        plan = {"arms": [{
+            "arm_id": "h-main",
+            "conditions": [{
+                "name": "x", "command": "./run",
+                "seeds": ["s1", "s2", "s3"],
+            }],
+        }]}
+        units = partition_plan(plan)
+        assert len(units) == 3
+        assert sorted(u.seed for u in units) == ["s1", "s2", "s3"]
+
+    def test_multiple_arms_and_conditions(self):
+        plan = {"arms": [
+            {"arm_id": "h-main", "conditions": [
+                {"name": "a", "command": "./a"},
+                {"name": "b", "command": "./b"},
+            ]},
+            {"arm_id": "h-ablation", "conditions": [
+                {"name": "c", "command": "./c"},
+            ]},
+        ]}
+        units = partition_plan(plan)
+        assert len(units) == 3
+        ids = sorted((u.arm_id, u.condition_name) for u in units)
+        assert ids == [("h-ablation", "c"), ("h-main", "a"), ("h-main", "b")]
+
+    def test_relative_results_dir_does_not_overlap(self):
+        plan = {"arms": [{
+            "arm_id": "h-main",
+            "conditions": [{
+                "name": "x", "command": "./run", "seeds": ["s1", "s2"],
+            }],
+        }]}
+        units = partition_plan(plan)
+        dirs = {u.relative_results_dir for u in units}
+        assert len(dirs) == 2  # s1 and s2 land in different paths
+
+    def test_skips_arms_without_command(self):
+        plan = {"arms": [{
+            "arm_id": "h-main",
+            "conditions": [{"name": "no-cmd"}],
+        }]}
+        assert partition_plan(plan) == []
+
+
+# ─── Run units ─────────────────────────────────────────────────────────────
+
+class _RecordingRunner:
+    def __init__(self, statuses: dict[str, str] | None = None):
+        self.calls: list[ArmUnit] = []
+        self.statuses = statuses or {}
+
+    def __call__(self, unit: ArmUnit) -> ArmUnitResult:
+        self.calls.append(unit)
+        status = self.statuses.get(unit.arm_id, "complete")
+        return ArmUnitResult(
+            unit=unit, status=status, duration_ms=100,
+            output_files=[f"{unit.relative_results_dir}/out.json"],
+        )
+
+
+class TestRunUnits:
+
+    def test_results_returned_in_input_order(self):
+        units = [
+            ArmUnit("h-main", "s1", "x", "./a"),
+            ArmUnit("h-main", "s2", "x", "./a"),
+            ArmUnit("h-ablation", "s1", "y", "./b"),
+        ]
+        runner = _RecordingRunner()
+        results = run_units(units, runner=runner)
+        assert [r.unit.seed for r in results] == ["s1", "s2", "s1"]
+
+    def test_runner_exception_becomes_failed_unit(self):
+        units = [ArmUnit("h-main", "s1", "x", "./a")]
+
+        def crash(_):
+            raise RuntimeError("boom")
+
+        results = run_units(units, runner=crash)
+        assert results[0].status == "failed"
+        assert "boom" in results[0].error
+        assert "RuntimeError" in results[0].error
+
+    def test_max_parallel_must_be_positive(self):
+        with pytest.raises(ValueError):
+            run_units([], runner=_RecordingRunner(), max_parallel=0)
+
+
+# ─── Merge ─────────────────────────────────────────────────────────────────
+
+class TestMergeUnitResults:
+
+    def _results(self) -> list[ArmUnitResult]:
+        return [
+            ArmUnitResult(
+                unit=ArmUnit("h-main", "s1", "x", "./a"),
+                status="complete", duration_ms=100,
+                output_files=["results/h-main/s1/out.json"],
+            ),
+            ArmUnitResult(
+                unit=ArmUnit("h-main", "s2", "x", "./a"),
+                status="complete", duration_ms=120,
+                output_files=["results/h-main/s2/out.json"],
+            ),
+            ArmUnitResult(
+                unit=ArmUnit("h-ablation", "s1", "y", "./b"),
+                status="failed", error="exit 1",
+            ),
+        ]
+
+    def test_arms_grouped_by_arm_id(self):
+        out = merge_unit_results(self._results())
+        ids = [a["arm_id"] for a in out["arms"]]
+        # Sorted for determinism.
+        assert ids == ["h-ablation", "h-main"]
+
+    def test_arm_status_failed_when_any_unit_failed(self):
+        out = merge_unit_results(self._results())
+        by_id = {a["arm_id"]: a for a in out["arms"]}
+        assert by_id["h-ablation"]["status"] == "failed"
+        assert by_id["h-main"]["status"] == "complete"
+
+    def test_failed_count_correct(self):
+        out = merge_unit_results(self._results())
+        assert out["failed_unit_count"] == 1
+        assert out["total_unit_count"] == 3
+
+    def test_byte_equal_across_repeated_calls(self):
+        a = json.dumps(merge_unit_results(self._results()), sort_keys=True)
+        b = json.dumps(merge_unit_results(self._results()), sort_keys=True)
+        assert a == b
+
+    def test_units_within_arm_sorted_by_seed_and_condition(self):
+        results = [
+            ArmUnitResult(unit=ArmUnit("h-main", "s2", "b", "./x"), status="complete"),
+            ArmUnitResult(unit=ArmUnit("h-main", "s1", "a", "./x"), status="complete"),
+            ArmUnitResult(unit=ArmUnit("h-main", "s1", "b", "./x"), status="complete"),
+        ]
+        out = merge_unit_results(results)
+        seeds = [u["seed"] for u in out["arms"][0]["units"]]
+        conds = [u["condition"] for u in out["arms"][0]["units"]]
+        assert list(zip(seeds, conds)) == [("s1", "a"), ("s1", "b"), ("s2", "b")]
+
+
+# ─── Partial-retry helper ──────────────────────────────────────────────────
+
+class TestFailedUnits:
+
+    def test_returns_only_failed_units(self):
+        results = [
+            ArmUnitResult(unit=ArmUnit("h-main", "s1", "x", "./a"), status="complete"),
+            ArmUnitResult(unit=ArmUnit("h-main", "s2", "x", "./a"), status="failed"),
+            ArmUnitResult(unit=ArmUnit("h-ablation", "s1", "y", "./b"), status="failed"),
+        ]
+        failed = failed_units(results)
+        assert len(failed) == 2
+        assert all(r.arm_id != "h-main" or r.seed == "s2" for r in failed)
+
+
+# ─── Phase B: end-to-end with the harness-isolated SDK runner ─────────────
+
+
+class TestEndToEndWithIsolatedRunner:
+    """The full chain: partition_plan -> make_isolated_arm_runner ->
+    run_units -> merge_unit_results. The SDK side is injected via a
+    fake; per the no-live-LLM policy (CLAUDE.md), no real subagent is
+    spawned. The test asserts the orchestration contract — every unit
+    is dispatched with isolation=worktree to a non-overlapping results
+    dir, failures are isolated, and the merged output is deterministic.
+    """
+
+    def _plan(self):
+        return {"arms": [
+            {"arm_id": "h-main", "conditions": [
+                {"name": "x", "command": "./run --arm main"},
+            ]},
+            {"arm_id": "h-ablation", "conditions": [
+                {"name": "y", "command": "./run --arm ablation",
+                 "seeds": ["s1", "s2"]},
+            ]},
+        ]}
+
+    def _success_runner(self):
+        SDKResult = _LocalSDKResult  # noqa: N806
+
+        sdk_calls: list[dict] = []
+
+        def sdk_runner(**kwargs):
+            sdk_calls.append(kwargs)
+            prompt = kwargs.get("prompt", "")
+            # Simulate the subagent writing a file in its results dir.
+            for line in prompt.splitlines():
+                if line.startswith("Write all output files to:"):
+                    target = line.split("`", 1)[1].rstrip("`")
+                    Path(target).mkdir(parents=True, exist_ok=True)
+                    (Path(target) / "out.json").write_text("{}")
+            return SDKResult(text="done", duration_ms=120)
+
+        return sdk_runner, sdk_calls
+
+    def test_three_units_dispatched_with_isolation_kwarg(self, tmp_path):
+        from orchestrator.worktree import make_isolated_arm_runner
+
+        iter_dir = tmp_path / "iter-1"
+        iter_dir.mkdir(parents=True)
+        sdk_runner, sdk_calls = self._success_runner()
+
+        runner = make_isolated_arm_runner(
+            sdk_runner=sdk_runner, repo_path=tmp_path, iter_dir=iter_dir,
+        )
+        units = partition_plan(self._plan())
+        assert len(units) == 3
+
+        results = run_units(units, runner=runner)
+        assert len(sdk_calls) == 3
+        assert all(c.get("isolation") == "worktree" for c in sdk_calls)
+
+        merged = merge_unit_results(results)
+        assert [a["arm_id"] for a in merged["arms"]] == ["h-ablation", "h-main"]
+        assert all(a["status"] == "complete" for a in merged["arms"])
+
+    def test_partial_failure_isolated_to_one_arm(self, tmp_path):
+        from orchestrator.worktree import make_isolated_arm_runner
+        SDKResult = _LocalSDKResult  # noqa: N806
+
+        iter_dir = tmp_path / "iter-1"
+        iter_dir.mkdir(parents=True)
+
+        def sdk_runner(**kwargs):
+            prompt = kwargs.get("prompt", "")
+            if "h-ablation" in prompt:
+                return SDKResult(
+                    text="", is_error=True, error_message="exit 1",
+                )
+            for line in prompt.splitlines():
+                if line.startswith("Write all output files to:"):
+                    target = line.split("`", 1)[1].rstrip("`")
+                    Path(target).mkdir(parents=True, exist_ok=True)
+                    (Path(target) / "out.json").write_text("{}")
+            return SDKResult(text="ok")
+
+        runner = make_isolated_arm_runner(
+            sdk_runner=sdk_runner, repo_path=tmp_path, iter_dir=iter_dir,
+        )
+        merged = merge_unit_results(
+            run_units(partition_plan(self._plan()), runner=runner)
+        )
+        by_arm = {a["arm_id"]: a for a in merged["arms"]}
+        assert by_arm["h-main"]["status"] == "complete"
+        assert by_arm["h-ablation"]["status"] == "failed"
+        assert merged["failed_unit_count"] == 2
+        assert merged["total_unit_count"] == 3
+
+    def test_no_two_units_share_results_dir(self, tmp_path):
+        from orchestrator.worktree import make_isolated_arm_runner
+
+        iter_dir = tmp_path / "iter-1"
+        iter_dir.mkdir(parents=True)
+        sdk_runner, _ = self._success_runner()
+        seen_dirs: list[str] = []
+
+        def capturing(**kwargs):
+            for line in kwargs.get("prompt", "").splitlines():
+                if line.startswith("Write all output files to:"):
+                    seen_dirs.append(line.split("`", 1)[1].rstrip("`"))
+            return sdk_runner(**kwargs)
+
+        runner = make_isolated_arm_runner(
+            sdk_runner=capturing, repo_path=tmp_path, iter_dir=iter_dir,
+        )
+        run_units(partition_plan(self._plan()), runner=runner)
+
+        # Acceptance criterion: no two subagents ever write to the same
+        # results path.
+        assert len(seen_dirs) == 3
+        assert len(set(seen_dirs)) == 3
diff --git a/tests/test_plan_enforcer_hook.py b/tests/test_plan_enforcer_hook.py
new file mode 100644
index 0000000..9b8ddd9
--- /dev/null
+++ b/tests/test_plan_enforcer_hook.py
@@ -0,0 +1,240 @@
+"""Behavioral tests for the PreToolUse plan-enforcer hook (issue #128).
+
+The hook intercepts Bash tool calls during EXECUTE_ANALYZE and decides
+whether the proposed command is consistent with the iteration's
+``experiment_plan.yaml``. The decision protocol:
+
+  * ``--strict`` (env: ``NOUS_PLAN_ENFORCEMENT=strict``): block (exit 2)
+    if the command's head binary doesn't appear in any planned condition.
+  * ``--warn`` (default): always allow (exit 0) but log violations to
+    ``<iter_dir>/plan_violations.jsonl``.
+  * Escape hatch: a command containing ``# nous: ad-hoc`` is allowed in
+    strict mode AND logged distinctly so reviewers can audit the use.
+
+The hook is invoked by Claude Code with JSON on stdin describing the
+proposed tool call. We test the contract: given (mode, plan, proposed
+command) → exit code + violations log entry.
+"""
+from __future__ import annotations
+
+import importlib.machinery
+import importlib.util
+import io
+import json
+from pathlib import Path
+
+import yaml
+
+
+HOOK_PATH = Path(__file__).resolve().parent.parent / "bin" / "nous-plan-enforcer"
+
+
+def _load_hook_main():
+    loader = importlib.machinery.SourceFileLoader("nous_plan_enforcer", str(HOOK_PATH))
+    spec = importlib.util.spec_from_loader("nous_plan_enforcer", loader)
+    assert spec is not None
+    module = importlib.util.module_from_spec(spec)
+    loader.exec_module(module)
+    return module.main
+
+
+def _write_plan(iter_dir: Path, arms: list[dict]) -> None:
+    iter_dir.mkdir(parents=True, exist_ok=True)
+    plan = {"arms": arms}
+    (iter_dir / "experiment_plan.yaml").write_text(yaml.safe_dump(plan))
+
+
+def _hook_event(command: str, cwd: str) -> str:
+    """Emit a Claude Code PreToolUse hook payload for a Bash call."""
+    return json.dumps({
+        "session_id": "test-session",
+        "tool_name": "Bash",
+        "tool_input": {"command": command},
+        "cwd": cwd,
+    })
+
+
+def _run_hook(stdin_text: str, *, env: dict, monkeypatch) -> int:
+    for k, v in env.items():
+        monkeypatch.setenv(k, v)
+    monkeypatch.setattr("sys.stdin", io.StringIO(stdin_text))
+    return _load_hook_main()()
+
+
+def _read_violations(iter_dir: Path) -> list[dict]:
+    p = iter_dir / "plan_violations.jsonl"
+    if not p.exists():
+        return []
+    return [json.loads(line) for line in p.read_text().splitlines() if line.strip()]
+
+
+# ─── Strict mode ────────────────────────────────────────────────────────────
+
+class TestStrictMode:
+
+    def test_allows_planned_binary(self, tmp_path, monkeypatch, capsys):
+        _write_plan(tmp_path, [{
+            "arm_id": "h-main",
+            "conditions": [{"name": "baseline", "command": "./blis run --workload x"}],
+        }])
+        rc = _run_hook(
+            _hook_event("./blis run --workload y", str(tmp_path)),
+            env={"NOUS_ITER_DIR": str(tmp_path), "NOUS_PLAN_ENFORCEMENT": "strict"},
+            monkeypatch=monkeypatch,
+        )
+        assert rc == 0
+        assert capsys.readouterr().err == ""
+
+    def test_blocks_unplanned_binary_with_reason(self, tmp_path, monkeypatch, capsys):
+        _write_plan(tmp_path, [{
+            "arm_id": "h-main",
+            "conditions": [{"name": "baseline", "command": "./blis run"}],
+        }])
+        rc = _run_hook(
+            _hook_event("rm -rf /", str(tmp_path)),
+            env={"NOUS_ITER_DIR": str(tmp_path), "NOUS_PLAN_ENFORCEMENT": "strict"},
+            monkeypatch=monkeypatch,
+        )
+        assert rc == 2
+        err = capsys.readouterr().err
+        assert "rm" in err
+        assert "experiment_plan.yaml" in err or "planned" in err
+
+    def test_allows_ad_hoc_escape_hatch(self, tmp_path, monkeypatch, capsys):
+        _write_plan(tmp_path, [{
+            "arm_id": "h-main",
+            "conditions": [{"name": "baseline", "command": "./blis run"}],
+        }])
+        rc = _run_hook(
+            _hook_event("# nous: ad-hoc\nls -la results/", str(tmp_path)),
+            env={"NOUS_ITER_DIR": str(tmp_path), "NOUS_PLAN_ENFORCEMENT": "strict"},
+            monkeypatch=monkeypatch,
+        )
+        assert rc == 0
+        violations = _read_violations(tmp_path)
+        # Ad-hoc escapes are still LOGGED for audit, just not blocked.
+        assert len(violations) == 1
+        assert violations[0]["kind"] == "ad-hoc"
+
+
+# ─── Warn mode (default) ────────────────────────────────────────────────────
+
+class TestWarnMode:
+
+    def test_warn_allows_unplanned_and_logs(self, tmp_path, monkeypatch, capsys):
+        _write_plan(tmp_path, [{
+            "arm_id": "h-main",
+            "conditions": [{"name": "baseline", "command": "./blis run"}],
+        }])
+        rc = _run_hook(
+            _hook_event("curl https://example.com", str(tmp_path)),
+            env={"NOUS_ITER_DIR": str(tmp_path)},  # default = warn
+            monkeypatch=monkeypatch,
+        )
+        assert rc == 0  # warn mode never blocks
+        violations = _read_violations(tmp_path)
+        assert len(violations) == 1
+        assert violations[0]["kind"] == "unplanned"
+        assert "curl" in violations[0]["command"]
+        assert violations[0]["arm"] is not None or violations[0]["arm"] == ""
+        assert "timestamp" in violations[0]
+
+    def test_warn_does_not_log_planned_commands(self, tmp_path, monkeypatch):
+        _write_plan(tmp_path, [{
+            "arm_id": "h-main",
+            "conditions": [{"name": "baseline", "command": "./blis run"}],
+        }])
+        rc = _run_hook(
+            _hook_event("./blis run --threads 8", str(tmp_path)),
+            env={"NOUS_ITER_DIR": str(tmp_path)},
+            monkeypatch=monkeypatch,
+        )
+        assert rc == 0
+        assert _read_violations(tmp_path) == []
+
+
+# ─── No false positives across plan shapes ─────────────────────────────────
+
+class TestNoFalsePositives:
+    """Exercise representative plan shapes and assert every planned command
+    is recognized as planned (no false positives in strict mode)."""
+
+    PLANS = [
+        # Single arm, single condition.
+        [{"arm_id": "h-main", "conditions": [
+            {"name": "x", "command": "python run.py --seed 1"},
+        ]}],
+        # Multiple conditions per arm.
+        [{"arm_id": "h-main", "conditions": [
+            {"name": "a", "command": "./blis run --workload a"},
+            {"name": "b", "command": "./blis run --workload b"},
+        ]}],
+        # Multiple arms, mixed binaries.
+        [
+            {"arm_id": "h-main", "conditions": [
+                {"name": "x", "command": "./sim --batch=4"}]},
+            {"arm_id": "h-ablation", "conditions": [
+                {"name": "y", "command": "/usr/bin/perf record -g ./sim"}]},
+        ],
+        # Absolute paths.
+        [{"arm_id": "h-main", "conditions": [
+            {"name": "x", "command": "/usr/local/bin/custom-bench --duration 60"}]}],
+    ]
+
+    def test_strict_allows_every_planned_command(self, tmp_path, monkeypatch):
+        for i, arms in enumerate(self.PLANS):
+            iter_dir = tmp_path / f"iter-{i}"
+            _write_plan(iter_dir, arms)
+            for arm in arms:
+                for cond in arm["conditions"]:
+                    rc = _run_hook(
+                        _hook_event(cond["command"], str(iter_dir)),
+                        env={
+                            "NOUS_ITER_DIR": str(iter_dir),
+                            "NOUS_PLAN_ENFORCEMENT": "strict",
+                        },
+                        monkeypatch=monkeypatch,
+                    )
+                    assert rc == 0, (
+                        f"Strict mode blocked a planned command in plan #{i}: "
+                        f"{cond['command']!r}"
+                    )
+
+
+# ─── Edge cases ─────────────────────────────────────────────────────────────
+
+class TestEdgeCases:
+
+    def test_missing_iter_dir_warns_but_allows(self, tmp_path, monkeypatch):
+        # If the env var isn't set, we can't enforce; allow + log nothing.
+        # (The wider campaign won't have wired up the hook in this case.)
+        monkeypatch.delenv("NOUS_ITER_DIR", raising=False)
+        rc = _run_hook(
+            _hook_event("./blis run", str(tmp_path)),
+            env={},
+            monkeypatch=monkeypatch,
+        )
+        assert rc == 0
+
+    def test_non_bash_tool_call_is_ignored(self, tmp_path, monkeypatch):
+        _write_plan(tmp_path, [{
+            "arm_id": "h-main",
+            "conditions": [{"name": "x", "command": "./blis run"}],
+        }])
+        # Read tool — not Bash; should pass through.
+        payload = json.dumps({
+            "session_id": "t",
+            "tool_name": "Read",
+            "tool_input": {"file_path": "/etc/passwd"},
+            "cwd": str(tmp_path),
+        })
+        rc = _run_hook(
+            payload,
+            env={
+                "NOUS_ITER_DIR": str(tmp_path),
+                "NOUS_PLAN_ENFORCEMENT": "strict",
+            },
+            monkeypatch=monkeypatch,
+        )
+        assert rc == 0
+        assert _read_violations(tmp_path) == []
diff --git a/tests/test_plugin_package.py b/tests/test_plugin_package.py
new file mode 100644
index 0000000..dee248f
--- /dev/null
+++ b/tests/test_plugin_package.py
@@ -0,0 +1,96 @@
+"""Behavioral tests for the plugin package (#125)."""
+from __future__ import annotations
+
+import json
+import re
+from pathlib import Path
+
+
+PLUGIN_ROOT = Path(__file__).resolve().parent.parent / "plugin" / "nous"
+
+_FRONTMATTER_RE = re.compile(r"^---\s*\n(.*?)\n---\s*\n", re.DOTALL)
+
+
+class TestPluginManifest:
+
+    def test_plugin_json_exists_with_required_fields(self):
+        path = PLUGIN_ROOT / "plugin.json"
+        assert path.exists()
+        data = json.loads(path.read_text())
+        for required in ("name", "version", "description", "skills"):
+            assert required in data, f"plugin.json missing {required!r}"
+        assert data["name"] == "nous"
+        assert isinstance(data["skills"], list)
+
+    def test_plugin_lists_at_least_five_skills(self):
+        data = json.loads((PLUGIN_ROOT / "plugin.json").read_text())
+        assert len(data["skills"]) >= 5
+
+    def test_each_listed_skill_file_exists(self):
+        data = json.loads((PLUGIN_ROOT / "plugin.json").read_text())
+        for rel in data["skills"]:
+            assert (PLUGIN_ROOT / rel).exists(), f"missing skill file: {rel}"
+
+
+class TestSkillFrontmatter:
+    """Each skill markdown must have YAML frontmatter with name + description.
+
+    The description is what Claude Code reads to decide whether to suggest
+    the skill. A vague or missing description is the difference between a
+    discoverable skill and a dead one.
+    """
+
+    def _frontmatter(self, path: Path) -> dict[str, str]:
+        match = _FRONTMATTER_RE.match(path.read_text())
+        if not match:
+            return {}
+        out: dict[str, str] = {}
+        for line in match.group(1).splitlines():
+            if ":" in line:
+                k, _, v = line.partition(":")
+                out[k.strip()] = v.strip()
+        return out
+
+    def test_every_skill_has_name_and_description(self):
+        data = json.loads((PLUGIN_ROOT / "plugin.json").read_text())
+        for rel in data["skills"]:
+            fm = self._frontmatter(PLUGIN_ROOT / rel)
+            assert "name" in fm and fm["name"], f"{rel}: missing name"
+            assert "description" in fm and fm["description"], f"{rel}: missing description"
+
+    def test_descriptions_describe_when_to_use(self):
+        """The description should include cue words that help Claude Code
+        match user intent ("when the user wants", "use when", etc.)."""
+        data = json.loads((PLUGIN_ROOT / "plugin.json").read_text())
+        for rel in data["skills"]:
+            fm = self._frontmatter(PLUGIN_ROOT / rel)
+            desc = fm.get("description", "").lower()
+            assert "use when" in desc or "when the user" in desc or "use this" in desc, (
+                f"{rel}: description should hint at when to use the skill"
+            )
+
+    def test_each_skill_body_references_nous_cli(self):
+        """Phase A skills are CLI wrappers — each markdown body must
+        reference the nous command it shells out to."""
+        data = json.loads((PLUGIN_ROOT / "plugin.json").read_text())
+        for rel in data["skills"]:
+            body = (PLUGIN_ROOT / rel).read_text()
+            assert "nous " in body or "campaign_index" in body, (
+                f"{rel}: body should invoke a nous command or campaign_index"
+            )
+
+
+class TestSkillCoverage:
+    """Acceptance criterion: at least 5 skills must be present and
+    cover the documented operations."""
+
+    EXPECTED_SKILLS = {
+        "nous-run", "nous-status", "nous-resume",
+        "nous-list", "nous-bisect", "nous-find-principle",
+    }
+
+    def test_all_expected_skills_present(self):
+        present = {p.stem for p in (PLUGIN_ROOT / "skills").glob("*.md")}
+        assert self.EXPECTED_SKILLS <= present, (
+            f"missing skills: {self.EXPECTED_SKILLS - present}"
+        )
diff --git a/tests/test_prompt_loader.py b/tests/test_prompt_loader.py
index 0e6c3ef..719e70c 100644
--- a/tests/test_prompt_loader.py
+++ b/tests/test_prompt_loader.py
@@ -75,3 +75,117 @@ def test_same_placeholder_multiple_times(self, prompts_dir: Path) -> None:
         result = loader.load("repeat", {"name": "Nous"})
 
         assert result == "Nous is great. We love Nous."
+
+
+class TestThinTemplateSelection:
+    """#131 Phase B: when a CLAUDE.md exists at the configured path, the
+    loader prefers ``<template>_thin.md`` so methodology is sourced from
+    CLAUDE.md (auto-loaded) rather than re-shipped on every call."""
+
+    def test_full_template_used_when_no_claude_md(self, prompts_dir, tmp_path):
+        _write_template(prompts_dir, "design", "FULL methodology + {{name}}")
+        _write_template(prompts_dir, "design_thin", "THIN: {{name}}")
+        loader = PromptLoader(prompts_dir, claude_md_at=tmp_path / "no-such.md")
+
+        result = loader.load("design", {"name": "BLIS"})
+        assert "FULL methodology" in result
+
+    def test_thin_template_picked_when_claude_md_exists(self, prompts_dir, tmp_path):
+        _write_template(prompts_dir, "design", "FULL methodology + {{name}}")
+        _write_template(prompts_dir, "design_thin", "THIN: {{name}}")
+        claude_md = tmp_path / "CLAUDE.md"
+        claude_md.write_text("# Methodology lives here.")
+
+        loader = PromptLoader(prompts_dir, claude_md_at=claude_md)
+        result = loader.load("design", {"name": "BLIS"})
+        assert "FULL methodology" not in result
+        assert "THIN: BLIS" == result
+
+    def test_full_used_when_no_thin_variant_exists(self, prompts_dir, tmp_path):
+        _write_template(prompts_dir, "report", "FULL report template {{x}}")
+        # No report_thin.md.
+        claude_md = tmp_path / "CLAUDE.md"
+        claude_md.write_text("...")
+
+        loader = PromptLoader(prompts_dir, claude_md_at=claude_md)
+        result = loader.load("report", {"x": "ok"})
+        assert result == "FULL report template ok"
+
+    def test_thin_template_strictly_smaller(self, prompts_dir, tmp_path):
+        """Acceptance criterion #2: iter N+1 prompt is measurably smaller."""
+        full_text = "Long methodology text. " * 200 + " Context: {{name}}"
+        thin_text = "Refer to CLAUDE.md. Context: {{name}}"
+        _write_template(prompts_dir, "design", full_text)
+        _write_template(prompts_dir, "design_thin", thin_text)
+        claude_md = tmp_path / "CLAUDE.md"
+        claude_md.write_text("methodology")
+
+        full_loader = PromptLoader(prompts_dir, claude_md_at=tmp_path / "no.md")
+        thin_loader = PromptLoader(prompts_dir, claude_md_at=claude_md)
+
+        full = full_loader.load("design", {"name": "x"})
+        thin = thin_loader.load("design", {"name": "x"})
+        # Thin must be ≥ 50% smaller — the issue's empirical criterion
+        # for the token-shrink win.
+        assert len(thin) < 0.5 * len(full)
+
+
+class TestRealMethodologyThinTemplates:
+    """The shipped design_thin.md / execute_analyze_thin.md must render
+    against the same context shape the dispatcher already provides AND
+    must be substantially smaller than their full counterparts."""
+
+    REAL_PROMPTS_DIR = (
+        Path(__file__).resolve().parent.parent / "prompts" / "methodology"
+    )
+
+    def _ctx_for_design(self) -> dict[str, str]:
+        return {
+            "iteration": "2",
+            "target_system": "BLIS",
+            "system_description": "Inference simulator.",
+            "research_question": "What drives saturation?",
+            "observable_metrics": "throughput, latency",
+            "controllable_knobs": "batch_size, scheduling",
+            "active_principles": "p1: ordinal scheduling helps.",
+            "previous_handoff": "(none)",
+            "previous_findings": "(none)",
+            "human_feedback": "(none)",
+            "iter_dir": "/tmp/iter-2",
+            "nous_dir": "/path/to/nous",
+            "repo_context": "(test)",
+            "max_turns": "25",
+        }
+
+    def _ctx_for_execute(self) -> dict[str, str]:
+        return {
+            "iteration": "2",
+            "target_system": "BLIS",
+            "system_description": "Inference simulator.",
+            "active_principles": "p1: ordinal scheduling helps.",
+            "iter_dir": "/tmp/iter-2",
+            "observable_metrics": "throughput, latency",
+            "controllable_knobs": "batch_size, scheduling",
+        }
+
+    def test_design_thin_renders_and_is_smaller_than_full(self, tmp_path):
+        claude_md = tmp_path / "CLAUDE.md"
+        claude_md.write_text("methodology")
+        full_loader = PromptLoader(self.REAL_PROMPTS_DIR)
+        thin_loader = PromptLoader(self.REAL_PROMPTS_DIR, claude_md_at=claude_md)
+
+        full = full_loader.load("design", self._ctx_for_design())
+        thin = thin_loader.load("design", self._ctx_for_design())
+
+        assert len(thin) < len(full)
+        # The actual win is substantial — the full template is ~266 lines.
+        assert len(thin) < 0.5 * len(full)
+
+    def test_execute_analyze_thin_renders(self, tmp_path):
+        claude_md = tmp_path / "CLAUDE.md"
+        claude_md.write_text("...")
+        loader = PromptLoader(self.REAL_PROMPTS_DIR, claude_md_at=claude_md)
+
+        out = loader.load("execute_analyze", self._ctx_for_execute())
+        assert "CLAUDE.md" in out
+        assert "BLIS" in out
diff --git a/tests/test_routines.py b/tests/test_routines.py
new file mode 100644
index 0000000..82ca45b
--- /dev/null
+++ b/tests/test_routines.py
@@ -0,0 +1,164 @@
+"""Behavioral tests for Routines payload building (#134 Phase A)."""
+from __future__ import annotations
+
+import pytest
+
+from orchestrator.routines import build_routine_payload
+
+
+def _campaign(**overrides):
+    base = {
+        "research_question": "What drives saturation?",
+        "run_id": "saturation-run",
+        "target_system": {
+            "name": "BLIS",
+            "description": "Inference simulator.",
+            "repo_path": "/path/to/blis",
+        },
+        "max_iterations": 5,
+    }
+    base.update(overrides)
+    return base
+
+
+class TestSchedulePayload:
+
+    def test_includes_cron_trigger(self, tmp_path):
+        out = build_routine_payload(_campaign(), schedule="0 2 * * *")
+        assert out["trigger"] == {"type": "cron", "expression": "0 2 * * *"}
+
+    def test_name_falls_back_to_run_id(self):
+        out = build_routine_payload(_campaign(), schedule="0 2 * * *")
+        assert out["name"] == "saturation-run"
+
+    def test_command_includes_auto_approve_and_agent_sdk(self, tmp_path):
+        path = tmp_path / "campaign.yaml"
+        path.write_text("dummy")
+        out = build_routine_payload(
+            _campaign(), campaign_path=path, schedule="0 2 * * *",
+        )
+        assert "--auto-approve" in out["command"]
+        assert out["command"][-2:] == ["--agent", "sdk"]
+
+    def test_credentials_placeholder_not_real_secret(self):
+        out = build_routine_payload(_campaign(), schedule="0 2 * * *")
+        # The payload must NOT contain the real key — it's a placeholder
+        # that the Routines runtime resolves from its secret store.
+        assert out["credentials"]["ANTHROPIC_API_KEY"].startswith("${secret:")
+
+    def test_mcp_refs_pass_through(self):
+        out = build_routine_payload(
+            _campaign(), schedule="0 2 * * *",
+            mcp_refs=["nous://campaigns", "nous://campaigns/saturation-run/principles"],
+        )
+        assert out["mcp"]["resources"] == [
+            "nous://campaigns",
+            "nous://campaigns/saturation-run/principles",
+        ]
+
+
+class TestPrLabelPayload:
+
+    def test_includes_pr_label_trigger(self):
+        out = build_routine_payload(_campaign(), pr_label="nous-experiment")
+        assert out["trigger"] == {"type": "pr_label", "label": "nous-experiment"}
+
+
+class TestValidation:
+
+    def test_missing_trigger_raises(self):
+        with pytest.raises(ValueError, match="schedule or pr_label"):
+            build_routine_payload(_campaign())
+
+    def test_both_triggers_raises(self):
+        with pytest.raises(ValueError, match="not both"):
+            build_routine_payload(
+                _campaign(), schedule="0 2 * * *", pr_label="nous-experiment",
+            )
+
+
+class TestCampaignReference:
+
+    def test_campaign_path_yields_path_reference(self, tmp_path):
+        path = tmp_path / "campaign.yaml"
+        path.write_text("...")
+        out = build_routine_payload(
+            _campaign(), schedule="0 2 * * *", campaign_path=path,
+        )
+        assert out["campaign_path"] == str(path.resolve())
+        assert "campaign_inline" not in out
+
+    def test_no_path_inlines_campaign_dict(self):
+        out = build_routine_payload(_campaign(), schedule="0 2 * * *")
+        assert "campaign_inline" in out
+        assert out["campaign_inline"]["run_id"] == "saturation-run"
+        assert "campaign_path" not in out
+
+
+# ─── Phase B: API submission with injected poster (no live HTTP) ───────────
+
+
+class _RecordingPoster:
+    def __init__(self, response: dict | None = None):
+        self.calls: list[dict] = []
+        self.response = response or {"routine_id": "rt_test_123"}
+
+    def __call__(self, url, body, headers, timeout):
+        import json as _json
+        self.calls.append({
+            "url": url,
+            "body_json": _json.loads(body),
+            "headers": dict(headers),
+            "timeout": timeout,
+        })
+        return self.response
+
+
+class TestSubmitRoutine:
+    """submit_routine posts the payload via an injected poster (no live
+    HTTP). Tests assert what was sent over the wire and what came back —
+    never that internal helpers were called."""
+
+    def test_posts_payload_with_auth_header(self):
+        from orchestrator.routines import submit_routine
+
+        payload = build_routine_payload(_campaign(), schedule="0 2 * * *")
+        poster = _RecordingPoster()
+
+        result = submit_routine(payload, api_key="sk-test", poster=poster)
+
+        assert len(poster.calls) == 1
+        call = poster.calls[0]
+        assert call["headers"]["Authorization"] == "Bearer sk-test"
+        assert call["headers"]["Content-Type"] == "application/json"
+        assert call["body_json"]["trigger"] == {"type": "cron", "expression": "0 2 * * *"}
+        assert result == {"routine_id": "rt_test_123"}
+
+    def test_uses_custom_api_base(self):
+        from orchestrator.routines import submit_routine
+
+        poster = _RecordingPoster()
+        submit_routine(
+            build_routine_payload(_campaign(), schedule="0 2 * * *"),
+            api_base="https://custom.example/v2/routines",
+            api_key="sk-test", poster=poster,
+        )
+        assert poster.calls[0]["url"] == "https://custom.example/v2/routines"
+
+    def test_returns_routine_id(self):
+        from orchestrator.routines import submit_routine
+
+        poster = _RecordingPoster(response={"routine_id": "rt_abc", "status": "active"})
+        result = submit_routine(
+            build_routine_payload(_campaign(), schedule="0 2 * * *"),
+            api_key="sk-test", poster=poster,
+        )
+        assert result == {"routine_id": "rt_abc", "status": "active"}
+
+    def test_raises_without_api_key_when_no_poster(self):
+        """Real-world misconfig protection: no key + no env + no poster
+        must fail loudly, not fall back to anonymous."""
+        from orchestrator.routines import submit_routine
+
+        with pytest.raises(RuntimeError, match="ANTHROPIC_API_KEY"):
+            submit_routine(build_routine_payload(_campaign(), schedule="0 2 * * *"))
diff --git a/tests/test_sdk_dispatch.py b/tests/test_sdk_dispatch.py
new file mode 100644
index 0000000..2d6d578
--- /dev/null
+++ b/tests/test_sdk_dispatch.py
@@ -0,0 +1,349 @@
+"""Behavioral tests for the SDK-based dispatcher.
+
+These tests do NOT mock the Claude Agent SDK directly. They inject a
+``sdk_runner`` callable that returns a ``SDKResult`` — same contract the
+real dispatcher uses internally — and assert what the dispatcher does
+with that result: artifacts on disk, metrics rows, retry behavior.
+
+That is the contract the rest of Nous depends on. Tests below should
+keep passing across SDK API churn as long as the dispatcher's responsibility
+to write artifacts and emit metrics holds.
+
+No assertions about argv shape, internal helper calls, or which methods
+the dispatcher invoked on the runner. That's structural — out of scope.
+"""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+import jsonschema
+import pytest
+import yaml
+
+from orchestrator.sdk_dispatch import SDKDispatcher, SDKResult, SDKTransientError
+
+
+SCHEMAS_DIR = Path(__file__).resolve().parent.parent / "orchestrator" / "schemas"
+
+
+def _load_schema(name: str) -> dict:
+    path = SCHEMAS_DIR / name
+    if path.suffix in (".yaml", ".yml"):
+        return yaml.safe_load(path.read_text())
+    return json.loads(path.read_text())
+
+
+def _make_campaign(repo_path: Path | None = None) -> dict:
+    target = {
+        "name": "test-system",
+        "description": "A small test system used by behavioral tests.",
+        "observable_metrics": ["latency", "throughput"],
+        "controllable_knobs": ["batch_size", "concurrency"],
+    }
+    if repo_path is not None:
+        target["repo_path"] = str(repo_path)
+    return {
+        "research_question": "What drives latency?",
+        "target_system": target,
+    }
+
+
+def _read_jsonl(path: Path) -> list[dict]:
+    if not path.exists():
+        return []
+    return [json.loads(line) for line in path.read_text().splitlines() if line.strip()]
+
+
+class _ScriptedRunner:
+    """A runner that returns a queue of pre-staged results.
+
+    Each call pops the next entry. Entries can be SDKResult objects (returned)
+    or BaseException instances (raised). When the queue is exhausted, raises
+    AssertionError — a test-only failure mode that signals the dispatcher
+    called the runner more times than expected.
+    """
+
+    def __init__(self, scripted: list):
+        self._scripted = list(scripted)
+        self.calls: list[dict] = []
+
+    def __call__(self, **kwargs) -> SDKResult:
+        self.calls.append(kwargs)
+        if not self._scripted:
+            raise AssertionError(
+                f"Runner exhausted; dispatcher called it {len(self.calls)} times "
+                f"but only {len(self.calls) - 1} responses were scripted."
+            )
+        nxt = self._scripted.pop(0)
+        if isinstance(nxt, BaseException):
+            raise nxt
+        return nxt
+
+
+# ─── Text-output phase (design): dispatcher writes assistant text to log ───
+
+class TestSDKDispatchTextPhase:
+    """For design/execute-analyze, the SDK runs an agent that writes
+    artifacts via tool calls; the dispatcher persists the assistant's
+    final text message as a log."""
+
+    def test_writes_assistant_text_to_output_path(self, tmp_path):
+        runner = _ScriptedRunner([
+            SDKResult(text="design log content here", input_tokens=100, output_tokens=50),
+        ])
+        dispatcher = SDKDispatcher(
+            work_dir=tmp_path,
+            campaign=_make_campaign(tmp_path),
+            sdk_runner=runner,
+        )
+
+        out = tmp_path / "runs" / "iter-1" / "design_log.md"
+        dispatcher.dispatch("planner", "design", output_path=out, iteration=1)
+
+        assert out.exists()
+        assert "design log content here" in out.read_text()
+
+    def test_emits_one_metrics_row_per_call(self, tmp_path):
+        runner = _ScriptedRunner([
+            SDKResult(
+                text="ok",
+                input_tokens=400,
+                output_tokens=120,
+                cache_read_input_tokens=300,
+                cache_creation_input_tokens=0,
+                cost_usd=0.021,
+                duration_ms=4500,
+                num_turns=3,
+            ),
+        ])
+        dispatcher = SDKDispatcher(
+            work_dir=tmp_path,
+            campaign=_make_campaign(tmp_path),
+            sdk_runner=runner,
+        )
+
+        dispatcher.dispatch(
+            "planner", "design",
+            output_path=tmp_path / "runs" / "iter-1" / "design_log.md",
+            iteration=1,
+        )
+
+        rows = _read_jsonl(tmp_path / "llm_metrics.jsonl")
+        assert len(rows) == 1
+        row = rows[0]
+        assert row["dispatcher"] == "sdk"
+        assert row["role"] == "planner"
+        assert row["phase"] == "design"
+        assert row["input_tokens"] == 400
+        assert row["output_tokens"] == 120
+        assert row["cache_read_input_tokens"] == 300
+        assert row["cost_usd"] == pytest.approx(0.021)
+        assert row["num_turns"] == 3
+
+
+# ─── Structured-output phase: dispatcher parses + validates + writes JSON ──
+
+class TestSDKDispatchStructuredPhase:
+    """Gate-summary phase: SDK returns a fenced JSON; dispatcher parses,
+    validates against gate_summary.schema.json, writes JSON output."""
+
+    _SUMMARY = {
+        "gate_type": "design",
+        "summary": "Hypothesis bundle is well-formed and consistent with active principles.",
+        "key_points": [
+            "Hypothesis bundle covers the four arms.",
+            "Methodology aligns with prior principles.",
+        ],
+    }
+
+    def test_writes_valid_json_when_runner_returns_fenced_payload(self, tmp_path):
+        fenced = "```json\n" + json.dumps(self._SUMMARY) + "\n```"
+        runner = _ScriptedRunner([SDKResult(text=fenced)])
+        dispatcher = SDKDispatcher(
+            work_dir=tmp_path,
+            campaign=_make_campaign(),
+            sdk_runner=runner,
+        )
+
+        out = tmp_path / "runs" / "iter-1" / "gate_summary.json"
+        dispatcher.dispatch(
+            "summarizer", "summarize-gate",
+            output_path=out, iteration=1, perspective="design",
+        )
+
+        assert out.exists()
+        parsed = json.loads(out.read_text())
+        jsonschema.validate(parsed, _load_schema("gate_summary.schema.json"))
+        assert parsed["gate_type"] == "design"
+
+
+# ─── Transient retry behavior ───────────────────────────────────────────────
+
+class TestSDKDispatchTransientRetry:
+
+    def test_retries_after_transient_error_then_succeeds(self, tmp_path, monkeypatch):
+        # Disable backoff sleep to keep the test fast.
+        monkeypatch.setattr(
+            "orchestrator.sdk_dispatch.time.sleep", lambda _s: None,
+        )
+        runner = _ScriptedRunner([
+            SDKTransientError("network blip"),
+            SDKResult(text="recovered text", input_tokens=10, output_tokens=5),
+        ])
+        dispatcher = SDKDispatcher(
+            work_dir=tmp_path,
+            campaign=_make_campaign(tmp_path),
+            sdk_runner=runner,
+            max_retries=3,
+        )
+
+        out = tmp_path / "runs" / "iter-1" / "design_log.md"
+        dispatcher.dispatch("planner", "design", output_path=out, iteration=1)
+
+        assert "recovered text" in out.read_text()
+
+        retry_log = _read_jsonl(tmp_path / "retry_log.jsonl")
+        assert len(retry_log) == 1
+        assert retry_log[0]["role"] == "planner"
+        assert retry_log[0]["phase"] == "design"
+        assert "network blip" in retry_log[0]["error"]
+
+    def test_raises_after_retries_exhausted(self, tmp_path, monkeypatch):
+        monkeypatch.setattr(
+            "orchestrator.sdk_dispatch.time.sleep", lambda _s: None,
+        )
+        runner = _ScriptedRunner([
+            SDKTransientError("persistent failure"),
+            SDKTransientError("persistent failure"),
+            SDKTransientError("persistent failure"),
+        ])
+        dispatcher = SDKDispatcher(
+            work_dir=tmp_path,
+            campaign=_make_campaign(tmp_path),
+            sdk_runner=runner,
+            max_retries=2,
+        )
+
+        with pytest.raises(RuntimeError, match="still failing"):
+            dispatcher.dispatch(
+                "planner", "design",
+                output_path=tmp_path / "runs" / "iter-1" / "design_log.md",
+                iteration=1,
+            )
+
+        retry_log = _read_jsonl(tmp_path / "retry_log.jsonl")
+        # Three failures = three retry-log rows.
+        assert len(retry_log) == 3
+
+
+# ─── #122 Phase B: methodology preamble cached as system_prompt ────────────
+
+class TestMethodologyPreambleCached:
+    """When the methodology files are on disk, SDKDispatcher loads them as
+    a single ``system_prompt`` so the Anthropic API marks them cached.
+    Tests assert the wiring contract: same system_prompt across calls,
+    placeholders stripped (otherwise dynamic content in system_prompt
+    would bust the cache)."""
+
+    def test_runner_receives_preamble_in_system_prompt(self, tmp_path):
+        prompts_dir = tmp_path / "prompts"
+        prompts_dir.mkdir()
+        # Use a placeholder that IS in the dispatcher's context so the
+        # regular template-load path doesn't reject it; the preamble
+        # loader still strips them before placing in system_prompt.
+        (prompts_dir / "design.md").write_text(
+            "# Design methodology\n\nStable text for {{target_system}}.\n"
+        )
+        (prompts_dir / "execute_analyze.md").write_text(
+            "# Execute methodology\n\nMore stable text for {{target_system}}.\n"
+        )
+
+        captured: list[dict] = []
+
+        def runner(**kwargs):
+            captured.append(kwargs)
+            return SDKResult(text="ok")
+
+        dispatcher = SDKDispatcher(
+            work_dir=tmp_path,
+            campaign=_make_campaign(tmp_path),
+            sdk_runner=runner,
+            prompts_dir=prompts_dir,
+        )
+        dispatcher.dispatch(
+            "planner", "design",
+            output_path=tmp_path / "runs" / "iter-1" / "design_log.md",
+            iteration=1,
+        )
+
+        assert len(captured) == 1
+        sp = captured[0]["system_prompt"]
+        assert sp is not None
+        assert "Design methodology" in sp
+        assert "Execute methodology" in sp
+        # Placeholders are stripped — dynamic content lives in the user
+        # message; otherwise the cache would never hit.
+        assert "{{target_system}}" not in sp
+        assert "{{" not in sp
+
+    def test_two_calls_reuse_same_system_prompt(self, tmp_path):
+        prompts_dir = tmp_path / "prompts"
+        prompts_dir.mkdir()
+        (prompts_dir / "design.md").write_text(
+            "# Design methodology\n\nText for {{target_system}}.\n"
+        )
+
+        captured: list[dict] = []
+
+        def runner(**kwargs):
+            captured.append(kwargs)
+            return SDKResult(text="ok")
+
+        dispatcher = SDKDispatcher(
+            work_dir=tmp_path,
+            campaign=_make_campaign(tmp_path),
+            sdk_runner=runner,
+            prompts_dir=prompts_dir,
+        )
+        for i in range(1, 3):
+            dispatcher.dispatch(
+                "planner", "design",
+                output_path=tmp_path / "runs" / f"iter-{i}" / "design_log.md",
+                iteration=i,
+            )
+
+        # Same system_prompt across both calls — the property the cache
+        # relies on.
+        assert captured[0]["system_prompt"] == captured[1]["system_prompt"]
+
+
+# ─── Error result path ──────────────────────────────────────────────────────
+
+class TestSDKDispatchErrorResult:
+    """When the SDK returns is_error=True (e.g. API rejected the request),
+    the dispatcher treats it as transient unless explicitly fatal."""
+
+    def test_is_error_treated_as_transient_and_retried(self, tmp_path, monkeypatch):
+        monkeypatch.setattr(
+            "orchestrator.sdk_dispatch.time.sleep", lambda _s: None,
+        )
+        runner = _ScriptedRunner([
+            SDKResult(text="", is_error=True, error_message="rate limit exceeded"),
+            SDKResult(text="finally got through", input_tokens=10, output_tokens=5),
+        ])
+        dispatcher = SDKDispatcher(
+            work_dir=tmp_path,
+            campaign=_make_campaign(tmp_path),
+            sdk_runner=runner,
+            max_retries=3,
+        )
+
+        out = tmp_path / "runs" / "iter-1" / "design_log.md"
+        dispatcher.dispatch("planner", "design", output_path=out, iteration=1)
+
+        assert "finally got through" in out.read_text()
+
+        retry_log = _read_jsonl(tmp_path / "retry_log.jsonl")
+        assert len(retry_log) == 1
+        assert "rate limit exceeded" in retry_log[0]["error"]
diff --git a/tests/test_settings_template.py b/tests/test_settings_template.py
new file mode 100644
index 0000000..ee12b9f
--- /dev/null
+++ b/tests/test_settings_template.py
@@ -0,0 +1,216 @@
+"""Behavioral tests for the per-campaign permission policy (issue #135).
+
+These tests describe the contract of ``render_campaign_settings`` and
+``write_campaign_settings``: given inputs (work_dir, repo_path, plan,
+hook paths), the resulting on-disk ``.claude/settings.json`` has
+specific, externally-visible properties — what's in ``allowOnly``,
+which Bash commands are allowed, where outbound network is denied.
+
+No assertions here about how the function organized its work, what
+helpers it called, or the literal Python control flow. The contract
+is the file's contents.
+"""
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+from orchestrator.settings_template import (
+    render_campaign_settings,
+    settings_path_for,
+    write_campaign_settings,
+)
+
+
+# ─── Generator: shape of the returned dict ──────────────────────────────────
+
+class TestRenderCampaignSettings:
+
+    def test_allow_only_includes_work_dir(self, tmp_path):
+        work_dir = tmp_path / "campaign-A"
+        work_dir.mkdir()
+
+        settings = render_campaign_settings(work_dir=work_dir)
+
+        assert str(work_dir.resolve()) in settings["permissions"]["allowOnly"]
+
+    def test_allow_only_includes_repo_path_when_provided(self, tmp_path):
+        work_dir = tmp_path / "campaign-A"
+        repo = tmp_path / "target-repo"
+        work_dir.mkdir()
+        repo.mkdir()
+
+        settings = render_campaign_settings(work_dir=work_dir, repo_path=repo)
+
+        allow_only = settings["permissions"]["allowOnly"]
+        assert str(work_dir.resolve()) in allow_only
+        assert str(repo.resolve()) in allow_only
+
+    def test_default_bin_allowlist_contains_python_and_git(self, tmp_path):
+        settings = render_campaign_settings(work_dir=tmp_path)
+
+        allow = settings["permissions"]["allow"]
+        # Each Bash allow entry has shape ``Bash(<bin>:*)`` — assert a few
+        # canonical ones are present without prescribing their order.
+        assert any("Bash(python:*)" == entry for entry in allow)
+        assert any("Bash(git:*)" == entry for entry in allow)
+        assert any("Bash(grep:*)" == entry for entry in allow)
+
+    def test_plan_binaries_added_to_allowlist(self, tmp_path):
+        plan = {
+            "arms": [
+                {
+                    "arm_id": "h-main",
+                    "conditions": [
+                        {"name": "baseline", "command": "./blis run --workload x"},
+                        {"name": "treatment", "command": "/usr/local/bin/sim --batch=4"},
+                    ],
+                },
+            ],
+        }
+        settings = render_campaign_settings(
+            work_dir=tmp_path, experiment_plan=plan,
+        )
+        allow = settings["permissions"]["allow"]
+
+        assert "Bash(blis:*)" in allow
+        assert "Bash(sim:*)" in allow
+
+    def test_extra_bin_allowlist_extends_defaults(self, tmp_path):
+        settings = render_campaign_settings(
+            work_dir=tmp_path,
+            extra_bin_allowlist=["custom-bench", "trace-tool"],
+        )
+        allow = settings["permissions"]["allow"]
+
+        assert "Bash(custom-bench:*)" in allow
+        assert "Bash(trace-tool:*)" in allow
+        # Defaults still present.
+        assert "Bash(git:*)" in allow
+
+    def test_deny_blocks_outbound_https(self, tmp_path):
+        settings = render_campaign_settings(work_dir=tmp_path)
+
+        deny = settings["permissions"]["deny"]
+        assert any("https" in entry for entry in deny)
+
+    def test_no_hooks_section_when_no_hook_paths(self, tmp_path):
+        settings = render_campaign_settings(work_dir=tmp_path)
+
+        assert "hooks" not in settings
+
+    def test_stop_hook_registered_when_path_provided(self, tmp_path):
+        hook = tmp_path / "bin" / "nous-execute-stop"
+        hook.parent.mkdir(parents=True)
+        hook.write_text("#!/bin/sh\nexit 0\n")
+
+        settings = render_campaign_settings(
+            work_dir=tmp_path, stop_hook_path=hook,
+        )
+
+        assert "Stop" in settings["hooks"]
+        stop_cfg = settings["hooks"]["Stop"]
+        assert stop_cfg[0]["hooks"][0]["command"] == str(hook.resolve())
+        assert stop_cfg[0]["hooks"][0]["type"] == "command"
+
+    def test_pre_tool_use_hook_registered_when_path_provided(self, tmp_path):
+        hook = tmp_path / "bin" / "nous-plan-enforcer"
+        hook.parent.mkdir(parents=True)
+        hook.write_text("#!/bin/sh\nexit 0\n")
+
+        settings = render_campaign_settings(
+            work_dir=tmp_path, pre_tool_use_hook_path=hook,
+        )
+
+        assert "PreToolUse" in settings["hooks"]
+        ptu = settings["hooks"]["PreToolUse"]
+        assert ptu[0]["matcher"] == "Bash"
+        assert ptu[0]["hooks"][0]["command"] == str(hook.resolve())
+
+
+# ─── Disk write ─────────────────────────────────────────────────────────────
+
+class TestWriteCampaignSettings:
+
+    def test_write_creates_parent_dir_and_writes_json(self, tmp_path):
+        work_dir = tmp_path / "campaign-X"
+        work_dir.mkdir()
+        settings = render_campaign_settings(work_dir=work_dir)
+
+        target = settings_path_for(work_dir)
+        path = write_campaign_settings(target, settings)
+
+        assert path.exists()
+        # Re-read and confirm round-trip equivalence — that's the contract:
+        # whatever the renderer produced is what's on disk.
+        on_disk = json.loads(path.read_text())
+        assert on_disk == settings
+
+    def test_settings_path_for_returns_dot_claude_subdir(self, tmp_path):
+        path = settings_path_for(tmp_path)
+
+        assert path.parent.name == ".claude"
+        assert path.name == "settings.json"
+
+
+# ─── No-`--dangerously` invariant ───────────────────────────────────────────
+
+class TestSetupWorkDirWritesSettings:
+    """Init-time wiring: ``setup_work_dir`` writes ``.claude/settings.json``
+    so the dispatcher can pick it up automatically."""
+
+    def test_init_writes_settings_in_dot_claude(self, tmp_path):
+        from orchestrator.iteration import setup_work_dir
+
+        repo = tmp_path / "target-repo"
+        repo.mkdir()
+        work_dir = setup_work_dir("run-123", repo_path=str(repo))
+
+        settings_path = work_dir / ".claude" / "settings.json"
+        assert settings_path.exists()
+
+        on_disk = json.loads(settings_path.read_text())
+        # work_dir and repo are both in allowOnly.
+        assert str(work_dir.resolve()) in on_disk["permissions"]["allowOnly"]
+        assert str(repo.resolve()) in on_disk["permissions"]["allowOnly"]
+
+    def test_init_does_not_overwrite_existing_settings(self, tmp_path):
+        from orchestrator.iteration import setup_work_dir
+
+        repo = tmp_path / "target-repo"
+        repo.mkdir()
+        work_dir = Path(repo) / ".nous" / "run-456"
+        work_dir.mkdir(parents=True)
+        settings_dir = work_dir / ".claude"
+        settings_dir.mkdir()
+        custom_settings = {"permissions": {"allowOnly": ["/custom"], "allow": [], "deny": []}}
+        (settings_dir / "settings.json").write_text(json.dumps(custom_settings))
+
+        # Re-running setup must NOT clobber the user's hand edits.
+        setup_work_dir("run-456", repo_path=str(repo))
+
+        on_disk = json.loads((settings_dir / "settings.json").read_text())
+        assert on_disk == custom_settings
+
+
+class TestNoDangerouslyFlag:
+    """Settings file is the *replacement* for ``--dangerously-skip-permissions``.
+
+    The contract is: when the dispatcher invokes claude with ``--settings <path>``
+    and this file is at <path>, the agent operates under deny-by-default rules
+    rather than auto-approval. We assert the produced file imposes a non-empty
+    allowOnly and at least one deny rule — the two properties that make the
+    settings file *meaningfully* restrictive vs ``--dangerously``.
+    """
+
+    def test_settings_imposes_allowonly_and_deny(self, tmp_path):
+        settings = render_campaign_settings(work_dir=tmp_path)
+
+        assert settings["permissions"]["allowOnly"], (
+            "allowOnly must be non-empty; otherwise everything is permitted, "
+            "which is the very property --dangerously gave us."
+        )
+        assert settings["permissions"]["deny"], (
+            "deny must be non-empty so writes/network outside the worktree "
+            "are blocked."
+        )
diff --git a/tests/test_status.py b/tests/test_status.py
new file mode 100644
index 0000000..94479a8
--- /dev/null
+++ b/tests/test_status.py
@@ -0,0 +1,257 @@
+"""Behavioral tests for the status snapshot reader (#127 Phase A).
+
+Tests synthesize a campaign work-dir on disk, set timestamps explicitly
+(via os.utime), and assert on the returned ``StatusSnapshot`` and the
+two formatter outputs. Determinism comes from injected ``now=`` and
+explicit mtimes — no real wall-clock dependency.
+"""
+from __future__ import annotations
+
+import json
+import os
+from pathlib import Path
+
+from orchestrator.status import (
+    StatusSnapshot,
+    format_one_liner,
+    format_watch_panel,
+    read_status_snapshot,
+)
+
+
+def _write_state(work_dir: Path, *, run_id: str, phase: str, iteration: int) -> None:
+    work_dir.mkdir(parents=True, exist_ok=True)
+    (work_dir / "state.json").write_text(json.dumps({
+        "run_id": run_id, "phase": phase, "iteration": iteration,
+    }))
+
+
+def _write_ledger(work_dir: Path, completed: int) -> None:
+    rows = [{"iteration": i + 1, "outcome": "experiment_valid"}
+            for i in range(completed)]
+    (work_dir / "ledger.json").write_text(json.dumps({"iterations": rows}))
+
+
+def _write_principles(work_dir: Path, principles: list[dict]) -> None:
+    (work_dir / "principles.json").write_text(json.dumps({
+        "principles": principles,
+    }))
+
+
+def _write_log(work_dir: Path, iteration: int, events: list[dict], mtime: float) -> Path:
+    iter_dir = work_dir / "runs" / f"iter-{iteration}"
+    iter_dir.mkdir(parents=True, exist_ok=True)
+    log = iter_dir / "executor_log.jsonl"
+    log.write_text("\n".join(json.dumps(e) for e in events) + "\n")
+    os.utime(log, (mtime, mtime))
+    return log
+
+
+# ─── Snapshot reader ────────────────────────────────────────────────────────
+
+class TestReadSnapshot:
+
+    def test_minimal_state_only(self, tmp_path):
+        _write_state(tmp_path, run_id="r1", phase="DESIGN", iteration=1)
+
+        snap = read_status_snapshot(tmp_path)
+        assert snap.run_id == "r1"
+        assert snap.phase == "DESIGN"
+        assert snap.iteration == 1
+        assert snap.completed_iterations == 0
+        assert snap.last_event is None
+        assert snap.stuck is False
+
+    def test_completed_iterations_from_ledger(self, tmp_path):
+        _write_state(tmp_path, run_id="r1", phase="DONE", iteration=3)
+        _write_ledger(tmp_path, completed=3)
+
+        snap = read_status_snapshot(tmp_path)
+        assert snap.completed_iterations == 3
+
+    def test_active_principles_excludes_retired(self, tmp_path):
+        _write_state(tmp_path, run_id="r1", phase="DESIGN", iteration=2)
+        _write_principles(tmp_path, [
+            {"id": "p1", "status": "active"},
+            {"id": "p2", "status": "retired"},
+            {"id": "p3", "status": "active"},
+        ])
+
+        snap = read_status_snapshot(tmp_path)
+        assert snap.active_principles == 2
+
+    def test_last_event_picked_up_from_executor_log(self, tmp_path):
+        _write_state(tmp_path, run_id="r1", phase="EXECUTE_ANALYZE", iteration=1)
+        mtime = 1_000_000.0
+        _write_log(tmp_path, 1, [
+            {"tool_name": "Bash", "ts": "..."},
+            {"tool_name": "Edit", "ts": "..."},
+        ], mtime=mtime)
+
+        snap = read_status_snapshot(tmp_path, now=mtime + 30)
+        assert snap.last_event["tool_name"] == "Edit"
+        assert 25 <= snap.elapsed_since_last_event <= 35
+        assert snap.stuck is False
+
+    def test_stuck_flag_set_after_threshold(self, tmp_path):
+        _write_state(tmp_path, run_id="r1", phase="EXECUTE_ANALYZE", iteration=1)
+        mtime = 1_000_000.0
+        _write_log(tmp_path, 1, [{"tool_name": "Bash"}], mtime=mtime)
+
+        snap = read_status_snapshot(tmp_path, now=mtime + 6 * 60)
+        assert snap.stuck is True
+        assert snap.elapsed_since_last_event > 5 * 60
+
+    def test_corrupt_state_json_does_not_crash(self, tmp_path):
+        (tmp_path / "state.json").write_text("not json")
+        snap = read_status_snapshot(tmp_path)
+        assert snap.run_id == "?"
+        assert snap.stuck is False
+
+    def test_corrupt_executor_log_lines_skipped(self, tmp_path):
+        _write_state(tmp_path, run_id="r1", phase="EXECUTE_ANALYZE", iteration=1)
+        iter_dir = tmp_path / "runs" / "iter-1"
+        iter_dir.mkdir(parents=True)
+        log = iter_dir / "executor_log.jsonl"
+        log.write_text(
+            json.dumps({"tool_name": "Bash"}) + "\n"
+            "not json\n"
+            + json.dumps({"tool_name": "Edit"}) + "\n"
+        )
+        os.utime(log, (1_000_000.0, 1_000_000.0))
+
+        snap = read_status_snapshot(tmp_path, now=1_000_000.0 + 5)
+        # The last *valid* event is what wins — the corrupt line in the
+        # middle is skipped.
+        assert snap.last_event["tool_name"] == "Edit"
+
+
+# ─── #127 Phase B: SDK event tee wiring ────────────────────────────────────
+
+class TestSDKEventTeeIntegration:
+    """SDKDispatcher passes event_log_path to its runner so the runner
+    can append every SDK message as a JSONL row that the status reader
+    picks up. Verify the wiring contract."""
+
+    def _campaign(self, repo_path: Path) -> dict:
+        return {
+            "research_question": "?",
+            "target_system": {
+                "name": "test", "description": "test",
+                "repo_path": str(repo_path),
+            },
+        }
+
+    def test_runner_receives_event_log_path_for_iteration(self, tmp_path):
+        from orchestrator.sdk_dispatch import SDKDispatcher, SDKResult
+
+        captured: list[dict] = []
+
+        def runner(**kwargs):
+            captured.append(kwargs)
+            return SDKResult(text="ok")
+
+        dispatcher = SDKDispatcher(
+            work_dir=tmp_path,
+            campaign=self._campaign(tmp_path),
+            sdk_runner=runner,
+        )
+        dispatcher.dispatch(
+            "planner", "design",
+            output_path=tmp_path / "runs" / "iter-3" / "design_log.md",
+            iteration=3,
+        )
+
+        elp = captured[0]["event_log_path"]
+        assert elp == tmp_path / "runs" / "iter-3" / "executor_log.jsonl"
+
+    def test_each_iteration_gets_its_own_event_log(self, tmp_path):
+        from orchestrator.sdk_dispatch import SDKDispatcher, SDKResult
+
+        captured: list[dict] = []
+
+        def runner(**kwargs):
+            captured.append(kwargs)
+            return SDKResult(text="ok")
+
+        dispatcher = SDKDispatcher(
+            work_dir=tmp_path,
+            campaign=self._campaign(tmp_path),
+            sdk_runner=runner,
+        )
+        dispatcher.dispatch(
+            "planner", "design",
+            output_path=tmp_path / "runs" / "iter-1" / "design_log.md",
+            iteration=1,
+        )
+        dispatcher.dispatch(
+            "planner", "design",
+            output_path=tmp_path / "runs" / "iter-2" / "design_log.md",
+            iteration=2,
+        )
+
+        assert "iter-1" in str(captured[0]["event_log_path"])
+        assert "iter-2" in str(captured[1]["event_log_path"])
+
+
+# ─── Formatters ─────────────────────────────────────────────────────────────
+
+class TestFormatOneLiner:
+
+    def test_single_line_no_newlines(self):
+        snap = StatusSnapshot(
+            run_id="saturation-detect", phase="EXECUTE_ANALYZE", iteration=2,
+            completed_iterations=1, active_principles=5,
+            last_event={"tool_name": "Bash"},
+        )
+        out = format_one_liner(snap)
+        assert "\n" not in out
+        assert "saturation-detect" in out
+        assert "EXECUTE_ANALYZE" in out
+        assert "iter 2" in out
+        assert "Bash" in out
+
+    def test_stuck_marker_appears(self):
+        snap = StatusSnapshot(
+            run_id="r1", phase="EXECUTE_ANALYZE", iteration=1,
+            stuck=True, last_event={"tool_name": "Bash"},
+        )
+        assert "STUCK" in format_one_liner(snap)
+
+    def test_stable_when_no_new_events(self):
+        snap = StatusSnapshot(
+            run_id="r1", phase="DESIGN", iteration=1,
+            completed_iterations=0, active_principles=0,
+        )
+        # Two consecutive renderings of the same snapshot — must match
+        # exactly. This is the property prompt-embedders rely on.
+        assert format_one_liner(snap) == format_one_liner(snap)
+
+
+class TestFormatWatchPanel:
+
+    def test_multi_line_panel_includes_phase_iter_principles(self):
+        snap = StatusSnapshot(
+            run_id="r1", phase="DESIGN", iteration=2,
+            completed_iterations=1, active_principles=3,
+        )
+        out = format_watch_panel(snap)
+        assert "Phase:" in out
+        assert "DESIGN" in out
+        assert "Iteration:" in out
+        assert "Principles" in out
+
+    def test_stuck_warning_rendered_distinctly(self):
+        snap = StatusSnapshot(
+            run_id="r1", phase="EXECUTE_ANALYZE", iteration=1,
+            last_event={"tool_name": "Bash"},
+            elapsed_since_last_event=400,
+            stuck=True,
+        )
+        out = format_watch_panel(snap)
+        assert "STUCK" in out
+
+    def test_no_events_renders_placeholder(self):
+        snap = StatusSnapshot(run_id="r1", phase="DESIGN", iteration=1)
+        out = format_watch_panel(snap)
+        assert "no events" in out.lower() or "(no events" in out
diff --git a/tests/test_worktree_gc.py b/tests/test_worktree_gc.py
new file mode 100644
index 0000000..60ebaed
--- /dev/null
+++ b/tests/test_worktree_gc.py
@@ -0,0 +1,198 @@
+"""Behavioral tests for orphan-worktree GC (#133 Phase A).
+
+Synthesizes ``<repo>/.nous-experiments/<id>`` directories with controlled
+mtimes and PID files, calls gc_orphan_worktrees, asserts which were
+removed. Tests inject a fake clock + fake pid_check so they're
+deterministic across machines.
+"""
+from __future__ import annotations
+
+import os
+import subprocess
+from pathlib import Path
+
+from orchestrator.worktree import gc_orphan_worktrees
+
+
+def _init_git_repo(repo: Path) -> None:
+    repo.mkdir(parents=True, exist_ok=True)
+    subprocess.run(["git", "init", "-q"], cwd=repo, check=True)
+    subprocess.run(["git", "config", "user.email", "t@t"], cwd=repo, check=True)
+    subprocess.run(["git", "config", "user.name", "t"], cwd=repo, check=True)
+    (repo / "f.txt").write_text("x")
+    subprocess.run(["git", "add", "."], cwd=repo, check=True, capture_output=True)
+    subprocess.run(
+        ["git", "commit", "-q", "-m", "init"], cwd=repo, check=True,
+        capture_output=True,
+    )
+
+
+def _make_worktree_dir(
+    repo: Path, exp_id: str, *, mtime: float, pid: int | None = None,
+) -> Path:
+    d = repo / ".nous-experiments" / exp_id
+    d.mkdir(parents=True, exist_ok=True)
+    (d / "marker").write_text("x")
+    if pid is not None:
+        (d / ".nous-pid").write_text(str(pid))
+    os.utime(d, (mtime, mtime))
+    return d
+
+
+class TestGcOrphanWorktrees:
+
+    def test_no_experiments_dir_returns_empty(self, tmp_path):
+        _init_git_repo(tmp_path)
+        assert gc_orphan_worktrees(tmp_path) == []
+
+    def test_removes_old_worktree_with_no_pid_file(self, tmp_path):
+        _init_git_repo(tmp_path)
+        old_mtime = 1000.0  # well in the past
+        _make_worktree_dir(tmp_path, "iter-1-aaaa", mtime=old_mtime)
+
+        removed = gc_orphan_worktrees(
+            tmp_path, max_age_seconds=60, now=old_mtime + 3600,
+        )
+
+        assert removed == ["iter-1-aaaa"]
+        assert not (tmp_path / ".nous-experiments" / "iter-1-aaaa").exists()
+
+    def test_keeps_recent_worktree(self, tmp_path):
+        _init_git_repo(tmp_path)
+        recent = 5000.0
+        _make_worktree_dir(tmp_path, "iter-2-bbbb", mtime=recent)
+
+        removed = gc_orphan_worktrees(
+            tmp_path, max_age_seconds=3600, now=recent + 30,
+        )
+
+        assert removed == []
+        assert (tmp_path / ".nous-experiments" / "iter-2-bbbb").exists()
+
+    def test_keeps_old_worktree_when_pid_alive(self, tmp_path):
+        _init_git_repo(tmp_path)
+        old = 1000.0
+        _make_worktree_dir(tmp_path, "iter-3-cccc", mtime=old, pid=12345)
+
+        # Inject an "always alive" pid_check; the dir should be kept
+        # despite being older than max_age_seconds.
+        removed = gc_orphan_worktrees(
+            tmp_path, max_age_seconds=60, now=old + 3600,
+            pid_check=lambda pid: True,
+        )
+
+        assert removed == []
+        assert (tmp_path / ".nous-experiments" / "iter-3-cccc").exists()
+
+    def test_removes_old_worktree_when_pid_dead(self, tmp_path):
+        _init_git_repo(tmp_path)
+        old = 1000.0
+        _make_worktree_dir(tmp_path, "iter-4-dddd", mtime=old, pid=12345)
+
+        removed = gc_orphan_worktrees(
+            tmp_path, max_age_seconds=60, now=old + 3600,
+            pid_check=lambda pid: False,
+        )
+
+        assert removed == ["iter-4-dddd"]
+        assert not (tmp_path / ".nous-experiments" / "iter-4-dddd").exists()
+
+    def test_invalid_pid_file_treated_as_no_pid(self, tmp_path):
+        _init_git_repo(tmp_path)
+        old = 1000.0
+        d = _make_worktree_dir(tmp_path, "iter-5-eeee", mtime=old)
+        (d / ".nous-pid").write_text("not-an-int")
+        os.utime(d, (old, old))
+
+        removed = gc_orphan_worktrees(
+            tmp_path, max_age_seconds=60, now=old + 3600,
+        )
+        assert removed == ["iter-5-eeee"]
+
+    def test_multiple_worktrees_partial_removal_is_sorted(self, tmp_path):
+        _init_git_repo(tmp_path)
+        old = 1000.0
+        recent = 5000.0
+        _make_worktree_dir(tmp_path, "iter-1-aaaa", mtime=old)
+        _make_worktree_dir(tmp_path, "iter-2-bbbb", mtime=recent)
+        _make_worktree_dir(tmp_path, "iter-3-cccc", mtime=old)
+
+        removed = gc_orphan_worktrees(
+            tmp_path, max_age_seconds=60, now=recent + 30,
+        )
+        # recent (iter-2) should still exist; old ones gone.
+        assert removed == ["iter-1-aaaa", "iter-3-cccc"]
+        assert (tmp_path / ".nous-experiments" / "iter-2-bbbb").exists()
+
+    def test_zero_leftover_worktrees_after_gc_for_age_match(self, tmp_path):
+        """Acceptance criterion: <repo>/.nous-experiments/ has zero
+        leftover entries after a multi-arm campaign that GC'd everything."""
+        _init_git_repo(tmp_path)
+        old = 1000.0
+        for i in range(5):
+            _make_worktree_dir(tmp_path, f"iter-{i}-x", mtime=old)
+
+        gc_orphan_worktrees(tmp_path, max_age_seconds=60, now=old + 3600)
+
+        leftovers = [
+            p for p in (tmp_path / ".nous-experiments").iterdir() if p.is_dir()
+        ]
+        assert leftovers == []
+
+
+# ─── Phase B: harness-isolated subagent runner factory ─────────────────────
+
+
+class TestMakeIsolatedArmRunner:
+    """The factory returns an ArmRunner-shaped callable that delegates to
+    the injected sdk_runner with isolation=worktree. Tests assert what
+    the runner sends to the SDK and how it interprets the response —
+    never that internal helpers were called."""
+
+    def _unit(self):
+        # Local stand-in for parallel_arms.ArmUnit so this test runs on
+        # the #133 branch before #123's parallel_arms.py lands. The real
+        # ArmUnit is duck-compatible with this shape.
+        from dataclasses import dataclass
+
+        @dataclass(frozen=True)
+        class _Unit:
+            arm_id: str
+            seed: str
+            condition_name: str
+            command: str
+
+            @property
+            def relative_results_dir(self) -> str:
+                return f"results/{self.arm_id}/{self.seed}"
+
+        return _Unit("h-main", "s1", "x", "./blis run")
+
+    def test_returns_callable(self, tmp_path):
+        try:
+            from orchestrator.parallel_arms import ArmUnit  # noqa: F401
+        except ImportError:
+            import pytest
+            pytest.skip("parallel_arms not on this branch yet (lands in #123)")
+        from orchestrator.worktree import make_isolated_arm_runner
+
+        runner = make_isolated_arm_runner(
+            sdk_runner=lambda **kw: None,
+            repo_path=tmp_path,
+            iter_dir=tmp_path / "iter-1",
+        )
+        assert callable(runner)
+
+    def test_factory_accepts_documented_kwargs(self, tmp_path):
+        """The factory's keyword surface is the public contract."""
+        from orchestrator.worktree import make_isolated_arm_runner
+        # Just verify the signature accepts what the docstring promises;
+        # construction must not raise.
+        make_isolated_arm_runner(
+            sdk_runner=lambda **kw: None,
+            repo_path=tmp_path,
+            iter_dir=tmp_path,
+            model="claude-sonnet-4-6",
+            max_turns=10,
+            subagent_type="claude",
+        )