Skip to content

Commit c6a6ee6

Browse files
sriumcpclaude
andauthored
post-SDK-migration UX + architecture cleanup (AI-native-Systems-Research#189) (AI-native-Systems-Research#192)
* post-SDK-migration UX + architecture cleanup (AI-native-Systems-Research#189) Closes AI-native-Systems-Research#183, AI-native-Systems-Research#184, AI-native-Systems-Research#185, AI-native-Systems-Research#186, AI-native-Systems-Research#187, AI-native-Systems-Research#188, AI-native-Systems-Research#190. Refs AI-native-Systems-Research#189. Six paper-burst attempts on `mechanismdesign` failed before nous produced a complete DESIGN→EXECUTE_ANALYZE flow, surfacing seven distinct trip-hazards. This PR retires the legacy claude -p subprocess path, fixes the last SDK-dispatcher path bug, widens several authoring surfaces, and adds two operator-facing tools (`nous stop`, `nous schema`) so agents and humans can run nous without grepping the source. Behaviour changes ----------------- * `AI-native-Systems-Research#190` — SDK dispatcher writes `executor_log.jsonl` under `runs/iter-N/inputs/` so the design-phase validator's iter-root whitelist is preserved. Status reader falls back to the legacy iter-root path so older campaigns keep rendering. * `AI-native-Systems-Research#183` (BREAKING) — removed `--agent api` (the legacy claude -p subprocess). `--agent sdk` is now the default and the only user-facing code-access path. `claude-agent-sdk` and `anyio` moved from `[project.optional-dependencies]` to required `dependencies`. Programmatic `agent="api"` callers raise `ValueError` with a migration message. * `AI-native-Systems-Research#184` — `nous create-campaign` defaults `target_system.repo_path` to CWD at scaffold time and exposes `--target-repo-path` to override. Closes the silent "wrong work_dir" trap when `nous run` is invoked from a different CWD later. * `AI-native-Systems-Research#185` — campaign schema accepts top-level `ground_truth` (the pre-registration use case) and `theory_references` items as strings or full objects. New helpers `_format_campaign_ground_truth` and `_normalize_theory_references` in `llm_dispatch.py` surface ground_truth into the DESIGN prompt. * `AI-native-Systems-Research#186` — campaign-level `max_turns` block overrides `defaults.yaml` per phase. Resolution order: campaign > defaults > hardcoded fallback (25). * `AI-native-Systems-Research#187` — `DesignIncompleteError` fires before schema validation when `bundle.yaml` / `problem.md` / `handoff_snapshot.md` are missing after dispatch. The error names the missing files and lists four common causes (max_turns, ran the experiment in DESIGN, API stall, transport failure) each pointing at a concrete artifact. A `failure_type: "design_incomplete"` retry_log entry is also written. * `AI-native-Systems-Research#188` — new `--bundle <path>` (with optional `--problem-md` and `--handoff-md`) skips DESIGN by copying a pre-authored bundle. Bundle is schema-validated, hashed, and recorded in `iter-1/bundle_manifest.json` with `bundle_source: pre_authored`, `bundle_path`, and `bundle_sha256` for reviewer-defensible provenance. New tooling for agents/humans ----------------------------- * `nous stop <target> [--reason ...]` — writes a STOP sentinel at the campaign work_dir. The orchestrator checks before each iteration and exits cleanly with a `stopped_by_user` ledger row. Mid-iteration interrupt is still ctrl-C; this is the agent-friendly handle. * `nous schema [campaign|bundle|findings] [--format md|json|yaml]` — pure deterministic Python (no LLM) that renders the schema YAML/JSON as a Markdown reference. Walks `properties` once, groups required vs optional, surfaces descriptions verbatim. * README — added a "Quick reference" table and an "Observability" section pointing at `executor_log.jsonl`, `retry_log.jsonl`, `llm_metrics.jsonl`, `state.json`, and the design-incomplete diagnostic. CLI flag help strings are now exhaustive. Tests ----- 939 passed, 1 skipped, 0 failed. New behavioural tests: - `test_validate.py` — SDK log under inputs/ passes; iter-root log still rejected (AI-native-Systems-Research#190 contract pinned). - `test_inline_dispatch.py` — `agent="api"` raises migration ValueError; sdk routing still works. - `test_create_campaign.py` — `--target-repo-path` lands as real value; CWD default works. - `test_theory_references.py` + `test_campaign_ground_truth.py` — new schema shapes accepted, helpers render correctly. - `test_max_turns_resolution.py` — campaign > defaults > hardcoded. - `test_design_artifact_assertion.py` — DesignIncompleteError shape, hint coverage, retry_log entry. - `test_pre_authored_bundle.py` — --bundle artifact copy, manifest shape, schema-invalid bundle fails fast, validator accepts the pre-authored iter dir. - `test_nous_stop.py` — sentinel helpers, CLI handler, campaign loop honours sentinel. - `test_nous_schema_command.py` — Markdown / JSON / YAML output; pinned that `nous schema` never invokes a dispatcher. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: nudge to retrigger Tests workflow on PR AI-native-Systems-Research#191 --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 3c26611 commit c6a6ee6

24 files changed

Lines changed: 2274 additions & 137 deletions

README.md

Lines changed: 100 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -64,11 +64,13 @@ Every experiment is structured as a bundle of falsifiable predictions:
6464
### Prerequisites
6565

6666
- **Python 3.11+**
67-
- **Claude Code CLI** (`claude`) — installed and authenticated
67+
- **Claude Code CLI** (`claude`) — installed and authenticated. The Claude
68+
Agent SDK (the default code-access backend, `--agent sdk`) reuses your
69+
CLI authentication.
6870

6971
### Environment setup
7072

71-
The `claude -p` subprocess handles its own authentication via Claude CLI config. However, gate summaries and report generation use the OpenAI-compatible LLM API, which needs:
73+
The Claude Agent SDK handles its own authentication via Claude CLI config. However, gate summaries and report generation use the OpenAI-compatible LLM API, which needs:
7274

7375
```bash
7476
export OPENAI_API_KEY=your-api-key
@@ -93,15 +95,14 @@ cd agentic-strategy-evolution
9395
pip install -e ".[dev]"
9496
```
9597

96-
For the SDK-based dispatcher (`--agent sdk`, see `docs/architecture.md`), also install the optional `[sdk]` extra:
97-
98-
```bash
99-
pip install -e ".[dev,sdk]"
100-
```
98+
The Claude Agent SDK (`claude-agent-sdk`) is a required dependency and
99+
lands automatically — no separate install step. The legacy `--agent api`
100+
(claude -p subprocess) backend was removed in #183; `--agent sdk` is the
101+
default and only user-facing code-access path.
101102

102103
### 2. Configure models
103104

104-
Two LLM calls per iteration, both via `claude -p`:
105+
Two LLM calls per iteration, both via the Claude Agent SDK:
105106

106107
| Phase | Default model | Role |
107108
|-------|---------------|------|
@@ -112,7 +113,18 @@ Both agents write their artifacts directly to disk and run `nous validate` befor
112113

113114
### 4. Create a campaign
114115

115-
Create a `campaign.yaml` pointing to your target repo:
116+
The fastest path is the scaffolder, which writes a heavily-commented
117+
campaign.yaml with the right defaults (including `repo_path` set to
118+
your CWD so `nous run` doesn't silently wedge — see #184):
119+
120+
```bash
121+
cd /path/to/your/repo
122+
nous create-campaign --to ./campaign.yaml \
123+
--target-name "Your System" \
124+
--research-question "What mechanism drives the primary bottleneck?"
125+
```
126+
127+
If you prefer to author by hand, use this minimum:
116128

117129
```yaml
118130
research_question: >
@@ -129,6 +141,29 @@ target_system:
129141
130142
When `repo_path` is set, the campaign directory is created inside the target repo at `.nous/<run_id>/`. All artifacts live there.
131143

144+
To discover the full schema (required vs optional fields, descriptions
145+
verbatim from the schema source), run:
146+
147+
```bash
148+
nous schema # campaign schema, Markdown (default)
149+
nous schema bundle --format yaml # bundle schema, raw YAML for tooling
150+
nous schema findings # findings schema
151+
```
152+
153+
Optional blocks worth knowing about:
154+
155+
- **`max_turns`** (#186) — per-phase tool-use budget override. Default
156+
is 80 design / 120 execute_analyze. A 50-arm fanout may need 200+
157+
design turns; a probe-only campaign fits in 30.
158+
- **`ground_truth`** (#185) — pre-register the immutable direction
159+
claim and pass condition before any iteration runs. Surfaces in the
160+
agent's prompt verbatim alongside `target_system.description`.
161+
- **`models`** — pin per-phase models. Defaults: Opus for DESIGN,
162+
Sonnet for EXECUTE_ANALYZE.
163+
- **`theory_references`** — declare external theory anchors (Little's
164+
Law, M/G/K stability bound, etc.). Items can be plain strings or
165+
full `{name, statement, ...}` objects.
166+
132167
The planner explores the codebase to discover metrics, knobs, and execution methods. You can optionally provide `observable_metrics` and `controllable_knobs` as hints — see [examples/campaign.yaml](examples/campaign.yaml) for all options.
133168

134169
### 5. Run a campaign
@@ -152,6 +187,11 @@ Options:
152187
nous run campaign.yaml --max-iterations 5 -v # verbose
153188
nous run campaign.yaml --auto-approve # skip gates (for CI/non-interactive)
154189
nous run campaign.yaml --auto-approve --max-iterations 1 # quick unattended run
190+
191+
# Skip DESIGN entirely with a pre-authored bundle (#188 — paper repro).
192+
# Bundle is schema-validated, hashed, and recorded in
193+
# iter-1/bundle_manifest.json for reviewer-defensible provenance.
194+
nous run campaign.yaml --bundle ./fig7_bundle.yaml --auto-approve
155195
```
156196

157197
### Overnight / long-running campaigns
@@ -174,7 +214,7 @@ nous run campaign.yaml \
174214

175215
| Flag | Default | Description |
176216
|------|---------|-------------|
177-
| `--timeout` | 1800 (30 min) | Per-phase time limit for `claude -p` |
217+
| `--timeout` | 1800 (30 min) | Per-phase time limit for the Agent SDK call |
178218
| `--max-cli-retries` | 10 | Retries per phase before giving up |
179219
| `--max-iterations` | 10 | Total experiment iterations |
180220

@@ -215,14 +255,57 @@ your-repo/.nous/<run_id>/
215255
### Other CLI commands
216256
217257
```bash
218-
nous status campaign.yaml # show campaign phase, iteration, principles
219-
nous cost campaign.yaml # token/cost summary from llm_metrics.jsonl
220-
nous report campaign.yaml # generate report.md (uses LLM)
221-
nous resume campaign.yaml # resume a paused/interrupted campaign
222-
nous replay campaign.yaml --iter 1 # re-run iteration 1 commands in fresh worktree (no LLM)
258+
nous status campaign.yaml # one-shot campaign phase, iteration, principles
259+
nous status campaign.yaml --watch # live redraw; STUCK marker after 5 min of silence
260+
nous status campaign.yaml --line # one-line summary (shell prompt / parent agent)
261+
nous cost campaign.yaml # token/cost summary from llm_metrics.jsonl
262+
nous cost campaign.yaml --cache-stats # include prompt-cache hit-rate stats (#122)
263+
nous report campaign.yaml # generate report.md (uses LLM)
264+
nous resume campaign.yaml # resume a paused/interrupted campaign
265+
nous replay campaign.yaml --iter 1 # re-run iteration 1 commands in fresh worktree (no LLM)
223266
nous validate design --dir .nous/run/runs/iter-1/ # validate artifacts (agent-facing)
267+
nous schema [campaign|bundle|findings] [--format md|json|yaml] # print artifact schema
268+
nous stop campaign.yaml --reason "out of budget" # halt at next iteration boundary
224269
```
225270

271+
### Quick reference: how to run nous correctly
272+
273+
| You want to... | Command |
274+
|---|---|
275+
| Discover the campaign.yaml shape | `nous schema` |
276+
| Scaffold a starter campaign | `nous create-campaign --to ./campaign.yaml` |
277+
| Run a campaign end-to-end | `nous run campaign.yaml` (uses `--agent sdk` by default) |
278+
| Skip DESIGN with a pre-authored bundle | `nous run campaign.yaml --bundle path/to/bundle.yaml` |
279+
| Watch progress live | `nous status campaign.yaml --watch` |
280+
| Cleanly halt a running campaign | `nous stop campaign.yaml` |
281+
| Resume after halt or interruption | `nous resume campaign.yaml` |
282+
| Diagnose a failed iteration | `cat .nous/<run>/runs/iter-N/inputs/executor_log.jsonl` (#190) and `cat .nous/<run>/retry_log.jsonl` |
283+
| Audit token spend | `nous cost campaign.yaml --cache-stats` |
284+
285+
### Observability (when nous looks stuck or wrong)
286+
287+
When a campaign is mid-iteration and you can't tell what's happening:
288+
289+
1. **`nous status --watch`** — live redraw of phase / iteration / last
290+
tool call. Prints `STUCK` after 5 min of dispatcher silence.
291+
2. **`runs/iter-N/inputs/executor_log.jsonl`** (#190) — every SDK
292+
streaming event with timestamps. Tail it: `tail -f .nous/<run>/runs/iter-N/inputs/executor_log.jsonl`.
293+
3. **`retry_log.jsonl`** at the campaign root — every transient failure
294+
with attempt count, backoff, error string. The DESIGN-incomplete
295+
case (#187) writes a `failure_type: "design_incomplete"` entry with
296+
the missing-files list and the active `max_turns`.
297+
4. **`llm_metrics.jsonl`** at the campaign root — per-call tokens,
298+
cost, cache hits. `nous cost --cache-stats` aggregates this.
299+
5. **`state.json`** — the engine's atomic phase + iteration. Safe to
300+
`cat` mid-run. Resume picks up from here.
301+
302+
When DESIGN exits without producing `bundle.yaml` / `problem.md` /
303+
`handoff_snapshot.md`, the orchestrator raises a structured
304+
`DesignIncompleteError` (#187) naming the missing files and listing
305+
the four common causes — `max_turns` exhaustion, agent ran the
306+
experiment in DESIGN, API stall, transport failure — each with a
307+
concrete file to inspect.
308+
226309
### Run tests
227310

228311
```bash
@@ -238,7 +321,8 @@ orchestrator/ Python orchestrator (deterministic, not an LLM)
238321
engine.py State machine with atomic checkpoint/resume
239322
validate.py Artifact validation CLI (nous validate design/execution)
240323
dispatch.py Stub agent dispatch (for testing without LLM)
241-
cli_dispatch.py Code-access agent dispatch via claude -p
324+
sdk_dispatch.py Code-access agent dispatch via Claude Agent SDK (default)
325+
cli_dispatch.py Private base class for sdk_dispatch (legacy claude -p path retired in #183)
242326
prompt_loader.py Template loading with {{placeholder}} rendering
243327
gates.py Human approval gates with summaries
244328
ledger.py Deterministic ledger append (no LLM)

docs/architecture.md

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -98,8 +98,8 @@ The dispatcher invokes AI agents by role and phase, passing structured input and
9898

9999
| Role | Invoked During | Produces |
100100
|---|---|---|
101-
| **Planner** (Opus, `claude -p`) | DESIGN | `problem.md`, `bundle.yaml`, `handoff_snapshot.md` |
102-
| **Executor** (Sonnet, `claude -p`) | EXECUTE_ANALYZE | `experiment_plan.yaml`, `findings.json`, `principle_updates.json`, `patches/`, `results/` |
101+
| **Planner** (Opus, Claude Agent SDK) | DESIGN | `problem.md`, `bundle.yaml`, `handoff_snapshot.md` |
102+
| **Executor** (Sonnet, Claude Agent SDK) | EXECUTE_ANALYZE | `experiment_plan.yaml`, `findings.json`, `principle_updates.json`, `patches/`, `results/` |
103103

104104
Both agents write artifacts directly to the campaign directory (`iter_dir`) and run `nous validate` before claiming done. If validation fails, the agent reads the errors, fixes the artifacts, and retries. The orchestrator runs a post-check as a safety net.
105105

@@ -110,8 +110,8 @@ Both agents write artifacts directly to the campaign directory (`iter_dir`) and
110110
**Implementations:**
111111

112112
- `StubDispatcher` (`dispatch.py`) produces valid, schema-conformant artifacts without calling any LLM. Used for testing the orchestrator loop.
113-
- `CLIDispatcher` (`cli_dispatch.py`) invokes `claude -p` as a subprocess, giving agents code access and shell tools. Agents write files directly to `iter_dir`. Supports `override_cwd()` context manager for pointing the executor at a git worktree. Selected via `--agent api`.
114-
- `SDKDispatcher` (`sdk_dispatch.py`) calls the Claude Agent SDK (`claude-agent-sdk`) instead of spawning a subprocess. Same artifact and metrics contract as `CLIDispatcher`; gains native streaming, programmatic prompt caching, and message-level retry. Selected via `--agent sdk`. Requires the optional `sdk` install extra (`pip install -e ".[sdk]"`). Inherits parse / validate / retry-with-feedback machinery from `CLIDispatcher` — only the transport changes.
113+
- `SDKDispatcher` (`sdk_dispatch.py`, default and only user-facing code-access backend post-#183) calls the Claude Agent SDK (`claude-agent-sdk`) directly, giving agents code access and shell tools through native streaming, programmatic prompt caching, and message-level retry. Agents write files directly to `iter_dir`. Selected via `--agent sdk` (the default). Requires `claude-agent-sdk` and `anyio`, both required dependencies of `nous` so `pip install nous` is sufficient.
114+
- `CLIDispatcher` (`cli_dispatch.py`) is retained as a private base class that `SDKDispatcher` inherits from for the parse / validate / retry-with-feedback machinery. The legacy `--agent api` (claude -p subprocess) path was removed in #183; the class is no longer reachable from the CLI.
115115

116116
**Dispatch interface:**
117117
```python
@@ -136,13 +136,13 @@ If either fails, the hook exits with code 2 and writes a structured reason to st
136136

137137
This is preferred over a probabilistic Haiku evaluator anywhere the success criterion is a schema check: cheaper, faster, and immune to evaluator drift.
138138

139-
## CLI Dispatch
139+
## SDK Dispatch
140140

141-
`CLIDispatcher` invokes `claude -p` for both agent roles.
141+
`SDKDispatcher` (`--agent sdk`, the default) invokes the Claude Agent SDK for both agent roles. The legacy `--agent api` (claude -p subprocess) backend was removed in #183; only the SDK path is reachable from the CLI.
142142

143143
### Retry and Resilience
144144

145-
**Pre-flight check:** At campaign start, Nous validates that the CLI is installed and credentials work via a quick `claude -p` test call. Environment problems are caught in seconds, not hours into an overnight run.
145+
**Pre-flight check:** At campaign start, Nous validates that the SDK is importable and credentials work. Environment problems are caught in seconds, not hours into an overnight run.
146146

147147
**All failures are retried** with exponential backoff (5s → 30s → 120s → 300s → 600s). There is no permanent/transient classification — the only hard failures are CLI-not-found and repo-path-missing, which are caught before the retry loop. Configurable via `--max-cli-retries` (default 10) and `--timeout` (default 1800s).
148148

@@ -163,7 +163,7 @@ Prompts are templates in `prompts/methodology/` (one per role). At dispatch time
163163

164164
### EXECUTE_ANALYZE: Merged Execution Pipeline
165165

166-
The executor agent (Sonnet, `claude -p`) handles the entire execution pipeline in a single session:
166+
The executor agent (Sonnet, via the Claude Agent SDK) handles the entire execution pipeline in a single session:
167167

168168
1. Receives the approved hypothesis bundle
169169
2. Explores the target repo, discovers build commands
@@ -176,7 +176,7 @@ After execution, the orchestrator validates artifacts (schema check) and merges
176176

177177
### Model Configuration
178178

179-
Two `claude -p` calls per iteration:
179+
Two Claude Agent SDK calls per iteration:
180180

181181
| Phase | Model | Role |
182182
|-------|-------|------|
@@ -368,10 +368,11 @@ The orchestrator is designed for crash-safe operation:
368368
369369
### Using a Different Dispatcher
370370
371-
Nous ships with two dispatchers:
371+
Nous ships with three dispatchers:
372372
373-
- `StubDispatcher` — deterministic stubs for testing
374-
- `CLIDispatcher` — real agent calls via `claude -p`
373+
- `StubDispatcher` — deterministic stubs for testing.
374+
- `InlineDispatcher` — emits prompts to stdout for an enclosing agent framework (no subprocess, no API key).
375+
- `SDKDispatcher` — real agent calls via the Claude Agent SDK (default and only user-facing code-access path post-#183).
375376
376377
To create a custom dispatcher, extend `LLMDispatcher`. Your dispatcher must produce artifacts that pass schema validation — the orchestrator trusts the schema contract, not the content.
377378

0 commit comments

Comments
 (0)